How I Used Azure AI Vision for OCR in My WhatsApp Kitchen Assistant
Introduction
When building Annapurna, my WhatsApp-based kitchen assistant, I wanted users to simply snap a photo of their grocery bill and have their pantry inventory update automatically. The magic behind this feature? Azure AI Vision’s OCR (Optical Character Recognition) service. In this post, I’ll walk you through exactly how I integrated Azure, the challenges I faced, and how the whole process works—from image to inventory.
Why Azure AI Vision?
There are many OCR solutions out there, but Azure’s Computer Vision API stood out for its:
- Accuracy: Handles messy receipts, various fonts, and even handwritten text.
- Speed: Returns results in seconds, perfect for a chat-based experience.
- Ease of Integration: Simple REST API, great Python SDK, and generous free tier.
How to Get Your Azure Computer Vision API Key
- Sign in to the Azure Portal.
- Create a Resource:
- Click “Create a resource.”
- Search for “Computer Vision” and select it.
- Click “Create.”
- Configure the Resource:
- Choose your subscription and resource group.
- Enter a unique resource name.
- Select a region (choose one close to your users).
- Click “Review + create,” then “Create.”
- Get Your Keys and Endpoint:
- Once deployment is complete, go to your new Computer Vision resource.
- In the left sidebar, click “Keys and Endpoint.”
- Copy one of the keys (Key 1 or Key 2) and the endpoint URL.
- Add to Your Environment:
- Set these values in your
.env
or Railway environment variables:VISION_KEY=your_azure_vision_key VISION_ENDPOINT=your_azure_vision_endpoint
- Set these values in your
The End-to-End Process: From Photo to Pantry
Let’s break down the journey of a grocery bill through Annapurna:
1. User Uploads a Grocery Bill
- The user sends a photo of their grocery bill via WhatsApp.
- Puch AI forwards the image to my MCP server.
2. Image Sent to Azure AI Vision
- The backend receives the image (as a URL or binary data).
- Using the Azure Computer Vision SDK, I send the image to the OCR endpoint:
from azure.cognitiveservices.vision.computervision import ComputerVisionClient from msrest.authentication import CognitiveServicesCredentials client = ComputerVisionClient( endpoint=VISION_ENDPOINT, credentials=CognitiveServicesCredentials(VISION_KEY) ) ocr_result = client.read_in_stream(image_stream, raw=True)
3. Extracting Text from the Response
- Azure processes the image and returns a structured response with detected text lines and words.
- I parse the response to extract item names and quantities, handling common receipt quirks (like “2x Apple” or “Banana 1kg”).
4. Smart Parsing & Cleaning
- Not all receipts are created equal! I use regular expressions and fuzzy matching to:
- Ignore prices, totals, and store info.
- Normalize item names (e.g., “TOMATOES” → “tomato”).
- Extract quantities and units.
5. Inventory Update
- Parsed items are matched against the user’s existing inventory in PostgreSQL.
- New items are added, quantities are updated, and the user gets a confirmation message on WhatsApp.
Challenges & Solutions
- Messy Receipts: Some receipts are crumpled, faded, or in odd layouts. Azure’s OCR is robust, but I added post-processing to handle edge cases.
- Multiple Languages: Azure supports many languages, so users can upload bills in their local language.
- Speed: For a snappy user experience, I run OCR and parsing asynchronously and send a “processing” message if needed.
Code Snippet: The Core OCR Call
def scan_grocery_bill(image_stream):
client = ComputerVisionClient(
endpoint=VISION_ENDPOINT,
credentials=CognitiveServicesCredentials(VISION_KEY)
)
poller = client.read_in_stream(image_stream, raw=True)
result = poller.result()
lines = []
for read_result in result.analyze_result.read_results:
for line in read_result.lines:
lines.append(line.text)
return lines
Final Thoughts
Integrating Azure AI Vision’s OCR turned a tedious task—manually logging groceries—into a magical, one-snap experience. Users love the convenience, and I love how easy Azure made it to add real-world intelligence to my bot.
If you’re building anything that needs to “see” and understand images, give Azure’s Computer Vision API a try. It’s the secret sauce behind Annapurna’s smart kitchen magic!
read the full implementation at Annapurna-bot blog