The Problem Worth Solving
The classic "what do I cook tonight" problem. You open the fridge, stare at it for a few seconds, and close it again. Nothing inspiring. You have ingredients, but you can't quite bridge the gap between "things in my fridge" and "an actual meal."
I wanted to build something that closed that gap in the laziest possible way: take a photo of your ingredients, get a recipe back. No typing, no ingredient lists, no searching. Just point your camera and go.
That's FlavorAI. And building it turned out to be a genuinely interesting exercise in multi-modal AI, prompt engineering, and how the AWS Bedrock API handles different input types.
The Architecture
The app is a single Python file with two main stages:
- Ingredient detection — send an image or video to a vision model, get back a structured list of ingredients with quantities.
- Recipe generation — send the ingredient list to a language model, get back a full recipe with nutrition info and steps.
Both stages run on AWS Bedrock, using Amazon's Nova model family. The frontend is a Gradio app with two tabs — one for images, one for videos. The whole thing is containerised with Docker.
Stage 1: Ingredient Detection with Nova Vision
The first call goes to amazon.nova-pro-v1:0 — the most capable model in the
Nova family and the right choice for vision tasks that need detailed extraction.
The system prompt sets the context clearly:
"You are an expert media analyst and a professional in identifying
food ingredients from visual content." The user prompt asks for a structured JSON response — just ingredient names and estimated quantities. Keeping the output schema strict here matters a lot: it makes the second stage more reliable because you're feeding clean, consistent data into the recipe model rather than a blob of free-form text.
One thing that surprised me: Bedrock handles image and video inputs through
the same converse API, but the payload structure differs slightly.
For images you pass an image content block; for video you pass a
video content block. The model is smart enough to extract frame-level
content from videos automatically — you don't need to sample frames yourself.
Handling both input types
For images, the code reads the file, base64-encodes it, and builds an
image/jpeg content block. For videos, it does the same with
video/mp4. Both go through the same downstream pipeline. In practice,
video is useful when a single frame isn't enough — for example, if someone
pans across a pantry shelf.
Stage 2: Recipe Generation with Nova Micro
Once I have the ingredient list, the second call goes to
amazon.nova-micro-v1:0 — a smaller, faster model that's well-suited
for structured text generation.
The prompt asks for a complete recipe in a specific JSON schema:
- Recipe title
- Cooking time and serving size
- Nutritional information (calories, protein, fat, carbohydrates)
- Ingredient list with measurements
- Step-by-step cooking instructions
The model also accepts optional user preferences — cuisine type, dietary restrictions, meal type — which get appended to the prompt if provided.
What I Learned
- Chaining models intentionally pays off. Using a vision model purely for extraction and a language model purely for generation kept each stage focused and made debugging much easier.
- Schema enforcement via prompting is brittle at scale. For a POC this works fine, but in production you'd want Bedrock's structured output features or a validation layer to guarantee the JSON shape.
- Nova's video understanding is genuinely impressive. It correctly identified ingredients from a short pan-across-shelf video without any frame sampling on my end.
- Gradio ships MVPs fast. The entire frontend took maybe two hours.
What's Next
- Add structured output schema to the Bedrock call so the JSON is guaranteed, not just prompted for.
- Store past recipes and ingredient scans — useful for building a personal recipe library over time.
- Deploy to AWS App Runner or ECS so it's always-on.
- Fine-tune on food imagery for better ingredient detection.
Source code: github.com/omishagupta/FlavorAI