Building FlavorAI: From Fridge Photo to Recipe with AWS Bedrock

The Problem Worth Solving

The classic "what do I cook tonight" problem. You open the fridge, stare at it for a few seconds, and close it again. Nothing inspiring. You have ingredients, but you can't quite bridge the gap between "things in my fridge" and "an actual meal."

I wanted to build something that closed that gap in the laziest possible way: take a photo of your ingredients, get a recipe back. No typing, no ingredient lists, no searching. Just point your camera and go.

That's FlavorAI. And building it turned out to be a genuinely interesting exercise in multi-modal AI, prompt engineering, and how the AWS Bedrock API handles different input types.

The Architecture

The app is a single Python file with two main stages:

Ingredient detection — send an image or video to a vision model, get back a structured list of ingredients with quantities.
Recipe generation — send the ingredient list to a language model, get back a full recipe with nutrition info and steps.

Both stages run on AWS Bedrock, using Amazon's Nova model family. The frontend is a Gradio app with two tabs — one for images, one for videos. The whole thing is containerised with Docker.

STACK

Python · AWS Bedrock (Amazon Nova) · Gradio · Docker · boto3

Stage 1: Ingredient Detection with Nova Vision

The first call goes to amazon.nova-pro-v1:0 — the most capable model in the Nova family and the right choice for vision tasks that need detailed extraction.

The system prompt sets the context clearly:

"You are an expert media analyst and a professional in identifying
food ingredients from visual content."

The user prompt asks for a structured JSON response — just ingredient names and estimated quantities. Keeping the output schema strict here matters a lot: it makes the second stage more reliable because you're feeding clean, consistent data into the recipe model rather than a blob of free-form text.

One thing that surprised me: Bedrock handles image and video inputs through the same converse API, but the payload structure differs slightly. For images you pass an image content block; for video you pass a video content block. The model is smart enough to extract frame-level content from videos automatically — you don't need to sample frames yourself.

Handling both input types

For images, the code reads the file, base64-encodes it, and builds an image/jpeg content block. For videos, it does the same with video/mp4. Both go through the same downstream pipeline. In practice, video is useful when a single frame isn't enough — for example, if someone pans across a pantry shelf.

Stage 2: Recipe Generation with Nova Micro

Once I have the ingredient list, the second call goes to amazon.nova-micro-v1:0 — a smaller, faster model that's well-suited for structured text generation.

The prompt asks for a complete recipe in a specific JSON schema:

Recipe title
Cooking time and serving size
Nutritional information (calories, protein, fat, carbohydrates)
Ingredient list with measurements
Step-by-step cooking instructions

The model also accepts optional user preferences — cuisine type, dietary restrictions, meal type — which get appended to the prompt if provided.

PROMPT TIP

Asking the model for JSON output and specifying the exact schema in the prompt dramatically improves consistency. Adding "respond with only the JSON object, no preamble or explanation" eliminates the wrapper text that would otherwise break your parser.

What I Learned

Chaining models intentionally pays off. Using a vision model purely for extraction and a language model purely for generation kept each stage focused and made debugging much easier.
Schema enforcement via prompting is brittle at scale. For a POC this works fine, but in production you'd want Bedrock's structured output features or a validation layer to guarantee the JSON shape.
Nova's video understanding is genuinely impressive. It correctly identified ingredients from a short pan-across-shelf video without any frame sampling on my end.
Gradio ships MVPs fast. The entire frontend took maybe two hours.

What's Next

Add structured output schema to the Bedrock call so the JSON is guaranteed, not just prompted for.
Store past recipes and ingredient scans — useful for building a personal recipe library over time.
Deploy to AWS App Runner or ECS so it's always-on.
Fine-tune on food imagery for better ingredient detection.

Source code: github.com/omishagupta/FlavorAI