website logo
← back to writing
Apr 2025 · 8 min read · AI / AWS

Building FlavorAI: From Fridge Photo to Recipe with AWS Bedrock

How I built a multi-modal AI app that identifies food ingredients from images and videos using Amazon Nova, then generates complete recipes with nutrition info and step-by-step instructions.

The Problem Worth Solving

The classic "what do I cook tonight" problem. You open the fridge, stare at it for a few seconds, and close it again. Nothing inspiring. You have ingredients, but you can't quite bridge the gap between "things in my fridge" and "an actual meal."

I wanted to build something that closed that gap in the laziest possible way: take a photo of your ingredients, get a recipe back. No typing, no ingredient lists, no searching. Just point your camera and go.

That's FlavorAI. And building it turned out to be a genuinely interesting exercise in multi-modal AI, prompt engineering, and how the AWS Bedrock API handles different input types.

The Architecture

The app is a single Python file with two main stages:

  1. Ingredient detection — send an image or video to a vision model, get back a structured list of ingredients with quantities.
  2. Recipe generation — send the ingredient list to a language model, get back a full recipe with nutrition info and steps.

Both stages run on AWS Bedrock, using Amazon's Nova model family. The frontend is a Gradio app with two tabs — one for images, one for videos. The whole thing is containerised with Docker.

STACK
Python · AWS Bedrock (Amazon Nova) · Gradio · Docker · boto3

Stage 1: Ingredient Detection with Nova Vision

The first call goes to amazon.nova-pro-v1:0 — the most capable model in the Nova family and the right choice for vision tasks that need detailed extraction.

The system prompt sets the context clearly:

"You are an expert media analyst and a professional in identifying
food ingredients from visual content."

The user prompt asks for a structured JSON response — just ingredient names and estimated quantities. Keeping the output schema strict here matters a lot: it makes the second stage more reliable because you're feeding clean, consistent data into the recipe model rather than a blob of free-form text.

One thing that surprised me: Bedrock handles image and video inputs through the same converse API, but the payload structure differs slightly. For images you pass an image content block; for video you pass a video content block. The model is smart enough to extract frame-level content from videos automatically — you don't need to sample frames yourself.

Handling both input types

For images, the code reads the file, base64-encodes it, and builds an image/jpeg content block. For videos, it does the same with video/mp4. Both go through the same downstream pipeline. In practice, video is useful when a single frame isn't enough — for example, if someone pans across a pantry shelf.

Stage 2: Recipe Generation with Nova Micro

Once I have the ingredient list, the second call goes to amazon.nova-micro-v1:0 — a smaller, faster model that's well-suited for structured text generation.

The prompt asks for a complete recipe in a specific JSON schema:

The model also accepts optional user preferences — cuisine type, dietary restrictions, meal type — which get appended to the prompt if provided.

PROMPT TIP
Asking the model for JSON output and specifying the exact schema in the prompt dramatically improves consistency. Adding "respond with only the JSON object, no preamble or explanation" eliminates the wrapper text that would otherwise break your parser.

What I Learned

What's Next


Source code: github.com/omishagupta/FlavorAI