December 16, 2025Whispcal

Teaching an AI to understand what you ate

aigeminillamafood-parsingvoice-input

On December 7th, four days into the project, I made the commit that would define WhispCal's identity: "Introduce authentication with Apple/Google and Supabase, and integrate Gemini AI for food parsing and recipe generation."

That single commit bundled together auth, a database, and an AI integration. The ambition-to-planning ratio was dangerously high.

The problem with food

Here's the thing about nutrition tracking that most calorie counter apps get wrong: people don't think in grams. Nobody says "I ate 142 grams of chicken breast with 85 grams of steamed broccoli." They say "I had some chicken and broccoli for lunch" or "leftover pasta, maybe two cups" or "a handful of almonds while I was on a call."

I wanted WhispCal to understand that. Not force users into a search-and-weigh workflow, but actually parse natural language descriptions of meals into structured nutritional data.

Gemini enters the chat

I started with Google's Gemini API. The first version was embarrassingly simple — take the user's text input, wrap it in a prompt asking for structured JSON with food items and estimated macros, parse the response. It worked surprisingly well for straightforward inputs.

The next day I added voice integration. That commit message — "UI/UX changes and voice integration!" — has an exclamation mark, which tells you how excited I was. Speaking your meal into the phone and watching it appear as structured nutritional data felt like magic.

Then came barcode scanning. Scan a product, look up the nutritional data, add it to your log. Between voice, text, and barcode, I had three input methods within a week.

The model-switching saga

What the commits don't show is the experimentation behind each AI decision. On December 15th: "Update generative model in handleParseFoodLog function." The next day: "Upgrade Llama model to llama-3.3-70b-versatile." Then: "change api to gemini for fetching food."

I was bouncing between models like someone trying on shoes. Gemini was good at structured output but expensive. Llama 3.3 through Groq was fast and cheap but occasionally hallucinated nutritional values. Gemini understood complex meals better but sometimes over-interpreted simple inputs.

The solution I landed on was using versioned model configurations — different models for different tasks. Food parsing needed accuracy, so Gemini. Recipe generation could tolerate more creativity, so I could use lighter models there. It's not elegant architecture. It's pragmatic triage.

When the AI fights back

The most frustrating bug I encountered was the AI modifying existing items in the user's tray. You'd have three items logged, ask to add a banana, and the AI would helpfully "correct" your previous entries too. Commits from January tell the story: "explicitly asking not to remove data from the existing food items in the prompt" and "added a safe way to NOT change the current tray items unless the user asks for it."

Prompt engineering for a production app is a completely different discipline than playing with ChatGPT. You need the model to be helpful but constrained. Creative but predictable. Conversational but structured. Every edge case is a new line in your prompt, and every new line is a potential conflict with an existing instruction.

Voice, camera, and the input explosion

By mid-December I had text, voice, barcode scanning, and — in a moment of ambition I'd later question — a camera input for photographing meals. The commit message says it all: "add camera input field (might regret that later)."

Each input method created its own parsing challenges. Voice transcription adds its own errors before the AI even sees the text. Barcode data from different databases has inconsistent formats. Camera-based food recognition is an unsolved problem in computer vision.

But the core insight held: meet users where they are. Some people want to type "oatmeal with honey." Others want to scan a barcode. The AI layer normalizes all of it into the same structured format. That's the real product — not any single input method, but the intelligence that sits behind all of them.