Deploy Dedicated GPU server to run AI models

Deploy Model
Skip to main content

Why Your LLM API Returns Plausible Code, Not Correct Code

||7 min read|API
Why Your LLM API Returns Plausible Code, Not Correct Code

Integrate AI APIs Today

Build next-generation applications with ModelsLab's enterprise-grade AI APIs for image, video, audio, and chat generation

Get Started
Get Started

The code compiled. It passed all its tests. It read and wrote the correct file format. It looked, architecturally, like a database engine. But performance fell apart on contact with real workloads, because LLMs optimize for plausibility, not correctness.

If you're calling any LLM API — ModelsLab, OpenAI, or anything else — and asking it to write code without telling it what "working" means, you're likely shipping this exact pattern. This post explains why it happens and shows a concrete fix developers can apply today.

The Plausibility Gap

Language models are trained to produce outputs that look correct. They learn from vast amounts of code where the naming, structure, and architectural patterns of production software are well-represented. So they produce code that has production-style names, production-style module organization, and production-style comments.

What they don't optimize for is runtime behavior. A model can't run code. It can't execute a benchmark. It approximates what correct-looking code should look like based on patterns in its training data — and that approximation is very good at fooling static analysis and even most tests, while hiding subtle bugs in logic, indexing, or algorithmic complexity that only surface under real conditions.

The SQLite example isn't a fluke. GitClear's analysis of 211 million lines of code found defect rates in AI-assisted codebases rising even as total output volume increased. METR's randomized study found similar patterns in agentic coding scenarios. The problem generalizes.

What "Acceptance Criteria First" Means in Practice

The fix is straightforward: tell the model what correct means before it writes the first line of code. Not in vague terms — in measurable terms.

Instead of: "Write a Rust function that does a primary key lookup in SQLite."

Use: "Write a Rust function that does a primary key lookup in SQLite. Acceptance criteria: single-row lookup must complete in under 1 ms on a 10,000-row table. Test with a compiled C benchmark using the same compiler flags as system SQLite, WAL mode, same schema."

The second prompt defines what success looks like. The model now has a target it can reason about when choosing between implementation approaches. It's the difference between asking a contractor to "build a wall" versus "build a wall rated for 40 mph wind load."

Acceptance criteria can be performance bounds, correctness assertions, security constraints, or API contracts. What matters is that they're specific and testable.

Applying This When Calling LLM APIs

If you're using the ModelsLab LLM API (or any OpenAI-compatible endpoint), the right place to put acceptance criteria is the system prompt. System prompt instructions shape how the model reasons about the entire task — putting criteria there means the model is thinking about them during generation, not just receiving them as an afterthought.

Here's the pattern for a code generation call with acceptance criteria:

import requests
import json
API_URL = "https://modelslab.com/api/v6/llm/chat/completions"
API_KEY = "your_api_key_here"

def generate_code_with_criteria( task_description: str, acceptance_criteria: list[str], language: str = "Python", ) -> str: """ Generate code that meets explicit acceptance criteria. Acceptance criteria go in the system prompt, not the user message. """ criteria_block = "\n".join(f"- {c}" for c in acceptance_criteria)

system_prompt = f"""You are a senior {language} engineer.

Your job is to write code that satisfies the following acceptance criteria. Do not submit code unless you are confident it meets every criterion. If a criterion conflicts with a common pattern, choose correctness over convention.

Acceptance Criteria: {criteria_block}

For each function you write, add an inline comment explaining which criterion it satisfies."""

payload = { "model": "meta-llama/Llama-3.3-70B-Instruct", "messages": [ {"role": "system", "content": system_prompt}, {"role": "user", "content": task_description}, ], "temperature": 0.2, # Lower temp = less creative, more reliable "max_tokens": 2048, }

response = requests.post( API_URL, headers={"Authorization": f"Bearer {API_KEY}", "Content-Type": "application/json"}, json=payload, ) response.raise_for_status() return response.json()["choices"][0]["message"]["content"]

Example: database lookup function

code = generate_code_with_criteria( task_description="Write a function to look up a user by ID from a SQLite database.", acceptance_criteria=[ "Single-row lookup by integer primary key completes in under 5 ms on a 100,000-row table", "Uses parameterized queries — no string interpolation", "Returns None if row not found, does not raise an exception", "Closes the cursor in a finally block regardless of success or failure", "Passes mypy strict type checking", ], language="Python", ) print(code)

Two things to notice: temperature is set to 0.2. Acceptance-criteria prompting works better at lower temperatures — you want the model converging on a specific correct implementation, not exploring creative alternatives. And the system prompt explicitly instructs the model to prioritize correctness over convention , which directly addresses the plausibility gap.

Structuring Acceptance Criteria by Failure Mode

Not all acceptance criteria are equal. The most useful ones target the failure modes that LLMs are most likely to miss:

Performance Bounds

LLMs frequently choose correct-looking but algorithmically inefficient implementations — O(n²) where O(n log n) was required, missing indexes, loading entire result sets into memory. State the expected time complexity or measured latency target explicitly.

acceptance_criteria=[
    "Lookup by indexed column completes in O(log n) — verify with EXPLAIN QUERY PLAN output",
    "No full-table scans in the hot path",
]

Edge Case Handling

Models learn from examples where the happy path is well-represented. Edge case handling is underrepresented in training data. Name the edge cases explicitly:

acceptance_criteria=[
    "Returns empty list, not None, when query produces zero results",
    "Handles unicode characters in text fields without encoding errors",
    "Behaves correctly when called concurrently from 10 threads",
]

Security Constraints

SQL injection and similar vulnerabilities exist in LLM-generated code at measurable rates. Name the constraint:

acceptance_criteria=[
    "All user-supplied values go through parameterized queries — no f-strings or .format() in SQL",
    "No credentials, tokens, or keys appear in log output",
]

Interface Contracts

LLMs frequently change return types, exception behavior, or argument shapes between generations of the same function. Lock the interface:

acceptance_criteria=[
    "Function signature: get_user(conn: sqlite3.Connection, user_id: int) -> Optional[User]",
    "Raises ValueError for user_id <= 0, not a database error",
]

What This Doesn't Fix

Acceptance criteria reduce the plausibility gap — they don't eliminate it. A model can satisfy every written criterion and still have bugs in code paths the criteria didn't cover. The right mental model is that acceptance criteria shift the failure surface from "hidden until production" to "visible during review and testing."

This is the same discipline that makes test-driven development valuable: the tests don't guarantee correctness, but they make failures explicit. Acceptance criteria in your LLM API prompts work the same way — they move the failure point earlier and make it inspectable.

For code that matters — anything in a hot path, anything touching user data, anything security-sensitive — criteria prompting is a starting point, not a substitute for code review.

Putting It Together

The practical workflow when using an LLM API for code generation:

  1. Write the acceptance criteria before you write the prompt. If you can't articulate what correct means, the model definitely can't figure it out.
  2. Put criteria in the system prompt , not the user message. System prompt instructions have higher weight during generation.
  3. Use lower temperature (0.1–0.3) for code generation tasks where you want consistency over creativity.
  4. Test the actual output against the criteria. Don't assume the model satisfied them because it claimed to.
  5. Iterate on criteria, not just prompts. If the same type of bug appears twice, add a criterion that rules it out.

The ModelsLab API follows the standard OpenAI chat completions schema, so this pattern works with any model on the platform — including Llama 3.3 70B, Qwen 2.5 Coder, and Mistral variants. You can swap the model field and test which performs best against your specific acceptance criteria without changing anything else in your integration.

LLMs are genuinely useful for code generation. The developers getting the most out of them are treating acceptance criteria the same way they treat unit tests: not optional, written before the code, and treated as a first-class artifact. The ones struggling are handing the model a vague task and hoping it understands what "working" means.

It doesn't. Define it yourself.

Ready to integrate ModelsLab's LLM API into your development workflow? Check out the model catalog and start building with any of the available models using PAYG credits — no subscription required.

Share:
Plugins

Explore Plugins for Pro

Our plugins are designed to work with the most popular content creation software.

API

Build Apps with
ML
API

Use our API to build apps, generate AI art, create videos, and produce audio with ease.