Rubric-Based LLM API Evals (2026): How to Build Your Own After LifeSciBench

On June 17, 2026, OpenAI released LifeSciBench, a benchmark built with 173 working scientists to test whether AI can actually help with real life-science research. The headline numbers are wild: 750 expert-authored tasks, 1,062 attached artifacts, and 19,020 rubric criteria. That last one is the part developers should care about, even if you never touch biology.

Here's the quote that matters: LifeSciBench grades each task against an average of about 25 criteria, because "many life science tasks cannot be graded by checking the final answer alone." A response can land the right conclusion and still be wrong if it skips a key caveat. It can also be partly right with strong reasoning even when it misses the full solution.

Swap "life science" for whatever you actually build, and the lesson holds. If you're still evaluating your LLM endpoints with exact-match or a single thumbs-up, you're flying blind. Let's fix that.

Why Final-Answer Scoring Lies to You

Most teams' "evals" look like this: send a prompt, compare the output to a golden string, mark pass or fail. That works for trivia. It falls apart the moment your task has more than one correct phrasing, requires multiple steps, or needs the model to flag a risk it wasn't explicitly asked about.

LifeSciBench's own stats make the point. 79% of its tasks need multiple reasoning steps, averaging four steps each. Over half require reading an attached figure, table, or PDF. You cannot grade that with a regex. You grade it the way a senior reviewer would: with a checklist of what a good answer must contain.

The core idea to steal: turn each test case into a list of small, checkable claims. Score the response by how many it satisfies. This gives you partial credit, surfaces specific failure modes, and makes regressions obvious when you swap models or tweak a prompt.

Anatomy of a Rubric Task

A rubric task has three parts: the input, the expected criteria, and the points. Keep it boring and explicit. Here's a JSON shape that works for almost any domain:

{
  "id": "refund-policy-001",
  "input": "Customer bought 14 days ago, item damaged in transit, wants full refund. Draft the reply.",
  "criteria": [
    {"id": "c1", "desc": "Confirms the 30-day return window applies", "points": 2},
    {"id": "c2", "desc": "Offers full refund OR replacement, not store credit only", "points": 2},
    {"id": "c3", "desc": "Asks for a photo of the damage before processing", "points": 1},
    {"id": "c4", "desc": "Does NOT promise a refund timeline faster than 5-7 days", "points": 1},
    {"id": "c5", "desc": "Tone stays apologetic and professional", "points": 1}
  ]
}

Notice c4 is a negative criterion. Those are gold. Some of the worst model failures aren't missing information, they're confidently adding something wrong. Rubrics catch that; string matching never will.

Grading with LLM-as-Judge (Python)

You don't hand-grade hundreds of runs. You use a second model as the judge, one criterion at a time, with a strict yes/no contract. Pin the judge model and keep temperature low so grades are stable across runs.

import os, json
from openai import OpenAI

client = OpenAI(
    api_key=os.environ["KISSAPI_KEY"],
    base_url="https://api.kissapi.ai/v1",
)

JUDGE_MODEL = "claude-sonnet-4-6"   # pin this, don't let it drift

def grade_criterion(response_text, criterion):
    judge_prompt = f"""You are a strict grader. Answer ONLY with JSON.
Criterion: {criterion['desc']}
Response to grade:
---
{response_text}
---
Return {{"met": true|false, "why": "one short sentence"}}"""

    out = client.chat.completions.create(
        model=JUDGE_MODEL,
        temperature=0,
        messages=[{"role": "user", "content": judge_prompt}],
    )
    return json.loads(out.choices[0].message.content)

def score_task(task, response_text):
    earned, total, detail = 0, 0, []
    for c in task["criteria"]:
        total += c["points"]
        verdict = grade_criterion(response_text, c)
        if verdict["met"]:
            earned += c["points"]
        detail.append({"id": c["id"], "met": verdict["met"], "why": verdict["why"]})
    return {"score": earned / total, "earned": earned, "total": total, "detail": detail}

One judge call per criterion sounds expensive, and it adds up if you're careless. Two things keep the bill sane: a cheaper model for the judge role, and routing both the candidate and judge traffic through one endpoint so you can watch spend in a single place. I run my candidate models and the judge through KissAPI for exactly that reason. One key, every model, one usage dashboard.

Running the Candidate (Node.js)

The other half is generating the responses you're grading. Keep this loop dead simple so the eval, not the harness, is what you're testing.

import OpenAI from "openai";
import fs from "fs";

const client = new OpenAI({
  apiKey: process.env.KISSAPI_KEY,
  baseURL: "https://api.kissapi.ai/v1",
});

const tasks = JSON.parse(fs.readFileSync("tasks.json", "utf8"));
const CANDIDATE = "gpt-5";

async function run() {
  const results = [];
  for (const task of tasks) {
    const out = await client.chat.completions.create({
      model: CANDIDATE,
      temperature: 0.2,
      messages: [{ role: "user", content: task.input }],
    });
    results.push({ id: task.id, response: out.choices[0].message.content });
  }
  fs.writeFileSync("responses.json", JSON.stringify(results, null, 2));
}

run();

Generate once, grade once, store both. When you change a prompt or test a new model, re-run and diff the scores. That diff is the entire point.

Comparing Models the Right Way

Once you have per-criterion scores, model comparison stops being vibes. You get a table like this, and the gaps tell you exactly where each model breaks:

ModelOverallNegative criteria (no hallucinated facts)Multi-step tasks
Model A0.880.710.79
Model B0.840.930.81
Model C (cheap)0.790.690.62

Model A has the best overall score but hallucinates more on the negative criteria. If your product can't tolerate confident wrong additions, Model B is the safer pick despite the lower headline number. You'd never see that with a single aggregate score. This is the same reason LifeSciBench reports by workflow and domain instead of one number.

Five Rules That Keep Evals Honest

  1. Pull tasks from real logs. Synthetic test cases miss the weird inputs your users actually send. 30 real ones beat 300 invented ones.
  2. Write negative criteria. "Does NOT invent a policy" catches the failures that hurt most in production.
  3. Pin the judge model and temperature 0. A drifting judge means your scores aren't comparable week to week.
  4. Spot-check the judge. Hand-grade 10% against the LLM judge. If they disagree often, your criteria are vague, not the model.
  5. Version your rubric. Tag it v1, v2. When scores jump, you want to know if the model changed or the rubric did.

The Takeaway

LifeSciBench is a research benchmark you'll probably never run. But the method underneath it, grade the reasoning and the caveats, not just the final answer, is the single biggest upgrade most teams can make to their eval setup in 2026. It's not expensive and it's not hard. It's a JSON file of criteria and a strict judge.

Start with 20 tasks from last week's logs. Write the rubrics. Run your two or three candidate models through one endpoint. You'll learn more about your models in an afternoon than a month of staring at sample outputs.

Run Every Model Through One Endpoint

Test candidates and your judge model with a single key and one usage dashboard. Create a free account at kissapi.ai/register.

Start Free

FAQ

What is a rubric-based LLM eval?

A rubric-based eval scores a model response against a checklist of specific criteria instead of one pass/fail final answer. Each criterion targets one claim, step, calculation, or caveat the response should include. LifeSciBench uses an average of about 25 criteria per task, which captures partial credit and reasoning quality that final-answer matching misses.

Do I need a separate model to grade my evals?

Often yes. A common pattern is LLM-as-judge: a separate model checks each rubric criterion and returns a yes/no with a short justification. Keep the judge prompt strict and deterministic (low temperature), pin the judge model, and spot-check a sample of its grades against human review so you trust the score.

How many eval tasks do I need to start?

Start small. 20 to 50 real tasks pulled from your production logs beats 1,000 synthetic ones. Write rubrics for each, run them on every model and prompt change, and grow the set as you find failure modes. Quality and realism of tasks matter far more than raw count.