Retrieval Evaluation¶

GitHelp includes a small retrieval evaluation workflow for checking whether retrieval changes improve source selection before looking at generated answers.

Question Set¶

Benchmark questions are stored in:

tests/evaluation/githelp_eval_questions.txt

Expected-source examples are stored in:

tests/evaluation/githelp_eval_expected_sources.example.json

The expected-source file lets GitHelp report pass/fail checks for whether a known relevant source appears in the retrieved top-k results.

Run Evaluation¶

Simple backend:

python scripts/evaluate_retrieval.py \
  --questions-path tests/evaluation/githelp_eval_questions.txt \
  --expected-sources-path tests/evaluation/githelp_eval_expected_sources.example.json \
  --corpus-path data/projects/mmore/corpus.jsonl \
  --backend simple \
  --top-k 5

MMORE backend:

python scripts/evaluate_retrieval.py \
  --questions-path tests/evaluation/githelp_eval_questions.txt \
  --expected-sources-path tests/evaluation/githelp_eval_expected_sources.example.json \
  --corpus-path data/projects/mmore/corpus.jsonl \
  --backend mmore \
  --top-k 5

What To Inspect¶

For each question, inspect:

whether the expected source appears in top-k;
whether the top result is specific enough for answer generation;
whether source types are balanced between docs, code, config, and repository structure;
whether the selected raw backend returns the expected evidence.

The evaluation script calls the selected retrieval backend directly. It does not run the full answering pipeline, so it does not apply project-profile query expansion, filtering, reranking, filename boosts, or direct answers. The compact output also does not currently expose whether mmore used native_index or corpus_fallback; inspect Streamlit diagnostics or the retrieved record metadata when that distinction matters.

This evaluation is intentionally lightweight, but it makes retrieval tuning more repeatable than manually asking a few questions in Streamlit.

The repository currently contains ten MMORE questions and expected-source annotations for a small subset. This is a preliminary regression check, not a complete benchmark of recall, ranking quality, or answer faithfulness.