Retrieval Evaluation¶
GitHelp includes a small retrieval evaluation workflow for checking whether retrieval changes improve source selection before looking at generated answers.
Question Set¶
Benchmark questions are stored in:
tests/evaluation/githelp_eval_questions.txt
Expected-source examples are stored in:
tests/evaluation/githelp_eval_expected_sources.example.json
The expected-source file lets GitHelp report pass/fail checks for whether a known relevant source appears in the retrieved top-k results.
Run Evaluation¶
Simple backend:
python scripts/evaluate_retrieval.py \
--questions-path tests/evaluation/githelp_eval_questions.txt \
--expected-sources-path tests/evaluation/githelp_eval_expected_sources.example.json \
--corpus-path data/projects/mmore/corpus.jsonl \
--backend simple \
--top-k 5
MMORE backend:
python scripts/evaluate_retrieval.py \
--questions-path tests/evaluation/githelp_eval_questions.txt \
--expected-sources-path tests/evaluation/githelp_eval_expected_sources.example.json \
--corpus-path data/projects/mmore/corpus.jsonl \
--backend mmore \
--top-k 5
What To Inspect¶
For each question, inspect:
whether the expected source appears in top-k;
whether the top result is specific enough for answer generation;
whether source types are balanced between docs, code, config, and repository structure;
whether the selected raw backend returns the expected evidence.
The evaluation script calls the selected retrieval backend directly. It does
not run the full answering pipeline, so it does not apply project-profile query
expansion, filtering, reranking, filename boosts, or direct answers. The compact
output also does not currently expose whether mmore used native_index or
corpus_fallback; inspect Streamlit diagnostics or the retrieved record metadata
when that distinction matters.
This evaluation is intentionally lightweight, but it makes retrieval tuning more repeatable than manually asking a few questions in Streamlit.
The repository currently contains ten MMORE questions and expected-source annotations for a small subset. This is a preliminary regression check, not a complete benchmark of recall, ranking quality, or answer faithfulness.