# Retrieval Evaluation GitHelp includes a small retrieval evaluation workflow for checking whether retrieval changes improve source selection before looking at generated answers. ## Question Set Benchmark questions are stored in: ```text tests/evaluation/githelp_eval_questions.txt ``` Expected-source examples are stored in: ```text tests/evaluation/githelp_eval_expected_sources.example.json ``` The expected-source file lets GitHelp report pass/fail checks for whether a known relevant source appears in the retrieved top-k results. ## Run Evaluation Simple backend: ```bash python scripts/evaluate_retrieval.py \ --questions-path tests/evaluation/githelp_eval_questions.txt \ --expected-sources-path tests/evaluation/githelp_eval_expected_sources.example.json \ --corpus-path data/projects/mmore/corpus.jsonl \ --backend simple \ --top-k 5 ``` MMORE backend: ```bash python scripts/evaluate_retrieval.py \ --questions-path tests/evaluation/githelp_eval_questions.txt \ --expected-sources-path tests/evaluation/githelp_eval_expected_sources.example.json \ --corpus-path data/projects/mmore/corpus.jsonl \ --backend mmore \ --top-k 5 ``` ## What To Inspect For each question, inspect: - whether the expected source appears in top-k; - whether the top result is specific enough for answer generation; - whether source types are balanced between docs, code, config, and repository structure; - whether the selected raw backend returns the expected evidence. The evaluation script calls the selected retrieval backend directly. It does not run the full answering pipeline, so it does not apply project-profile query expansion, filtering, reranking, filename boosts, or direct answers. The compact output also does not currently expose whether `mmore` used `native_index` or `corpus_fallback`; inspect Streamlit diagnostics or the retrieved record metadata when that distinction matters. This evaluation is intentionally lightweight, but it makes retrieval tuning more repeatable than manually asking a few questions in Streamlit. The repository currently contains ten MMORE questions and expected-source annotations for a small subset. This is a preliminary regression check, not a complete benchmark of recall, ranking quality, or answer faithfulness.