Retrieval

Retrieval is the step that finds relevant documents for a user question.

GitHelp currently supports two retrieval backends:

simple
mmore

Shared result format

File:

src/githelp/retrieval/base.py

The shared result type is:

RetrievalResult

It contains:

  • the retrieved DocumentRecord;

  • a retrieval score.

Simple retriever

File:

src/githelp/retrieval/simple_retriever.py

The simple retriever is a local debugging and dynamic-project backend.

It:

  • reads a selected corpus.jsonl;

  • tokenizes the query and documents;

  • ranks documents with overlap-based heuristics;

  • adds boosts for exact symbols, signatures, modules, titles, and user-facing documentation.

It does not use embeddings or MMORE.

Command:

python scripts/debug_retrieval.py \
  "How do I configure indexing?" \
  --corpus-path data/projects/mmore/corpus.jsonl

MMORE retriever

File:

src/githelp/retrieval/mmore_retriever.py
src/githelp/retrieval/mmore_native.py
src/githelp/retrieval/mmore_subprocess.py
src/githelp/retrieval/mmore_corpus.py
src/githelp/retrieval/mmore_result_mapping.py

The MMORE retriever:

  • runs native MMORE retrieval in an isolated subprocess;

  • loads a MMORE retriever from config inside that subprocess;

  • calls retriever.retrieve(...) when native retrieval is available;

  • parses GitHelp metadata from retrieved text;

  • converts raw MMORE results back into RetrievalResult objects.

  • falls back to lexical ranking over mmore_corpus.jsonl if the native process fails locally.

Retrieved sources are tagged with one of these metadata values:

native_index
corpus_fallback

Command example:

python scripts/prepare_answer.py \
  "How do I configure indexing?" \
  --backend mmore

Retriever factory

File:

src/githelp/retrieval/retriever_factory.py

This file gives the rest of GitHelp one entry point:

retrieve_documents(...)

It chooses the backend based on:

simple
mmore

Project profiles

Retrieval results can be refined by a project profile.

Project profiles live in:

src/githelp/project_profiles/

They can:

  • expand queries;

  • filter irrelevant sources;

  • rerank retrieved results;

  • answer some structured questions directly.

This keeps the core GitHelp retrieval pipeline generic while allowing MMORE-specific improvements to remain isolated.

Backend choice

For a project corpus built from the Streamlit interface, simple is useful for quick deterministic checks:

backend simple

For the main MMORE workflow, use:

backend mmore

The mmore backend attempts native MMORE index retrieval first. If that native process fails, GitHelp falls back to the exported mmore_corpus.jsonl next to the selected project corpus so Streamlit can continue answering. Results tagged corpus_fallback did not come from MMORE/Milvus vector search.

The high-level answering pipeline retrieves a wider candidate pool before profile filtering and reranking. For code-, symbol-, or filename-oriented questions, it may merge simple lexical candidates into the MMORE candidate pool. The final source list can therefore contain evidence rescued by the simple retriever even when backend mmore was selected.

The default app config selects the mmore project profile globally. Generated project configs do not change this value automatically; select a generic or custom app config when querying another project.