Indexing¶
GitHelp has its own corpus format, but MMORE expects a different JSONL format.
The indexing layer bridges the two.
Relevant files¶
src/githelp/indexing/mmore_format.py
src/githelp/indexing/mmore_indexer.py
scripts/export_mmore_corpus.py
scripts/build_index.py
Step 1: export to MMORE format¶
Default command:
python scripts/export_mmore_corpus.py
Default input:
data/processed/corpus.jsonl
Default output:
data/processed/mmore_corpus.jsonl
Project-specific command:
python scripts/export_mmore_corpus.py \
--corpus-path data/projects/mmore/corpus.jsonl \
--output-path data/projects/mmore/mmore_corpus.jsonl
MMORE-compatible records look like:
{
"text": "...",
"modalities": [],
"metadata": {}
}
GitHelp adds a short source header inside the text field before indexing. This makes it possible to reconstruct source information after MMORE retrieval.
Step 2: build the MMORE index¶
Default command:
python scripts/build_index.py
Project-specific command:
python scripts/build_index.py \
--documents-path data/projects/mmore/mmore_corpus.jsonl \
--collection-name mmore_docs
This uses:
configs/mmore_index_config.yaml
and stores the index under:
data/indexes/mmore/
GitHelp can recover from missing Milvus model metadata by reading model names
from configs/mmore_index_config.yaml. If rebuilding fails, inspect the build
output shown by Streamlit or run the command directly with logs enabled.
In local environments where native MMORE/Milvus retrieval crashes, GitHelp runs
native retrieval in an isolated child process. If that process fails, the
mmore backend falls back to the exported mmore_corpus.jsonl so Streamlit can
still answer from the MMORE-formatted corpus. This fallback uses the simple
lexical ranking algorithm; it is not native MMORE/Milvus retrieval.
The default index config stores one Milvus Lite database at
data/indexes/mmore/githelp.db, and the app currently builds the shared
mmore_docs collection. Rebuilding a native index resets that local database,
so the most recently built native project index replaces the previous one.
Why keep indexing separate?¶
The corpus can be built and inspected before MMORE is involved.
This makes debugging easier:
build
corpus.jsonl;preview the records;
test simple retrieval;
only then export and index with MMORE.
Important distinction¶
Building a corpus does not automatically rebuild the MMORE index.
For a newly selected project in Streamlit:
Build simple index → corpus.jsonl → backend simple
For MMORE retrieval:
Build MMORE index → corpus.jsonl → mmore_corpus.jsonl → native index → backend mmore