# Indexing

GitHelp has its own corpus format, but MMORE expects a different JSONL format.

The indexing layer bridges the two.

## Relevant files

```text
src/githelp/indexing/mmore_format.py
src/githelp/indexing/mmore_indexer.py
scripts/export_mmore_corpus.py
scripts/build_index.py
```

## Step 1: export to MMORE format

Default command:

```bash
python scripts/export_mmore_corpus.py
```

Default input:

```text
data/processed/corpus.jsonl
```

Default output:

```text
data/processed/mmore_corpus.jsonl
```

Project-specific command:

```bash
python scripts/export_mmore_corpus.py \
  --corpus-path data/projects/mmore/corpus.jsonl \
  --output-path data/projects/mmore/mmore_corpus.jsonl
```

MMORE-compatible records look like:

```json
{
  "text": "...",
  "modalities": [],
  "metadata": {}
}
```

GitHelp adds a short source header inside the text field before indexing. This makes it possible to reconstruct source information after MMORE retrieval.

## Step 2: build the MMORE index

Default command:

```bash
python scripts/build_index.py
```

Project-specific command:

```bash
python scripts/build_index.py \
  --documents-path data/projects/mmore/mmore_corpus.jsonl \
  --collection-name mmore_docs
```

This uses:

```text
configs/mmore_index_config.yaml
```

and stores the index under:

```text
data/indexes/mmore/
```

GitHelp can recover from missing Milvus model metadata by reading model names
from `configs/mmore_index_config.yaml`. If rebuilding fails, inspect the build
output shown by Streamlit or run the command directly with logs enabled.

In local environments where native MMORE/Milvus retrieval crashes, GitHelp runs
native retrieval in an isolated child process. If that process fails, the
`mmore` backend falls back to the exported `mmore_corpus.jsonl` so Streamlit can
still answer from the MMORE-formatted corpus. This fallback uses the simple
lexical ranking algorithm; it is not native MMORE/Milvus retrieval.

The default index config stores one Milvus Lite database at
`data/indexes/mmore/githelp.db`, and the app currently builds the shared
`mmore_docs` collection. Rebuilding a native index resets that local database,
so the most recently built native project index replaces the previous one.

## Why keep indexing separate?

The corpus can be built and inspected before MMORE is involved.

This makes debugging easier:

1. build `corpus.jsonl`;
2. preview the records;
3. test simple retrieval;
4. only then export and index with MMORE.

## Important distinction

Building a corpus does not automatically rebuild the MMORE index.

For a newly selected project in Streamlit:

```text
Build simple index → corpus.jsonl → backend simple
```

For MMORE retrieval:

```text
Build MMORE index → corpus.jsonl → mmore_corpus.jsonl → native index → backend mmore
```