Data flow

This page explains how information moves through GitHelp.

Step 1: project selection

GitHelp starts from a target project repository.

The Streamlit interface lets a user select a local project path. For example, the MMORE repository may be located at:

/path/to/mmore

GitHelp then creates a project-specific working folder:

data/projects/<project_name>/

For MMORE:

data/projects/mmore/

Step 2: project configuration

GitHelp generates or reads a project configuration file.

For a Streamlit-selected project, this is usually:

data/projects/<project_name>/project_config.yaml

For older command-line workflows, the default config remains:

configs/project_config.yaml

The project config points to:

docs_path          -> Markdown and RST documentation
code_path          -> Python source code
yaml_config_paths  -> YAML example or production configs
repo_path          -> repository structure

Step 3: normalized records

Each source is converted into a DocumentRecord.

A DocumentRecord contains:

  • a unique doc_id;

  • textual content;

  • a source_type;

  • optional source metadata such as path, title, module, symbol, and signature.

Example source types:

markdown_section
python_module
python_class
python_function
python_method
example_config
production_config
repo_structure

Step 4: corpus file

All records are saved to JSONL.

Default command-line corpus:

data/processed/corpus.jsonl

Project-specific corpus:

data/projects/<project_name>/corpus.jsonl

Each line is one serialized DocumentRecord.

Step 5: query preparation and retrieval

There are two retrieval paths.

Before retrieval, the selected project profile may expand the question with project-specific terms. The expanded text is used only for retrieval; the original question remains the question shown to the answer generator.

Simple retrieval

The simple retriever reads the selected corpus.jsonl directly.

It is useful for:

  • debugging corpus quality;

  • testing queries quickly;

  • using newly built project corpora;

  • avoiding MMORE indexing during development.

MMORE retrieval

For MMORE, the corpus is first converted to:

data/projects/<project_name>/mmore_corpus.jsonl

or, in the default workflow:

data/processed/mmore_corpus.jsonl

Then MMORE indexes it and retrieves from the generated index. If native retrieval fails, GitHelp can instead run lexical retrieval over the exported mmore_corpus.jsonl and reports the mode as corpus_fallback.

The native MMORE backend does not automatically read a newly built corpus.jsonl. The export and index must be rebuilt when the target corpus changes.

For code-, symbol-, and filename-oriented questions, the answering pipeline can also add candidates from the simple retriever before final filtering and reranking.

Step 6: project profile

After retrieval, GitHelp applies the result-processing parts of the selected project profile.

A project profile can:

  • filter low-value or irrelevant results;

  • rerank results for known intents;

  • answer some structured questions directly without calling the LLM.

The MMORE profile, for example, can answer some Milvus parameter questions deterministically to avoid unrelated configuration fields.

The project profile comes from the application config. The generated project config does not select it automatically, so use project_profile: generic when querying another project unless a custom profile is required.

Step 7: prompt and answer

If no direct answer is returned by the project profile, retrieved sources are formatted into a prompt.

The LLM is instructed to:

  • answer only from the provided sources;

  • cite every factual statement;

  • avoid inventing commands, paths, APIs, or configuration keys;

  • say clearly when the available sources are insufficient.

The final output is an answer with cited sources.