Ask HN: Any AI tools that help you make sense of a large repo?

5 points by zh2408 3 days ago

I'm trying to understand some large codebases. Current LLMs can understand code snippets but not large repositories with their complex structures. For example, we need tools that can make sense of the Linux source code. If I ask about changing a file system or thread locking mechanism, such tools should point to the most relevant files and help me understand how other code would be affected. Perhaps something like the repository map feature in Aider? But would love a cursor like chat interface to help make sense of code repository.

dgosling56 a day ago

+1 looking for an answer here as well. My guess is that the context limit for all the source code for some large projects will almost always exceed the LLM allowed token limits, so the AI editors generally need to implement some RAG like solution for these. Sourcegraph's cody[0] claims to be able to answer questions across your entire codebase. I haven't tested it extensively but I'm guessing it's competitive with some other AI editors that would offer this.

If I just have a github file open in browser for reference, then I'll sometimes use Rocky AI[1] to explain the code to me - helps avoid the constant copy past into chat gpt from your tab.

[0] https://sourcegraph.com/cody [1] https://rockyai.me/

prash2488 2 days ago

I've worked on SourceSailor, a CLI tool that tries to tackle this exact problem, though I should note it's still in early stages. While it can't yet fully map complex codebases like the Linux kernel (that's a significant challenge), it does provide some useful capabilities for understanding smaller to medium-sized codebases. SourceSailor generates a structural understanding of your codebase and creates reports about dependencies and project architecture. It leverages LLMs (OpenAI, Anthropic, or Gemini) for analysis and allows you to ignore files you don't want to analyze (following how .gitignore is used and parsed) to focus on relevant parts of the codebase. However, I should be clear about its limitations:

- It's not yet as interactive as Cursor or Aider, and I am not planning to make it like that

- Large codebases (like Linux) would be challenging due to token limits of current LLMs. Though gemini may help, but we all know it's privacy policy shenanigans.

- The analysis is more high-level rather than detailed implementation specifics. Though it helps you to understand the codebase, and it tries to explain interesting parts, but ymmv...

If you're specifically looking to understand massive codebases like Linux, SourceSailor probably isn't the straightforward yet, and there will be workarounds. But if you're working with smaller to medium projects and need help understanding their structure and dependencies, it might be worth trying. The project is open source if you want to check it out or contribute: https://github.com/PrashamTrivedi/SourceSailor-CLI