RAG:

- You feed blocks of text into an LLM and receive embeddings (numeric vectors)

- You store the embeddings and the block of text in a local database that 
handles vectors

- You create a query, send it to the LLM and get the embedding for it.

- You feed the embedding of your query into your local embedding database to 
find the nearest N matches. (i.e. the related text.)

- You send the related text blocks (not embeddings) to the LLM as context along 
with your text query to the LLM to get your answer with the most relevant 
context.

So, when using code, it is best if the "related text" were "related methods." 
In order to get those blocks of text out of Groovy source code you must break 
classes up into methods. That's why I need the parser. My blocks of text (or 
chunks) are Groovy methods.

With the help of chatGPT and ANTLR, I think I now have a Groovy chunker. I will 
be releasing this as part of my KISS open-source framework (kissweb.org)

Thanks!

Blake

On Tuesday, April 8th, 2025 at 8:35 AM, Alessio Stalla 
<alessiosta...@gmail.com> wrote:

> Hi Blake,
>
> In my understanding of LLMs, they don't need a parser at all, embeddings work 
> on the text. There are experiments for embedding AST nodes (code2vec, 
> code2seq) but what people usually do is to just treat code as any other piece 
> of text.
>
> On Mon, 7 Apr 2025 at 01:49, Blake McBride <bl...@mcbridemail.com> wrote:
>
>> Greetings,
>>
>> I am trying to write a RAG chunker for Groovy. This is used to (essentially) 
>> train an AI/LLM on my code base so that the AI/LLM can help me with my 
>> Groovy application.
>>
>> Essentially, what I need to do is read in a Groovy source file and do 
>> something (create embeddings) for each individual method. This was pretty 
>> trivial in Java because there are ready-made Java parsers. However, I have 
>> spent a long time trying to create a parser for Groovy but have so far been 
>> unsuccessful.
>>
>> I sure appreciate any suggestions.
>>
>> Thanks!
>>
>> Blake McBride

Reply via email to