Re: LLM based data pre-processing

Russell Jurney Fri, 03 Jan 2025 07:41:32 -0800

I don't have an answer, but I have the very same questions and am eagerly
awaiting a solid response :)


Russell

On Fri, Jan 3, 2025 at 5:07 AM Mayur Dattatray Bhosale <ma...@sarvam.ai>
wrote:

> Hi team,
>
> We are planning to use Spark for pre-processing the ML training data given
> the data is 500+ TBs.
>
> One of the steps in the data-preprocessing requires us to use a LLM (own
> deployment of model). I wanted to understand what is the right way to
> architect this? These are the options that I can think of:
>
> - Split this into multiple applications at the LLM use case step. Use a
> workflow manager to feed the output of the application-1 to LLM and feed
> the output of LLM to application 2
> - Split this into multiple stages by writing the orchestration code of
> feeding output of the pre-LLM processing stages to externally hosted LLM
> and vice versa
>
> I wanted to know if within Spark there is an easier way to do this or any
> plans of having such functionality as a first class citizen of Spark in
> future? Also, please suggest any other better alternatives.
>
> Thanks,
> Mayur
>

Re: LLM based data pre-processing

Reply via email to