Re: LLM based data pre-processing

Gurunandan Fri, 03 Jan 2025 08:50:55 -0800

HI Mayur,
Please evaluate Langchain's Spark Dataframe Agent for your use case.


documentation:
1) https://python.langchain.com/v0.1/docs/integrations/toolkits/spark/
2) https://python.langchain.com/docs/integrations/tools/spark_sql/

regards,
Guru

On Fri, Jan 3, 2025 at 6:38 PM Mayur Dattatray Bhosale <ma...@sarvam.ai> wrote:
>
> Hi team,
>
> We are planning to use Spark for pre-processing the ML training data given 
> the data is 500+ TBs.
>
> One of the steps in the data-preprocessing requires us to use a LLM (own 
> deployment of model). I wanted to understand what is the right way to 
> architect this? These are the options that I can think of:
>
> - Split this into multiple applications at the LLM use case step. Use a 
> workflow manager to feed the output of the application-1 to LLM and feed the 
> output of LLM to application 2
> - Split this into multiple stages by writing the orchestration code of 
> feeding output of the pre-LLM processing stages to externally hosted LLM and 
> vice versa
>
> I wanted to know if within Spark there is an easier way to do this or any 
> plans of having such functionality as a first class citizen of Spark in 
> future? Also, please suggest any other better alternatives.
>
> Thanks,
> Mayur

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: LLM based data pre-processing

Reply via email to