HI Mayur, Please evaluate Langchain's Spark Dataframe Agent for your use case.
documentation: 1) https://python.langchain.com/v0.1/docs/integrations/toolkits/spark/ 2) https://python.langchain.com/docs/integrations/tools/spark_sql/ regards, Guru On Fri, Jan 3, 2025 at 6:38 PM Mayur Dattatray Bhosale <ma...@sarvam.ai> wrote: > > Hi team, > > We are planning to use Spark for pre-processing the ML training data given > the data is 500+ TBs. > > One of the steps in the data-preprocessing requires us to use a LLM (own > deployment of model). I wanted to understand what is the right way to > architect this? These are the options that I can think of: > > - Split this into multiple applications at the LLM use case step. Use a > workflow manager to feed the output of the application-1 to LLM and feed the > output of LLM to application 2 > - Split this into multiple stages by writing the orchestration code of > feeding output of the pre-LLM processing stages to externally hosted LLM and > vice versa > > I wanted to know if within Spark there is an easier way to do this or any > plans of having such functionality as a first class citizen of Spark in > future? Also, please suggest any other better alternatives. > > Thanks, > Mayur --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org