So I've been working on similar LLM pre-processing of data and I would say one of the questions worth answering is do you want/need your models to be collocated? If you're running on prem in a GPU rich env there's a lot of benefits, but even with a custom model, if your using 3rd party inference or even just trying to keep your GPUs warm in general the co-location may not be as important.
On Fri, Jan 3, 2025 at 9:01 AM Russell Jurney <russell.jur...@gmail.com> wrote: > Thanks! The first link is old, here is a more recent one: > > 1) > https://python.langchain.com/docs/integrations/providers/spark/#spark-sql-individual-tools > > Russell > > On Fri, Jan 3, 2025 at 8:50 AM Gurunandan <gurunandan....@gmail.com> > wrote: > >> HI Mayur, >> Please evaluate Langchain's Spark Dataframe Agent for your use case. >> >> documentation: >> 1) https://python.langchain.com/v0.1/docs/integrations/toolkits/spark/ >> 2) https://python.langchain.com/docs/integrations/tools/spark_sql/ >> >> regards, >> Guru >> >> On Fri, Jan 3, 2025 at 6:38 PM Mayur Dattatray Bhosale <ma...@sarvam.ai> >> wrote: >> > >> > Hi team, >> > >> > We are planning to use Spark for pre-processing the ML training data >> given the data is 500+ TBs. >> > >> > One of the steps in the data-preprocessing requires us to use a LLM >> (own deployment of model). I wanted to understand what is the right way to >> architect this? These are the options that I can think of: >> > >> > - Split this into multiple applications at the LLM use case step. Use a >> workflow manager to feed the output of the application-1 to LLM and feed >> the output of LLM to application 2 >> > - Split this into multiple stages by writing the orchestration code of >> feeding output of the pre-LLM processing stages to externally hosted LLM >> and vice versa >> > >> > I wanted to know if within Spark there is an easier way to do this or >> any plans of having such functionality as a first class citizen of Spark in >> future? Also, please suggest any other better alternatives. >> > >> > Thanks, >> > Mayur >> >> --------------------------------------------------------------------- >> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >> >> -- Twitter: https://twitter.com/holdenkarau Fight Health Insurance: https://www.fighthealthinsurance.com/ <https://www.fighthealthinsurance.com/?q=hk_email> Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> YouTube Live Streams: https://www.youtube.com/user/holdenkarau Pronouns: she/her