So it's improved a lot with resource profiles, but in the OSS it's far from automatic you'll have to do a bunch of manual work setting up the resource profiles & tagging your stages with them. For hosted solutions like databricks there might be some magic.
On Fri, Jan 3, 2025 at 9:19 AM Russell Jurney <russell.jur...@gmail.com> wrote: > How well does Spark handle scheduling for GPUs these days? Three years ago > my team used GPUs with Spark on Databricks (one of the first customers), > and we couldn't saturate our GPUs more than 35% when doing inference, > encoding string fields in sentence transformers for fuzzy string matching. > This was a major cost factor that I've read about from others... could be > that it takes high end GPU-to-GPU networking to make things work? Does > Spark-RAPIDS <https://nvidia.github.io/spark-rapids/> address this - is > it relevant to his query? > > Thanks, > Russell > > On Fri, Jan 3, 2025 at 9:03 AM Holden Karau <holden.ka...@gmail.com> > wrote: > >> So I've been working on similar LLM pre-processing of data and I would >> say one of the questions worth answering is do you want/need your models to >> be collocated? If you're running on prem in a GPU rich env there's a lot of >> benefits, but even with a custom model, if your using 3rd party inference >> or even just trying to keep your GPUs warm in general the co-location may >> not be as important. >> >> On Fri, Jan 3, 2025 at 9:01 AM Russell Jurney <russell.jur...@gmail.com> >> wrote: >> >>> Thanks! The first link is old, here is a more recent one: >>> >>> 1) >>> https://python.langchain.com/docs/integrations/providers/spark/#spark-sql-individual-tools >>> >>> Russell >>> >>> On Fri, Jan 3, 2025 at 8:50 AM Gurunandan <gurunandan....@gmail.com> >>> wrote: >>> >>>> HI Mayur, >>>> Please evaluate Langchain's Spark Dataframe Agent for your use case. >>>> >>>> documentation: >>>> 1) https://python.langchain.com/v0.1/docs/integrations/toolkits/spark/ >>>> 2) https://python.langchain.com/docs/integrations/tools/spark_sql/ >>>> >>>> regards, >>>> Guru >>>> >>>> On Fri, Jan 3, 2025 at 6:38 PM Mayur Dattatray Bhosale <ma...@sarvam.ai> >>>> wrote: >>>> > >>>> > Hi team, >>>> > >>>> > We are planning to use Spark for pre-processing the ML training data >>>> given the data is 500+ TBs. >>>> > >>>> > One of the steps in the data-preprocessing requires us to use a LLM >>>> (own deployment of model). I wanted to understand what is the right way to >>>> architect this? These are the options that I can think of: >>>> > >>>> > - Split this into multiple applications at the LLM use case step. Use >>>> a workflow manager to feed the output of the application-1 to LLM and feed >>>> the output of LLM to application 2 >>>> > - Split this into multiple stages by writing the orchestration code >>>> of feeding output of the pre-LLM processing stages to externally hosted LLM >>>> and vice versa >>>> > >>>> > I wanted to know if within Spark there is an easier way to do this or >>>> any plans of having such functionality as a first class citizen of Spark in >>>> future? Also, please suggest any other better alternatives. >>>> > >>>> > Thanks, >>>> > Mayur >>>> >>>> --------------------------------------------------------------------- >>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >>>> >>>> >> >> -- >> Twitter: https://twitter.com/holdenkarau >> Fight Health Insurance: https://www.fighthealthinsurance.com/ >> <https://www.fighthealthinsurance.com/?q=hk_email> >> Books (Learning Spark, High Performance Spark, etc.): >> https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> >> YouTube Live Streams: https://www.youtube.com/user/holdenkarau >> Pronouns: she/her >> > -- Twitter: https://twitter.com/holdenkarau Fight Health Insurance: https://www.fighthealthinsurance.com/ <https://www.fighthealthinsurance.com/?q=hk_email> Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> YouTube Live Streams: https://www.youtube.com/user/holdenkarau Pronouns: she/her