Re: LLM based data pre-processing

Russell Jurney Fri, 03 Jan 2025 09:21:34 -0800

How well does Spark handle scheduling for GPUs these days? Three years ago
my team used GPUs with Spark on Databricks (one of the first customers),
and we couldn't saturate our GPUs more than 35% when doing inference,
encoding string fields in sentence transformers for fuzzy string matching.
This was a major cost factor that I've read about from others... could be
that it takes high end GPU-to-GPU networking to make things work? Does
Spark-RAPIDS <https://nvidia.github.io/spark-rapids/> address this - is it
relevant to his query?


Thanks,
Russell

On Fri, Jan 3, 2025 at 9:03 AM Holden Karau <holden.ka...@gmail.com> wrote:

> So I've been working on similar LLM pre-processing of data and I would say
> one of the questions worth answering is do you want/need your models to be
> collocated? If you're running on prem in a GPU rich env there's a lot of
> benefits, but even with a custom model, if your using 3rd party inference
> or even just trying to keep your GPUs warm in general the co-location may
> not be as important.
>
> On Fri, Jan 3, 2025 at 9:01 AM Russell Jurney <russell.jur...@gmail.com>
> wrote:
>
>> Thanks! The first link is old, here is a more recent one:
>>
>> 1)
>> https://python.langchain.com/docs/integrations/providers/spark/#spark-sql-individual-tools
>>
>> Russell
>>
>> On Fri, Jan 3, 2025 at 8:50 AM Gurunandan <gurunandan....@gmail.com>
>> wrote:
>>
>>> HI Mayur,
>>> Please evaluate Langchain's Spark Dataframe Agent for your use case.
>>>
>>> documentation:
>>> 1) https://python.langchain.com/v0.1/docs/integrations/toolkits/spark/
>>> 2) https://python.langchain.com/docs/integrations/tools/spark_sql/
>>>
>>> regards,
>>> Guru
>>>
>>> On Fri, Jan 3, 2025 at 6:38 PM Mayur Dattatray Bhosale <ma...@sarvam.ai>
>>> wrote:
>>> >
>>> > Hi team,
>>> >
>>> > We are planning to use Spark for pre-processing the ML training data
>>> given the data is 500+ TBs.
>>> >
>>> > One of the steps in the data-preprocessing requires us to use a LLM
>>> (own deployment of model). I wanted to understand what is the right way to
>>> architect this? These are the options that I can think of:
>>> >
>>> > - Split this into multiple applications at the LLM use case step. Use
>>> a workflow manager to feed the output of the application-1 to LLM and feed
>>> the output of LLM to application 2
>>> > - Split this into multiple stages by writing the orchestration code of
>>> feeding output of the pre-LLM processing stages to externally hosted LLM
>>> and vice versa
>>> >
>>> > I wanted to know if within Spark there is an easier way to do this or
>>> any plans of having such functionality as a first class citizen of Spark in
>>> future? Also, please suggest any other better alternatives.
>>> >
>>> > Thanks,
>>> > Mayur
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>
>>>
>
> --
> Twitter: https://twitter.com/holdenkarau
> Fight Health Insurance: https://www.fighthealthinsurance.com/
> <https://www.fighthealthinsurance.com/?q=hk_email>
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
> Pronouns: she/her
>

Re: LLM based data pre-processing

Reply via email to