Re: LLM based data pre-processing

Holden Karau Fri, 03 Jan 2025 14:10:39 -0800

So it's improved a lot with resource profiles, but in the OSS it's far from
automatic you'll have to do a bunch of manual work setting up the resource
profiles & tagging your stages with them. For hosted solutions like
databricks there might be some magic.


On Fri, Jan 3, 2025 at 9:19 AM Russell Jurney <russell.jur...@gmail.com>
wrote:

> How well does Spark handle scheduling for GPUs these days? Three years ago
> my team used GPUs with Spark on Databricks (one of the first customers),
> and we couldn't saturate our GPUs more than 35% when doing inference,
> encoding string fields in sentence transformers for fuzzy string matching.
> This was a major cost factor that I've read about from others... could be
> that it takes high end GPU-to-GPU networking to make things work? Does
> Spark-RAPIDS <https://nvidia.github.io/spark-rapids/> address this - is
> it relevant to his query?
>
> Thanks,
> Russell
>
> On Fri, Jan 3, 2025 at 9:03 AM Holden Karau <holden.ka...@gmail.com>
> wrote:
>
>> So I've been working on similar LLM pre-processing of data and I would
>> say one of the questions worth answering is do you want/need your models to
>> be collocated? If you're running on prem in a GPU rich env there's a lot of
>> benefits, but even with a custom model, if your using 3rd party inference
>> or even just trying to keep your GPUs warm in general the co-location may
>> not be as important.
>>
>> On Fri, Jan 3, 2025 at 9:01 AM Russell Jurney <russell.jur...@gmail.com>
>> wrote:
>>
>>> Thanks! The first link is old, here is a more recent one:
>>>
>>> 1)
>>> https://python.langchain.com/docs/integrations/providers/spark/#spark-sql-individual-tools
>>>
>>> Russell
>>>
>>> On Fri, Jan 3, 2025 at 8:50 AM Gurunandan <gurunandan....@gmail.com>
>>> wrote:
>>>
>>>> HI Mayur,
>>>> Please evaluate Langchain's Spark Dataframe Agent for your use case.
>>>>
>>>> documentation:
>>>> 1) https://python.langchain.com/v0.1/docs/integrations/toolkits/spark/
>>>> 2) https://python.langchain.com/docs/integrations/tools/spark_sql/
>>>>
>>>> regards,
>>>> Guru
>>>>
>>>> On Fri, Jan 3, 2025 at 6:38 PM Mayur Dattatray Bhosale <ma...@sarvam.ai>
>>>> wrote:
>>>> >
>>>> > Hi team,
>>>> >
>>>> > We are planning to use Spark for pre-processing the ML training data
>>>> given the data is 500+ TBs.
>>>> >
>>>> > One of the steps in the data-preprocessing requires us to use a LLM
>>>> (own deployment of model). I wanted to understand what is the right way to
>>>> architect this? These are the options that I can think of:
>>>> >
>>>> > - Split this into multiple applications at the LLM use case step. Use
>>>> a workflow manager to feed the output of the application-1 to LLM and feed
>>>> the output of LLM to application 2
>>>> > - Split this into multiple stages by writing the orchestration code
>>>> of feeding output of the pre-LLM processing stages to externally hosted LLM
>>>> and vice versa
>>>> >
>>>> > I wanted to know if within Spark there is an easier way to do this or
>>>> any plans of having such functionality as a first class citizen of Spark in
>>>> future? Also, please suggest any other better alternatives.
>>>> >
>>>> > Thanks,
>>>> > Mayur
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>>
>>>>
>>
>> --
>> Twitter: https://twitter.com/holdenkarau
>> Fight Health Insurance: https://www.fighthealthinsurance.com/
>> <https://www.fighthealthinsurance.com/?q=hk_email>
>> Books (Learning Spark, High Performance Spark, etc.):
>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>> Pronouns: she/her
>>
>

-- 
Twitter: https://twitter.com/holdenkarau
Fight Health Insurance: https://www.fighthealthinsurance.com/
<https://www.fighthealthinsurance.com/?q=hk_email>
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
YouTube Live Streams: https://www.youtube.com/user/holdenkarau
Pronouns: she/her

Re: LLM based data pre-processing

Reply via email to