Re: LLM based data pre-processing

Mich Talebzadeh Sat, 04 Jan 2025 14:11:26 -0800

Hi Russel,

Spark's GPU scheduling capabilities have improved significantly with the
advent of tools like the NVIDIA RAPIDS Accelerator for Spark.
<https://www.nvidia.com/en-gb/deep-learning-ai/solutions/data-science/apache-spark-3/>


The NVIDIA RAPIDS Accelerator for Spark is directly relevant to AI workload
and addresses historical challenges like underutilized GPUs. Proper batch
optimization, infrastructure selection, and leveraging RAPIDS will help
achieve better GPU saturation and cost efficiency.

Specifically:

   - Improved Resource Scheduling: Spark now has native support for
   GPU-aware scheduling:
      - Executors can request specific GPU resources (
      spark.executor.resource.gpu.amount).
      - Tasks are aware of GPU resources (spark.task.resource.gpu.amount).
   - RAPIDS Integration:
      - Offloads many Spark SQL and DataFrame operations to GPUs using
      NVIDIA RAPIDS.
      
<https://developer.nvidia.com/blog/accelerating-apache-spark-3-0-with-gpus-and-rapids/>
      - Operations like joins, aggregations, and string manipulations are
      now GPU-accelerated.
      - Significantly reduces CPU-GPU communication overhead by processing
      data entirely on GPUs.
   - Enhanced Support for ML/Inference:
      - Libraries like cuML <https://github.com/rapidsai/cuml> and cuDF
      <https://docs.rapids.ai/api/cudf/stable/> integrate seamlessly,
      enabling efficient ML tasks like sentence embedding generation.
      - The RAPIDS Accelerator supports PyTorch/TensorFlow-based inference
      pipelines.

- GPU Saturation Challenges

Your experience of 35% GPU utilization during inference is likely caused by:

   - Bottlenecks in CPU-GPU Communication:
      - Data Partitioning Issues:
   -
      - Insufficient partition sizes in Spark can lead to underutilization.
      - If the GPU cannot process large enough batches of data, cores
      remain idle.
   - Limited Networking Bandwidth:
      - Without high-bandwidth interconnects like infiniBand, multi-GPU
      setups suffer.
   - I/O Bound Workloads:
      - If the workload is dominated by I/O rather than GPU-computable
      tasks (e.g., simple string operations), GPUs will  not be saturated.

Maybe it is about time to try again :)HTH

 Mich TalebzadehArchitect | Data Science | Financial Crime | Forensic
Analysis | GDPR
PhD <https://en.wikipedia.org/wiki/Doctor_of_Philosophy> Imperial College
London <https://en.wikipedia.org/wiki/Imperial_College_London>
London, United Kingdom




On Fri, 3 Jan 2025 at 17:22, Russell Jurney <russell.jur...@gmail.com>
wrote:

> How well does Spark handle scheduling for GPUs these days? Three years ago
> my team used GPUs with Spark on Databricks (one of the first customers),
> and we couldn't saturate our GPUs more than 35% when doing inference,
> encoding string fields in sentence transformers for fuzzy string matching.
> This was a major cost factor that I've read about from others... could be
> that it takes high end GPU-to-GPU networking to make things work? Does
> Spark-RAPIDS <https://nvidia.github.io/spark-rapids/> address this - is
> it relevant to his query?
>
> Thanks,
> Russell
>
> On Fri, Jan 3, 2025 at 9:03 AM Holden Karau <holden.ka...@gmail.com>
> wrote:
>
>> So I've been working on similar LLM pre-processing of data and I would
>> say one of the questions worth answering is do you want/need your models to
>> be collocated? If you're running on prem in a GPU rich env there's a lot of
>> benefits, but even with a custom model, if your using 3rd party inference
>> or even just trying to keep your GPUs warm in general the co-location may
>> not be as important.
>>
>> On Fri, Jan 3, 2025 at 9:01 AM Russell Jurney <russell.jur...@gmail.com>
>> wrote:
>>
>>> Thanks! The first link is old, here is a more recent one:
>>>
>>> 1)
>>> https://python.langchain.com/docs/integrations/providers/spark/#spark-sql-individual-tools
>>>
>>> Russell
>>>
>>> On Fri, Jan 3, 2025 at 8:50 AM Gurunandan <gurunandan....@gmail.com>
>>> wrote:
>>>
>>>> HI Mayur,
>>>> Please evaluate Langchain's Spark Dataframe Agent for your use case.
>>>>
>>>> documentation:
>>>> 1) https://python.langchain.com/v0.1/docs/integrations/toolkits/spark/
>>>> 2) https://python.langchain.com/docs/integrations/tools/spark_sql/
>>>>
>>>> regards,
>>>> Guru
>>>>
>>>> On Fri, Jan 3, 2025 at 6:38 PM Mayur Dattatray Bhosale <ma...@sarvam.ai>
>>>> wrote:
>>>> >
>>>> > Hi team,
>>>> >
>>>> > We are planning to use Spark for pre-processing the ML training data
>>>> given the data is 500+ TBs.
>>>> >
>>>> > One of the steps in the data-preprocessing requires us to use a LLM
>>>> (own deployment of model). I wanted to understand what is the right way to
>>>> architect this? These are the options that I can think of:
>>>> >
>>>> > - Split this into multiple applications at the LLM use case step. Use
>>>> a workflow manager to feed the output of the application-1 to LLM and feed
>>>> the output of LLM to application 2
>>>> > - Split this into multiple stages by writing the orchestration code
>>>> of feeding output of the pre-LLM processing stages to externally hosted LLM
>>>> and vice versa
>>>> >
>>>> > I wanted to know if within Spark there is an easier way to do this or
>>>> any plans of having such functionality as a first class citizen of Spark in
>>>> future? Also, please suggest any other better alternatives.
>>>> >
>>>> > Thanks,
>>>> > Mayur
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>>
>>>>
>>
>> --
>> Twitter: https://twitter.com/holdenkarau
>> Fight Health Insurance: https://www.fighthealthinsurance.com/
>> <https://www.fighthealthinsurance.com/?q=hk_email>
>> Books (Learning Spark, High Performance Spark, etc.):
>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>> Pronouns: she/her
>>
>

Re: LLM based data pre-processing

Reply via email to