LLM based data pre-processing

Mayur Dattatray Bhosale Fri, 03 Jan 2025 05:07:28 -0800

Hi team,

We are planning to use Spark for pre-processing the ML training data given
the data is 500+ TBs.


One of the steps in the data-preprocessing requires us to use a LLM (own
deployment of model). I wanted to understand what is the right way to
architect this? These are the options that I can think of:

- Split this into multiple applications at the LLM use case step. Use a
workflow manager to feed the output of the application-1 to LLM and feed
the output of LLM to application 2
- Split this into multiple stages by writing the orchestration code of
feeding output of the pre-LLM processing stages to externally hosted LLM
and vice versa

I wanted to know if within Spark there is an easier way to do this or any
plans of having such functionality as a first class citizen of Spark in
future? Also, please suggest any other better alternatives.

Thanks,
Mayur

LLM based data pre-processing

Reply via email to