well I can given  you some advice using Google Cloud tools

Level 0: High-Level Overview


   1. Input: Raw data in Google Cloud Storage (GCS).
   2. Processing:


   - Pre-processing with Dataproc (Spark on tin box)
   - Inference with LLM (Cloud Run/Vertex AI).
   - Post-processing with Dataproc (Spark)

3. Output: Final processed dataset stored in GCS or Google BigQuery DW

Level 1: Detailed Data Flow

   1.

   *Step 1: Pre-Processing*
   - Input: Raw data from GCS.
      - Process:
         - Transform raw data using Spark on *Dataproc.*
      - Output: Pre-processed data stored back in *GCS.*
   2.

   *Step 2: LLM Inference*
   - Input: Pre-processed data from GCS.
      - Process:
         - Data sent in batches to *LLM Inference Service(*for processing,
         pre-processed data) hosted on *Cloud Run/Vertex AI.*
         - LLM generates inferences for each batch.
      - Output: LLM-inferred results stored in* GCS.*
   3.

   *Step 3: Post-Processing*
   - Input: LLM-inferred results from* GCS*.
      - Process:
         - Additional transformations, aggregations, or merging with other
         datasets using Spark on *Dataproc.*
      - Output: Final dataset stored in* GCS *or loaded into *Google
BigQuery DW
      *for downstream ML training.

*Orchestration *

Use *Cloud Compose*r that sits on top of *Apache Airflow* or just Airflow
itself

*Monitoring*

   - Job performance -> Dataproc
   - LLM API throughput -> Cloud Run/Vertex AI.
   - Storage and data transfer metrics -> GCS
   - Google logs

*Notes*
The LLM-inferenced results are the predictions, insights, or
transformations performed by the LLM on input data.These results are the
outputs of the model’s reasoning, natural language understanding, or
processing capabilities applied to the input.

HTH

Mich Talebzadeh,

Architect | Data Science | Financial Crime | Forensic Analysis | GDPR
PhD <https://en.wikipedia.org/wiki/Doctor_of_Philosophy> Imperial College
London <https://en.wikipedia.org/wiki/Imperial_College_London>
London, United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von
Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".


On Fri, 3 Jan 2025 at 13:08, Mayur Dattatray Bhosale <ma...@sarvam.ai>
wrote:

> Hi team,
>
> We are planning to use Spark for pre-processing the ML training data given
> the data is 500+ TBs.
>
> One of the steps in the data-preprocessing requires us to use a LLM (own
> deployment of model). I wanted to understand what is the right way to
> architect this? These are the options that I can think of:
>
> - Split this into multiple applications at the LLM use case step. Use a
> workflow manager to feed the output of the application-1 to LLM and feed
> the output of LLM to application 2
> - Split this into multiple stages by writing the orchestration code of
> feeding output of the pre-LLM processing stages to externally hosted LLM
> and vice versa
>
> I wanted to know if within Spark there is an easier way to do this or any
> plans of having such functionality as a first class citizen of Spark in
> future? Also, please suggest any other better alternatives.
>
> Thanks,
> Mayur
>

Reply via email to