Hi team, We are planning to use Spark for pre-processing the ML training data given the data is 500+ TBs.
One of the steps in the data-preprocessing requires us to use a LLM (own deployment of model). I wanted to understand what is the right way to architect this? These are the options that I can think of: - Split this into multiple applications at the LLM use case step. Use a workflow manager to feed the output of the application-1 to LLM and feed the output of LLM to application 2 - Split this into multiple stages by writing the orchestration code of feeding output of the pre-LLM processing stages to externally hosted LLM and vice versa I wanted to know if within Spark there is an easier way to do this or any plans of having such functionality as a first class citizen of Spark in future? Also, please suggest any other better alternatives. Thanks, Mayur