I don't have an answer, but I have the very same questions and am eagerly awaiting a solid response :)
Russell On Fri, Jan 3, 2025 at 5:07 AM Mayur Dattatray Bhosale <ma...@sarvam.ai> wrote: > Hi team, > > We are planning to use Spark for pre-processing the ML training data given > the data is 500+ TBs. > > One of the steps in the data-preprocessing requires us to use a LLM (own > deployment of model). I wanted to understand what is the right way to > architect this? These are the options that I can think of: > > - Split this into multiple applications at the LLM use case step. Use a > workflow manager to feed the output of the application-1 to LLM and feed > the output of LLM to application 2 > - Split this into multiple stages by writing the orchestration code of > feeding output of the pre-LLM processing stages to externally hosted LLM > and vice versa > > I wanted to know if within Spark there is an easier way to do this or any > plans of having such functionality as a first class citizen of Spark in > future? Also, please suggest any other better alternatives. > > Thanks, > Mayur >