RE: Pyspark UDF as a data source for streaming

2023-12-27 Thread Поротиков Станислав Вячеславович
Yes, it's actual data. Best regards, Stanislav Porotikov From: Mich Talebzadeh Sent: Wednesday, December 27, 2023 9:43 PM Cc: user@spark.apache.org Subject: Re: Pyspark UDF as a data source for streaming Is this generated data actual data or you are testing the application? Sounds like a form

Fwd: the life cycle shuffle Dependency

2023-12-27 Thread yang chen
hi, I'm learning spark, and wonder when to delete shuffle data, I find the ContextCleaner class which clean the shuffle data when shuffle dependency is GC-ed. Based on source code, the shuffle dependency is gc-ed only when active job finish, but i'm not sure, Could you explain the life cycle of

RE: Pyspark UDF as a data source for streaming

2023-12-27 Thread Поротиков Станислав Вячеславович
Actually it's json with specific structure from API server. But the task is to check constantly if new data appears on API server and load it to Kafka. Full pipeline can be presented like that: REST API -> Kafka -> some processing -> Kafka/Mongo -> … Best regards, Stanislav Porotikov From: Mich

RE: Pyspark UDF as a data source for streaming

2023-12-27 Thread Поротиков Станислав Вячеславович
Actually it's json with specific structure from API server. But the task is to check constantly if new data appears on API server and load it to Kafka. Full pipeline can be presented like that: REST API -> Kafka -> some processing -> Kafka/Mongo -> … Best regards, Stanislav Porotikov From:

Re: Pyspark UDF as a data source for streaming

2023-12-27 Thread Mich Talebzadeh
Ok so you want to generate some random data and load it into Kafka on a regular interval and the rest? HTH Mich Talebzadeh, Dad | Technologist | Solutions Architect | Engineer London United Kingdom view my Linkedin profile

Pyspark UDF as a data source for streaming

2023-12-27 Thread Поротиков Станислав Вячеславович
Hello! Is it possible to write pyspark UDF, generated data to streaming dataframe? I want to get some data from REST API requests in real time and consider to save this data to dataframe. And then put it to Kafka. I can't realise how to create streaming dataframe from generated data. I am new in