Hi all, I have IoT time series data in Kafka and reading it in static dataframe as: df = spark.read\ .format("kafka")\ .option("zookeeper.connect", "localhost:2181")\ .option("kafka.bootstrap.servers", "localhost:9092")\ .option("subscribe", "test_topic")\ .option("failOnDataLoss", "false")\ .option("startingOffsets", "earliest")\ .load()\ .select(from_json(col("value").cast("string"), schema).alias("stream"))\ .select("stream.*")\ .withColumn("time",col("time").cast("timestamp"))\ .orderBy("dev_id")
I want to know how data is distributed over multiple executors. I want the data to be distributed on the basis of dev_id, each executor gets all data from one dev_id. Later I group by dev_id and run @pandas_udf on each group. Please note that it is a static dataframe not streaming dataframe as @pandas_udf don't support streaming dataframe. -- Regards, Arbab Khalil Software Design Engineer