[jira] [Updated] (SPARK-47842) Spark job relying over Hudi are blocked after one or zero commit
[ https://issues.apache.org/jira/browse/SPARK-47842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-47842: --- Labels: pull-request-available (was: ) > Spark job relying over Hudi are blocked after one or zero commit > > > Key: SPARK-47842 > URL: https://issues.apache.org/jira/browse/SPARK-47842 > Project: Spark > Issue Type: Bug > Components: PySpark, Structured Streaming >Affects Versions: 3.3.0 > Environment: Hudi version : 0.12.1-amzn-0 > Spark version : 3.3.0 > Hive version : 3.1.3 > Hadoop version : 3.3.3 amz > Storage (HDFS/S3/GCS..) : S3 > Running on Docker? (yes/no) : no (EMR 6.9.0) > Additional context >Reporter: alessandro pontis >Priority: Blocker > Labels: pull-request-available > Attachments: console_spark.png > > > Hello, we are facing the fact that some pyspark job that rely on Hudi seems > to be blocked, in fact if we go over the spark console we can see the > situation in the attachment > we can see that we have 71 completed jobs but those are CDC process that > should read from Kafka topic continuously. We verified yet that there are > messages queued over the kafka topic. If you kill the application and then > restart in some cases the job will act normally and other times the job still > remain stacked. > Our deploy condition are the following: > We read INSERT, UPDATE and DELETE operation from a Kafka topic and we > replicate them in a target hudi table stored on Hive via a pyspark job > running 24/7 > > PYSPARK WRITE > df_source.writeStream.foreachBatch(foreach_batch_write_function) > {{ FOR EACH BATCH FUNCTION: > #management of delete messages > batchDF_deletes.write.format('hudi') \ > .option('hoodie.datasource.write.operation', 'delete') \ > .options(**hudiOptions_table) \ > .mode('append') \ > .save(S3_OUTPUT_PATH) > #management of update and insert messages > batchDF_upserts.write.format('org.apache.hudi') \ > .option('hoodie.datasource.write.operation', 'upsert') \ > .options(**hudiOptions_table) \ > .mode('append') \ > .save(S3_OUTPUT_PATH)}} > > SPARK SUBMIT > spark-submit --master yarn --deploy-mode cluster --num-executors 1 > --executor-memory 1G --executor-cores 2 --conf > spark.dynamicAllocation.enabled=false --packages > org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.2 --conf > spark.serializer=org.apache.spark.serializer.KryoSerializer --conf > spark.sql.hive.convertMetastoreParquet=false --jars > /usr/lib/hudi/hudi-spark-bundle.jar -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47842) Spark job relying over Hudi are blocked after one or zero commit
[ https://issues.apache.org/jira/browse/SPARK-47842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] alessandro pontis updated SPARK-47842: -- Description: Hello, we are facing the fact that some pyspark job that rely on Hudi seems to be blocked, in fact if we go over the spark console we can see the situation in the attachment we can see that we have 71 completed jobs but those are CDC process that should read from Kafka topic continuously. We verified yet that there are messages queued over the kafka topic. If you kill the application and then restart in some cases the job will act normally and other times the job still remain stacked. Our deploy condition are the following: We read INSERT, UPDATE and DELETE operation from a Kafka topic and we replicate them in a target hudi table stored on Hive via a pyspark job running 24/7 PYSPARK WRITE df_source.writeStream.foreachBatch(foreach_batch_write_function) {{ FOR EACH BATCH FUNCTION: #management of delete messages batchDF_deletes.write.format('hudi') \ .option('hoodie.datasource.write.operation', 'delete') \ .options(**hudiOptions_table) \ .mode('append') \ .save(S3_OUTPUT_PATH) #management of update and insert messages batchDF_upserts.write.format('org.apache.hudi') \ .option('hoodie.datasource.write.operation', 'upsert') \ .options(**hudiOptions_table) \ .mode('append') \ .save(S3_OUTPUT_PATH)}} SPARK SUBMIT spark-submit --master yarn --deploy-mode cluster --num-executors 1 --executor-memory 1G --executor-cores 2 --conf spark.dynamicAllocation.enabled=false --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.2 --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --conf spark.sql.hive.convertMetastoreParquet=false --jars /usr/lib/hudi/hudi-spark-bundle.jar was: # Hello, we are facing the fact that some pyspark job that rely on Hudi seems to be blocked, in fact if we go over the spark console we can see the situation in the attachment we can see that we have 71 completed jobs but those are CDC process that should read from Kafka topic continuously. We verified yet that there are messages queued over the kafka topic. If you kill the application and then restart in some cases the job will act normally and other times the job still remain stacked. Our deploy condition are the following: We read INSERT, UPDATE and DELETE operation from a Kafka topic and we replicate them in a target hudi table stored on Hive via a pyspark job running 24/7 PYSPARK WRITE df_source.writeStream.foreachBatch(foreach_batch_write_function) {{ FOR EACH BATCH FUNCTION: #management of delete messages batchDF_deletes.write.format('hudi') \ .option('hoodie.datasource.write.operation', 'delete') \ .options(**hudiOptions_table) \ .mode('append') \ .save(S3_OUTPUT_PATH) #management of update and insert messages batchDF_upserts.write.format('org.apache.hudi') \ .option('hoodie.datasource.write.operation', 'upsert') \ .options(**hudiOptions_table) \ .mode('append') \ .save(S3_OUTPUT_PATH)}} SPARK SUBMIT spark-submit --master yarn --deploy-mode cluster --num-executors 1 --executor-memory 1G --executor-cores 2 --conf spark.dynamicAllocation.enabled=false --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.2 --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --conf spark.sql.hive.convertMetastoreParquet=false --jars /usr/lib/hudi/hudi-spark-bundle.jar > Spark job relying over Hudi are blocked after one or zero commit > > > Key: SPARK-47842 > URL: https://issues.apache.org/jira/browse/SPARK-47842 > Project: Spark > Issue Type: Bug > Components: PySpark, Structured Streaming >Affects Versions: 3.3.0 > Environment: Hudi version : 0.12.1-amzn-0 > Spark version : 3.3.0 > Hive version : 3.1.3 > Hadoop version : 3.3.3 amz > Storage (HDFS/S3/GCS..) : S3 > Running on Docker? (yes/no) : no (EMR 6.9.0) > Additional context >Reporter: alessandro pontis >Priority: Blocker > Attachments: console_spark.png > > > Hello, we are facing the fact that some pyspark job that rely on Hudi seems > to be blocked, in fact if we go over the spark console we can see the > situation in the attachment > we can see that we have 71 completed jobs but those are CDC process that > should read from Kafka topic continuously. We verified yet that there are > messages queued over the kafka topic. If you kill the application and then > restart in some cases the job will act normally and other times the job still > remain stacked. > Our deploy condition are the following: > We read INSERT, UPDATE and DELETE operation from a Kafka topic and we > replicate them in a target hudi table stored on Hive via a pyspark job > running 24/7 > > PYSPARK WRITE > df_source.writeStream.foreachBatch(foreach_batch_wr
[jira] [Updated] (SPARK-47842) Spark job relying over Hudi are blocked after one or zero commit
[ https://issues.apache.org/jira/browse/SPARK-47842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] alessandro pontis updated SPARK-47842: -- Attachment: console_spark.png > Spark job relying over Hudi are blocked after one or zero commit > > > Key: SPARK-47842 > URL: https://issues.apache.org/jira/browse/SPARK-47842 > Project: Spark > Issue Type: Bug > Components: PySpark, Structured Streaming >Affects Versions: 3.3.0 > Environment: Hudi version : 0.12.1-amzn-0 > Spark version : 3.3.0 > Hive version : 3.1.3 > Hadoop version : 3.3.3 amz > Storage (HDFS/S3/GCS..) : S3 > Running on Docker? (yes/no) : no (EMR 6.9.0) > Additional context >Reporter: alessandro pontis >Priority: Blocker > Attachments: console_spark.png > > > # Hello, we are facing the fact that some pyspark job that rely on Hudi seems > to be blocked, in fact if we go over the spark console we can see the > situation in the attachment > we can see that we have 71 completed jobs but those are CDC process that > should read from Kafka topic continuously. We verified yet that there are > messages queued over the kafka topic. If you kill the application and then > restart in some cases the job will act normally and other times the job still > remain stacked. > Our deploy condition are the following: > We read INSERT, UPDATE and DELETE operation from a Kafka topic and we > replicate them in a target hudi table stored on Hive via a pyspark job > running 24/7 > > PYSPARK WRITE > df_source.writeStream.foreachBatch(foreach_batch_write_function) > {{ FOR EACH BATCH FUNCTION: > #management of delete messages > batchDF_deletes.write.format('hudi') \ > .option('hoodie.datasource.write.operation', 'delete') \ > .options(**hudiOptions_table) \ > .mode('append') \ > .save(S3_OUTPUT_PATH) > #management of update and insert messages > batchDF_upserts.write.format('org.apache.hudi') \ > .option('hoodie.datasource.write.operation', 'upsert') \ > .options(**hudiOptions_table) \ > .mode('append') \ > .save(S3_OUTPUT_PATH)}} > > SPARK SUBMIT > spark-submit --master yarn --deploy-mode cluster --num-executors 1 > --executor-memory 1G --executor-cores 2 --conf > spark.dynamicAllocation.enabled=false --packages > org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.2 --conf > spark.serializer=org.apache.spark.serializer.KryoSerializer --conf > spark.sql.hive.convertMetastoreParquet=false --jars > /usr/lib/hudi/hudi-spark-bundle.jar -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47842) Spark job relying over Hudi are blocked after one or zero commit
[ https://issues.apache.org/jira/browse/SPARK-47842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] alessandro pontis updated SPARK-47842: -- Attachment: (was: Screenshot_20 1.png) > Spark job relying over Hudi are blocked after one or zero commit > > > Key: SPARK-47842 > URL: https://issues.apache.org/jira/browse/SPARK-47842 > Project: Spark > Issue Type: Bug > Components: PySpark, Structured Streaming >Affects Versions: 3.3.0 > Environment: Hudi version : 0.12.1-amzn-0 > Spark version : 3.3.0 > Hive version : 3.1.3 > Hadoop version : 3.3.3 amz > Storage (HDFS/S3/GCS..) : S3 > Running on Docker? (yes/no) : no (EMR 6.9.0) > Additional context >Reporter: alessandro pontis >Priority: Blocker > > # Hello, we are facing the fact that some pyspark job that rely on Hudi seems > to be blocked, in fact if we go over the spark console we can see the > situation in the attachment > we can see that we have 71 completed jobs but those are CDC process that > should read from Kafka topic continuously. We verified yet that there are > messages queued over the kafka topic. If you kill the application and then > restart in some cases the job will act normally and other times the job still > remain stacked. > Our deploy condition are the following: > We read INSERT, UPDATE and DELETE operation from a Kafka topic and we > replicate them in a target hudi table stored on Hive via a pyspark job > running 24/7 > > PYSPARK WRITE > df_source.writeStream.foreachBatch(foreach_batch_write_function) > {{ FOR EACH BATCH FUNCTION: > #management of delete messages > batchDF_deletes.write.format('hudi') \ > .option('hoodie.datasource.write.operation', 'delete') \ > .options(**hudiOptions_table) \ > .mode('append') \ > .save(S3_OUTPUT_PATH) > #management of update and insert messages > batchDF_upserts.write.format('org.apache.hudi') \ > .option('hoodie.datasource.write.operation', 'upsert') \ > .options(**hudiOptions_table) \ > .mode('append') \ > .save(S3_OUTPUT_PATH)}} > > SPARK SUBMIT > spark-submit --master yarn --deploy-mode cluster --num-executors 1 > --executor-memory 1G --executor-cores 2 --conf > spark.dynamicAllocation.enabled=false --packages > org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.2 --conf > spark.serializer=org.apache.spark.serializer.KryoSerializer --conf > spark.sql.hive.convertMetastoreParquet=false --jars > /usr/lib/hudi/hudi-spark-bundle.jar -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47842) Spark job relying over Hudi are blocked after one or zero commit
[ https://issues.apache.org/jira/browse/SPARK-47842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] alessandro pontis updated SPARK-47842: -- Description: # Hello, we are facing the fact that some pyspark job that rely on Hudi seems to be blocked, in fact if we go over the spark console we can see the situation in the attachment we can see that we have 71 completed jobs but those are CDC process that should read from Kafka topic continuously. We verified yet that there are messages queued over the kafka topic. If you kill the application and then restart in some cases the job will act normally and other times the job still remain stacked. Our deploy condition are the following: We read INSERT, UPDATE and DELETE operation from a Kafka topic and we replicate them in a target hudi table stored on Hive via a pyspark job running 24/7 PYSPARK WRITE df_source.writeStream.foreachBatch(foreach_batch_write_function) {{ FOR EACH BATCH FUNCTION: #management of delete messages batchDF_deletes.write.format('hudi') \ .option('hoodie.datasource.write.operation', 'delete') \ .options(**hudiOptions_table) \ .mode('append') \ .save(S3_OUTPUT_PATH) #management of update and insert messages batchDF_upserts.write.format('org.apache.hudi') \ .option('hoodie.datasource.write.operation', 'upsert') \ .options(**hudiOptions_table) \ .mode('append') \ .save(S3_OUTPUT_PATH)}} SPARK SUBMIT spark-submit --master yarn --deploy-mode cluster --num-executors 1 --executor-memory 1G --executor-cores 2 --conf spark.dynamicAllocation.enabled=false --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.2 --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --conf spark.sql.hive.convertMetastoreParquet=false --jars /usr/lib/hudi/hudi-spark-bundle.jar was: Hello, we are facing the fact that some pyspark job that rely on Hudi seems to be blocked, in fact if we go over the spark console we can see the following situation: !image-2024-04-13-15-57-56-605.png|width=1040,height=558! we can see that we have 71 completed jobs but those are CDC process that should read from Kafka topic continuously. We verified yet that there are messages queued over the kafka topic. If you kill the application and then restart in some cases the job will act normally and other times the job still remain stacked. Our deploy condition are the following: We read INSERT, UPDATE and DELETE operation from a Kafka topic and we replicate them in a target hudi table stored on Hive via a pyspark job running 24/7 PYSPARK WRITE df_source.writeStream.foreachBatch(foreach_batch_write_function) {{ FOR EACH BATCH FUNCTION: #management of delete messages batchDF_deletes.write.format('hudi') \ .option('hoodie.datasource.write.operation', 'delete') \ .options(**hudiOptions_table) \ .mode('append') \ .save(S3_OUTPUT_PATH) #management of update and insert messages batchDF_upserts.write.format('org.apache.hudi') \ .option('hoodie.datasource.write.operation', 'upsert') \ .options(**hudiOptions_table) \ .mode('append') \ .save(S3_OUTPUT_PATH)}} SPARK SUBMIT spark-submit --master yarn --deploy-mode cluster --num-executors 1 --executor-memory 1G --executor-cores 2 --conf spark.dynamicAllocation.enabled=false --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.2 --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --conf spark.sql.hive.convertMetastoreParquet=false --jars /usr/lib/hudi/hudi-spark-bundle.jar > Spark job relying over Hudi are blocked after one or zero commit > > > Key: SPARK-47842 > URL: https://issues.apache.org/jira/browse/SPARK-47842 > Project: Spark > Issue Type: Bug > Components: PySpark, Structured Streaming >Affects Versions: 3.3.0 > Environment: Hudi version : 0.12.1-amzn-0 > Spark version : 3.3.0 > Hive version : 3.1.3 > Hadoop version : 3.3.3 amz > Storage (HDFS/S3/GCS..) : S3 > Running on Docker? (yes/no) : no (EMR 6.9.0) > Additional context >Reporter: alessandro pontis >Priority: Blocker > > # Hello, we are facing the fact that some pyspark job that rely on Hudi seems > to be blocked, in fact if we go over the spark console we can see the > situation in the attachment > we can see that we have 71 completed jobs but those are CDC process that > should read from Kafka topic continuously. We verified yet that there are > messages queued over the kafka topic. If you kill the application and then > restart in some cases the job will act normally and other times the job still > remain stacked. > Our deploy condition are the following: > We read INSERT, UPDATE and DELETE operation from a Kafka topic and we > replicate them in a target hudi
[jira] [Updated] (SPARK-47842) Spark job relying over Hudi are blocked after one or zero commit
[ https://issues.apache.org/jira/browse/SPARK-47842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] alessandro pontis updated SPARK-47842: -- Attachment: Screenshot_20 1.png > Spark job relying over Hudi are blocked after one or zero commit > > > Key: SPARK-47842 > URL: https://issues.apache.org/jira/browse/SPARK-47842 > Project: Spark > Issue Type: Bug > Components: PySpark, Structured Streaming >Affects Versions: 3.3.0 > Environment: Hudi version : 0.12.1-amzn-0 > Spark version : 3.3.0 > Hive version : 3.1.3 > Hadoop version : 3.3.3 amz > Storage (HDFS/S3/GCS..) : S3 > Running on Docker? (yes/no) : no (EMR 6.9.0) > Additional context >Reporter: alessandro pontis >Priority: Blocker > Attachments: Screenshot_20 1.png > > > Hello, we are facing the fact that some pyspark job that rely on Hudi seems > to be blocked, in fact if we go over the spark console we can see the > following situation: > !image-2024-04-13-15-57-56-605.png|width=1040,height=558! > we can see that we have 71 completed jobs but those are CDC process that > should read from Kafka topic continuously. We verified yet that there are > messages queued over the kafka topic. If you kill the application and then > restart in some cases the job will act normally and other times the job still > remain stacked. > Our deploy condition are the following: > We read INSERT, UPDATE and DELETE operation from a Kafka topic and we > replicate them in a target hudi table stored on Hive via a pyspark job > running 24/7 > > PYSPARK WRITE > df_source.writeStream.foreachBatch(foreach_batch_write_function) > {{ FOR EACH BATCH FUNCTION: > #management of delete messages > batchDF_deletes.write.format('hudi') \ > .option('hoodie.datasource.write.operation', 'delete') \ > .options(**hudiOptions_table) \ > .mode('append') \ > .save(S3_OUTPUT_PATH) > #management of update and insert messages > batchDF_upserts.write.format('org.apache.hudi') \ > .option('hoodie.datasource.write.operation', 'upsert') \ > .options(**hudiOptions_table) \ > .mode('append') \ > .save(S3_OUTPUT_PATH)}} > > SPARK SUBMIT > spark-submit --master yarn --deploy-mode cluster --num-executors 1 > --executor-memory 1G --executor-cores 2 --conf > spark.dynamicAllocation.enabled=false --packages > org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.2 --conf > spark.serializer=org.apache.spark.serializer.KryoSerializer --conf > spark.sql.hive.convertMetastoreParquet=false --jars > /usr/lib/hudi/hudi-spark-bundle.jar -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org