Spark 2.4 lifetime

2020-11-10 Thread Netanel Malka
Hi folks, Do you know about how long Spark will continue to maintain version 2.4? Thanks. -- Best regards, Netanel Malka.

Pyspark application hangs (no error messages) on Python RDD .map

2020-11-10 Thread Daniel Stojanov
Hi, This code will hang indefinitely at the last line (the .map()). Interestingly, if I run the same code at the beginning of my application (removing the .write step) it executes as expected. Otherwise, the code appears further along in my application which is where it hangs. The debugging

Blacklisting in Spark Stateful Structured Streaming

2020-11-10 Thread Eric Beabes
Currently we’ve a “Stateful” Spark Structured Streaming job that computes aggregates for each ID. I need to implement a new requirement which says that if the no. of incoming messages for a particular ID exceeds a certain value then add this ID to a blacklist & remove the state for it. Going

Re: Cannot perform operation after producer has been closed

2020-11-10 Thread Eric Beabes
BTW, we are seeing this message as well: *"org.apache.kafka.common.KafkaException: Producer** closed while send in progress"*. I am assuming this happens because of the previous issue.."producer has been closed", right? Or are they unrelated? Please advise. Thanks. On Tue, Nov 10, 2020 at 11:17

Re: Cannot perform operation after producer has been closed

2020-11-10 Thread Eric Beabes
Thanks for the reply. We are on Spark 2.4. Is there no way to get this fixed in Spark 2.4? On Mon, Nov 2, 2020 at 8:32 PM Jungtaek Lim wrote: > Which Spark version do you use? There's a known issue on Kafka producer > pool in Spark 2.x which was fixed in Spark 3.0, so you'd like to check >

DStreams stop consuming from Kafka

2020-11-10 Thread Razvan-Daniel Mihai
Hello, I have a usecase where I have to stream events from Kafka to a JDBC sink. Kafka producers write events in bursts of hourly batches. I started with a structured streaming approach, but it turns out that structured streaming has no JDBC sink. I found an implementation in Apache Bahir, but

Spark Parquet file size

2020-11-10 Thread Tzahi File
Hi, We have many Spark jobs that create multiple small files. We would like to improve analyst reading performance, doing so I'm testing the parquet optimal file size. I've found that the optimal file size should be around 1GB, and not less than 128MB, depending on the size of the data. I took

Creating hive table through df.write.mode("overwrite").saveAsTable("DB.TABLE")

2020-11-10 Thread Mich Talebzadeh
Hi, In Spark I specifically specify the format of the table to be created sqltext = """ CREATE TABLE test.randomDataPy( ID INT , CLUSTERED INT , SCATTERED INT , RANDOMISED INT , RANDOM_STRING VARCHAR(50) , SMALL_VC VARCHAR(50) , PADDING VARCHAR(4000)

Distribution of spark 3.0.1 with Hive1.2

2020-11-10 Thread Dmitry
Hi all, I am trying to make distribution 3.0.1 with spark 3 using ./dev/make-distribution.sh --name spark3-hive12 --pip --tgz -Phive-1.2 -Phadoop-2.7 -Pyarn The problem is maven can't found right profile for hive and build ends without hive jars ++ /Users/reireirei/spark/spark/build/mvn

spark cassandra questiom

2020-11-10 Thread adfel70
I an very very new to both spark and spark structured streaming. I have to write an application that receives a very very large csv files in hdfs folder. the app must take the file and on each row it must read from Cassandra data base some rows (not many rows will be returned for each row in csv).