Re: How to read json data from kafka and store to hdfs with spark structued streaming?

2018-07-26 Thread Tathagata Das
Are you writing multiple streaming query output to the same location? If so, I can see this error occurring. Multiple streaming queries writing to the same directory is not supported. On Tue, Jul 24, 2018 at 3:38 PM, dddaaa wrote: > I'm trying to read json messages from kafka and store them in

Re: Backpressure initial rate not working

2018-07-26 Thread Biplob Biswas
Hi Todd, Thanks a lot, that works. Althouhg I am curious whether you know why the initialRate setting not kicking in? But for now the pipeline is usable again. Thanks a lot. Thanks & Regards Biplob Biswas On Thu, Jul 26, 2018 at 3:03 PM Todd Nist wrote: > Have you tried reducing the

Re: Exceptions with simplest Structured Streaming example

2018-07-26 Thread Tathagata Das
Unfortunately, your output is not visible in the email that we see. Was it an image that some got removed? Maybe best to copy the output text (i.e. the error message) into the email. On Thu, Jul 26, 2018 at 5:41 AM, Jonathan Apple wrote: > Hello, > > There is a streaming World Count example at

Re: Exceptions with simplest Structured Streaming example

2018-07-26 Thread Jonathan Apple
(My apologies; I used Nabble to post and it stripped out the HTML) The original message is below, but note that we just had the issue solved on Stack Overflow: https://stackoverflow.com/questions/51541134/pyspark-exceptions-with-simplest-structured-streaming-example Turns out it's a known issue

Re: Backpressure initial rate not working

2018-07-26 Thread Biplob Biswas
Did anyone face similar issue? and any viable way to solve this? Thanks & Regards Biplob Biswas On Wed, Jul 25, 2018 at 4:23 PM Biplob Biswas wrote: > I have enabled the spark.streaming.backpressure.enabled setting and also > set spark.streaming.backpressure.initialRate to 15000, but my

Re: Backpressure initial rate not working

2018-07-26 Thread Todd Nist
Hi Biplob, How many partitions are on the topic you are reading from and have you set the maxRatePerPartition? iirc, spark back pressure is calculated as follows: *Spark back pressure:* Back pressure is calculated off of the following: • maxRatePerPartition=200 • batchInterval 30s • 3

Re: Backpressure initial rate not working

2018-07-26 Thread Biplob Biswas
Hi Todd, Thanks for the reply. I have the mayxRatePerPartition set as well. Below is the spark submit config we used and still got the issue. Also the *batch interval is set at 10s* and *number of partitions on the topic is set to 4* : spark2-submit --name "${YARN_NAME}" \ --master yarn \

Exceptions with simplest Structured Streaming example

2018-07-26 Thread Jonathan Apple
Hello, There is a streaming World Count example at the beginning of the Structured Streaming Programming Guide . First, we execute *nc -lk * in a separate terminal. Next, following the Python code, we have

Re: Backpressure initial rate not working

2018-07-26 Thread Todd Nist
Have you tried reducing the maxRatePerPartition to a lower value? Based on your settings, I believe you are going to be able to pull *600K* worth of messages from Kafka, basically: • maxRatePerPartition=15000 • batchInterval 10s • 4 partitions on Ingest topic This results in a maximum

Optimizing a join with bucketing

2018-07-26 Thread Vitaliy Pisarev
I am joining two entities. One of the entities weighs ~0.5 TB. The other weighs ~16GB Both are stored in parquet. Another trait of the problem is that the "smaller" entity does not change, so I figured I'd pre-bucket it to improve performance. * What are the guidelines for deciding the best