Re: script running in jupyter 6-7x faster than spark submit

2019-09-11 Thread Abdeali Kothari
In a bash terminal, can you do: *export PYSPARK_DRIVER_PYTHON=/path/to/venv/bin/python* and then: run the *spark-shell* script ? This should mimic the behaviour of jupyter in spark-shell and should be fast (1-2mins similar to jupyter notebook) This would confirm the guess that the python2.7 venv

Re: Spark Kafka Streaming making progress but there is no data to be consumed

2019-09-11 Thread Charles vinodh
The only form of rate limiting I have set is *maxOffsetsPerTrigger *and *fetch.message.max.bytes. * *"*may be that you are trying to process records that have passed the retention period within Kafka.*"* If the above is true then I should have my offsets reset only once ideally when my

Spark Kafka Streaming making progress but there is no data to be consumed

2019-09-11 Thread Charles vinodh
Hi, I am trying to run a spark application ingesting data from Kafka using the Spark structured streaming and the spark library org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.1. I am facing a very weird issue where during execution of all my micro-batches the Kafka consumer is not able to fetch

Re: Spark Kafka Streaming making progress but there is no data to be consumed

2019-09-11 Thread Sandish Kumar HN
You can see this kind of error, if there is consumer lag more than Kafka retention period. You will not see any failures if below option is not set. Set failOnDataLoss=true option to see failures. On Wed, Sep 11, 2019 at 3:24 PM Charles vinodh wrote: > The only form of rate limiting I have set

Re: Spark Kafka Streaming making progress but there is no data to be consumed

2019-09-11 Thread Burak Yavuz
Do you have rate limiting set on your stream? It may be that you are trying to process records that have passed the retention period within Kafka. On Wed, Sep 11, 2019 at 2:39 PM Charles vinodh wrote: > > Hi, > > I am trying to run a spark application ingesting data from Kafka using the > Spark

Re: Spark Kafka Streaming making progress but there is no data to be consumed

2019-09-11 Thread Charles vinodh
Hi Sandish, as I have said if the offset reset happens only once that would make sense. But I am not sure how to explain why the offset reset is happening for every micro-batch... ideally once the offset reset happens the app should move to a valid offset and start consuming data. but in my case

Re: Spark Kafka Streaming making progress but there is no data to be consumed

2019-09-11 Thread Charles vinodh
Thanks Dhaval, that fixed the issue. The constant resetting of Kafka offsets misled me about the issue. Please feel free the answer the SO question here if you would like to..

Re: Spark Kafka Streaming making progress but there is no data to be consumed

2019-09-11 Thread Burak Yavuz
Hey Charles, If you are using maxOffsetsPerTrigger, you will likely rest the offsets every microbatch, because: 1. Spark will figure out a range of offsets to process (let's call them x and y) 2. If these offsets have fallen out of the retention period, Spark will try to set the offset to x

Re: Spark Kafka Streaming making progress but there is no data to be consumed

2019-09-11 Thread Dhaval Patel
Hi Charles, Can you check is any of the case related to output directory and checkpoint location mentioned in below link is applicable in your case? https://kb.databricks.com/streaming/file-sink-streaming.html Regards Dhaval On Wed, Sep 11, 2019 at 9:29 PM Burak Yavuz wrote: > Hey Charles, >

Re: script running in jupyter 6-7x faster than spark submit

2019-09-11 Thread Dhrubajyoti Hati
Hi, I just ran the same script in a shell in jupyter notebook and find the performance to be similar. So I can confirm this is because the libraries used jupyter notebook python is different than the spark-submit python this is happening. But now I have a following question. Are the dependent

Re: Access all of the custom streaming query listeners that were registered to spark session

2019-09-11 Thread Gabor Somogyi
Hi Natalie, In ideal situation it would be good to keep track of the registered listeners. If this is not possible in 3.0 new feature added here: https://github.com/apache/spark/pull/25518 BR, G On Tue, Sep 10, 2019 at 10:18 PM Natalie Ruiz wrote: > Hello, > > > > Is there a way to access

Re: script running in jupyter 6-7x faster than spark submit

2019-09-11 Thread Patrick McCarthy
Are you running in cluster mode? A large virtualenv zip for the driver sent into the cluster on a slow pipe could account for much of that eight minutes. On Wed, Sep 11, 2019 at 3:17 AM Dhrubajyoti Hati wrote: > Hi, > > I just ran the same script in a shell in jupyter notebook and find the >

Re: script running in jupyter 6-7x faster than spark submit

2019-09-11 Thread Abdeali Kothari
The driver python may not always be the same as the executor python. You can set these using PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON The dependent libraries are not transferred by spark in any way unless you do a --py-files or .addPyFile() Could you try this: *import sys; print(sys.prefix)* on

Re: script running in jupyter 6-7x faster than spark submit

2019-09-11 Thread Dhrubajyoti Hati
But would it be the case for multiple tasks running on the same worker and also both the tasks are running in client mode, so the one true is true for both or for neither. As mentioned earlier all the confs are same. I have checked and compared each conf. As Abdeali mentioned it must be because

Re: script running in jupyter 6-7x faster than spark submit

2019-09-11 Thread Dhrubajyoti Hati
If you say that libraries are not transferred by default and in my case I haven't used any --py-files then just because the driver python is different I have facing 6x speed difference ? I am using client mode to submit the program but the udfs and all are executed in the executors, then why is

Re: script running in jupyter 6-7x faster than spark submit

2019-09-11 Thread Dhrubajyoti Hati
Also the performance remains identical when running the same script from jupyter terminal instead or normal terminal. In the script the spark context is created by spark = SparkSession \ .builder \ .. .. getOrCreate() command On Wed, Sep 11, 2019 at 10:28 PM Dhrubajyoti Hati wrote: > If