Re: Spark structured streaming - Fallback to earliest offset

2020-04-14 Thread Jungtaek Lim
I think Spark is trying to ensure that it reads the input "continuously" without any missing. Technically it may be valid to say the situation is a kind of "data-loss", as the query couldn't process the offsets which are being thrown out, and owner of the query needs to be careful as it affects

Re: Going it alone.

2020-04-14 Thread yeikel valdes
There are many use case cases for Spark. A google search with "Use cases for apache spark" will give you all the information that you need.  On Tue, 14 Apr 2020 18:44:59 -0400 janethor...@aol.com.INVALID wrote I did write a long email in response to you. But then I deleted it

Re: Spark structured streaming - Fallback to earliest offset

2020-04-14 Thread Ruijing Li
I see, I wasn’t sure if that would work as expected. The docs seems to suggest to be careful before turning off that option, and I’m not sure why failOnDataLoss is true by default. On Tue, Apr 14, 2020 at 5:16 PM Burak Yavuz wrote: > Just set `failOnDataLoss=false` as an option in readStream? >

Re: Spark structured streaming - Fallback to earliest offset

2020-04-14 Thread Burak Yavuz
Just set `failOnDataLoss=false` as an option in readStream? On Tue, Apr 14, 2020 at 4:33 PM Ruijing Li wrote: > Hi all, > > I have a spark structured streaming app that is consuming from a kafka > topic with retention set up. Sometimes I face an issue where my query has > not finished

Spark structured streaming - Fallback to earliest offset

2020-04-14 Thread Ruijing Li
Hi all, I have a spark structured streaming app that is consuming from a kafka topic with retention set up. Sometimes I face an issue where my query has not finished processing a message but the retention kicks in and deletes the offset, which since I use the default setting of

Re: Going it alone.

2020-04-14 Thread jane thorpe
I did write a long email in response to you. But then I deleted it because I felt it would be too revealing. On Tuesday, 14 April 2020 David Hesson wrote: I want to know  if Spark is headed in my direction. You are implying  Spark could be.  What direction are you headed in, exactly?

Re: Going it alone.

2020-04-14 Thread David Hesson
> > I want to know if Spark is headed in my direction. > You are implying Spark could be. What direction are you headed in, exactly? I don't feel as if anything were implied when you were asked for use cases or what problem you are solving. You were asked to identify some use cases, of which

Re: Going it alone.

2020-04-14 Thread jane thorpe
That's what  I want to know,  Use Cases. I am looking for  direction as I described and I want to know  if Spark is headed in my direction.   You are implying  Spark could be. So tell me about the USE CASES and I'll do the rest. On Tuesday, 14 April 2020 yeikel valdes wrote: It depends on

Cross Region Apache Spark Setup

2020-04-14 Thread Stone Zhong
Hi, I am trying to setup a cross region Apache Spark cluster. All my data are stored in Amazon S3 and well partitioned by region. For example, I have parquet file at S3://mybucket/sales_fact.parquet/us-west S3://mybucket/sales_fact.parquet/us-east S3://mybucket/sales_fact.parquet/uk

Re: Going it alone.

2020-04-14 Thread yeikel valdes
It depends on your use case. What are you trying to solve?  On Tue, 14 Apr 2020 15:36:50 -0400 janethor...@aol.com.INVALID wrote Hi, I consider myself to be quite good in Software Development especially using frameworks. I like to get my hands  dirty. I have spent the last few

Going it alone.

2020-04-14 Thread jane thorpe
Hi, I consider myself to be quite good in Software Development especially using frameworks. I like to get my hands  dirty. I have spent the last few months understanding modern frameworks and architectures. I am looking to invest my energy in a product where I don't have to relying on

Re: What is the best way to take the top N entries from a hive table/data source?

2020-04-14 Thread Yeikel
Looking at the results of explain, I can see a CollectLimit step. Does that work the same way as a regular .collect() ? (where all records are sent to the driver?) spark.sql("select * from db.table limit 100").explain(false) == Physical Plan == CollectLimit 100 +- FileScan parquet ...

Re: How does spark sql evaluate case statements?

2020-04-14 Thread Yeikel
I do not know the answer to this question so I am also looking for it, but @kant maybe the generated code can help with this. -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail:

Is there any way to set the location of the history for the spark-shell per session?

2020-04-14 Thread Yeikel
In my team , we get elevated access to our Spark cluster using a common username which means that we all share the same history. I am not sure if this is common , but unfortunately there is nothing I can do about it. Is there any option to set the location of the history? I am looking for

Question on writing batch synchronized incremental graph algorithms

2020-04-14 Thread Kaan Sancak
Hi all, I have been trying to write batch-synchronized incremental graph algorithms. More specifically, I want to run an increment algorithm on a given data-set and when a new batch arrives, I want to start the algorithm from last snapshot, and run the algorithm on the vertices that are

Re: Spark interrupts S3 request backoff

2020-04-14 Thread Gabor Somogyi
+1 on the previous guess and additionally I suggest to reproduce it with vanilla Spark. Amazon Spark contains modifications which not available in vanilla Spark which makes problem hunting hard or impossible. Such case amazon can help... On Tue, Apr 14, 2020 at 11:20 AM ZHANG Wei wrote: > I

Re: Spark hangs while reading from jdbc - does nothing Removing Guess work from trouble shooting

2020-04-14 Thread Gabor Somogyi
The simplest way is to do thread dump which doesn't require any fancy tool (it's available on Spark UI). Without thread dump it's hard to say anything... On Tue, Apr 14, 2020 at 11:32 AM jane thorpe wrote: > Here a is another tool I use Logic Analyser 7:55 > https://youtu.be/LnzuMJLZRdU > >

Re: Spark Streaming not working

2020-04-14 Thread Gerard Maas
Hi, Could you share the code that you're using to configure the connection to the Kafka broker? This is a bread-and-butter feature. My first thought is that there's something in your particular setup that prevents this from working. kind regards, Gerard. On Fri, Apr 10, 2020 at 7:34 PM

Re: Spark Streaming not working

2020-04-14 Thread Gabor Somogyi
Sorry, hit the send accidentally... The symptom is simple, the broker is not responding in 120 seconds. That's the reason why Debabrata asked the broker config. What I can suggest is to check the previous printout which logs the Kafka consumer settings. With the mentioned settings you can start

Re: Spark Streaming not working

2020-04-14 Thread Gabor Somogyi
The symptom is simple, the broker is not responding in 120 seconds. That's the reason why Debabrata asked the broker config. What I can suggest is to check the previous printout which logs the Kafka consumer settings. With On Tue, Apr 14, 2020 at 11:44 AM ZHANG Wei wrote: > Here is the

[Spark Core]: Does an executor only cache the partitions it requires for its computations or always the full RDD?

2020-04-14 Thread zwithouta
Provided caching is activated for a RDD, does each executor of a cluster only cache the partitions it requires for its computations or always the full RDD? -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ -

Re: Spark Streaming not working

2020-04-14 Thread ZHANG Wei
Here is the assertion error message format: s"Failed to get records for $groupId $topic $partition $offset after polling for $timeout") You might have to check the kafka service with the error log: > 20/04/10 17:28:04 ERROR Executor: Exception in task 0.5 in stage 0.0 (TID 24) >

Re: Spark hangs while reading from jdbc - does nothing Removing Guess work from trouble shooting

2020-04-14 Thread jane thorpe
Here a is another tool I use Logic Analyser  7:55 https://youtu.be/LnzuMJLZRdU you could take some suggestions for improving performance  queries. https://dzone.com/articles/why-you-should-not-use-select-in-sql-query-1 Jane thorpe janethor...@aol.com -Original Message- From: jane

Re: Spark interrupts S3 request backoff

2020-04-14 Thread ZHANG Wei
I will make a guess, it's not interruptted, it's killed by the driver or the resource manager since the executor fallen into sleep for a long time. You may have to find the root cause in the driver and failed executor log contexts. -- Cheers, -z From:

What is the best way to take the top N entries from a hive table/data source?

2020-04-14 Thread yeikel valdes
When I use .limit() , the number of partitions for the returning dataframe is 1 which normally fails most jobs. val df = spark.sql("select * from table limit n") df.write.parquet() Thanks!