Re: is RosckDB backend available in 3.0 preview?

2020-04-21 Thread Jungtaek Lim
Unfortunately, the short answer is no. Please refer the last part of
discussion on the PR https://github.com/apache/spark/pull/24922

Unless we get any native implementation of this, I guess this project is
most widely known implementation for RocksDB backend state store -
https://github.com/chermenin/spark-states

On Wed, Apr 22, 2020 at 11:32 AM kant kodali  wrote:

> Hi All,
>
> 1. is RosckDB backend available in 3.0 preview?
> 2. if RocksDB can store intermediate results of a stream-stream join can I
> run  streaming join queries forever? forever I mean until I run out of
> disk. or put it another can I run the stream-stream join queries for years
> if necessary (imagine I got lot of disk capacity but not a whole lot of
> RAM)?
> 3. Does is to do incremental checkpointing to HDFS?
>
> Thanks!
>
>


Re: What is the best way to take the top N entries from a hive table/data source?

2020-04-21 Thread Yeikel
Hi Zhang. Thank you for your response 

While your answer clarifies my confusion with `CollectLimit` it still does
not clarify what is the recommended way to extract large amounts of data
(but not all the records) from a source and maintain a high level of
parallelism. 

For example , at some instances trying to extract 1 million records from a
table with over 100M records , I see my cluster using 1-2 cores out of the
hundreds that I have available. 



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Spark hangs while reading from jdbc - does nothing Removing Guess work from trouble shooting

2020-04-21 Thread Jungtaek Lim
No, that's not a thing to apologize for. It's just your call - less context
would bring less reaction and interest.

On Wed, Apr 22, 2020 at 11:50 AM Ruijing Li  wrote:

> I apologize, but I cannot share it, even if it is just typical spark
> libraries. I definitely understand that limits debugging help, but wanted
> to understand if anyone has encountered a similar issue.
>
> On Tue, Apr 21, 2020 at 7:12 PM Jungtaek Lim 
> wrote:
>
>> If there's no third party libraries in the dump then why not share the
>> thread dump? (I mean, the output of jstack)
>>
>> stack trace would be more helpful to find which thing acquired lock and
>> which other things are waiting for acquiring lock, if we suspect deadlock.
>>
>> On Wed, Apr 22, 2020 at 2:38 AM Ruijing Li  wrote:
>>
>>> After refreshing a couple of times, I notice the lock is being swapped
>>> between these 3. The other 2 will be blocked by whoever gets this lock, in
>>> a cycle of 160 has lock -> 161 -> 159 -> 160
>>>
>>> On Tue, Apr 21, 2020 at 10:33 AM Ruijing Li 
>>> wrote:
>>>
 In thread dump, I do see this
 - SparkUI-160- acceptor-id-ServerConnector@id(HTTP/1.1) | RUNNABLE |
 Monitor
 - SparkUI-161-acceptor-id-ServerConnector@id(HTTP/1.1) | BLOCKED |
 Blocked by Thread(Some(160)) Lock
 -  SparkUI-159-acceptor-id-ServerConnector@id(HTTP/1.1) | BLOCKED |
 Blocked by Thread(Some(160)) Lock

 Could the fact that 160 has the monitor but is not running be causing a
 deadlock preventing the job from finishing?

 I do see my Finalizer and main method are waiting. I don’t see any
 other threads from 3rd party libraries or my code in the dump. I do see
 spark context cleaner has timed waiting.

 Thanks


 On Tue, Apr 21, 2020 at 9:58 AM Ruijing Li 
 wrote:

> Strangely enough I found an old issue that is the exact same issue as
> mine
> https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-18343
>
> However I’m using spark 2.4.4 so the issue should have been solved by
> now.
>
> Like the user in the jira issue I am using mesos, but I am reading
> from oracle instead of writing to Cassandra and S3.
>
>
> On Thu, Apr 16, 2020 at 1:54 AM ZHANG Wei  wrote:
>
>> The Thread dump result table of Spark UI can provide some clues to
>> find out thread locks issue, such as:
>>
>>   Thread ID | Thread Name  | Thread State | Thread
>> Locks
>>   13| NonBlockingInputStreamThread | WAITING  | Blocked
>> by Thread Some(48) Lock(jline.internal.NonBlockingInputStream@103008951
>> })
>>   48| Thread-16| RUNNABLE |
>> Monitor(jline.internal.NonBlockingInputStream@103008951})
>>
>> And echo thread row can show the call stacks after being clicked,
>> then you can check the root cause of holding locks like this(Thread 48 of
>> above):
>>
>>   org.fusesource.jansi.internal.Kernel32.ReadConsoleInputW(Native
>> Method)
>>
>> org.fusesource.jansi.internal.Kernel32.readConsoleInputHelper(Kernel32.java:811)
>>
>> org.fusesource.jansi.internal.Kernel32.readConsoleKeyInput(Kernel32.java:842)
>>
>> org.fusesource.jansi.internal.WindowsSupport.readConsoleInput(WindowsSupport.java:97)
>>   jline.WindowsTerminal.readConsoleInput(WindowsTerminal.java:222)
>>   
>>
>> Hope it can help you.
>>
>> --
>> Cheers,
>> -z
>>
>> On Thu, 16 Apr 2020 16:36:42 +0900
>> Jungtaek Lim  wrote:
>>
>> > Do thread dump continuously, per specific period (like 1s) and see
>> the
>> > change of stack / lock for each thread. (This is not easy to be
>> done in UI
>> > so maybe doing manually would be the only option. Not sure Spark UI
>> will
>> > provide the same, haven't used at all.)
>> >
>> > It will tell which thread is being blocked (even it's shown as
>> running) and
>> > which point to look at.
>> >
>> > On Thu, Apr 16, 2020 at 4:29 PM Ruijing Li 
>> wrote:
>> >
>> > > Once I do. thread dump, what should I be looking for to tell
>> where it is
>> > > hanging? Seeing a lot of timed_waiting and waiting on driver.
>> Driver is
>> > > also being blocked by spark UI. If there are no tasks, is there a
>> point to
>> > > do thread dump of executors?
>> > >
>> > > On Tue, Apr 14, 2020 at 4:49 AM Gabor Somogyi <
>> gabor.g.somo...@gmail.com>
>> > > wrote:
>> > >
>> > >> The simplest way is to do thread dump which doesn't require any
>> fancy
>> > >> tool (it's available on Spark UI).
>> > >> Without thread dump it's hard to say anything...
>> > >>
>> > >>
>> > >> On Tue, Apr 14, 2020 at 11:32 AM jane thorpe
>> 
>> > >> wrote:
>> > >>
>> > >>> Here a is another tool I use Logic Analyser  7:55
>> > >>> 

Re: Spark hangs while reading from jdbc - does nothing Removing Guess work from trouble shooting

2020-04-21 Thread Ruijing Li
I apologize, but I cannot share it, even if it is just typical spark
libraries. I definitely understand that limits debugging help, but wanted
to understand if anyone has encountered a similar issue.

On Tue, Apr 21, 2020 at 7:12 PM Jungtaek Lim 
wrote:

> If there's no third party libraries in the dump then why not share the
> thread dump? (I mean, the output of jstack)
>
> stack trace would be more helpful to find which thing acquired lock and
> which other things are waiting for acquiring lock, if we suspect deadlock.
>
> On Wed, Apr 22, 2020 at 2:38 AM Ruijing Li  wrote:
>
>> After refreshing a couple of times, I notice the lock is being swapped
>> between these 3. The other 2 will be blocked by whoever gets this lock, in
>> a cycle of 160 has lock -> 161 -> 159 -> 160
>>
>> On Tue, Apr 21, 2020 at 10:33 AM Ruijing Li 
>> wrote:
>>
>>> In thread dump, I do see this
>>> - SparkUI-160- acceptor-id-ServerConnector@id(HTTP/1.1) | RUNNABLE |
>>> Monitor
>>> - SparkUI-161-acceptor-id-ServerConnector@id(HTTP/1.1) | BLOCKED |
>>> Blocked by Thread(Some(160)) Lock
>>> -  SparkUI-159-acceptor-id-ServerConnector@id(HTTP/1.1) | BLOCKED |
>>> Blocked by Thread(Some(160)) Lock
>>>
>>> Could the fact that 160 has the monitor but is not running be causing a
>>> deadlock preventing the job from finishing?
>>>
>>> I do see my Finalizer and main method are waiting. I don’t see any other
>>> threads from 3rd party libraries or my code in the dump. I do see spark
>>> context cleaner has timed waiting.
>>>
>>> Thanks
>>>
>>>
>>> On Tue, Apr 21, 2020 at 9:58 AM Ruijing Li 
>>> wrote:
>>>
 Strangely enough I found an old issue that is the exact same issue as
 mine
 https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-18343

 However I’m using spark 2.4.4 so the issue should have been solved by
 now.

 Like the user in the jira issue I am using mesos, but I am reading from
 oracle instead of writing to Cassandra and S3.


 On Thu, Apr 16, 2020 at 1:54 AM ZHANG Wei  wrote:

> The Thread dump result table of Spark UI can provide some clues to
> find out thread locks issue, such as:
>
>   Thread ID | Thread Name  | Thread State | Thread
> Locks
>   13| NonBlockingInputStreamThread | WAITING  | Blocked by
> Thread Some(48) Lock(jline.internal.NonBlockingInputStream@103008951})
>   48| Thread-16| RUNNABLE |
> Monitor(jline.internal.NonBlockingInputStream@103008951})
>
> And echo thread row can show the call stacks after being clicked, then
> you can check the root cause of holding locks like this(Thread 48 of 
> above):
>
>   org.fusesource.jansi.internal.Kernel32.ReadConsoleInputW(Native
> Method)
>
> org.fusesource.jansi.internal.Kernel32.readConsoleInputHelper(Kernel32.java:811)
>
> org.fusesource.jansi.internal.Kernel32.readConsoleKeyInput(Kernel32.java:842)
>
> org.fusesource.jansi.internal.WindowsSupport.readConsoleInput(WindowsSupport.java:97)
>   jline.WindowsTerminal.readConsoleInput(WindowsTerminal.java:222)
>   
>
> Hope it can help you.
>
> --
> Cheers,
> -z
>
> On Thu, 16 Apr 2020 16:36:42 +0900
> Jungtaek Lim  wrote:
>
> > Do thread dump continuously, per specific period (like 1s) and see
> the
> > change of stack / lock for each thread. (This is not easy to be done
> in UI
> > so maybe doing manually would be the only option. Not sure Spark UI
> will
> > provide the same, haven't used at all.)
> >
> > It will tell which thread is being blocked (even it's shown as
> running) and
> > which point to look at.
> >
> > On Thu, Apr 16, 2020 at 4:29 PM Ruijing Li 
> wrote:
> >
> > > Once I do. thread dump, what should I be looking for to tell where
> it is
> > > hanging? Seeing a lot of timed_waiting and waiting on driver.
> Driver is
> > > also being blocked by spark UI. If there are no tasks, is there a
> point to
> > > do thread dump of executors?
> > >
> > > On Tue, Apr 14, 2020 at 4:49 AM Gabor Somogyi <
> gabor.g.somo...@gmail.com>
> > > wrote:
> > >
> > >> The simplest way is to do thread dump which doesn't require any
> fancy
> > >> tool (it's available on Spark UI).
> > >> Without thread dump it's hard to say anything...
> > >>
> > >>
> > >> On Tue, Apr 14, 2020 at 11:32 AM jane thorpe
> 
> > >> wrote:
> > >>
> > >>> Here a is another tool I use Logic Analyser  7:55
> > >>> https://youtu.be/LnzuMJLZRdU
> > >>>
> > >>> you could take some suggestions for improving performance
> queries.
> > >>>
> https://dzone.com/articles/why-you-should-not-use-select-in-sql-query-1
> > >>>
> > >>>
> > >>> Jane thorpe
> > >>> janethor...@aol.com
> > >>>
> > >>>
> > >>> 

is RosckDB backend available in 3.0 preview?

2020-04-21 Thread kant kodali
Hi All,

1. is RosckDB backend available in 3.0 preview?
2. if RocksDB can store intermediate results of a stream-stream join can I
run  streaming join queries forever? forever I mean until I run out of
disk. or put it another can I run the stream-stream join queries for years
if necessary (imagine I got lot of disk capacity but not a whole lot of
RAM)?
3. Does is to do incremental checkpointing to HDFS?

Thanks!


Re: Spark hangs while reading from jdbc - does nothing Removing Guess work from trouble shooting

2020-04-21 Thread Jungtaek Lim
If there's no third party libraries in the dump then why not share the
thread dump? (I mean, the output of jstack)

stack trace would be more helpful to find which thing acquired lock and
which other things are waiting for acquiring lock, if we suspect deadlock.

On Wed, Apr 22, 2020 at 2:38 AM Ruijing Li  wrote:

> After refreshing a couple of times, I notice the lock is being swapped
> between these 3. The other 2 will be blocked by whoever gets this lock, in
> a cycle of 160 has lock -> 161 -> 159 -> 160
>
> On Tue, Apr 21, 2020 at 10:33 AM Ruijing Li  wrote:
>
>> In thread dump, I do see this
>> - SparkUI-160- acceptor-id-ServerConnector@id(HTTP/1.1) | RUNNABLE |
>> Monitor
>> - SparkUI-161-acceptor-id-ServerConnector@id(HTTP/1.1) | BLOCKED |
>> Blocked by Thread(Some(160)) Lock
>> -  SparkUI-159-acceptor-id-ServerConnector@id(HTTP/1.1) | BLOCKED |
>> Blocked by Thread(Some(160)) Lock
>>
>> Could the fact that 160 has the monitor but is not running be causing a
>> deadlock preventing the job from finishing?
>>
>> I do see my Finalizer and main method are waiting. I don’t see any other
>> threads from 3rd party libraries or my code in the dump. I do see spark
>> context cleaner has timed waiting.
>>
>> Thanks
>>
>>
>> On Tue, Apr 21, 2020 at 9:58 AM Ruijing Li  wrote:
>>
>>> Strangely enough I found an old issue that is the exact same issue as
>>> mine
>>> https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-18343
>>>
>>> However I’m using spark 2.4.4 so the issue should have been solved by
>>> now.
>>>
>>> Like the user in the jira issue I am using mesos, but I am reading from
>>> oracle instead of writing to Cassandra and S3.
>>>
>>>
>>> On Thu, Apr 16, 2020 at 1:54 AM ZHANG Wei  wrote:
>>>
 The Thread dump result table of Spark UI can provide some clues to find
 out thread locks issue, such as:

   Thread ID | Thread Name  | Thread State | Thread Locks
   13| NonBlockingInputStreamThread | WAITING  | Blocked by
 Thread Some(48) Lock(jline.internal.NonBlockingInputStream@103008951})
   48| Thread-16| RUNNABLE |
 Monitor(jline.internal.NonBlockingInputStream@103008951})

 And echo thread row can show the call stacks after being clicked, then
 you can check the root cause of holding locks like this(Thread 48 of 
 above):

   org.fusesource.jansi.internal.Kernel32.ReadConsoleInputW(Native
 Method)

 org.fusesource.jansi.internal.Kernel32.readConsoleInputHelper(Kernel32.java:811)

 org.fusesource.jansi.internal.Kernel32.readConsoleKeyInput(Kernel32.java:842)

 org.fusesource.jansi.internal.WindowsSupport.readConsoleInput(WindowsSupport.java:97)
   jline.WindowsTerminal.readConsoleInput(WindowsTerminal.java:222)
   

 Hope it can help you.

 --
 Cheers,
 -z

 On Thu, 16 Apr 2020 16:36:42 +0900
 Jungtaek Lim  wrote:

 > Do thread dump continuously, per specific period (like 1s) and see the
 > change of stack / lock for each thread. (This is not easy to be done
 in UI
 > so maybe doing manually would be the only option. Not sure Spark UI
 will
 > provide the same, haven't used at all.)
 >
 > It will tell which thread is being blocked (even it's shown as
 running) and
 > which point to look at.
 >
 > On Thu, Apr 16, 2020 at 4:29 PM Ruijing Li 
 wrote:
 >
 > > Once I do. thread dump, what should I be looking for to tell where
 it is
 > > hanging? Seeing a lot of timed_waiting and waiting on driver.
 Driver is
 > > also being blocked by spark UI. If there are no tasks, is there a
 point to
 > > do thread dump of executors?
 > >
 > > On Tue, Apr 14, 2020 at 4:49 AM Gabor Somogyi <
 gabor.g.somo...@gmail.com>
 > > wrote:
 > >
 > >> The simplest way is to do thread dump which doesn't require any
 fancy
 > >> tool (it's available on Spark UI).
 > >> Without thread dump it's hard to say anything...
 > >>
 > >>
 > >> On Tue, Apr 14, 2020 at 11:32 AM jane thorpe
 
 > >> wrote:
 > >>
 > >>> Here a is another tool I use Logic Analyser  7:55
 > >>> https://youtu.be/LnzuMJLZRdU
 > >>>
 > >>> you could take some suggestions for improving performance
 queries.
 > >>>
 https://dzone.com/articles/why-you-should-not-use-select-in-sql-query-1
 > >>>
 > >>>
 > >>> Jane thorpe
 > >>> janethor...@aol.com
 > >>>
 > >>>
 > >>> -Original Message-
 > >>> From: jane thorpe 
 > >>> To: janethorpe1 ; mich.talebzadeh <
 > >>> mich.talebza...@gmail.com>; liruijing09 ;
 user <
 > >>> user@spark.apache.org>
 > >>> Sent: Mon, 13 Apr 2020 8:32
 > >>> Subject: Re: Spark hangs while reading from jdbc - does nothing
 Removing
 > >>> Guess work from trouble shooting
 > >>>
 > >>>
 > >>>
 > 

Spark Mongodb connector hangs indefinitely, not working on Amazon EMR

2020-04-21 Thread Daniel Stojanov
When running a Pyspark application on my local machine I am able to save
and retrieve from the Mongodb server using the Mongodb Spark connector. All
works properly. When submitting the exact same application on my Amazon EMR
cluster I can see that the package for the Spark driver is being properly
collected from Maven when the job is submitted. However, it is not working.

>From my instance of Amazon EMR I can communicate with the database using
Pymongo without problems. I can load/save dataframes when using pyspark
interactively from the driver, but when submitting jobs via spark-submit
over the yarn cluster it hangs.

The problem gives no error messages, it just shows 0 activity on the driver
and executor. The pyspark application just stops until manually terminated.

Has anyone else used the Mongodb Spark connector from Amazon EMR?


--


Re: Spark Structure Streaming | FileStreamSourceLog not deleting list of input files | Spark -2.4.0

2020-04-21 Thread Jungtaek Lim
You're hitting an existing issue
https://issues.apache.org/jira/browse/SPARK-17604. While there's no active
PR to address it, I've been planning to take a look sooner than later.

Btw, you may also want to take a look at my previous mail - the topic on
the mail thread was regarding file stream sink metadata growing bigger, but
in fact that's basically the same issue, so you may get some information
from there. (tl;dr. I have bunch of PRs for addressing multiple issues on
file stream source and sink, just having lack of some love.)

https://lists.apache.org/thread.html/rb4ebf1d20d13db0a78694e8d301e51c326f803cb86fc1a1f66f2ae7e%40%3Cuser.spark.apache.org%3E

Thanks,
Jungtaek Lim (HeartSaVioR)

On Tue, Apr 21, 2020 at 8:23 PM Pappu Yadav  wrote:

> Hi Team,
>
> While Running Spark Below are some finding.
>
>1. FileStreamSourceLog is responsible for maintaining input source
>file list.
>2. Spark Streaming delete expired log files on the basis of s
>*park.sql.streaming.fileSource.log.deletion* and
>*spark.sql.streaming.minBatchesToRetain.*
>3. But while compacting logs Spark Streaming write the complete list
>of files streaming has seen till now in HDFS into one single .compact file.
>4. Over the course of time this compact file  is consuming around
>2GB-5GB in HDFS which will delay creation of compact file after every 10th
>Batch and also job restart time will increase.
>5. Why Spark Streaming is logging files in the system which are
>already deleted . While creating compact file there must be some configured
>timeout so that Spark can skip writing expired list of input files.
>
> *Also kindly let me know if i missed something and there is some
> configuration already present to handle this. *
>
> Regards
> Pappu Yadav
>


Re: Using startingOffsets latest - no data from structured streaming kafka query

2020-04-21 Thread Ruijing Li
Yes, we did. But for some reason latest does not show them. The count is
always 0.

On Sun, Apr 19, 2020 at 3:42 PM Jungtaek Lim 
wrote:

> Did you provide more records to topic "after" you started the query?
> That's the only one I can imagine based on such information.
>
> On Fri, Apr 17, 2020 at 9:13 AM Ruijing Li  wrote:
>
>> Hi all,
>>
>> Apologies if this has been asked before, but I could not find the answer
>> to this question. We have a structured streaming job, but for some reason,
>> if we use startingOffsets = latest with foreachbatch mode, it doesn’t
>> produce any data.
>>
>> Rather, in logs I see it repeats the message “ Fetcher [Consumer]
>> Resetting offset for partition to offset” over and over again..
>>
>> However with startingOffsets=earliest, we don’t get this issue. I’m
>> wondering then how we can use startingOffsets=latest as I wish to start
>> from the latest offset available.
>> --
>> Cheers,
>> Ruijing Li
>>
> --
Cheers,
Ruijing Li


Re: Spark hangs while reading from jdbc - does nothing Removing Guess work from trouble shooting

2020-04-21 Thread Ruijing Li
In thread dump, I do see this
- SparkUI-160- acceptor-id-ServerConnector@id(HTTP/1.1) | RUNNABLE |
Monitor
- SparkUI-161-acceptor-id-ServerConnector@id(HTTP/1.1) | BLOCKED | Blocked
by Thread(Some(160)) Lock
-  SparkUI-159-acceptor-id-ServerConnector@id(HTTP/1.1) | BLOCKED | Blocked
by Thread(Some(160)) Lock

Could the fact that 160 has the monitor but is not running be causing a
deadlock preventing the job from finishing?

I do see my Finalizer and main method are waiting. I don’t see any other
threads from 3rd party libraries or my code in the dump. I do see spark
context cleaner has timed waiting.

Thanks


On Tue, Apr 21, 2020 at 9:58 AM Ruijing Li  wrote:

> Strangely enough I found an old issue that is the exact same issue as mine
> https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-18343
>
> However I’m using spark 2.4.4 so the issue should have been solved by now.
>
> Like the user in the jira issue I am using mesos, but I am reading from
> oracle instead of writing to Cassandra and S3.
>
>
> On Thu, Apr 16, 2020 at 1:54 AM ZHANG Wei  wrote:
>
>> The Thread dump result table of Spark UI can provide some clues to find
>> out thread locks issue, such as:
>>
>>   Thread ID | Thread Name  | Thread State | Thread Locks
>>   13| NonBlockingInputStreamThread | WAITING  | Blocked by
>> Thread Some(48) Lock(jline.internal.NonBlockingInputStream@103008951})
>>   48| Thread-16| RUNNABLE |
>> Monitor(jline.internal.NonBlockingInputStream@103008951})
>>
>> And echo thread row can show the call stacks after being clicked, then
>> you can check the root cause of holding locks like this(Thread 48 of above):
>>
>>   org.fusesource.jansi.internal.Kernel32.ReadConsoleInputW(Native Method)
>>
>> org.fusesource.jansi.internal.Kernel32.readConsoleInputHelper(Kernel32.java:811)
>>
>> org.fusesource.jansi.internal.Kernel32.readConsoleKeyInput(Kernel32.java:842)
>>
>> org.fusesource.jansi.internal.WindowsSupport.readConsoleInput(WindowsSupport.java:97)
>>   jline.WindowsTerminal.readConsoleInput(WindowsTerminal.java:222)
>>   
>>
>> Hope it can help you.
>>
>> --
>> Cheers,
>> -z
>>
>> On Thu, 16 Apr 2020 16:36:42 +0900
>> Jungtaek Lim  wrote:
>>
>> > Do thread dump continuously, per specific period (like 1s) and see the
>> > change of stack / lock for each thread. (This is not easy to be done in
>> UI
>> > so maybe doing manually would be the only option. Not sure Spark UI will
>> > provide the same, haven't used at all.)
>> >
>> > It will tell which thread is being blocked (even it's shown as running)
>> and
>> > which point to look at.
>> >
>> > On Thu, Apr 16, 2020 at 4:29 PM Ruijing Li 
>> wrote:
>> >
>> > > Once I do. thread dump, what should I be looking for to tell where it
>> is
>> > > hanging? Seeing a lot of timed_waiting and waiting on driver. Driver
>> is
>> > > also being blocked by spark UI. If there are no tasks, is there a
>> point to
>> > > do thread dump of executors?
>> > >
>> > > On Tue, Apr 14, 2020 at 4:49 AM Gabor Somogyi <
>> gabor.g.somo...@gmail.com>
>> > > wrote:
>> > >
>> > >> The simplest way is to do thread dump which doesn't require any fancy
>> > >> tool (it's available on Spark UI).
>> > >> Without thread dump it's hard to say anything...
>> > >>
>> > >>
>> > >> On Tue, Apr 14, 2020 at 11:32 AM jane thorpe
>> 
>> > >> wrote:
>> > >>
>> > >>> Here a is another tool I use Logic Analyser  7:55
>> > >>> https://youtu.be/LnzuMJLZRdU
>> > >>>
>> > >>> you could take some suggestions for improving performance  queries.
>> > >>>
>> https://dzone.com/articles/why-you-should-not-use-select-in-sql-query-1
>> > >>>
>> > >>>
>> > >>> Jane thorpe
>> > >>> janethor...@aol.com
>> > >>>
>> > >>>
>> > >>> -Original Message-
>> > >>> From: jane thorpe 
>> > >>> To: janethorpe1 ; mich.talebzadeh <
>> > >>> mich.talebza...@gmail.com>; liruijing09 ;
>> user <
>> > >>> user@spark.apache.org>
>> > >>> Sent: Mon, 13 Apr 2020 8:32
>> > >>> Subject: Re: Spark hangs while reading from jdbc - does nothing
>> Removing
>> > >>> Guess work from trouble shooting
>> > >>>
>> > >>>
>> > >>>
>> > >>> This tool may be useful for you to trouble shoot your problems away.
>> > >>>
>> > >>>
>> > >>>
>> https://www.javacodegeeks.com/2020/04/simplifying-apm-remove-the-guesswork-from-troubleshooting.html
>> > >>>
>> > >>>
>> > >>> "APM tools typically use a waterfall-type view to show the blocking
>> > >>> time of different components cascading through the control flow
>> within an
>> > >>> application.
>> > >>> These types of visualizations are useful, and AppOptics has them,
>> but
>> > >>> they can be difficult to understand for those of us without a PhD."
>> > >>>
>> > >>> Especially  helpful if you want to understand through visualisation
>> and
>> > >>> you do not have a phD.
>> > >>>
>> > >>>
>> > >>> Jane thorpe
>> > >>> janethor...@aol.com
>> > >>>
>> > >>>
>> > >>> -Original Message-
>> > >>> From: jane thorpe 
>> > >>> To: 

Re: Spark hangs while reading from jdbc - does nothing Removing Guess work from trouble shooting

2020-04-21 Thread Ruijing Li
After refreshing a couple of times, I notice the lock is being swapped
between these 3. The other 2 will be blocked by whoever gets this lock, in
a cycle of 160 has lock -> 161 -> 159 -> 160

On Tue, Apr 21, 2020 at 10:33 AM Ruijing Li  wrote:

> In thread dump, I do see this
> - SparkUI-160- acceptor-id-ServerConnector@id(HTTP/1.1) | RUNNABLE |
> Monitor
> - SparkUI-161-acceptor-id-ServerConnector@id(HTTP/1.1) | BLOCKED |
> Blocked by Thread(Some(160)) Lock
> -  SparkUI-159-acceptor-id-ServerConnector@id(HTTP/1.1) | BLOCKED |
> Blocked by Thread(Some(160)) Lock
>
> Could the fact that 160 has the monitor but is not running be causing a
> deadlock preventing the job from finishing?
>
> I do see my Finalizer and main method are waiting. I don’t see any other
> threads from 3rd party libraries or my code in the dump. I do see spark
> context cleaner has timed waiting.
>
> Thanks
>
>
> On Tue, Apr 21, 2020 at 9:58 AM Ruijing Li  wrote:
>
>> Strangely enough I found an old issue that is the exact same issue as
>> mine
>> https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-18343
>>
>> However I’m using spark 2.4.4 so the issue should have been solved by now.
>>
>> Like the user in the jira issue I am using mesos, but I am reading from
>> oracle instead of writing to Cassandra and S3.
>>
>>
>> On Thu, Apr 16, 2020 at 1:54 AM ZHANG Wei  wrote:
>>
>>> The Thread dump result table of Spark UI can provide some clues to find
>>> out thread locks issue, such as:
>>>
>>>   Thread ID | Thread Name  | Thread State | Thread Locks
>>>   13| NonBlockingInputStreamThread | WAITING  | Blocked by
>>> Thread Some(48) Lock(jline.internal.NonBlockingInputStream@103008951})
>>>   48| Thread-16| RUNNABLE |
>>> Monitor(jline.internal.NonBlockingInputStream@103008951})
>>>
>>> And echo thread row can show the call stacks after being clicked, then
>>> you can check the root cause of holding locks like this(Thread 48 of above):
>>>
>>>   org.fusesource.jansi.internal.Kernel32.ReadConsoleInputW(Native Method)
>>>
>>> org.fusesource.jansi.internal.Kernel32.readConsoleInputHelper(Kernel32.java:811)
>>>
>>> org.fusesource.jansi.internal.Kernel32.readConsoleKeyInput(Kernel32.java:842)
>>>
>>> org.fusesource.jansi.internal.WindowsSupport.readConsoleInput(WindowsSupport.java:97)
>>>   jline.WindowsTerminal.readConsoleInput(WindowsTerminal.java:222)
>>>   
>>>
>>> Hope it can help you.
>>>
>>> --
>>> Cheers,
>>> -z
>>>
>>> On Thu, 16 Apr 2020 16:36:42 +0900
>>> Jungtaek Lim  wrote:
>>>
>>> > Do thread dump continuously, per specific period (like 1s) and see the
>>> > change of stack / lock for each thread. (This is not easy to be done
>>> in UI
>>> > so maybe doing manually would be the only option. Not sure Spark UI
>>> will
>>> > provide the same, haven't used at all.)
>>> >
>>> > It will tell which thread is being blocked (even it's shown as
>>> running) and
>>> > which point to look at.
>>> >
>>> > On Thu, Apr 16, 2020 at 4:29 PM Ruijing Li 
>>> wrote:
>>> >
>>> > > Once I do. thread dump, what should I be looking for to tell where
>>> it is
>>> > > hanging? Seeing a lot of timed_waiting and waiting on driver. Driver
>>> is
>>> > > also being blocked by spark UI. If there are no tasks, is there a
>>> point to
>>> > > do thread dump of executors?
>>> > >
>>> > > On Tue, Apr 14, 2020 at 4:49 AM Gabor Somogyi <
>>> gabor.g.somo...@gmail.com>
>>> > > wrote:
>>> > >
>>> > >> The simplest way is to do thread dump which doesn't require any
>>> fancy
>>> > >> tool (it's available on Spark UI).
>>> > >> Without thread dump it's hard to say anything...
>>> > >>
>>> > >>
>>> > >> On Tue, Apr 14, 2020 at 11:32 AM jane thorpe
>>> 
>>> > >> wrote:
>>> > >>
>>> > >>> Here a is another tool I use Logic Analyser  7:55
>>> > >>> https://youtu.be/LnzuMJLZRdU
>>> > >>>
>>> > >>> you could take some suggestions for improving performance  queries.
>>> > >>>
>>> https://dzone.com/articles/why-you-should-not-use-select-in-sql-query-1
>>> > >>>
>>> > >>>
>>> > >>> Jane thorpe
>>> > >>> janethor...@aol.com
>>> > >>>
>>> > >>>
>>> > >>> -Original Message-
>>> > >>> From: jane thorpe 
>>> > >>> To: janethorpe1 ; mich.talebzadeh <
>>> > >>> mich.talebza...@gmail.com>; liruijing09 ;
>>> user <
>>> > >>> user@spark.apache.org>
>>> > >>> Sent: Mon, 13 Apr 2020 8:32
>>> > >>> Subject: Re: Spark hangs while reading from jdbc - does nothing
>>> Removing
>>> > >>> Guess work from trouble shooting
>>> > >>>
>>> > >>>
>>> > >>>
>>> > >>> This tool may be useful for you to trouble shoot your problems
>>> away.
>>> > >>>
>>> > >>>
>>> > >>>
>>> https://www.javacodegeeks.com/2020/04/simplifying-apm-remove-the-guesswork-from-troubleshooting.html
>>> > >>>
>>> > >>>
>>> > >>> "APM tools typically use a waterfall-type view to show the blocking
>>> > >>> time of different components cascading through the control flow
>>> within an
>>> > >>> application.
>>> > >>> These types of visualizations are 

Re: Spark hangs while reading from jdbc - does nothing Removing Guess work from trouble shooting

2020-04-21 Thread Ruijing Li
Strangely enough I found an old issue that is the exact same issue as mine
https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-18343

However I’m using spark 2.4.4 so the issue should have been solved by now.

Like the user in the jira issue I am using mesos, but I am reading from
oracle instead of writing to Cassandra and S3.


On Thu, Apr 16, 2020 at 1:54 AM ZHANG Wei  wrote:

> The Thread dump result table of Spark UI can provide some clues to find
> out thread locks issue, such as:
>
>   Thread ID | Thread Name  | Thread State | Thread Locks
>   13| NonBlockingInputStreamThread | WAITING  | Blocked by
> Thread Some(48) Lock(jline.internal.NonBlockingInputStream@103008951})
>   48| Thread-16| RUNNABLE |
> Monitor(jline.internal.NonBlockingInputStream@103008951})
>
> And echo thread row can show the call stacks after being clicked, then you
> can check the root cause of holding locks like this(Thread 48 of above):
>
>   org.fusesource.jansi.internal.Kernel32.ReadConsoleInputW(Native Method)
>
> org.fusesource.jansi.internal.Kernel32.readConsoleInputHelper(Kernel32.java:811)
>
> org.fusesource.jansi.internal.Kernel32.readConsoleKeyInput(Kernel32.java:842)
>
> org.fusesource.jansi.internal.WindowsSupport.readConsoleInput(WindowsSupport.java:97)
>   jline.WindowsTerminal.readConsoleInput(WindowsTerminal.java:222)
>   
>
> Hope it can help you.
>
> --
> Cheers,
> -z
>
> On Thu, 16 Apr 2020 16:36:42 +0900
> Jungtaek Lim  wrote:
>
> > Do thread dump continuously, per specific period (like 1s) and see the
> > change of stack / lock for each thread. (This is not easy to be done in
> UI
> > so maybe doing manually would be the only option. Not sure Spark UI will
> > provide the same, haven't used at all.)
> >
> > It will tell which thread is being blocked (even it's shown as running)
> and
> > which point to look at.
> >
> > On Thu, Apr 16, 2020 at 4:29 PM Ruijing Li 
> wrote:
> >
> > > Once I do. thread dump, what should I be looking for to tell where it
> is
> > > hanging? Seeing a lot of timed_waiting and waiting on driver. Driver is
> > > also being blocked by spark UI. If there are no tasks, is there a
> point to
> > > do thread dump of executors?
> > >
> > > On Tue, Apr 14, 2020 at 4:49 AM Gabor Somogyi <
> gabor.g.somo...@gmail.com>
> > > wrote:
> > >
> > >> The simplest way is to do thread dump which doesn't require any fancy
> > >> tool (it's available on Spark UI).
> > >> Without thread dump it's hard to say anything...
> > >>
> > >>
> > >> On Tue, Apr 14, 2020 at 11:32 AM jane thorpe
> 
> > >> wrote:
> > >>
> > >>> Here a is another tool I use Logic Analyser  7:55
> > >>> https://youtu.be/LnzuMJLZRdU
> > >>>
> > >>> you could take some suggestions for improving performance  queries.
> > >>>
> https://dzone.com/articles/why-you-should-not-use-select-in-sql-query-1
> > >>>
> > >>>
> > >>> Jane thorpe
> > >>> janethor...@aol.com
> > >>>
> > >>>
> > >>> -Original Message-
> > >>> From: jane thorpe 
> > >>> To: janethorpe1 ; mich.talebzadeh <
> > >>> mich.talebza...@gmail.com>; liruijing09 ;
> user <
> > >>> user@spark.apache.org>
> > >>> Sent: Mon, 13 Apr 2020 8:32
> > >>> Subject: Re: Spark hangs while reading from jdbc - does nothing
> Removing
> > >>> Guess work from trouble shooting
> > >>>
> > >>>
> > >>>
> > >>> This tool may be useful for you to trouble shoot your problems away.
> > >>>
> > >>>
> > >>>
> https://www.javacodegeeks.com/2020/04/simplifying-apm-remove-the-guesswork-from-troubleshooting.html
> > >>>
> > >>>
> > >>> "APM tools typically use a waterfall-type view to show the blocking
> > >>> time of different components cascading through the control flow
> within an
> > >>> application.
> > >>> These types of visualizations are useful, and AppOptics has them, but
> > >>> they can be difficult to understand for those of us without a PhD."
> > >>>
> > >>> Especially  helpful if you want to understand through visualisation
> and
> > >>> you do not have a phD.
> > >>>
> > >>>
> > >>> Jane thorpe
> > >>> janethor...@aol.com
> > >>>
> > >>>
> > >>> -Original Message-
> > >>> From: jane thorpe 
> > >>> To: mich.talebzadeh ; liruijing09 <
> > >>> liruijin...@gmail.com>; user 
> > >>> CC: user 
> > >>> Sent: Sun, 12 Apr 2020 4:35
> > >>> Subject: Re: Spark hangs while reading from jdbc - does nothing
> > >>>
> > >>> You seem to be implying the error is intermittent.
> > >>> You seem to be implying data is being ingested  via JDBC. So the
> > >>> connection has proven itself to be working unless no data is
> arriving from
> > >>> the  JDBC channel at all.  If no data is arriving then one could say
> it
> > >>> could be  the JDBC.
> > >>> If the error is intermittent  then it is likely a resource involved
> in
> > >>> processing is filling to capacity.
> > >>> Try reducing the data ingestion volume and see if that completes,
> then
> > >>> increase the data ingested  incrementally.
> > >>> I assume you have  

Re: Using P4J Plugins with Spark

2020-04-21 Thread Todd Nist
You may want to make sure you include the jar of P4J and your plugins as
part of the following so that both the driver and executors have access.
If HDFS is out then you could
make a common mount point on each of the executor nodes so they have access
to the classes.


   - spark-submit --jars /common/path/to/jars
   - spark.driver.extraClassPath or it's alias --driver-class-path to set
   extra classpaths on the node running the driver.
   - spark.executor.extraClassPath to set extra class path on the Worker
   nodes.


On Tue, Apr 21, 2020 at 1:13 AM Shashanka Balakuntala <
shbalakunt...@gmail.com> wrote:

> Hi users,
> I'm a bit of newbie to spark infrastructure. And i have a small doubt.
> I have a maven project with plugins generated separately in a folder and
> normal java command to run is as follows:
> `java -Dp4j.pluginsDir=./plugins -jar /path/to/jar`
>
> Now when I run this program in local with spark-submit with standalone
> cluster(not cluster mode) the program compiles and plugins are in "plugins"
> folder in the $SPARK_HOME and it is getting recognised.
> The same is not the case in cluster mode. It says the Extenstion point is
> not loaded. Please advise on how can i create a folder which can be shared
> among the workers in "plugin" folder.
>
> PS: HDFS is not an options as we dont have a different setup
>
> Thanks.
>
>
> *Regards*
>   Shashanka Balakuntala Srinivasa
>
>


Spark Structure Streaming | FileStreamSourceLog not deleting list of input files | Spark -2.4.0

2020-04-21 Thread Pappu Yadav
Hi Team,

While Running Spark Below are some finding.

   1. FileStreamSourceLog is responsible for maintaining input source file
   list.
   2. Spark Streaming delete expired log files on the basis of s
   *park.sql.streaming.fileSource.log.deletion* and
   *spark.sql.streaming.minBatchesToRetain.*
   3. But while compacting logs Spark Streaming write the complete list of
   files streaming has seen till now in HDFS into one single .compact file.
   4. Over the course of time this compact file  is consuming around
   2GB-5GB in HDFS which will delay creation of compact file after every 10th
   Batch and also job restart time will increase.
   5. Why Spark Streaming is logging files in the system which are already
   deleted . While creating compact file there must be some configured timeout
   so that Spark can skip writing expired list of input files.

*Also kindly let me know if i missed something and there is some
configuration already present to handle this. *

Regards
Pappu Yadav


Re: What is the best way to take the top N entries from a hive table/data source?

2020-04-21 Thread ZHANG Wei
https://github.com/apache/spark/pull/7334 may explain the question as below:

>  This patch preserves this optimization by treating logical Limit operators 
> specially when they appear as the terminal operator in a query plan: if a 
> Limit is the final operator, then we will plan a special CollectLimit 
> physical operator which implements the old take()-based logic.

For `spark.sql("select * from db.table limit 100").explain(false)`, `limit` 
is the final operator;
for `spark.sql("select * from db.table limit 
100").repartition(1000).explain(false)`, `repartition` is the final 
operator. If you add a `.limit()` operation after `repartition`, such as 
`spark.sql("select * from db.table limit 
100").repartition(1000).limit(1000).explain(false)`, the `CollectLimit` 
will show again.

---
Cheers,
-z


From: Yeikel 
Sent: Wednesday, April 15, 2020 2:45
To: user@spark.apache.org
Subject: Re: What is the best way to take the top N entries from a hive 
table/data source?

Looking at the results of explain, I can see a CollectLimit step. Does that
work the same way as a regular .collect() ? (where all records are sent to
the driver?)


spark.sql("select * from db.table limit 100").explain(false)
== Physical Plan ==
CollectLimit 100
+- FileScan parquet ... 806 more fields] Batched: false, Format: Parquet,
Location: CatalogFileIndex[...], PartitionCount: 3, PartitionFilters: [],
PushedFilters: [], ReadSchema:.
db: Unit = ()

The number of partitions is 1 so that makes sense.

spark.sql("select * from db.table limit 100").rdd.partitions.size = 1

As a follow up , I tried to repartition the resultant dataframe and while I
can't see the CollectLimit step anymore , It did not make any difference in
the job. I still saw a big task at the end that ends up failing.

spark.sql("select * from db.table limit
100").repartition(1000).explain(false)

Exchange RoundRobinPartitioning(1000)
+- GlobalLimit 100
   +- Exchange SinglePartition
  +- LocalLimit 100  -> Is this a collect?





--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Using P4J Plugins with Spark

2020-04-21 Thread Shashanka Balakuntala
Hi users,
I'm a bit of newbie to spark infrastructure. And i have a small doubt.
I have a maven project with plugins generated separately in a folder and
normal java command to run is as follows:
`java -Dp4j.pluginsDir=./plugins -jar /path/to/jar`

Now when I run this program in local with spark-submit with standalone
cluster(not cluster mode) the program compiles and plugins are in "plugins"
folder in the $SPARK_HOME and it is getting recognised.
The same is not the case in cluster mode. It says the Extenstion point is
not loaded. Please advise on how can i create a folder which can be shared
among the workers in "plugin" folder.

PS: HDFS is not an options as we dont have a different setup

Thanks.


*Regards*
  Shashanka Balakuntala Srinivasa