Re: is RosckDB backend available in 3.0 preview?
Unfortunately, the short answer is no. Please refer the last part of discussion on the PR https://github.com/apache/spark/pull/24922 Unless we get any native implementation of this, I guess this project is most widely known implementation for RocksDB backend state store - https://github.com/chermenin/spark-states On Wed, Apr 22, 2020 at 11:32 AM kant kodali wrote: > Hi All, > > 1. is RosckDB backend available in 3.0 preview? > 2. if RocksDB can store intermediate results of a stream-stream join can I > run streaming join queries forever? forever I mean until I run out of > disk. or put it another can I run the stream-stream join queries for years > if necessary (imagine I got lot of disk capacity but not a whole lot of > RAM)? > 3. Does is to do incremental checkpointing to HDFS? > > Thanks! > >
Re: What is the best way to take the top N entries from a hive table/data source?
Hi Zhang. Thank you for your response While your answer clarifies my confusion with `CollectLimit` it still does not clarify what is the recommended way to extract large amounts of data (but not all the records) from a source and maintain a high level of parallelism. For example , at some instances trying to extract 1 million records from a table with over 100M records , I see my cluster using 1-2 cores out of the hundreds that I have available. -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: Spark hangs while reading from jdbc - does nothing Removing Guess work from trouble shooting
No, that's not a thing to apologize for. It's just your call - less context would bring less reaction and interest. On Wed, Apr 22, 2020 at 11:50 AM Ruijing Li wrote: > I apologize, but I cannot share it, even if it is just typical spark > libraries. I definitely understand that limits debugging help, but wanted > to understand if anyone has encountered a similar issue. > > On Tue, Apr 21, 2020 at 7:12 PM Jungtaek Lim > wrote: > >> If there's no third party libraries in the dump then why not share the >> thread dump? (I mean, the output of jstack) >> >> stack trace would be more helpful to find which thing acquired lock and >> which other things are waiting for acquiring lock, if we suspect deadlock. >> >> On Wed, Apr 22, 2020 at 2:38 AM Ruijing Li wrote: >> >>> After refreshing a couple of times, I notice the lock is being swapped >>> between these 3. The other 2 will be blocked by whoever gets this lock, in >>> a cycle of 160 has lock -> 161 -> 159 -> 160 >>> >>> On Tue, Apr 21, 2020 at 10:33 AM Ruijing Li >>> wrote: >>> In thread dump, I do see this - SparkUI-160- acceptor-id-ServerConnector@id(HTTP/1.1) | RUNNABLE | Monitor - SparkUI-161-acceptor-id-ServerConnector@id(HTTP/1.1) | BLOCKED | Blocked by Thread(Some(160)) Lock - SparkUI-159-acceptor-id-ServerConnector@id(HTTP/1.1) | BLOCKED | Blocked by Thread(Some(160)) Lock Could the fact that 160 has the monitor but is not running be causing a deadlock preventing the job from finishing? I do see my Finalizer and main method are waiting. I don’t see any other threads from 3rd party libraries or my code in the dump. I do see spark context cleaner has timed waiting. Thanks On Tue, Apr 21, 2020 at 9:58 AM Ruijing Li wrote: > Strangely enough I found an old issue that is the exact same issue as > mine > https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-18343 > > However I’m using spark 2.4.4 so the issue should have been solved by > now. > > Like the user in the jira issue I am using mesos, but I am reading > from oracle instead of writing to Cassandra and S3. > > > On Thu, Apr 16, 2020 at 1:54 AM ZHANG Wei wrote: > >> The Thread dump result table of Spark UI can provide some clues to >> find out thread locks issue, such as: >> >> Thread ID | Thread Name | Thread State | Thread >> Locks >> 13| NonBlockingInputStreamThread | WAITING | Blocked >> by Thread Some(48) Lock(jline.internal.NonBlockingInputStream@103008951 >> }) >> 48| Thread-16| RUNNABLE | >> Monitor(jline.internal.NonBlockingInputStream@103008951}) >> >> And echo thread row can show the call stacks after being clicked, >> then you can check the root cause of holding locks like this(Thread 48 of >> above): >> >> org.fusesource.jansi.internal.Kernel32.ReadConsoleInputW(Native >> Method) >> >> org.fusesource.jansi.internal.Kernel32.readConsoleInputHelper(Kernel32.java:811) >> >> org.fusesource.jansi.internal.Kernel32.readConsoleKeyInput(Kernel32.java:842) >> >> org.fusesource.jansi.internal.WindowsSupport.readConsoleInput(WindowsSupport.java:97) >> jline.WindowsTerminal.readConsoleInput(WindowsTerminal.java:222) >> >> >> Hope it can help you. >> >> -- >> Cheers, >> -z >> >> On Thu, 16 Apr 2020 16:36:42 +0900 >> Jungtaek Lim wrote: >> >> > Do thread dump continuously, per specific period (like 1s) and see >> the >> > change of stack / lock for each thread. (This is not easy to be >> done in UI >> > so maybe doing manually would be the only option. Not sure Spark UI >> will >> > provide the same, haven't used at all.) >> > >> > It will tell which thread is being blocked (even it's shown as >> running) and >> > which point to look at. >> > >> > On Thu, Apr 16, 2020 at 4:29 PM Ruijing Li >> wrote: >> > >> > > Once I do. thread dump, what should I be looking for to tell >> where it is >> > > hanging? Seeing a lot of timed_waiting and waiting on driver. >> Driver is >> > > also being blocked by spark UI. If there are no tasks, is there a >> point to >> > > do thread dump of executors? >> > > >> > > On Tue, Apr 14, 2020 at 4:49 AM Gabor Somogyi < >> gabor.g.somo...@gmail.com> >> > > wrote: >> > > >> > >> The simplest way is to do thread dump which doesn't require any >> fancy >> > >> tool (it's available on Spark UI). >> > >> Without thread dump it's hard to say anything... >> > >> >> > >> >> > >> On Tue, Apr 14, 2020 at 11:32 AM jane thorpe >> >> > >> wrote: >> > >> >> > >>> Here a is another tool I use Logic Analyser 7:55 >> > >>>
Re: Spark hangs while reading from jdbc - does nothing Removing Guess work from trouble shooting
I apologize, but I cannot share it, even if it is just typical spark libraries. I definitely understand that limits debugging help, but wanted to understand if anyone has encountered a similar issue. On Tue, Apr 21, 2020 at 7:12 PM Jungtaek Lim wrote: > If there's no third party libraries in the dump then why not share the > thread dump? (I mean, the output of jstack) > > stack trace would be more helpful to find which thing acquired lock and > which other things are waiting for acquiring lock, if we suspect deadlock. > > On Wed, Apr 22, 2020 at 2:38 AM Ruijing Li wrote: > >> After refreshing a couple of times, I notice the lock is being swapped >> between these 3. The other 2 will be blocked by whoever gets this lock, in >> a cycle of 160 has lock -> 161 -> 159 -> 160 >> >> On Tue, Apr 21, 2020 at 10:33 AM Ruijing Li >> wrote: >> >>> In thread dump, I do see this >>> - SparkUI-160- acceptor-id-ServerConnector@id(HTTP/1.1) | RUNNABLE | >>> Monitor >>> - SparkUI-161-acceptor-id-ServerConnector@id(HTTP/1.1) | BLOCKED | >>> Blocked by Thread(Some(160)) Lock >>> - SparkUI-159-acceptor-id-ServerConnector@id(HTTP/1.1) | BLOCKED | >>> Blocked by Thread(Some(160)) Lock >>> >>> Could the fact that 160 has the monitor but is not running be causing a >>> deadlock preventing the job from finishing? >>> >>> I do see my Finalizer and main method are waiting. I don’t see any other >>> threads from 3rd party libraries or my code in the dump. I do see spark >>> context cleaner has timed waiting. >>> >>> Thanks >>> >>> >>> On Tue, Apr 21, 2020 at 9:58 AM Ruijing Li >>> wrote: >>> Strangely enough I found an old issue that is the exact same issue as mine https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-18343 However I’m using spark 2.4.4 so the issue should have been solved by now. Like the user in the jira issue I am using mesos, but I am reading from oracle instead of writing to Cassandra and S3. On Thu, Apr 16, 2020 at 1:54 AM ZHANG Wei wrote: > The Thread dump result table of Spark UI can provide some clues to > find out thread locks issue, such as: > > Thread ID | Thread Name | Thread State | Thread > Locks > 13| NonBlockingInputStreamThread | WAITING | Blocked by > Thread Some(48) Lock(jline.internal.NonBlockingInputStream@103008951}) > 48| Thread-16| RUNNABLE | > Monitor(jline.internal.NonBlockingInputStream@103008951}) > > And echo thread row can show the call stacks after being clicked, then > you can check the root cause of holding locks like this(Thread 48 of > above): > > org.fusesource.jansi.internal.Kernel32.ReadConsoleInputW(Native > Method) > > org.fusesource.jansi.internal.Kernel32.readConsoleInputHelper(Kernel32.java:811) > > org.fusesource.jansi.internal.Kernel32.readConsoleKeyInput(Kernel32.java:842) > > org.fusesource.jansi.internal.WindowsSupport.readConsoleInput(WindowsSupport.java:97) > jline.WindowsTerminal.readConsoleInput(WindowsTerminal.java:222) > > > Hope it can help you. > > -- > Cheers, > -z > > On Thu, 16 Apr 2020 16:36:42 +0900 > Jungtaek Lim wrote: > > > Do thread dump continuously, per specific period (like 1s) and see > the > > change of stack / lock for each thread. (This is not easy to be done > in UI > > so maybe doing manually would be the only option. Not sure Spark UI > will > > provide the same, haven't used at all.) > > > > It will tell which thread is being blocked (even it's shown as > running) and > > which point to look at. > > > > On Thu, Apr 16, 2020 at 4:29 PM Ruijing Li > wrote: > > > > > Once I do. thread dump, what should I be looking for to tell where > it is > > > hanging? Seeing a lot of timed_waiting and waiting on driver. > Driver is > > > also being blocked by spark UI. If there are no tasks, is there a > point to > > > do thread dump of executors? > > > > > > On Tue, Apr 14, 2020 at 4:49 AM Gabor Somogyi < > gabor.g.somo...@gmail.com> > > > wrote: > > > > > >> The simplest way is to do thread dump which doesn't require any > fancy > > >> tool (it's available on Spark UI). > > >> Without thread dump it's hard to say anything... > > >> > > >> > > >> On Tue, Apr 14, 2020 at 11:32 AM jane thorpe > > > >> wrote: > > >> > > >>> Here a is another tool I use Logic Analyser 7:55 > > >>> https://youtu.be/LnzuMJLZRdU > > >>> > > >>> you could take some suggestions for improving performance > queries. > > >>> > https://dzone.com/articles/why-you-should-not-use-select-in-sql-query-1 > > >>> > > >>> > > >>> Jane thorpe > > >>> janethor...@aol.com > > >>> > > >>> > > >>>
is RosckDB backend available in 3.0 preview?
Hi All, 1. is RosckDB backend available in 3.0 preview? 2. if RocksDB can store intermediate results of a stream-stream join can I run streaming join queries forever? forever I mean until I run out of disk. or put it another can I run the stream-stream join queries for years if necessary (imagine I got lot of disk capacity but not a whole lot of RAM)? 3. Does is to do incremental checkpointing to HDFS? Thanks!
Re: Spark hangs while reading from jdbc - does nothing Removing Guess work from trouble shooting
If there's no third party libraries in the dump then why not share the thread dump? (I mean, the output of jstack) stack trace would be more helpful to find which thing acquired lock and which other things are waiting for acquiring lock, if we suspect deadlock. On Wed, Apr 22, 2020 at 2:38 AM Ruijing Li wrote: > After refreshing a couple of times, I notice the lock is being swapped > between these 3. The other 2 will be blocked by whoever gets this lock, in > a cycle of 160 has lock -> 161 -> 159 -> 160 > > On Tue, Apr 21, 2020 at 10:33 AM Ruijing Li wrote: > >> In thread dump, I do see this >> - SparkUI-160- acceptor-id-ServerConnector@id(HTTP/1.1) | RUNNABLE | >> Monitor >> - SparkUI-161-acceptor-id-ServerConnector@id(HTTP/1.1) | BLOCKED | >> Blocked by Thread(Some(160)) Lock >> - SparkUI-159-acceptor-id-ServerConnector@id(HTTP/1.1) | BLOCKED | >> Blocked by Thread(Some(160)) Lock >> >> Could the fact that 160 has the monitor but is not running be causing a >> deadlock preventing the job from finishing? >> >> I do see my Finalizer and main method are waiting. I don’t see any other >> threads from 3rd party libraries or my code in the dump. I do see spark >> context cleaner has timed waiting. >> >> Thanks >> >> >> On Tue, Apr 21, 2020 at 9:58 AM Ruijing Li wrote: >> >>> Strangely enough I found an old issue that is the exact same issue as >>> mine >>> https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-18343 >>> >>> However I’m using spark 2.4.4 so the issue should have been solved by >>> now. >>> >>> Like the user in the jira issue I am using mesos, but I am reading from >>> oracle instead of writing to Cassandra and S3. >>> >>> >>> On Thu, Apr 16, 2020 at 1:54 AM ZHANG Wei wrote: >>> The Thread dump result table of Spark UI can provide some clues to find out thread locks issue, such as: Thread ID | Thread Name | Thread State | Thread Locks 13| NonBlockingInputStreamThread | WAITING | Blocked by Thread Some(48) Lock(jline.internal.NonBlockingInputStream@103008951}) 48| Thread-16| RUNNABLE | Monitor(jline.internal.NonBlockingInputStream@103008951}) And echo thread row can show the call stacks after being clicked, then you can check the root cause of holding locks like this(Thread 48 of above): org.fusesource.jansi.internal.Kernel32.ReadConsoleInputW(Native Method) org.fusesource.jansi.internal.Kernel32.readConsoleInputHelper(Kernel32.java:811) org.fusesource.jansi.internal.Kernel32.readConsoleKeyInput(Kernel32.java:842) org.fusesource.jansi.internal.WindowsSupport.readConsoleInput(WindowsSupport.java:97) jline.WindowsTerminal.readConsoleInput(WindowsTerminal.java:222) Hope it can help you. -- Cheers, -z On Thu, 16 Apr 2020 16:36:42 +0900 Jungtaek Lim wrote: > Do thread dump continuously, per specific period (like 1s) and see the > change of stack / lock for each thread. (This is not easy to be done in UI > so maybe doing manually would be the only option. Not sure Spark UI will > provide the same, haven't used at all.) > > It will tell which thread is being blocked (even it's shown as running) and > which point to look at. > > On Thu, Apr 16, 2020 at 4:29 PM Ruijing Li wrote: > > > Once I do. thread dump, what should I be looking for to tell where it is > > hanging? Seeing a lot of timed_waiting and waiting on driver. Driver is > > also being blocked by spark UI. If there are no tasks, is there a point to > > do thread dump of executors? > > > > On Tue, Apr 14, 2020 at 4:49 AM Gabor Somogyi < gabor.g.somo...@gmail.com> > > wrote: > > > >> The simplest way is to do thread dump which doesn't require any fancy > >> tool (it's available on Spark UI). > >> Without thread dump it's hard to say anything... > >> > >> > >> On Tue, Apr 14, 2020 at 11:32 AM jane thorpe > >> wrote: > >> > >>> Here a is another tool I use Logic Analyser 7:55 > >>> https://youtu.be/LnzuMJLZRdU > >>> > >>> you could take some suggestions for improving performance queries. > >>> https://dzone.com/articles/why-you-should-not-use-select-in-sql-query-1 > >>> > >>> > >>> Jane thorpe > >>> janethor...@aol.com > >>> > >>> > >>> -Original Message- > >>> From: jane thorpe > >>> To: janethorpe1 ; mich.talebzadeh < > >>> mich.talebza...@gmail.com>; liruijing09 ; user < > >>> user@spark.apache.org> > >>> Sent: Mon, 13 Apr 2020 8:32 > >>> Subject: Re: Spark hangs while reading from jdbc - does nothing Removing > >>> Guess work from trouble shooting > >>> > >>> > >>> >
Spark Mongodb connector hangs indefinitely, not working on Amazon EMR
When running a Pyspark application on my local machine I am able to save and retrieve from the Mongodb server using the Mongodb Spark connector. All works properly. When submitting the exact same application on my Amazon EMR cluster I can see that the package for the Spark driver is being properly collected from Maven when the job is submitted. However, it is not working. >From my instance of Amazon EMR I can communicate with the database using Pymongo without problems. I can load/save dataframes when using pyspark interactively from the driver, but when submitting jobs via spark-submit over the yarn cluster it hangs. The problem gives no error messages, it just shows 0 activity on the driver and executor. The pyspark application just stops until manually terminated. Has anyone else used the Mongodb Spark connector from Amazon EMR? --
Re: Spark Structure Streaming | FileStreamSourceLog not deleting list of input files | Spark -2.4.0
You're hitting an existing issue https://issues.apache.org/jira/browse/SPARK-17604. While there's no active PR to address it, I've been planning to take a look sooner than later. Btw, you may also want to take a look at my previous mail - the topic on the mail thread was regarding file stream sink metadata growing bigger, but in fact that's basically the same issue, so you may get some information from there. (tl;dr. I have bunch of PRs for addressing multiple issues on file stream source and sink, just having lack of some love.) https://lists.apache.org/thread.html/rb4ebf1d20d13db0a78694e8d301e51c326f803cb86fc1a1f66f2ae7e%40%3Cuser.spark.apache.org%3E Thanks, Jungtaek Lim (HeartSaVioR) On Tue, Apr 21, 2020 at 8:23 PM Pappu Yadav wrote: > Hi Team, > > While Running Spark Below are some finding. > >1. FileStreamSourceLog is responsible for maintaining input source >file list. >2. Spark Streaming delete expired log files on the basis of s >*park.sql.streaming.fileSource.log.deletion* and >*spark.sql.streaming.minBatchesToRetain.* >3. But while compacting logs Spark Streaming write the complete list >of files streaming has seen till now in HDFS into one single .compact file. >4. Over the course of time this compact file is consuming around >2GB-5GB in HDFS which will delay creation of compact file after every 10th >Batch and also job restart time will increase. >5. Why Spark Streaming is logging files in the system which are >already deleted . While creating compact file there must be some configured >timeout so that Spark can skip writing expired list of input files. > > *Also kindly let me know if i missed something and there is some > configuration already present to handle this. * > > Regards > Pappu Yadav >
Re: Using startingOffsets latest - no data from structured streaming kafka query
Yes, we did. But for some reason latest does not show them. The count is always 0. On Sun, Apr 19, 2020 at 3:42 PM Jungtaek Lim wrote: > Did you provide more records to topic "after" you started the query? > That's the only one I can imagine based on such information. > > On Fri, Apr 17, 2020 at 9:13 AM Ruijing Li wrote: > >> Hi all, >> >> Apologies if this has been asked before, but I could not find the answer >> to this question. We have a structured streaming job, but for some reason, >> if we use startingOffsets = latest with foreachbatch mode, it doesn’t >> produce any data. >> >> Rather, in logs I see it repeats the message “ Fetcher [Consumer] >> Resetting offset for partition to offset” over and over again.. >> >> However with startingOffsets=earliest, we don’t get this issue. I’m >> wondering then how we can use startingOffsets=latest as I wish to start >> from the latest offset available. >> -- >> Cheers, >> Ruijing Li >> > -- Cheers, Ruijing Li
Re: Spark hangs while reading from jdbc - does nothing Removing Guess work from trouble shooting
In thread dump, I do see this - SparkUI-160- acceptor-id-ServerConnector@id(HTTP/1.1) | RUNNABLE | Monitor - SparkUI-161-acceptor-id-ServerConnector@id(HTTP/1.1) | BLOCKED | Blocked by Thread(Some(160)) Lock - SparkUI-159-acceptor-id-ServerConnector@id(HTTP/1.1) | BLOCKED | Blocked by Thread(Some(160)) Lock Could the fact that 160 has the monitor but is not running be causing a deadlock preventing the job from finishing? I do see my Finalizer and main method are waiting. I don’t see any other threads from 3rd party libraries or my code in the dump. I do see spark context cleaner has timed waiting. Thanks On Tue, Apr 21, 2020 at 9:58 AM Ruijing Li wrote: > Strangely enough I found an old issue that is the exact same issue as mine > https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-18343 > > However I’m using spark 2.4.4 so the issue should have been solved by now. > > Like the user in the jira issue I am using mesos, but I am reading from > oracle instead of writing to Cassandra and S3. > > > On Thu, Apr 16, 2020 at 1:54 AM ZHANG Wei wrote: > >> The Thread dump result table of Spark UI can provide some clues to find >> out thread locks issue, such as: >> >> Thread ID | Thread Name | Thread State | Thread Locks >> 13| NonBlockingInputStreamThread | WAITING | Blocked by >> Thread Some(48) Lock(jline.internal.NonBlockingInputStream@103008951}) >> 48| Thread-16| RUNNABLE | >> Monitor(jline.internal.NonBlockingInputStream@103008951}) >> >> And echo thread row can show the call stacks after being clicked, then >> you can check the root cause of holding locks like this(Thread 48 of above): >> >> org.fusesource.jansi.internal.Kernel32.ReadConsoleInputW(Native Method) >> >> org.fusesource.jansi.internal.Kernel32.readConsoleInputHelper(Kernel32.java:811) >> >> org.fusesource.jansi.internal.Kernel32.readConsoleKeyInput(Kernel32.java:842) >> >> org.fusesource.jansi.internal.WindowsSupport.readConsoleInput(WindowsSupport.java:97) >> jline.WindowsTerminal.readConsoleInput(WindowsTerminal.java:222) >> >> >> Hope it can help you. >> >> -- >> Cheers, >> -z >> >> On Thu, 16 Apr 2020 16:36:42 +0900 >> Jungtaek Lim wrote: >> >> > Do thread dump continuously, per specific period (like 1s) and see the >> > change of stack / lock for each thread. (This is not easy to be done in >> UI >> > so maybe doing manually would be the only option. Not sure Spark UI will >> > provide the same, haven't used at all.) >> > >> > It will tell which thread is being blocked (even it's shown as running) >> and >> > which point to look at. >> > >> > On Thu, Apr 16, 2020 at 4:29 PM Ruijing Li >> wrote: >> > >> > > Once I do. thread dump, what should I be looking for to tell where it >> is >> > > hanging? Seeing a lot of timed_waiting and waiting on driver. Driver >> is >> > > also being blocked by spark UI. If there are no tasks, is there a >> point to >> > > do thread dump of executors? >> > > >> > > On Tue, Apr 14, 2020 at 4:49 AM Gabor Somogyi < >> gabor.g.somo...@gmail.com> >> > > wrote: >> > > >> > >> The simplest way is to do thread dump which doesn't require any fancy >> > >> tool (it's available on Spark UI). >> > >> Without thread dump it's hard to say anything... >> > >> >> > >> >> > >> On Tue, Apr 14, 2020 at 11:32 AM jane thorpe >> >> > >> wrote: >> > >> >> > >>> Here a is another tool I use Logic Analyser 7:55 >> > >>> https://youtu.be/LnzuMJLZRdU >> > >>> >> > >>> you could take some suggestions for improving performance queries. >> > >>> >> https://dzone.com/articles/why-you-should-not-use-select-in-sql-query-1 >> > >>> >> > >>> >> > >>> Jane thorpe >> > >>> janethor...@aol.com >> > >>> >> > >>> >> > >>> -Original Message- >> > >>> From: jane thorpe >> > >>> To: janethorpe1 ; mich.talebzadeh < >> > >>> mich.talebza...@gmail.com>; liruijing09 ; >> user < >> > >>> user@spark.apache.org> >> > >>> Sent: Mon, 13 Apr 2020 8:32 >> > >>> Subject: Re: Spark hangs while reading from jdbc - does nothing >> Removing >> > >>> Guess work from trouble shooting >> > >>> >> > >>> >> > >>> >> > >>> This tool may be useful for you to trouble shoot your problems away. >> > >>> >> > >>> >> > >>> >> https://www.javacodegeeks.com/2020/04/simplifying-apm-remove-the-guesswork-from-troubleshooting.html >> > >>> >> > >>> >> > >>> "APM tools typically use a waterfall-type view to show the blocking >> > >>> time of different components cascading through the control flow >> within an >> > >>> application. >> > >>> These types of visualizations are useful, and AppOptics has them, >> but >> > >>> they can be difficult to understand for those of us without a PhD." >> > >>> >> > >>> Especially helpful if you want to understand through visualisation >> and >> > >>> you do not have a phD. >> > >>> >> > >>> >> > >>> Jane thorpe >> > >>> janethor...@aol.com >> > >>> >> > >>> >> > >>> -Original Message- >> > >>> From: jane thorpe >> > >>> To:
Re: Spark hangs while reading from jdbc - does nothing Removing Guess work from trouble shooting
After refreshing a couple of times, I notice the lock is being swapped between these 3. The other 2 will be blocked by whoever gets this lock, in a cycle of 160 has lock -> 161 -> 159 -> 160 On Tue, Apr 21, 2020 at 10:33 AM Ruijing Li wrote: > In thread dump, I do see this > - SparkUI-160- acceptor-id-ServerConnector@id(HTTP/1.1) | RUNNABLE | > Monitor > - SparkUI-161-acceptor-id-ServerConnector@id(HTTP/1.1) | BLOCKED | > Blocked by Thread(Some(160)) Lock > - SparkUI-159-acceptor-id-ServerConnector@id(HTTP/1.1) | BLOCKED | > Blocked by Thread(Some(160)) Lock > > Could the fact that 160 has the monitor but is not running be causing a > deadlock preventing the job from finishing? > > I do see my Finalizer and main method are waiting. I don’t see any other > threads from 3rd party libraries or my code in the dump. I do see spark > context cleaner has timed waiting. > > Thanks > > > On Tue, Apr 21, 2020 at 9:58 AM Ruijing Li wrote: > >> Strangely enough I found an old issue that is the exact same issue as >> mine >> https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-18343 >> >> However I’m using spark 2.4.4 so the issue should have been solved by now. >> >> Like the user in the jira issue I am using mesos, but I am reading from >> oracle instead of writing to Cassandra and S3. >> >> >> On Thu, Apr 16, 2020 at 1:54 AM ZHANG Wei wrote: >> >>> The Thread dump result table of Spark UI can provide some clues to find >>> out thread locks issue, such as: >>> >>> Thread ID | Thread Name | Thread State | Thread Locks >>> 13| NonBlockingInputStreamThread | WAITING | Blocked by >>> Thread Some(48) Lock(jline.internal.NonBlockingInputStream@103008951}) >>> 48| Thread-16| RUNNABLE | >>> Monitor(jline.internal.NonBlockingInputStream@103008951}) >>> >>> And echo thread row can show the call stacks after being clicked, then >>> you can check the root cause of holding locks like this(Thread 48 of above): >>> >>> org.fusesource.jansi.internal.Kernel32.ReadConsoleInputW(Native Method) >>> >>> org.fusesource.jansi.internal.Kernel32.readConsoleInputHelper(Kernel32.java:811) >>> >>> org.fusesource.jansi.internal.Kernel32.readConsoleKeyInput(Kernel32.java:842) >>> >>> org.fusesource.jansi.internal.WindowsSupport.readConsoleInput(WindowsSupport.java:97) >>> jline.WindowsTerminal.readConsoleInput(WindowsTerminal.java:222) >>> >>> >>> Hope it can help you. >>> >>> -- >>> Cheers, >>> -z >>> >>> On Thu, 16 Apr 2020 16:36:42 +0900 >>> Jungtaek Lim wrote: >>> >>> > Do thread dump continuously, per specific period (like 1s) and see the >>> > change of stack / lock for each thread. (This is not easy to be done >>> in UI >>> > so maybe doing manually would be the only option. Not sure Spark UI >>> will >>> > provide the same, haven't used at all.) >>> > >>> > It will tell which thread is being blocked (even it's shown as >>> running) and >>> > which point to look at. >>> > >>> > On Thu, Apr 16, 2020 at 4:29 PM Ruijing Li >>> wrote: >>> > >>> > > Once I do. thread dump, what should I be looking for to tell where >>> it is >>> > > hanging? Seeing a lot of timed_waiting and waiting on driver. Driver >>> is >>> > > also being blocked by spark UI. If there are no tasks, is there a >>> point to >>> > > do thread dump of executors? >>> > > >>> > > On Tue, Apr 14, 2020 at 4:49 AM Gabor Somogyi < >>> gabor.g.somo...@gmail.com> >>> > > wrote: >>> > > >>> > >> The simplest way is to do thread dump which doesn't require any >>> fancy >>> > >> tool (it's available on Spark UI). >>> > >> Without thread dump it's hard to say anything... >>> > >> >>> > >> >>> > >> On Tue, Apr 14, 2020 at 11:32 AM jane thorpe >>> >>> > >> wrote: >>> > >> >>> > >>> Here a is another tool I use Logic Analyser 7:55 >>> > >>> https://youtu.be/LnzuMJLZRdU >>> > >>> >>> > >>> you could take some suggestions for improving performance queries. >>> > >>> >>> https://dzone.com/articles/why-you-should-not-use-select-in-sql-query-1 >>> > >>> >>> > >>> >>> > >>> Jane thorpe >>> > >>> janethor...@aol.com >>> > >>> >>> > >>> >>> > >>> -Original Message- >>> > >>> From: jane thorpe >>> > >>> To: janethorpe1 ; mich.talebzadeh < >>> > >>> mich.talebza...@gmail.com>; liruijing09 ; >>> user < >>> > >>> user@spark.apache.org> >>> > >>> Sent: Mon, 13 Apr 2020 8:32 >>> > >>> Subject: Re: Spark hangs while reading from jdbc - does nothing >>> Removing >>> > >>> Guess work from trouble shooting >>> > >>> >>> > >>> >>> > >>> >>> > >>> This tool may be useful for you to trouble shoot your problems >>> away. >>> > >>> >>> > >>> >>> > >>> >>> https://www.javacodegeeks.com/2020/04/simplifying-apm-remove-the-guesswork-from-troubleshooting.html >>> > >>> >>> > >>> >>> > >>> "APM tools typically use a waterfall-type view to show the blocking >>> > >>> time of different components cascading through the control flow >>> within an >>> > >>> application. >>> > >>> These types of visualizations are
Re: Spark hangs while reading from jdbc - does nothing Removing Guess work from trouble shooting
Strangely enough I found an old issue that is the exact same issue as mine https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-18343 However I’m using spark 2.4.4 so the issue should have been solved by now. Like the user in the jira issue I am using mesos, but I am reading from oracle instead of writing to Cassandra and S3. On Thu, Apr 16, 2020 at 1:54 AM ZHANG Wei wrote: > The Thread dump result table of Spark UI can provide some clues to find > out thread locks issue, such as: > > Thread ID | Thread Name | Thread State | Thread Locks > 13| NonBlockingInputStreamThread | WAITING | Blocked by > Thread Some(48) Lock(jline.internal.NonBlockingInputStream@103008951}) > 48| Thread-16| RUNNABLE | > Monitor(jline.internal.NonBlockingInputStream@103008951}) > > And echo thread row can show the call stacks after being clicked, then you > can check the root cause of holding locks like this(Thread 48 of above): > > org.fusesource.jansi.internal.Kernel32.ReadConsoleInputW(Native Method) > > org.fusesource.jansi.internal.Kernel32.readConsoleInputHelper(Kernel32.java:811) > > org.fusesource.jansi.internal.Kernel32.readConsoleKeyInput(Kernel32.java:842) > > org.fusesource.jansi.internal.WindowsSupport.readConsoleInput(WindowsSupport.java:97) > jline.WindowsTerminal.readConsoleInput(WindowsTerminal.java:222) > > > Hope it can help you. > > -- > Cheers, > -z > > On Thu, 16 Apr 2020 16:36:42 +0900 > Jungtaek Lim wrote: > > > Do thread dump continuously, per specific period (like 1s) and see the > > change of stack / lock for each thread. (This is not easy to be done in > UI > > so maybe doing manually would be the only option. Not sure Spark UI will > > provide the same, haven't used at all.) > > > > It will tell which thread is being blocked (even it's shown as running) > and > > which point to look at. > > > > On Thu, Apr 16, 2020 at 4:29 PM Ruijing Li > wrote: > > > > > Once I do. thread dump, what should I be looking for to tell where it > is > > > hanging? Seeing a lot of timed_waiting and waiting on driver. Driver is > > > also being blocked by spark UI. If there are no tasks, is there a > point to > > > do thread dump of executors? > > > > > > On Tue, Apr 14, 2020 at 4:49 AM Gabor Somogyi < > gabor.g.somo...@gmail.com> > > > wrote: > > > > > >> The simplest way is to do thread dump which doesn't require any fancy > > >> tool (it's available on Spark UI). > > >> Without thread dump it's hard to say anything... > > >> > > >> > > >> On Tue, Apr 14, 2020 at 11:32 AM jane thorpe > > > >> wrote: > > >> > > >>> Here a is another tool I use Logic Analyser 7:55 > > >>> https://youtu.be/LnzuMJLZRdU > > >>> > > >>> you could take some suggestions for improving performance queries. > > >>> > https://dzone.com/articles/why-you-should-not-use-select-in-sql-query-1 > > >>> > > >>> > > >>> Jane thorpe > > >>> janethor...@aol.com > > >>> > > >>> > > >>> -Original Message- > > >>> From: jane thorpe > > >>> To: janethorpe1 ; mich.talebzadeh < > > >>> mich.talebza...@gmail.com>; liruijing09 ; > user < > > >>> user@spark.apache.org> > > >>> Sent: Mon, 13 Apr 2020 8:32 > > >>> Subject: Re: Spark hangs while reading from jdbc - does nothing > Removing > > >>> Guess work from trouble shooting > > >>> > > >>> > > >>> > > >>> This tool may be useful for you to trouble shoot your problems away. > > >>> > > >>> > > >>> > https://www.javacodegeeks.com/2020/04/simplifying-apm-remove-the-guesswork-from-troubleshooting.html > > >>> > > >>> > > >>> "APM tools typically use a waterfall-type view to show the blocking > > >>> time of different components cascading through the control flow > within an > > >>> application. > > >>> These types of visualizations are useful, and AppOptics has them, but > > >>> they can be difficult to understand for those of us without a PhD." > > >>> > > >>> Especially helpful if you want to understand through visualisation > and > > >>> you do not have a phD. > > >>> > > >>> > > >>> Jane thorpe > > >>> janethor...@aol.com > > >>> > > >>> > > >>> -Original Message- > > >>> From: jane thorpe > > >>> To: mich.talebzadeh ; liruijing09 < > > >>> liruijin...@gmail.com>; user > > >>> CC: user > > >>> Sent: Sun, 12 Apr 2020 4:35 > > >>> Subject: Re: Spark hangs while reading from jdbc - does nothing > > >>> > > >>> You seem to be implying the error is intermittent. > > >>> You seem to be implying data is being ingested via JDBC. So the > > >>> connection has proven itself to be working unless no data is > arriving from > > >>> the JDBC channel at all. If no data is arriving then one could say > it > > >>> could be the JDBC. > > >>> If the error is intermittent then it is likely a resource involved > in > > >>> processing is filling to capacity. > > >>> Try reducing the data ingestion volume and see if that completes, > then > > >>> increase the data ingested incrementally. > > >>> I assume you have
Re: Using P4J Plugins with Spark
You may want to make sure you include the jar of P4J and your plugins as part of the following so that both the driver and executors have access. If HDFS is out then you could make a common mount point on each of the executor nodes so they have access to the classes. - spark-submit --jars /common/path/to/jars - spark.driver.extraClassPath or it's alias --driver-class-path to set extra classpaths on the node running the driver. - spark.executor.extraClassPath to set extra class path on the Worker nodes. On Tue, Apr 21, 2020 at 1:13 AM Shashanka Balakuntala < shbalakunt...@gmail.com> wrote: > Hi users, > I'm a bit of newbie to spark infrastructure. And i have a small doubt. > I have a maven project with plugins generated separately in a folder and > normal java command to run is as follows: > `java -Dp4j.pluginsDir=./plugins -jar /path/to/jar` > > Now when I run this program in local with spark-submit with standalone > cluster(not cluster mode) the program compiles and plugins are in "plugins" > folder in the $SPARK_HOME and it is getting recognised. > The same is not the case in cluster mode. It says the Extenstion point is > not loaded. Please advise on how can i create a folder which can be shared > among the workers in "plugin" folder. > > PS: HDFS is not an options as we dont have a different setup > > Thanks. > > > *Regards* > Shashanka Balakuntala Srinivasa > >
Spark Structure Streaming | FileStreamSourceLog not deleting list of input files | Spark -2.4.0
Hi Team, While Running Spark Below are some finding. 1. FileStreamSourceLog is responsible for maintaining input source file list. 2. Spark Streaming delete expired log files on the basis of s *park.sql.streaming.fileSource.log.deletion* and *spark.sql.streaming.minBatchesToRetain.* 3. But while compacting logs Spark Streaming write the complete list of files streaming has seen till now in HDFS into one single .compact file. 4. Over the course of time this compact file is consuming around 2GB-5GB in HDFS which will delay creation of compact file after every 10th Batch and also job restart time will increase. 5. Why Spark Streaming is logging files in the system which are already deleted . While creating compact file there must be some configured timeout so that Spark can skip writing expired list of input files. *Also kindly let me know if i missed something and there is some configuration already present to handle this. * Regards Pappu Yadav
Re: What is the best way to take the top N entries from a hive table/data source?
https://github.com/apache/spark/pull/7334 may explain the question as below: > This patch preserves this optimization by treating logical Limit operators > specially when they appear as the terminal operator in a query plan: if a > Limit is the final operator, then we will plan a special CollectLimit > physical operator which implements the old take()-based logic. For `spark.sql("select * from db.table limit 100").explain(false)`, `limit` is the final operator; for `spark.sql("select * from db.table limit 100").repartition(1000).explain(false)`, `repartition` is the final operator. If you add a `.limit()` operation after `repartition`, such as `spark.sql("select * from db.table limit 100").repartition(1000).limit(1000).explain(false)`, the `CollectLimit` will show again. --- Cheers, -z From: Yeikel Sent: Wednesday, April 15, 2020 2:45 To: user@spark.apache.org Subject: Re: What is the best way to take the top N entries from a hive table/data source? Looking at the results of explain, I can see a CollectLimit step. Does that work the same way as a regular .collect() ? (where all records are sent to the driver?) spark.sql("select * from db.table limit 100").explain(false) == Physical Plan == CollectLimit 100 +- FileScan parquet ... 806 more fields] Batched: false, Format: Parquet, Location: CatalogFileIndex[...], PartitionCount: 3, PartitionFilters: [], PushedFilters: [], ReadSchema:. db: Unit = () The number of partitions is 1 so that makes sense. spark.sql("select * from db.table limit 100").rdd.partitions.size = 1 As a follow up , I tried to repartition the resultant dataframe and while I can't see the CollectLimit step anymore , It did not make any difference in the job. I still saw a big task at the end that ends up failing. spark.sql("select * from db.table limit 100").repartition(1000).explain(false) Exchange RoundRobinPartitioning(1000) +- GlobalLimit 100 +- Exchange SinglePartition +- LocalLimit 100 -> Is this a collect? -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail: user-unsubscr...@spark.apache.org - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Using P4J Plugins with Spark
Hi users, I'm a bit of newbie to spark infrastructure. And i have a small doubt. I have a maven project with plugins generated separately in a folder and normal java command to run is as follows: `java -Dp4j.pluginsDir=./plugins -jar /path/to/jar` Now when I run this program in local with spark-submit with standalone cluster(not cluster mode) the program compiles and plugins are in "plugins" folder in the $SPARK_HOME and it is getting recognised. The same is not the case in cluster mode. It says the Extenstion point is not loaded. Please advise on how can i create a folder which can be shared among the workers in "plugin" folder. PS: HDFS is not an options as we dont have a different setup Thanks. *Regards* Shashanka Balakuntala Srinivasa