Re: Does Spark have a plan to move away from sun.misc.Unsafe?
Here you go: the umbrella ticket: https://issues.apache.org/jira/browse/SPARK-24417 and the sun.misc.unsafe one https://issues.apache.org/jira/browse/SPARK-24421 On Wed, Oct 24, 2018 at 8:08 PM kant kodali wrote: > > Hi All, > > Does Spark have a plan to move away from sun.misc.Unsafe to VarHandles? I am > trying to find a JIRA issue for this? > > Thanks! -- Sent from my iPhone - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: [Spark UI] Spark 2.3.1 UI no longer respects spark.ui.retainedJobs
Done: https://issues.apache.org/jira/browse/SPARK-25837 On Thu, Oct 25, 2018 at 10:21 AM Marcelo Vanzin wrote: > Ah that makes more sense. Could you file a bug with that information > so we don't lose track of this? > > Thanks > On Wed, Oct 24, 2018 at 6:13 PM Patrick Brown > wrote: > > > > On my production application I am running ~200 jobs at once, but > continue to submit jobs in this manner for sometimes ~1 hour. > > > > The reproduction code above generally only has 4 ish jobs running at > once, and as you can see runs through 50k jobs in this manner. > > > > I guess I should clarify my above statement, the issue seems to appear > when running multiple jobs at once as well as in sequence for a while and > may as well have something to do with high master CPU usage (thus the > collect in the code). My rough guess would be whatever is managing clearing > out completed jobs gets overwhelmed (my master was a 4 core machine while > running this, and htop reported almost full CPU usage across all 4 cores). > > > > The attached screenshot shows the state of the webui after running the > repro code, you can see the ui is displaying some 43k completed jobs (takes > a long time to load) after a few minutes of inactivity this will clear out, > however as my production application continues to submit jobs every once in > a while, the issue persists. > > > > On Wed, Oct 24, 2018 at 5:05 PM Marcelo Vanzin > wrote: > >> > >> When you say many jobs at once, what ballpark are you talking about? > >> > >> The code in 2.3+ does try to keep data about all running jobs and > >> stages regardless of the limit. If you're running into issues because > >> of that we may have to look again at whether that's the right thing to > >> do. > >> On Tue, Oct 23, 2018 at 10:02 AM Patrick Brown > >> wrote: > >> > > >> > I believe I may be able to reproduce this now, it seems like it may > be something to do with many jobs at once: > >> > > >> > Spark 2.3.1 > >> > > >> > > spark-shell --conf spark.ui.retainedJobs=1 > >> > > >> > scala> import scala.concurrent._ > >> > scala> import scala.concurrent.ExecutionContext.Implicits.global > >> > scala> for (i <- 0 until 5) { Future { println(sc.parallelize(0 > until i).collect.length) } } > >> > > >> > On Mon, Oct 22, 2018 at 11:25 AM Marcelo Vanzin > wrote: > >> >> > >> >> Just tried on 2.3.2 and worked fine for me. UI had a single job and a > >> >> single stage (+ the tasks related to that single stage), same thing > in > >> >> memory (checked with jvisualvm). > >> >> > >> >> On Sat, Oct 20, 2018 at 6:45 PM Marcelo Vanzin > wrote: > >> >> > > >> >> > On Tue, Oct 16, 2018 at 9:34 AM Patrick Brown > >> >> > wrote: > >> >> > > I recently upgraded to spark 2.3.1 I have had these same > settings in my spark submit script, which worked on 2.0.2, and according to > the documentation appear to not have changed: > >> >> > > > >> >> > > spark.ui.retainedTasks=1 > >> >> > > spark.ui.retainedStages=1 > >> >> > > spark.ui.retainedJobs=1 > >> >> > > >> >> > I tried that locally on the current master and it seems to be > working. > >> >> > I don't have 2.3 easily in front of me right now, but will take a > look > >> >> > Monday. > >> >> > > >> >> > -- > >> >> > Marcelo > >> >> > >> >> > >> >> > >> >> -- > >> >> Marcelo > >> > >> > >> > >> -- > >> Marcelo > > > > -- > Marcelo >
Re: [Spark UI] Spark 2.3.1 UI no longer respects spark.ui.retainedJobs
Ah that makes more sense. Could you file a bug with that information so we don't lose track of this? Thanks On Wed, Oct 24, 2018 at 6:13 PM Patrick Brown wrote: > > On my production application I am running ~200 jobs at once, but continue to > submit jobs in this manner for sometimes ~1 hour. > > The reproduction code above generally only has 4 ish jobs running at once, > and as you can see runs through 50k jobs in this manner. > > I guess I should clarify my above statement, the issue seems to appear when > running multiple jobs at once as well as in sequence for a while and may as > well have something to do with high master CPU usage (thus the collect in the > code). My rough guess would be whatever is managing clearing out completed > jobs gets overwhelmed (my master was a 4 core machine while running this, and > htop reported almost full CPU usage across all 4 cores). > > The attached screenshot shows the state of the webui after running the repro > code, you can see the ui is displaying some 43k completed jobs (takes a long > time to load) after a few minutes of inactivity this will clear out, however > as my production application continues to submit jobs every once in a while, > the issue persists. > > On Wed, Oct 24, 2018 at 5:05 PM Marcelo Vanzin wrote: >> >> When you say many jobs at once, what ballpark are you talking about? >> >> The code in 2.3+ does try to keep data about all running jobs and >> stages regardless of the limit. If you're running into issues because >> of that we may have to look again at whether that's the right thing to >> do. >> On Tue, Oct 23, 2018 at 10:02 AM Patrick Brown >> wrote: >> > >> > I believe I may be able to reproduce this now, it seems like it may be >> > something to do with many jobs at once: >> > >> > Spark 2.3.1 >> > >> > > spark-shell --conf spark.ui.retainedJobs=1 >> > >> > scala> import scala.concurrent._ >> > scala> import scala.concurrent.ExecutionContext.Implicits.global >> > scala> for (i <- 0 until 5) { Future { println(sc.parallelize(0 until >> > i).collect.length) } } >> > >> > On Mon, Oct 22, 2018 at 11:25 AM Marcelo Vanzin >> > wrote: >> >> >> >> Just tried on 2.3.2 and worked fine for me. UI had a single job and a >> >> single stage (+ the tasks related to that single stage), same thing in >> >> memory (checked with jvisualvm). >> >> >> >> On Sat, Oct 20, 2018 at 6:45 PM Marcelo Vanzin >> >> wrote: >> >> > >> >> > On Tue, Oct 16, 2018 at 9:34 AM Patrick Brown >> >> > wrote: >> >> > > I recently upgraded to spark 2.3.1 I have had these same settings in >> >> > > my spark submit script, which worked on 2.0.2, and according to the >> >> > > documentation appear to not have changed: >> >> > > >> >> > > spark.ui.retainedTasks=1 >> >> > > spark.ui.retainedStages=1 >> >> > > spark.ui.retainedJobs=1 >> >> > >> >> > I tried that locally on the current master and it seems to be working. >> >> > I don't have 2.3 easily in front of me right now, but will take a look >> >> > Monday. >> >> > >> >> > -- >> >> > Marcelo >> >> >> >> >> >> >> >> -- >> >> Marcelo >> >> >> >> -- >> Marcelo -- Marcelo - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Spark SQL Error
Hi all, I am getting the following error message in one of my Spark SQL's. I realize this may be related to the version of Spark or a configuration change but want to know the details and resolution. Thanks spark.sql.codegen.aggregate.map.twolevel.enabled is set to true, but current version of codegened fast hashmap does not support this aggregate
Re: Watermarking without aggregation with Structured Streaming
Hello peay-2, Were you able to get a solution to your problem ? Were you able to get watermark timestamp available through a function ? Regards, Sanjay peay-2 wrote > Thanks for the pointers. I guess right now the only workaround would be to > apply a "dummy" aggregation (e.g., group by the timestamp itself) only to > have the stateful processing logic kick in and apply the filtering? > > For my purposes, an alternative solution to pushing it out to the source > would be to make the watermark timestamp available through a function so > that it can be used in a regular filter clause. Based on my experiments, > the timestamp is computed and updated even when no stateful computations > occur. I am not sure how easy that would be to contribute though, maybe > someone can suggest a starting point? -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: [External Sender] Having access to spark results
Femi, We have a solution that needs to be both on-prem and also in the cloud. Not sure how that impacts anything, what we want is to run an analytical query on a large dataset (ours is over Cassandra) -- so batch in that sense, but think on-demand --- and then have the result be entirely (not first x number of rows) available for a web application to access the results. Web application work over a REST API, so while the query can be submitted through something like Livy or the thrift-server, the concern is how do we get the final result back to be useful. I could think of two ways of doing that. A global temp table would work, but that would be first point --- it seems a bit involved. My point was that, has someone solved that problem and run through all the steps? - Affan ᐧ On Thu, Oct 25, 2018 at 12:39 PM Femi Anthony < olufemi.anth...@capitalone.com> wrote: > What sort of environment are you running Spark on - in the cloud, on > premise ? Is its a real-time or batch oriented application? > Please provide more details. > Femi > > On Thu, Oct 25, 2018 at 3:29 AM Affan Syed wrote: > >> Spark users, >> We really would want to get an input here about how the results from a >> Spark Query will be accessible to a web-application. Given Spark is a well >> used in the industry I would have thought that this part would have lots of >> answers/tutorials about it, but I didnt find anything. >> >> Here are a few options that come to mind >> >> 1) Spark results are saved in another DB ( perhaps a traditional one) and >> a request for query returns the new table name for access through a >> paginated query. That seems doable, although a bit convoluted as we need to >> handle the completion of the query. >> >> 2) Spark results are pumped into a messaging queue from which a socket >> server like connection is made. >> >> What confuses me is that other connectors to spark, like those for >> Tableau, using something like JDBC should have all the data (not the top >> 500 that we typically can get via Livy or other REST interfaces to Spark). >> How do those connectors get all the data through a single connection? >> >> >> Can someone with expertise help in bringing clarity. >> >> Thank you. >> >> Affan >> ᐧ >> ᐧ >> > > -- > > The information contained in this e-mail is confidential and/or > proprietary to Capital One and/or its affiliates and may only be used > solely in performance of work or services for Capital One. The information > transmitted herewith is intended only for use by the individual or entity > to which it is addressed. If the reader of this message is not the intended > recipient, you are hereby notified that any review, retransmission, > dissemination, distribution, copying or other use of, or taking of any > action in reliance upon this information is strictly prohibited. If you > have received this communication in error, please contact the sender and > delete the material from your computer. >
Fwd: Having access to spark results
What about using cache() or save as a global temp table for subsequent access? Sent using Zoho Mail Forwarded message From : Affan Syed To : "spark users" Date : Thu, 25 Oct 2018 10:58:43 +0330 Subject : Having access to spark results Forwarded message Spark users, We really would want to get an input here about how the results from a Spark Query will be accessible to a web-application. Given Spark is a well used in the industry I would have thought that this part would have lots of answers/tutorials about it, but I didnt find anything. Here are a few options that come to mind 1) Spark results are saved in another DB ( perhaps a traditional one) and a request for query returns the new table name for access through a paginated query. That seems doable, although a bit convoluted as we need to handle the completion of the query. 2) Spark results are pumped into a messaging queue from which a socket server like connection is made. What confuses me is that other connectors to spark, like those for Tableau, using something like JDBC should have all the data (not the top 500 that we typically can get via Livy or other REST interfaces to Spark). How do those connectors get all the data through a single connection? Can someone with expertise help in bringing clarity. Thank you. Affan ᐧ ᐧ
Re: [External Sender] Having access to spark results
What sort of environment are you running Spark on - in the cloud, on premise ? Is its a real-time or batch oriented application? Please provide more details. Femi On Thu, Oct 25, 2018 at 3:29 AM Affan Syed wrote: > Spark users, > We really would want to get an input here about how the results from a > Spark Query will be accessible to a web-application. Given Spark is a well > used in the industry I would have thought that this part would have lots of > answers/tutorials about it, but I didnt find anything. > > Here are a few options that come to mind > > 1) Spark results are saved in another DB ( perhaps a traditional one) and > a request for query returns the new table name for access through a > paginated query. That seems doable, although a bit convoluted as we need to > handle the completion of the query. > > 2) Spark results are pumped into a messaging queue from which a socket > server like connection is made. > > What confuses me is that other connectors to spark, like those for > Tableau, using something like JDBC should have all the data (not the top > 500 that we typically can get via Livy or other REST interfaces to Spark). > How do those connectors get all the data through a single connection? > > > Can someone with expertise help in bringing clarity. > > Thank you. > > Affan > ᐧ > ᐧ > The information contained in this e-mail is confidential and/or proprietary to Capital One and/or its affiliates and may only be used solely in performance of work or services for Capital One. The information transmitted herewith is intended only for use by the individual or entity to which it is addressed. If the reader of this message is not the intended recipient, you are hereby notified that any review, retransmission, dissemination, distribution, copying or other use of, or taking of any action in reliance upon this information is strictly prohibited. If you have received this communication in error, please contact the sender and delete the material from your computer.
Having access to spark results
Spark users, We really would want to get an input here about how the results from a Spark Query will be accessible to a web-application. Given Spark is a well used in the industry I would have thought that this part would have lots of answers/tutorials about it, but I didnt find anything. Here are a few options that come to mind 1) Spark results are saved in another DB ( perhaps a traditional one) and a request for query returns the new table name for access through a paginated query. That seems doable, although a bit convoluted as we need to handle the completion of the query. 2) Spark results are pumped into a messaging queue from which a socket server like connection is made. What confuses me is that other connectors to spark, like those for Tableau, using something like JDBC should have all the data (not the top 500 that we typically can get via Livy or other REST interfaces to Spark). How do those connectors get all the data through a single connection? Can someone with expertise help in bringing clarity. Thank you. Affan ᐧ ᐧ