Re: spark optimized pagination

2018-06-11 Thread Teemu Heikkilä
So you are now providing the data on-demand through spark? I suggest you change your API to query from cassandra and store the results from Spark back there, that way you will have to process the whole dataset just once and cassandra is suitable for that kind of workloads. -T > On 10 Jun

[Spark Optimization] Why is one node getting all the pressure?

2018-06-11 Thread Aakash Basu
Hi, I have submitted a job on* 4 node cluster*, where I see, most of the operations happening at one of the worker nodes and other two are simply chilling out. Picture below puts light on that - How to properly distribute the load? My cluster conf (4 node cluster [1 driver; 3 slaves]) -

Re: [Spark Streaming] is spark.streaming.concurrentJobs a per node or a cluster global value ?

2018-06-11 Thread thomas lavocat
Thank you very much for your answer. Since I don't have dependent jobs I will continue to use this functionality. On 05/06/2018 13:52, Saisai Shao wrote: "dependent" I mean this batch's job relies on the previous batch's result. So this batch should wait for the finish of previous batch, if

Re: [Spark Optimization] Why is one node getting all the pressure?

2018-06-11 Thread Jörn Franke
What is your code ? Maybe this one does an operation which is bound to a single host or your data volume is too small for multiple hosts. > On 11. Jun 2018, at 11:13, Aakash Basu wrote: > > Hi, > > I have submitted a job on 4 node cluster, where I see, most of the operations > happening at

Re: [Spark Optimization] Why is one node getting all the pressure?

2018-06-11 Thread akshay naidu
try --num-executors 3 --executor-cores 4 --executor-memory 2G --conf spark.scheduler.mode=FAIR On Mon, Jun 11, 2018 at 2:43 PM, Aakash Basu wrote: > Hi, > > I have submitted a job on* 4 node cluster*, where I see, most of the > operations happening at one of the worker nodes and other two are

Launch a pyspark Job From UI

2018-06-11 Thread srungarapu vamsi
Hi, I am looking for applications where we can trigger spark jobs from UI. Are there any such applications available? I have checked Spark-jobserver using which we can expose an api to submit a spark application. Are there any other alternatives using which i can submit pyspark jobs from UI ?

Re: [Spark Optimization] Why is one node getting all the pressure?

2018-06-11 Thread Jörn Franke
If it is in kB then spark will always schedule it to one node. As soon as it gets bigger you will see usage of more nodes. Hence increase your testing Dataset . > On 11. Jun 2018, at 12:22, Aakash Basu wrote: > > Jorn - The code is a series of feature engineering and model tuning >

Re: Launch a pyspark Job From UI

2018-06-11 Thread hemant singh
You can explore Livy https://dzone.com/articles/quick-start-with-apache-livy On Mon, Jun 11, 2018 at 3:35 PM, srungarapu vamsi wrote: > Hi, > > I am looking for applications where we can trigger spark jobs from UI. > Are there any such applications available? > > I have checked Spark-jobserver

Re: Launch a pyspark Job From UI

2018-06-11 Thread Sathishkumar Manimoorthy
You can use Zeppelin as well https://zeppelin.apache.org/docs/latest/interpreter/spark.html Thanks, Sathish On Mon, Jun 11, 2018 at 4:25 PM, hemant singh wrote: > You can explore Livy https://dzone.com/articles/quick-start-with- > apache-livy > > On Mon, Jun 11, 2018 at 3:35 PM, srungarapu

Re: Launch a pyspark Job From UI

2018-06-11 Thread uğur sopaoğlu
Dear Hemant, I have built spark cluster by using docker container. Can I use apache livy to submit a job to master node? hemant singh şunları yazdı (11 Haz 2018 13:55): > You can explore Livy https://dzone.com/articles/quick-start-with-apache-livy > >> On Mon, Jun 11, 2018 at 3:35 PM,

Visual PySpark Programming

2018-06-11 Thread srungarapu vamsi
Hi, I have the following use case and I did not find a suitable tool which can serve my purpose. Use case: Step 1,2,3 are UI driven. *Step 1*) A user should be able to choose data source (example HDFS) and should be able to configure it so that it points to a file. *Step 2*) A user should be

Re: [Spark Optimization] Why is one node getting all the pressure?

2018-06-11 Thread Aakash Basu
Hi Jorn/Others, Thanks for your help. Now, data is being distributed in a proper way, but the challenge is, after a certain point, I'm getting this error, after which, everything stops moving ahead - 2018-06-11 18:14:56 ERROR TaskSchedulerImpl:70 - Lost executor 0 on 192.168.49.39: Remote RPC

Re: spark optimized pagination

2018-06-11 Thread vaquar khan
Spark is processing engine not storage or cache ,you can dump your results back to Cassandra, if you see latency then you can use cache to dump spark results. In short answer is NO,spark doesn't supporter give any api to give you cache kind of storage. Directly reading from dataset millions

re: streaming - kafka partition transition time from (stage change logger)

2018-06-11 Thread Peter Liu
Hi there, Working on the streaming processing latency time based on timestamps from Kafka, I have two quick general questions triggered by looking at the kafka stage change log file: (a) the partition state change from OfflineReplica state *to OnlinePartition *state seems to take more than 20

Re: Exception when closing SparkContext in Spark 2.3

2018-06-11 Thread umayr_nuna
My bad, it's EMR 5.14.0 -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: [Spark Optimization] Why is one node getting all the pressure?

2018-06-11 Thread Vamshi Talla
Aakash, Like Jorn suggested, did you increase your test data set? If so, did you also update your executor-memory setting? It seems like you might exceeding the executor memory threshold. Thanks Vamshi Talla Sent from my iPhone On Jun 11, 2018, at 8:54 AM, Aakash Basu

[Spark Streaming]: How do I apply window before filter?

2018-06-11 Thread Tejas Manohar
Hey friends, We're trying to make some batched computations run against an OLAP DB closer to "realtime". One of our more complex computations is a trigger when event A occurs but not event B within a given time period. Our experience with Spark is limited, but since Spark 2.3.0 just introduced

Re: GC- Yarn vs Standalone K8

2018-06-11 Thread Keith Chapman
Spark on EMR is configured to use CMS GC, specifically following flags, spark.executor.extraJavaOptions -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70 -XX:+CMSClassUnloadingEnabled

[ANNOUNCE] Announcing Apache Spark 2.3.1

2018-06-11 Thread Marcelo Vanzin
We are happy to announce the availability of Spark 2.3.1! Apache Spark 2.3.1 is a maintenance release, based on the branch-2.3 maintenance branch of Spark. We strongly recommend all 2.3.x users to upgrade to this stable release. To download Spark 2.3.1, head over to the download page:

Exception when closing SparkContext in Spark 2.3

2018-06-11 Thread umayr_nuna
I'm running a Scala application in EMR 5.12.0 (S3, HDFS) with the following properties: --master yarn --deploy-mode cluster --driver-memory 13g --executor-memory 30g --executor-cores 5 --conf spark.default.parallelism=400 --conf spark.dynamicAllocation.enabled=true --conf