RE: OS killing Executor due to high (possibly off heap) memory usage

2016-11-24 Thread Shreya Agarwal
I don’t think it’s just memory overhead. It might be better to use an execute with lesser heap space(30GB?). 46 GB would mean more data load into memory and more GC, which can cause issues. Also, have you tried to persist data in any way? If so, then that might be causing an issue. Lastly, I

RE: How to expose Spark-Shell in the production?

2016-11-23 Thread Shreya Agarwal
Use Livy out job server to execute spark-shell commands remotely Sent from my Windows 10 phone From: kant kodali Sent: Saturday, November 19, 2016 12:57 AM To: user @spark Subject: How to expose Spark-Shell in the production? How to

RE: Join Query

2016-11-20 Thread Shreya Agarwal
Replication join = broadcast join. Look for that term on google. Many examples. Semi join can be done on dataframes/dataset by passing “semi join” as the third parameter on the join/joinWith function. Not sure about the other two. Sent from my Windows 10 phone From: Aakash

RE: HDPCD SPARK Certification Queries

2016-11-20 Thread Shreya Agarwal
Replication join = broadcast join. Look for that term on google. Many examples. Semi join can be done on dataframes/dataset by passing “semi join” as the third parameter on the join/joinWith function. Not sure about the other two. Sent from my Windows 10 phone From: Aakash

RE: Spark UI shows Jobs are processing, but the files are already written to S3

2016-11-16 Thread Shreya Agarwal
I think that is a bug. I have seen that a lot especially with long running jobs where Spark skips a lot of stages because it has pre-computed results. And some of these are never marked as completed, even though in reality they are. I figured this out because I was using the interactive shell

RE: AVRO File size when caching in-memory

2016-11-16 Thread Shreya Agarwal
Ah, yes. Nested schemas should be avoided if you want the best memory usage. Sent from my Windows 10 phone From: Prithish<mailto:prith...@gmail.com> Sent: Wednesday, November 16, 2016 12:48 AM To: Takeshi Yamamuro<mailto:linguin@gmail.com> Cc: Shreya Agarwal<mailto:shrey..

RE: what is the optimized way to combine multiple dataframes into one dataframe ?

2016-11-15 Thread Shreya Agarwal
If you are reading all these datasets from files in persistent storage, functions like sc.textFile can take folders/patterns as input and read all of the files matching into the same RDD. Then you can convert it to a dataframe. When you say it is time consuming with union, how are you measuring

RE: AVRO File size when caching in-memory

2016-11-15 Thread Shreya Agarwal
, Shreya Sent from my Windows 10 phone From: Prithish<mailto:prith...@gmail.com> Sent: Tuesday, November 15, 2016 11:04 PM To: Shreya Agarwal<mailto:shrey...@microsoft.com> Subject: Re: AVRO File size when caching in-memory I did another test and noting my observations here. The

RE: Strongly Connected Components

2016-11-11 Thread Shreya Agarwal
Thanks for the detailed response ☺ I will try the things you mentioned! From: Daniel Darabos [mailto:daniel.dara...@lynxanalytics.com] Sent: Friday, November 11, 2016 4:59 PM To: Shreya Agarwal <shrey...@microsoft.com> Cc: Felix Cheung <felixcheun...@hotmail.com>; user@spark.apache.or

RE: Strongly Connected Components

2016-11-11 Thread Shreya Agarwal
) at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:190) From: Shreya Agarwal Sent: Thursday, November 10, 2016 8:16 PM To: 'Felix Cheung' <felixcheun...@hotmail.com>; user@spark.apache.org Subje

RE: Dataset API | Setting number of partitions during join/groupBy

2016-11-11 Thread Shreya Agarwal
Curious – why do you want to repartition? Is there a subsequent step which fails because the number of partitions is less? Or you want to do it for a perf gain? Also, what were your initial Dataset partitions and how many did you have for the result of join? From: Aniket Bhatnagar

RE: Strongly Connected Components

2016-11-10 Thread Shreya Agarwal
; Shreya Agarwal <shrey...@microsoft.com> Subject: Re: Strongly Connected Components It is possible it is dead. Could you check the Spark UI to see if there is any progress? _ From: Shreya Agarwal <shrey...@microsoft.com<mailto:shrey...@microsoft.com>&

RE: Strongly Connected Components

2016-11-10 Thread Shreya Agarwal
Bump. Anyone? Its been running for 10 hours now. No results. From: Shreya Agarwal Sent: Tuesday, November 8, 2016 9:05 PM To: user@spark.apache.org Subject: Strongly Connected Components Hi, I am running this on a graph with >5B edges and >3B edges and have 2 questions - 1.

RE: Re:RE: how to merge dataframe write output files

2016-11-10 Thread Shreya Agarwal
y I don't know the answer to this, but pretty sure there should be a way to work with fragmented files too. From: lk_spark [mailto:lk_sp...@163.com] Sent: Thursday, November 10, 2016 12:20 AM To: Shreya Agarwal <shrey...@microsoft.com> Cc: user.spark <user@spark.apache.org> Subj

RE: how to merge dataframe write output files

2016-11-09 Thread Shreya Agarwal
Is there a reason you want to merge the files? The reason you are getting errors (afaik) is because when you try to coalesce and then write, you are forcing all the content to reside on one executor, and the size of data is exceeding the memory you have for storage in your executor, hence

Strongly Connected Components

2016-11-08 Thread Shreya Agarwal
Hi, I am running this on a graph with >5B edges and >3B edges and have 2 questions - 1. What is the optimal number of iterations? 2. I am running it for 1 iteration right now on a beefy 100 node cluster, with 300 executors each having 30GB RAM and 5 cores. I have persisted the graph to

RE: Anomalous Spark RDD persistence behavior

2016-11-07 Thread Shreya Agarwal
I don’t think this is correct. Unless you are serializing when caching to memory but not serializing when persisting to disk. Can you check? Also, I have seen the behavior where if I have 100 GB in-memory cache and I use 60 GB to persist something (MEMORY_AND_DISK). Then try to persist another