Spark shuffle: FileNotFound exception

2016-12-03 Thread Swapnil Shinde
Hello All I am facing FileNotFoundException for shuffle index file when running job with large data. Same job runs fine with smaller datasets. These our my cluster specifications - No of nodes - 19 Total cores - 380 Memory per executor - 32G Spark 1.6 mapr version

Re: Unsubscribe

2016-12-03 Thread kote rao
unsubscribe From: S Malligarjunan Sent: Saturday, December 3, 2016 11:55:41 AM To: user@spark.apache.org Subject: Re: Unsubscribe Unsubscribe Thanks and Regards, Malligarjunan S. On Saturday, 3 December 2016, 20:42,

Unsubscribe

2016-12-03 Thread S Malligarjunan
Unsubscribe Thanks and Regards,Malligarjunan S.  

Re: Unsubscribe

2016-12-03 Thread S Malligarjunan
Unsubscribe Thanks and Regards,Malligarjunan S.   On Saturday, 3 December 2016, 20:42, Sivakumar S wrote: Unsubscribe

Re: What benefits do we really get out of colocation?

2016-12-03 Thread kant kodali
ephemeral storage on ssd will be very painful to maintain especially with large datasets. we will pretty soon have somewhere in PB. I am thinking to leverage something like below. But not sure how much performance gain we could get out of that. https://github.com/stec-inc/EnhanceIO On Sat, Dec

Design patterns for Spark implementation

2016-12-03 Thread Vasu Gourabathina
Hi, I know this is a broad question. If this is not the right forum, appreciate if you can point to other sites/areas that may be helpful. Before posing this question, I did use our friend Google, but sanitizing the query results from my need angle hasn't been easy. Who I am: - Have done

Re: What benefits do we really get out of colocation?

2016-12-03 Thread vincent gromakowski
What about ephemeral storage on ssd ? If performance is required it's generally for production so the cluster would never be stopped. Then a spark job to backup/restore on S3 allows to shut down completely the cluster Le 3 déc. 2016 1:28 PM, "David Mitchell" a écrit :

Unsubscribe

2016-12-03 Thread Sivakumar S
Unsubscribe

Re: What benefits do we really get out of colocation?

2016-12-03 Thread David Mitchell
To get a node local read from Spark to Cassandra, one has to use a read consistency level of LOCAL_ONE. For some use cases, this is not an option. For example, if you need to use a read consistency level of LOCAL_QUORUM, as many use cases demand, then one is not going to get a node local read.

Parquet timestamp storage in Hive and possible use case of spark instead of impala

2016-12-03 Thread Mich Talebzadeh
guys, This is my suggestion. Use Spark SQL instead of Impala from Hive tables to get correct timestamp values all the time. The situation is explained below: I have come across a situation where a multi-tenant cluster is being used to read and write to Parquet file. This causes some issues as

Re: What benefits do we really get out of colocation?

2016-12-03 Thread Steve Loughran
On 3 Dec 2016, at 09:16, Manish Malhotra > wrote: thanks for sharing number as well ! Now a days even network can be with very high throughput, and might out perform the disk, but as Sean mentioned data on network will have

Re: What benefits do we really get out of colocation?

2016-12-03 Thread kant kodali
hmm GCE pretty much seems to follow the same model as AWS. On Sat, Dec 3, 2016 at 1:22 AM, kant kodali wrote: > GCE seems to have better options. Any one had any experience with GCE? > > On Sat, Dec 3, 2016 at 1:16 AM, Manish Malhotra < > manish.malhotra.w...@gmail.com>

Re: What benefits do we really get out of colocation?

2016-12-03 Thread kant kodali
GCE seems to have better options. Any one had any experience with GCE? On Sat, Dec 3, 2016 at 1:16 AM, Manish Malhotra < manish.malhotra.w...@gmail.com> wrote: > thanks for sharing number as well ! > > Now a days even network can be with very high throughput, and might out > perform the disk,

Re: What benefits do we really get out of colocation?

2016-12-03 Thread Manish Malhotra
thanks for sharing number as well ! Now a days even network can be with very high throughput, and might out perform the disk, but as Sean mentioned data on network will have other dependencies like network hops, like if its across rack, which can have switch in between. But yes people are

Re: What benefits do we really get out of colocation?

2016-12-03 Thread kant kodali
Forgot to mention my entire cluster is on one DC. so if it is across multiple DC's then colocating does makes sense in theory as well. On Sat, Dec 3, 2016 at 1:12 AM, kant kodali wrote: > Thanks Sean! Just for the record I am currently seeing 95 MB/s RX (Receive > throughput

Re: What benefits do we really get out of colocation?

2016-12-03 Thread kant kodali
Thanks Sean! Just for the record I am currently seeing 95 MB/s RX (Receive throughput ) on my spark worker machine when I do `sudo iftop -B` The problem with instance store on AWS is that they all are ephemeral so placing Cassandra on top doesn't make a lot of sense. so In short, AWS doesn't seem

Re: What benefits do we really get out of colocation?

2016-12-03 Thread kant kodali
wait, how is that a benefit? isn't that a bad thing if you are saying colocating leads to more latency and overall execution time is longer? On Sat, Dec 3, 2016 at 12:34 AM, vincent gromakowski < vincent.gromakow...@gmail.com> wrote: > You get more latency on reads so overall execution time is

Re: What benefits do we really get out of colocation?

2016-12-03 Thread vincent gromakowski
You get more latency on reads so overall execution time is longer Le 3 déc. 2016 7:39 AM, "kant kodali" a écrit : > > I wonder what benefits do I really I get If I colocate my spark worker > process and Cassandra server process on each node? > > I understand the concept of