Re: Spark & S3 - Introducing random values into key names

2018-03-08 Thread Subhash Sriram
Thanks, Vadim! That helps and makes sense. I don't think we have a number of keys so large that we have to worry about it. If we do, I think I would go with an approach similar to what you suggested. Thanks again, Subhash Sent from my iPhone > On Mar 8, 2018, at 11:56 AM, Vadim Semenov wrote

Re: Spark & S3 - Introducing random values into key names

2018-03-08 Thread Vadim Semenov
You need to put randomness into the beginning of the key, if you put it other than into the beginning, it's not guaranteed that you're going to have good performance. The way we achieved this is by writing to HDFS first, and then having a custom DistCp implemented using Spark that copies parquet f

Re: Spark <--> S3 flakiness

2017-05-18 Thread Steve Loughran
On 18 May 2017, at 05:29, lucas.g...@gmail.com wrote: Steve, just to clarify: "FWIW, if you can move up to the Hadoop 2.8 version of the S3A client it is way better on high-performance reads, especially if you are working with column data and can set the fs.s3a.ex

Re: Spark <--> S3 flakiness

2017-05-17 Thread lucas.g...@gmail.com
Steve, just to clarify: "FWIW, if you can move up to the Hadoop 2.8 version of the S3A client it is way better on high-performance reads, especially if you are working with column data and can set the fs.s3a.experimental.fadvise=random option. " Are you talking about the hadoop-aws lib or hadoop

Re: Spark <--> S3 flakiness

2017-05-17 Thread Steve Loughran
On 17 May 2017, at 06:00, lucas.g...@gmail.com wrote: Steve, thanks for the reply. Digging through all the documentation now. Much appreciated! FWIW, if you can move up to the Hadoop 2.8 version of the S3A client it is way better on high-performance reads, espe

Re: Spark <--> S3 flakiness

2017-05-16 Thread lucas.g...@gmail.com
Steve, thanks for the reply. Digging through all the documentation now. Much appreciated! On 16 May 2017 at 10:10, Steve Loughran wrote: > > On 11 May 2017, at 06:07, lucas.g...@gmail.com wrote: > > Hi users, we have a bunch of pyspark jobs that are using S3 for loading / > intermediate step

Re: Spark <--> S3 flakiness

2017-05-16 Thread Steve Loughran
On 11 May 2017, at 06:07, lucas.g...@gmail.com wrote: Hi users, we have a bunch of pyspark jobs that are using S3 for loading / intermediate steps and final output of parquet files. Please don't, not without a committer specially written to work against S3 in the

Re: Spark <--> S3 flakiness

2017-05-14 Thread Gourav Sengupta
Are you running EMR? On Sun, May 14, 2017 at 4:59 AM, Miguel Morales wrote: > Some things just didn't work as i had first expected it. For example, > when writing from a spark collection to an alluxio destination didn't > persist them to s3 automatically. > > I remember having to use the alluxi

Re: Spark <--> S3 flakiness

2017-05-13 Thread Miguel Morales
Some things just didn't work as i had first expected it. For example, when writing from a spark collection to an alluxio destination didn't persist them to s3 automatically. I remember having to use the alluxio library directly to force the files to persist to s3 after spark finished writing to a

Re: Spark <--> S3 flakiness

2017-05-12 Thread Gene Pang
Hi, Yes, you can use Alluxio with Spark to read/write to S3. Here is a blog post on Spark + Alluxio + S3 , and here is some documentation for configuring Alluxio + S3

Re: Spark <--> S3 flakiness

2017-05-11 Thread lucas.g...@gmail.com
Interesting, the links here: http://spark.apache.org/community.html point to: http://apache-spark-user-list.1001560.n3.nabble.com/ On 11 May 2017 at 12:35, Vadim Semenov wrote: > Use the official mailing list archive > > http://mail-archives.apache.org/mod_mbox/spark-user/201705.mbox/% > 3ccaj

Re: Spark <--> S3 flakiness

2017-05-11 Thread Vadim Semenov
Use the official mailing list archive http://mail-archives.apache.org/mod_mbox/spark-user/201705.mbox/%3ccajyeq0gh1fbhbajb9gghognhqouogydba28lnn262hfzzgf...@mail.gmail.com%3e On Thu, May 11, 2017 at 2:50 PM, lucas.g...@gmail.com wrote: > Also, and this is unrelated to the actual question... Why

Re: Spark <--> S3 flakiness

2017-05-11 Thread Miguel Morales
Might want to try to use gzip as opposed to parquet. The only way i ever reliably got parquet to work on S3 is by using Alluxio as a buffer, but it's a decent amount of work. On Thu, May 11, 2017 at 11:50 AM, lucas.g...@gmail.com wrote: > Also, and this is unrelated to the actual question... Why

Re: Spark <--> S3 flakiness

2017-05-11 Thread lucas.g...@gmail.com
Also, and this is unrelated to the actual question... Why don't these messages show up in the archive? http://apache-spark-user-list.1001560.n3.nabble.com/ Ideally I'd want to post a link to our internal wiki for these questions, but can't find them in the archive. On 11 May 2017 at 07:16, lucas

Re: Spark <--> S3 flakiness

2017-05-11 Thread lucas.g...@gmail.com
Looks like this isn't viable in spark 2.0.0 (and greater I presume). I'm pretty sure I came across this blog and ignored it due to that. Any other thoughts? The linked tickets in: https://issues.apache.org/jira/browse/SPARK-10063 https://issues.apache.org/jira/browse/HADOOP-13786 https://issues.

Re: Spark <--> S3 flakiness

2017-05-10 Thread Miguel Morales
Try using the DirectParquetOutputCommiter: http://dev.sortable.com/spark-directparquetoutputcommitter/ On Wed, May 10, 2017 at 10:07 PM, lucas.g...@gmail.com wrote: > Hi users, we have a bunch of pyspark jobs that are using S3 for loading / > intermediate steps and final output of parquet files.

Re: Spark S3

2016-10-11 Thread Abhinay Mehta
Hi Selvam, Is your 35GB parquet file split up into multiple S3 objects or just one big Parquet file? If its just one big file then I believe only one executor will be able to work on it until some job action partitions the data into smaller chunks. On 11 October 2016 at 06:03, Selvam Raman wr

Re: Spark S3

2016-10-10 Thread Selvam Raman
I mentioned parquet as input format. On Oct 10, 2016 11:06 PM, "ayan guha" wrote: > It really depends on the input format used. > On 11 Oct 2016 08:46, "Selvam Raman" wrote: > >> Hi, >> >> How spark reads data from s3 and runs parallel task. >> >> Assume I have a s3 bucket size of 35 GB( parquet

Re: Spark S3

2016-10-10 Thread ayan guha
It really depends on the input format used. On 11 Oct 2016 08:46, "Selvam Raman" wrote: > Hi, > > How spark reads data from s3 and runs parallel task. > > Assume I have a s3 bucket size of 35 GB( parquet file). > > How the sparksession will read the data and process the data parallel. How > it sp

Re: Spark S3 Performance

2014-11-24 Thread Daniil Osipov
Can you verify that its reading the entire file on each worker using network monitoring stats? If it does, that would be a bug in my opinion. On Mon, Nov 24, 2014 at 2:06 PM, Nitay Joffe wrote: > Andrei, Ashish, > > To be clear, I don't think it's *counting* the entire file. It just seems > from

Re: Spark S3 Performance

2014-11-24 Thread Nitay Joffe
Andrei, Ashish, To be clear, I don't think it's *counting* the entire file. It just seems from the logging and the timing that it is doing a get of the entire file, then figures out it only needs some certain blocks, does another get of only the specific block. Regarding # partitions - I think I

Re: Spark S3 Performance

2014-11-22 Thread Andrei
Concerning your second question, I believe you try to set number of partitions with something like this: rdd = sc.textFile(..., 8) but things like `textFile()` don't actually take fixed number of partitions. Instead, they expect *minimal* number of partitions. Since in your file you have 21 b

Re: Spark S3 Performance

2014-11-22 Thread Ashish Rangole
What makes you think that each executor is reading the whole file? If that is the case then the count value returned to the driver will be actual X NumOfExecutors. Is that the case when compared with actual lines in the input file? If the count returned is same as actual then you probably don't hav

Re: Spark S3 Performance

2014-11-22 Thread Andrei
Not that I'm professional user of Amazon services, but I have a guess about your performance issues. From [1], there are two different filesystems over S3: - native that behaves just like regular files (schema: s3n) - block-based that looks more like HDFS (schema: s3) Since you use "s3n" in you

Re: Spark S3 Performance

2014-11-22 Thread Nitay Joffe
Err I meant #1 :) - Nitay Founder & CTO On Sat, Nov 22, 2014 at 10:20 AM, Nitay Joffe wrote: > Anyone have any thoughts on this? Trying to understand especially #2 if > it's a legit bug or something I'm doing wrong. > > - Nitay > Founder & CTO > > > On Thu, Nov 20, 2014 at 11:54 AM, Nitay Joff

Re: Spark S3 Performance

2014-11-22 Thread Nitay Joffe
Anyone have any thoughts on this? Trying to understand especially #2 if it's a legit bug or something I'm doing wrong. - Nitay Founder & CTO On Thu, Nov 20, 2014 at 11:54 AM, Nitay Joffe wrote: > I have a simple S3 job to read a text file and do a line count. > Specifically I'm doing *sc.textF

Re: SPARK S3 LZO input; worker stuck

2014-07-13 Thread hassan
Interestingly if I don't cache the data it works. However, as I need to re-use the data to apply different kinds of filtering it really slows down the job as it needs to read from S3 again and again. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SPARK-S3-