Re: very slow parquet file write

2016-09-16 Thread tosaigan...@gmail.com
Hi,

try this conf


val sc = new SparkContext(conf)
sc.hadoopConfiguration.setBoolean("parquet.enable.summary-metadata", false)


Regards,
Sai Ganesh

On Thu, Sep 15, 2016 at 11:34 PM, gaurav24 [via Apache Spark User List] <
ml-node+s1001560n27738...@n3.nabble.com> wrote:

> Hi Rok,
>
> facing similar issue with streaming where I append to parquet data every
> hour. Writing seems to be slowing down it time it writes. It has gone from
> 17 mins to 40 mins in a month
>
> --
> If you reply to this email, your message will be added to the discussion
> below:
> http://apache-spark-user-list.1001560.n3.nabble.com/very-
> slow-parquet-file-write-tp25295p27738.html
> To start a new topic under Apache Spark User List, email
> ml-node+s1001560n1...@n3.nabble.com
> To unsubscribe from Apache Spark User List, click here
> 
> .
> NAML
> 
>




-
Sai Ganesh
--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/very-slow-parquet-file-write-tp25295p27739.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: very slow parquet file write

2015-11-14 Thread Sabarish Sasidharan
How are you writing it out? Can you post some code?

Regards
Sab
On 14-Nov-2015 5:21 am, "Rok Roskar"  wrote:

> I'm not sure what you mean? I didn't do anything specifically to partition
> the columns
> On Nov 14, 2015 00:38, "Davies Liu"  wrote:
>
>> Do you have partitioned columns?
>>
>> On Thu, Nov 5, 2015 at 2:08 AM, Rok Roskar  wrote:
>> > I'm writing a ~100 Gb pyspark DataFrame with a few hundred partitions
>> into a
>> > parquet file on HDFS. I've got a few hundred nodes in the cluster, so
>> for
>> > the size of file this is way over-provisioned (I've tried it with fewer
>> > partitions and fewer nodes, no obvious effect). I was expecting the
>> dump to
>> > disk to be very fast -- the DataFrame is cached in memory and contains
>> just
>> > 14 columns (13 are floats and one is a string). When I write it out in
>> json
>> > format, this is indeed reasonably fast (though it still takes a few
>> minutes,
>> > which is longer than I would expect).
>> >
>> > However, when I try to write a parquet file it takes way longer -- the
>> first
>> > set of tasks finishes in a few minutes, but the subsequent tasks take
>> more
>> > than twice as long or longer. In the end it takes over half an hour to
>> write
>> > the file. I've looked at the disk I/O and cpu usage on the compute
>> nodes and
>> > it looks like the processors are fully loaded while the disk I/O is
>> > essentially zero for long periods of time. I don't see any obvious
>> garbage
>> > collection issues and there are no problems with memory.
>> >
>> > Any ideas on how to debug/fix this?
>> >
>> > Thanks!
>> >
>> >
>>
>


Re: very slow parquet file write

2015-11-13 Thread Davies Liu
Have you use any partitioned columns when write as json or parquet?

On Fri, Nov 6, 2015 at 6:53 AM, Rok Roskar  wrote:
> yes I was expecting that too because of all the metadata generation and
> compression. But I have not seen performance this bad for other parquet
> files I’ve written and was wondering if there could be something obvious
> (and wrong) to do with how I’ve specified the schema etc. It’s a very simple
> schema consisting of a StructType with a few StructField floats and a
> string. I’m using all the spark defaults for io compression.
>
> I'll see what I can do about running a profiler -- can you point me to a
> resource/example?
>
> Thanks,
>
> Rok
>
> ps: my post on the mailing list is still listed as not accepted by the
> mailing list:
> http://apache-spark-user-list.1001560.n3.nabble.com/very-slow-parquet-file-write-td25295.html
> -- none of your responses are there either. I am definitely subscribed to
> the list though (I get daily digests). Any clue how to fix it?
>
>
>
>
> On Nov 6, 2015, at 9:26 AM, Cheng Lian  wrote:
>
> I'd expect writing Parquet files slower than writing JSON files since
> Parquet involves more complicated encoders, but maybe not that slow. Would
> you mind to try to profile one Spark executor using tools like YJP to see
> what's the hotspot?
>
> Cheng
>
> On 11/6/15 7:34 AM, rok wrote:
>
> Apologies if this appears a second time!
>
> I'm writing a ~100 Gb pyspark DataFrame with a few hundred partitions into a
> parquet file on HDFS. I've got a few hundred nodes in the cluster, so for
> the size of file this is way over-provisioned (I've tried it with fewer
> partitions and fewer nodes, no obvious effect). I was expecting the dump to
> disk to be very fast -- the DataFrame is cached in memory and contains just
> 14 columns (13 are floats and one is a string). When I write it out in json
> format, this is indeed reasonably fast (though it still takes a few minutes,
> which is longer than I would expect).
>
> However, when I try to write a parquet file it takes way longer -- the first
> set of tasks finishes in a few minutes, but the subsequent tasks take more
> than twice as long or longer. In the end it takes over half an hour to write
> the file. I've looked at the disk I/O and cpu usage on the compute nodes and
> it looks like the processors are fully loaded while the disk I/O is
> essentially zero for long periods of time. I don't see any obvious garbage
> collection issues and there are no problems with memory.
>
> Any ideas on how to debug/fix this?
>
> Thanks!
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/very-slow-parquet-file-write-tp25295.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
>
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: very slow parquet file write

2015-11-13 Thread Rok Roskar
I'm not sure what you mean? I didn't do anything specifically to partition
the columns
On Nov 14, 2015 00:38, "Davies Liu"  wrote:

> Do you have partitioned columns?
>
> On Thu, Nov 5, 2015 at 2:08 AM, Rok Roskar  wrote:
> > I'm writing a ~100 Gb pyspark DataFrame with a few hundred partitions
> into a
> > parquet file on HDFS. I've got a few hundred nodes in the cluster, so for
> > the size of file this is way over-provisioned (I've tried it with fewer
> > partitions and fewer nodes, no obvious effect). I was expecting the dump
> to
> > disk to be very fast -- the DataFrame is cached in memory and contains
> just
> > 14 columns (13 are floats and one is a string). When I write it out in
> json
> > format, this is indeed reasonably fast (though it still takes a few
> minutes,
> > which is longer than I would expect).
> >
> > However, when I try to write a parquet file it takes way longer -- the
> first
> > set of tasks finishes in a few minutes, but the subsequent tasks take
> more
> > than twice as long or longer. In the end it takes over half an hour to
> write
> > the file. I've looked at the disk I/O and cpu usage on the compute nodes
> and
> > it looks like the processors are fully loaded while the disk I/O is
> > essentially zero for long periods of time. I don't see any obvious
> garbage
> > collection issues and there are no problems with memory.
> >
> > Any ideas on how to debug/fix this?
> >
> > Thanks!
> >
> >
>


Re: very slow parquet file write

2015-11-13 Thread Davies Liu
Do you have partitioned columns?

On Thu, Nov 5, 2015 at 2:08 AM, Rok Roskar  wrote:
> I'm writing a ~100 Gb pyspark DataFrame with a few hundred partitions into a
> parquet file on HDFS. I've got a few hundred nodes in the cluster, so for
> the size of file this is way over-provisioned (I've tried it with fewer
> partitions and fewer nodes, no obvious effect). I was expecting the dump to
> disk to be very fast -- the DataFrame is cached in memory and contains just
> 14 columns (13 are floats and one is a string). When I write it out in json
> format, this is indeed reasonably fast (though it still takes a few minutes,
> which is longer than I would expect).
>
> However, when I try to write a parquet file it takes way longer -- the first
> set of tasks finishes in a few minutes, but the subsequent tasks take more
> than twice as long or longer. In the end it takes over half an hour to write
> the file. I've looked at the disk I/O and cpu usage on the compute nodes and
> it looks like the processors are fully loaded while the disk I/O is
> essentially zero for long periods of time. I don't see any obvious garbage
> collection issues and there are no problems with memory.
>
> Any ideas on how to debug/fix this?
>
> Thanks!
>
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: very slow parquet file write

2015-11-06 Thread Cheng Lian
I'd expect writing Parquet files slower than writing JSON files since 
Parquet involves more complicated encoders, but maybe not that slow. 
Would you mind to try to profile one Spark executor using tools like YJP 
to see what's the hotspot?


Cheng

On 11/6/15 7:34 AM, rok wrote:

Apologies if this appears a second time!

I'm writing a ~100 Gb pyspark DataFrame with a few hundred partitions into a
parquet file on HDFS. I've got a few hundred nodes in the cluster, so for
the size of file this is way over-provisioned (I've tried it with fewer
partitions and fewer nodes, no obvious effect). I was expecting the dump to
disk to be very fast -- the DataFrame is cached in memory and contains just
14 columns (13 are floats and one is a string). When I write it out in json
format, this is indeed reasonably fast (though it still takes a few minutes,
which is longer than I would expect).

However, when I try to write a parquet file it takes way longer -- the first
set of tasks finishes in a few minutes, but the subsequent tasks take more
than twice as long or longer. In the end it takes over half an hour to write
the file. I've looked at the disk I/O and cpu usage on the compute nodes and
it looks like the processors are fully loaded while the disk I/O is
essentially zero for long periods of time. I don't see any obvious garbage
collection issues and there are no problems with memory.

Any ideas on how to debug/fix this?

Thanks!



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/very-slow-parquet-file-write-tp25295.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org





-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: very slow parquet file write

2015-11-06 Thread Jörn Franke
Do you use some compression? Maybe there is some activated by default in your 
Hadoop environment?

> On 06 Nov 2015, at 00:34, rok  wrote:
> 
> Apologies if this appears a second time! 
> 
> I'm writing a ~100 Gb pyspark DataFrame with a few hundred partitions into a
> parquet file on HDFS. I've got a few hundred nodes in the cluster, so for
> the size of file this is way over-provisioned (I've tried it with fewer
> partitions and fewer nodes, no obvious effect). I was expecting the dump to
> disk to be very fast -- the DataFrame is cached in memory and contains just
> 14 columns (13 are floats and one is a string). When I write it out in json
> format, this is indeed reasonably fast (though it still takes a few minutes,
> which is longer than I would expect). 
> 
> However, when I try to write a parquet file it takes way longer -- the first
> set of tasks finishes in a few minutes, but the subsequent tasks take more
> than twice as long or longer. In the end it takes over half an hour to write
> the file. I've looked at the disk I/O and cpu usage on the compute nodes and
> it looks like the processors are fully loaded while the disk I/O is
> essentially zero for long periods of time. I don't see any obvious garbage
> collection issues and there are no problems with memory. 
> 
> Any ideas on how to debug/fix this? 
> 
> Thanks!
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/very-slow-parquet-file-write-tp25295.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
> 

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: very slow parquet file write

2015-11-06 Thread Rok Roskar
yes I was expecting that too because of all the metadata generation and
compression. But I have not seen performance this bad for other parquet
files I’ve written and was wondering if there could be something obvious
(and wrong) to do with how I’ve specified the schema etc. It’s a very
simple schema consisting of a StructType with a few StructField floats and
a string. I’m using all the spark defaults for io compression.

I'll see what I can do about running a profiler -- can you point me to a
resource/example?

Thanks,

Rok

ps: my post on the mailing list is still listed as not accepted by the
mailing list:
http://apache-spark-user-list.1001560.n3.nabble.com/very-slow-parquet-file-write-td25295.html
-- none of your responses are there either. I am definitely subscribed to
the list though (I get daily digests). Any clue how to fix it?




On Nov 6, 2015, at 9:26 AM, Cheng Lian  wrote:

I'd expect writing Parquet files slower than writing JSON files since
Parquet involves more complicated encoders, but maybe not that slow. Would
you mind to try to profile one Spark executor using tools like YJP to see
what's the hotspot?

Cheng

On 11/6/15 7:34 AM, rok wrote:

Apologies if this appears a second time!

I'm writing a ~100 Gb pyspark DataFrame with a few hundred partitions into a
parquet file on HDFS. I've got a few hundred nodes in the cluster, so for
the size of file this is way over-provisioned (I've tried it with fewer
partitions and fewer nodes, no obvious effect). I was expecting the dump to
disk to be very fast -- the DataFrame is cached in memory and contains just
14 columns (13 are floats and one is a string). When I write it out in json
format, this is indeed reasonably fast (though it still takes a few minutes,
which is longer than I would expect).

However, when I try to write a parquet file it takes way longer -- the first
set of tasks finishes in a few minutes, but the subsequent tasks take more
than twice as long or longer. In the end it takes over half an hour to write
the file. I've looked at the disk I/O and cpu usage on the compute nodes and
it looks like the processors are fully loaded while the disk I/O is
essentially zero for long periods of time. I don't see any obvious garbage
collection issues and there are no problems with memory.

Any ideas on how to debug/fix this?

Thanks!



--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/very-slow-parquet-file-write-tp25295.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org


Re: very slow parquet file write

2015-11-06 Thread Cheng Lian



On 11/6/15 10:53 PM, Rok Roskar wrote:
yes I was expecting that too because of all the metadata generation 
and compression. But I have not seen performance this bad for other 
parquet files I’ve written and was wondering if there could be 
something obvious (and wrong) to do with how I’ve specified the schema 
etc. It’s a very simple schema consisting of a StructType with a few 
StructField floats and a string. I’m using all the spark defaults for 
io compression.


I'll see what I can do about running a profiler -- can you point me to 
a resource/example?
This link is probably helpful: 
https://cwiki.apache.org/confluence/display/SPARK/Profiling+Spark+Applications+Using+YourKit


Thanks,

Rok

ps: my post on the mailing list is still listed as not accepted by the 
mailing list: 
http://apache-spark-user-list.1001560.n3.nabble.com/very-slow-parquet-file-write-td25295.html 
-- none of your responses are there either. I am definitely subscribed 
to the list though (I get daily digests). Any clue how to fix it?

Sorry, no idea :-/





On Nov 6, 2015, at 9:26 AM, Cheng Lian > wrote:


I'd expect writing Parquet files slower than writing JSON files since 
Parquet involves more complicated encoders, but maybe not that slow. 
Would you mind to try to profile one Spark executor using tools like 
YJP to see what's the hotspot?


Cheng

On 11/6/15 7:34 AM, rok wrote:

Apologies if this appears a second time!

I'm writing a ~100 Gb pyspark DataFrame with a few hundred 
partitions into a
parquet file on HDFS. I've got a few hundred nodes in the cluster, 
so for

the size of file this is way over-provisioned (I've tried it with fewer
partitions and fewer nodes, no obvious effect). I was expecting the 
dump to
disk to be very fast -- the DataFrame is cached in memory and 
contains just
14 columns (13 are floats and one is a string). When I write it out 
in json
format, this is indeed reasonably fast (though it still takes a few 
minutes,

which is longer than I would expect).

However, when I try to write a parquet file it takes way longer -- 
the first
set of tasks finishes in a few minutes, but the subsequent tasks 
take more
than twice as long or longer. In the end it takes over half an hour 
to write
the file. I've looked at the disk I/O and cpu usage on the compute 
nodes and

it looks like the processors are fully loaded while the disk I/O is
essentially zero for long periods of time. I don't see any obvious 
garbage

collection issues and there are no problems with memory.

Any ideas on how to debug/fix this?

Thanks!



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/very-slow-parquet-file-write-tp25295.html

Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 

For additional commands, e-mail: user-h...@spark.apache.org