Re: Upgrade to parquet 1.6.0

2015-06-12 Thread Eric Eijkelenboom
ded Parquet to 1.7.0 (which is exactly the same as 1.6.0 with > package name renamed from com.twitter to org.apache.parquet) on master branch > recently. > > Cheng > > On 6/12/15 6:16 PM, Eric Eijkelenboom wrote: >> Hi >> >> What is the reason that Spark

Upgrade to parquet 1.6.0

2015-06-12 Thread Eric Eijkelenboom
Hi What is the reason that Spark still comes with Parquet 1.6.0rc3? It seems like newer Parquet versions are available (e.g. 1.6.0). This would fix problems with ‘spark.sql.parquet.filterPushdown’, which currently is disabled by default, because of a bug in Parquet 1.6.0rc3. Thanks! Eric

Re: Parquet number of partitions

2015-05-07 Thread Eric Eijkelenboom
that Spark generates so many > partitions as parquet files exist in the path. > > Q2: > To reduce the number of partitions you can use rdd.repartition(x), x=> number > of partitions. Depend on your case, repartition could be a heavy task > > > Regards. > Migue

Parquet number of partitions

2015-05-05 Thread Eric Eijkelenboom
Hello guys Q1: How does Spark determine the number of partitions when reading a Parquet file? val df = sqlContext.parquetFile(path) Is it some way related to the number of Parquet row groups in my input? Q2: How can I reduce this number of partitions? Doing this: df.rdd.coalesce(200).count f

Re: Opening many Parquet files = slow

2015-04-13 Thread Eric Eijkelenboom
(or any other action I presume). 3. Run Spark 1.3.1-rc2. sqlContext.load() took about 30 minutes for 5000 Parquet files on S3, the same as 1.3.0. Any help would be greatly appreciated! Thanks a lot. Eric > On 10 Apr 2015, at 16:46, Eric Eijkelenboom > wrote: > >

Opening many Parquet files = slow

2015-04-08 Thread Eric Eijkelenboom
t looks like Spark is opening each file, before it actually does any work. This means a delay of 25 minutes when working with Parquet files. Previously, we used LZO files and did not experience this problem. Bonus info: This also happens when I use auto partition discovery (i.e. sqlContext.parquetFile(“/path/to/logsroot/")). What can I do to avoid this? Thanks in advance! Eric Eijkelenboom