ded Parquet to 1.7.0 (which is exactly the same as 1.6.0 with
> package name renamed from com.twitter to org.apache.parquet) on master branch
> recently.
>
> Cheng
>
> On 6/12/15 6:16 PM, Eric Eijkelenboom wrote:
>> Hi
>>
>> What is the reason that Spark
Hi
What is the reason that Spark still comes with Parquet 1.6.0rc3? It seems like
newer Parquet versions are available (e.g. 1.6.0). This would fix problems with
‘spark.sql.parquet.filterPushdown’, which currently is disabled by default,
because of a bug in Parquet 1.6.0rc3.
Thanks!
Eric
that Spark generates so many
> partitions as parquet files exist in the path.
>
> Q2:
> To reduce the number of partitions you can use rdd.repartition(x), x=> number
> of partitions. Depend on your case, repartition could be a heavy task
>
>
> Regards.
> Migue
Hello guys
Q1: How does Spark determine the number of partitions when reading a Parquet
file?
val df = sqlContext.parquetFile(path)
Is it some way related to the number of Parquet row groups in my input?
Q2: How can I reduce this number of partitions? Doing this:
df.rdd.coalesce(200).count
f
(or any other action I presume).
3. Run Spark 1.3.1-rc2.
sqlContext.load() took about 30 minutes for 5000 Parquet files on S3, the
same as 1.3.0.
Any help would be greatly appreciated!
Thanks a lot.
Eric
> On 10 Apr 2015, at 16:46, Eric Eijkelenboom
> wrote:
>
>
t looks like Spark is opening each file, before it actually does any work.
This means a delay of 25 minutes when working with Parquet files. Previously,
we used LZO files and did not experience this problem.
Bonus info:
This also happens when I use auto partition discovery (i.e.
sqlContext.parquetFile(“/path/to/logsroot/")).
What can I do to avoid this?
Thanks in advance!
Eric Eijkelenboom