Xudong and Rex,
Can you try 1.3.1? With PR 5339 http://github.com/apache/spark/pull/5339 ,
after we get a hive parquet from metastore and convert it to our native
parquet code path, we will cache the converted relation. For now, the first
access to that hive parquet table reads all of the footers
Yin,
Thanks for your reply.
We already patched this PR to our 1.3.0
As Xudong mentioned, we have thousand of parquet files, it's very very slow
in first read, and another app will add more files and refresh table
regularly.
Cheng Lian's PR-5334 seems can resolve this issue, it will skip read all
We have the similar issue with massive parquet files, Cheng Lian, could you
have a look?
2015-04-08 15:47 GMT+08:00 Zheng, Xudong dong...@gmail.com:
Hi Cheng,
I tried both these patches, and seems still not resolve my issue. And I
found the most time is spend on this line in
Hi Cheng,
I tried both these patches, and seems still not resolve my issue. And I
found the most time is spend on this line in newParquet.scala:
ParquetFileReader.readAllFootersInParallel(
sparkContext.hadoopConfiguration, seqAsJavaList(leaves), taskSideMetaData)
Which need read all the files
Hey Xudong,
We had been digging this issue for a while, and believe PR 5339
http://github.com/apache/spark/pull/5339 and PR 5334
http://github.com/apache/spark/pull/5339 should fix this issue.
There two problems:
1. Normally we cache Parquet table metadata for better performance, but
when