Re: Apache Spark orc read performance when reading large number of small files

Jörn Franke Wed, 31 Oct 2018 12:20:42 -0700

How large are they? A lot of (small) files will cause significant delay in 
progressing - try to merge as much as possible into one file.


Can you please share full source code in Hive and Spark as well as the versions 
you are using?

> Am 31.10.2018 um 18:23 schrieb gpatcham <gpatc...@gmail.com>:
> 
> 
> 
> When reading large number of orc files from HDFS under a directory spark
> doesn't launch any tasks until some amount of time and I don't see any tasks
> running during that time. I'm using below command to read orc and spark.sql
> configs.
> 
> What spark is doing under hoods when spark.read.orc is issued?
> 
> spark.read.schema(schame1).orc("hdfs://test1").filter("date >= 20181001")
> "spark.sql.orc.enabled": "true",
> "spark.sql.orc.filterPushdown": "true
> 
> Also instead of directly reading orc files I tried running Hive query on
> same dataset. But I was not able to push filter predicate. Where should I
> set the below config's "hive.optimize.ppd":"true",
> "hive.optimize.ppd.storage":"true"
> 
> Suggest what is the best way to read orc files from HDFS and tuning
> parameters ?
> 
> 
> 
> 
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
> 
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> 

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Apache Spark orc read performance when reading large number of small files

Reply via email to