Re: Apache Spark orc read performance when reading large number of small files

2018-11-01 Thread gpatcham
When I run spark.read.orc("hdfs://test").filter("conv_date = 20181025").count with "spark.sql.orc.filterPushdown=true" I see below in executors logs. Predicate push down is happening 18/11/01 17:31:17 INFO OrcInputFormat: ORC pushdown predicate: leaf-0 = (IS_NULL conv_date) leaf-1 = (EQUALS

Re: Apache Spark orc read performance when reading large number of small files

2018-11-01 Thread Jörn Franke
A lot of small files is very inefficient itself and predicate push down will not help you much there unless you merge them into one large file (one large file can be much more efficiently processed). How did you validate that predicate pushdown did not work on Hive? You Hive Version is also

Re: Apache Spark orc read performance when reading large number of small files

2018-10-31 Thread gpatcham
spark version 2.2.0 Hive version 1.1.0 There are lot of small files Spark code : "spark.sql.orc.enabled": "true", "spark.sql.orc.filterPushdown": "true val logs =spark.read.schema(schema).orc("hdfs://test/date=201810").filter("date > 20181003") Hive: "spark.sql.orc.enabled": "true",

Re: Apache Spark orc read performance when reading large number of small files

2018-10-31 Thread Jörn Franke
How large are they? A lot of (small) files will cause significant delay in progressing - try to merge as much as possible into one file. Can you please share full source code in Hive and Spark as well as the versions you are using? > Am 31.10.2018 um 18:23 schrieb gpatcham : > > > > When

Apache Spark orc read performance when reading large number of small files

2018-10-31 Thread gpatcham
When reading large number of orc files from HDFS under a directory spark doesn't launch any tasks until some amount of time and I don't see any tasks running during that time. I'm using below command to read orc and spark.sql configs. What spark is doing under hoods when spark.read.orc is