subject:"Apache Spark orc read performance when reading large number of small files"

Re: Apache Spark orc read performance when reading large number of small files

2018-11-01 Thread gpatcham

When I run spark.read.orc("hdfs://test").filter("conv_date = 20181025").count
with "spark.sql.orc.filterPushdown=true" I see below in executors logs.
Predicate push down is happening

18/11/01 17:31:17 INFO OrcInputFormat: ORC pushdown predicate: leaf-0 =
(IS_NULL conv_date)
leaf-1 = (EQUALS conv_date 20181025)
expr = (and (not leaf-0) leaf-1)


But when I run hive query in spark I see below logs

Hive table: Hive

spark.sql("select * from test where conv_date = 20181025").count

18/11/01 17:37:57 INFO HadoopRDD: Input split: hdfs://test/test1.orc:0+34568
18/11/01 17:37:57 INFO OrcRawRecordMerger: min key = null, max key = null
18/11/01 17:37:57 INFO ReaderImpl: Reading ORC rows from
hdfs://test/test1.orc with {include: [true, false, false, false, true,
false, false, false, false, false, false, false, false, false, false, false,
false, false, false, false, false, false, false, false, false, false, false,
false, false, false, false, false, false, false, false, false, false, false,
false, false, false, false, false, false, false, false, false, false, false,
false, false, false, false, false, false, false, false, false, false, false,
false, false, false], offset: 0, length: 9223372036854775807}
18/11/01 17:37:57 INFO Executor: Finished task 224.0 in stage 0.0 (TID 33).
1662 bytes result sent to driver
18/11/01 17:37:57 INFO CoarseGrainedExecutorBackend: Got assigned task 40
18/11/01 17:37:57 INFO Executor: Running task 956.0 in stage 0.0 (TID 40)





--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Apache Spark orc read performance when reading large number of small files

2018-11-01 Thread Jörn Franke

A lot of small files is very inefficient itself and predicate push down will 
not help you much there unless you merge them into one large file (one large 
file can be much more efficiently processed).

How did you validate that predicate pushdown did not work on Hive? You Hive 
Version is also very old - consider upgrading to at least Hive 2.x

> Am 31.10.2018 um 20:35 schrieb gpatcham :
> 
> spark version 2.2.0
> Hive version 1.1.0
> 
> There are lot of small files
> 
> Spark code :
> 
> "spark.sql.orc.enabled": "true",
> "spark.sql.orc.filterPushdown": "true 
> 
> val logs
> =spark.read.schema(schema).orc("hdfs://test/date=201810").filter("date >
> 20181003")
> 
> Hive:
> 
> "spark.sql.orc.enabled": "true",
> "spark.sql.orc.filterPushdown": "true 
> 
> test  table in Hive is pointing to hdfs://test/  and partitioned on date
> 
> val sqlStr = s"select * from test where date > 20181001"
> val logs = spark.sql(sqlStr)
> 
> With Hive query I don't see filter pushdown is  happening. I tried setting
> these configs in both hive-site.xml and also spark.sqlContext.setConf
> 
> "hive.optimize.ppd":"true",
> "hive.optimize.ppd.storage":"true" 
> 
> 
> 
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
> 
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> 

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Apache Spark orc read performance when reading large number of small files

2018-10-31 Thread gpatcham

spark version 2.2.0
Hive version 1.1.0

There are lot of small files

Spark code :

"spark.sql.orc.enabled": "true",
"spark.sql.orc.filterPushdown": "true 

val logs
=spark.read.schema(schema).orc("hdfs://test/date=201810").filter("date >
20181003")

Hive:

"spark.sql.orc.enabled": "true",
"spark.sql.orc.filterPushdown": "true 

test  table in Hive is pointing to hdfs://test/  and partitioned on date

val sqlStr = s"select * from test where date > 20181001"
val logs = spark.sql(sqlStr)

With Hive query I don't see filter pushdown is  happening. I tried setting
these configs in both hive-site.xml and also spark.sqlContext.setConf

"hive.optimize.ppd":"true",
"hive.optimize.ppd.storage":"true" 



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Apache Spark orc read performance when reading large number of small files

2018-10-31 Thread Jörn Franke

How large are they? A lot of (small) files will cause significant delay in 
progressing - try to merge as much as possible into one file.

Can you please share full source code in Hive and Spark as well as the versions 
you are using?

> Am 31.10.2018 um 18:23 schrieb gpatcham :
> 
> 
> 
> When reading large number of orc files from HDFS under a directory spark
> doesn't launch any tasks until some amount of time and I don't see any tasks
> running during that time. I'm using below command to read orc and spark.sql
> configs.
> 
> What spark is doing under hoods when spark.read.orc is issued?
> 
> spark.read.schema(schame1).orc("hdfs://test1").filter("date >= 20181001")
> "spark.sql.orc.enabled": "true",
> "spark.sql.orc.filterPushdown": "true
> 
> Also instead of directly reading orc files I tried running Hive query on
> same dataset. But I was not able to push filter predicate. Where should I
> set the below config's "hive.optimize.ppd":"true",
> "hive.optimize.ppd.storage":"true"
> 
> Suggest what is the best way to read orc files from HDFS and tuning
> parameters ?
> 
> 
> 
> 
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
> 
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> 

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Apache Spark orc read performance when reading large number of small files

2018-10-31 Thread gpatcham




When reading large number of orc files from HDFS under a directory spark
doesn't launch any tasks until some amount of time and I don't see any tasks
running during that time. I'm using below command to read orc and spark.sql
configs.

What spark is doing under hoods when spark.read.orc is issued?

spark.read.schema(schame1).orc("hdfs://test1").filter("date >= 20181001")
"spark.sql.orc.enabled": "true",
"spark.sql.orc.filterPushdown": "true

Also instead of directly reading orc files I tried running Hive query on
same dataset. But I was not able to push filter predicate. Where should I
set the below config's "hive.optimize.ppd":"true",
"hive.optimize.ppd.storage":"true"

Suggest what is the best way to read orc files from HDFS and tuning
parameters ?




--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Apache Spark orc read performance when reading large number of small files

Re: Apache Spark orc read performance when reading large number of small files

Re: Apache Spark orc read performance when reading large number of small files

Re: Apache Spark orc read performance when reading large number of small files

Apache Spark orc read performance when reading large number of small files

5 matches

Site Navigation

Mail list logo

Footer information