Re: ORC file across multiple HDFS blocks

Alan Gates Mon, 27 Apr 2015 11:06:58 -0700

No, you don't want to be designing ORC files to not cross blockboundaries. Engines in Hadoop (MapReduce, Tez, etc.) are all built tohandle the fact that files tend to cross blocks and hence nodes. Thereis value in lining up stripe size and HDFS block size so that yourstripes don't straddle blocks, but that has been on by default since atleast Hive 0.13.


Alan.

Demai Ni <mailto:nid...@gmail.com>
April 24, 2015 at 14:45
hi, Guys,
I am working on directly READ ORC files from HDFS cluster, andhopefully to leverage HDFS local shortcuit READ(http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/ShortCircuitLocalReads.html)as much as possible
According to ORC design, each ORC file usually contain severalStripes, and each Stripe has default of 250MB for the efficient readsfrom HDFS. With that, size of a ORC file can be easily at GB level,consisted of several HDFS blocks. There is a good chance that
1) a ORC file across several HDFS data nodes.
2) a Stripe may across two HDFS blocks, and lands on two differentphysical nodes
With this in mind, should I design my ORC file to
1) only contain one Stripe?
2) make ensure(either by larger HDFS block or smaller Stripe size)that each ORC file contain only one HDFS block?
Does it look reasonable? thanks

Demai

Re: ORC file across multiple HDFS blocks

Reply via email to