Stripe size is too low. ORC maintains multiple buffers in memory. ORC’s memory manager flushes a stripe when the in-memory data size (which includes buffers in memory) is greater than specified stripe size. This check happens after every 5000 rows.
This is what is happening in this case There are 4 string columns. Each columns has 5 streams. Each stream has a buffer of 256KB each. So even a single column takes >1MB (5 streams * 256KB buffer). There are additional internal buffers for string type which needs to be taken into account. Therefore after every 5000 rows a stripe is flushed out. In this case #stripes = total_rows/5000. I would recommend keeping the defaults for stripe size. - Prasanth On Tue, Dec 2, 2014 at 6:37 PM, Jim Green <openkbi...@gmail.com> wrote: > Hi Team, > I am creating this table: > CREATE TABLE IF NOT EXISTS orctest2 ( > id string, > id2 string, > id3 string, > id4 string > ) > STORED AS ORC tblproperties > ("orc.stripe.size"="1048576","orc.row.index.stride"="3333”); > The stripe size is set to 1MB. > After loading data, the table file is about 60MB: > -rwxr-xr-x 1 root root 61335650 Dec 2 18:08 000000_0 > However it genrated 1492 stripes and each stripe is about 40KB. > Stripes: > Stripe: offset: 3 data: 39124 rows: 5000 tail: 68 index: 292 > Stream: column 0 section ROW_INDEX start: 3 length 16 > Stream: column 1 section ROW_INDEX start: 19 length 69 > Stream: column 2 section ROW_INDEX start: 88 length 69 > Stream: column 3 section ROW_INDEX start: 157 length 69 > Stream: column 4 section ROW_INDEX start: 226 length 69 > Stream: column 1 section DATA start: 295 length 9762 > Stream: column 1 section LENGTH start: 10057 length 19 > Stream: column 2 section DATA start: 10076 length 9762 > Stream: column 2 section LENGTH start: 19838 length 19 > Stream: column 3 section DATA start: 19857 length 9762 > Stream: column 3 section LENGTH start: 29619 length 19 > Stream: column 4 section DATA start: 29638 length 9762 > Stream: column 4 section LENGTH start: 39400 length 19 > Encoding column 0: DIRECT > Encoding column 1: DIRECT_V2 > Encoding column 2: DIRECT_V2 > Encoding column 3: DIRECT_V2 > Encoding column 4: DIRECT_V2 > Anybody knows how does the ORC stripe define? > Thanks. > -- > Thanks, > www.openkb.info > (Open KnowledgeBase for Hadoop/Database/OS/Network/Tool) -- CONFIDENTIALITY NOTICE NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You.