[ 
https://issues.apache.org/jira/browse/HIVE-11188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14617407#comment-14617407
 ] 

Kevin Wilfong commented on HIVE-11188:
--------------------------------------

We did something like this in DWRF
https://github.com/facebook/hive-dwrf/commit/c9205c3894cb04453a790de28270d8118a87101d
https://github.com/facebook/hive-dwrf/commit/1e920729a0b3a6887194a54a070e2471fac947d2

> Make ORCFile's String Dictionary more efficient
> -----------------------------------------------
>
>                 Key: HIVE-11188
>                 URL: https://issues.apache.org/jira/browse/HIVE-11188
>             Project: Hive
>          Issue Type: Improvement
>          Components: File Formats
>    Affects Versions: 1.2.0, 1.1.0
>            Reporter: Zheng Shao
>
> Currently, ORCFile's String Dictionary uses StringRedBlackTree for 
> adding/finding/sorting duplicate strings.  When there are a large number of 
> unique strings (let's say over 16K) and a large number of rows (let's say 
> 1M), the binary search will take O(1M * log(16K)) time which can be very long.
> Alternatively, ORCFile's String Dictionary can use HashMap for adding/finding 
> duplicate strings, and use quicksort at the end to produce a sorted order.  
> In the same case above, the total time spent will be O(1M + 16K * log(16K)) 
> which is much faster.
> When the number of unique string is close to the number of rows (let's say, 
> both around 1M), ORC will automatically disable the dictionary encoding.  In 
> the old approach will take O(1M * log(1M)), and our new approach will take 
> O(1M) since we can skip the final quicksort if the dictionary encoding is 
> disabled.
> So in either case, the new approach should be a win.
> Here is an PMP output based on ~600 traces (so 126 means 126/600 ~= 21% of 
> total time). It's a query like "INSERT OVERWRITE TABLE target SELECT * FROM 
> src" using hive-1.1.0-cdh-5.4.1. target TABLE is STORED AS ORC, and src TABLE 
> is STORED AS RCFILE.
>     126  
> org.apache.hadoop.hive.ql.io.orc.StringRedBlackTree.compareValue(StringRedBlackTree.java:67)
>      35  java.util.zip.Deflater.deflateBytes(Native Method)
>      26  
> org.apache.hadoop.hive.ql.io.orc.SerializationUtils.findClosestNumBits(SerializationUtils.java:218)
>      24  
> org.apache.hadoop.hive.serde2.lazy.LazyNonPrimitive.isNull(LazyNonPrimitive.java:63)
>      22  org.apache.hadoop.hive.serde2.lazy.LazyMap.parse(LazyMap.java:204)
>      22  
> org.apache.hadoop.hive.serde2.lazy.LazyLong.parseLong(LazyLong.java:116)
>      21  
> org.apache.hadoop.hive.serde2.columnar.ColumnarStructBase$FieldInfo.uncheckedGetField(ColumnarStructBase.java:111)
>      19  
> org.apache.hadoop.hive.serde2.lazy.LazyPrimitive.hashCode(LazyPrimitive.java:57)
>      18  
> org.apache.hadoop.hive.ql.io.orc.RedBlackTree.getRight(RedBlackTree.java:99)
>      16  
> org.apache.hadoop.hive.ql.io.RCFile$Reader.getCurrentRow(RCFile.java:1932)
>      15  
> org.apache.hadoop.io.compress.zlib.ZlibDecompressor.inflateBytesDirect(Native 
> Method)
>      15  
> org.apache.hadoop.hive.ql.io.orc.WriterImpl$IntegerTreeWriter.write(WriterImpl.java:929)
>      12  
> org.apache.hadoop.hive.ql.io.orc.WriterImpl$StructTreeWriter.write(WriterImpl.java:1607)
>      12  org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:815)
>      11  
> org.apache.hadoop.hive.ql.io.orc.RedBlackTree.getLeft(RedBlackTree.java:92)
>      11  
> org.apache.hadoop.hive.ql.io.orc.DynamicIntArray.add(DynamicIntArray.java:105)
>      10  org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:52)
>      ...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to