[jira] [Commented] (HIVE-12450) OrcFileMergeOperator does not use correct compression buffer size
[ https://issues.apache.org/jira/browse/HIVE-12450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15765887#comment-15765887 ] roncenzhao commented on HIVE-12450: --- Hi, [~prasanth_j], would you tell me which tool you use to get the result in 'zlib-hang.png'. Thanks~ > OrcFileMergeOperator does not use correct compression buffer size > - > > Key: HIVE-12450 > URL: https://issues.apache.org/jira/browse/HIVE-12450 > Project: Hive > Issue Type: Bug > Components: ORC >Affects Versions: 1.2.0, 1.3.0, 1.2.1, 2.0.0 >Reporter: Prasanth Jayachandran >Assignee: Prasanth Jayachandran >Priority: Critical > Fix For: 1.3.0, 2.0.0 > > Attachments: HIVE-12450.1.patch, HIVE-12450.2.patch, > HIVE-12450.3.patch, HIVE-12450.4.patch, zlib-hang.png > > > OrcFileMergeOperator checks for compatibility before merging orc files. This > compatibility check include checking compression buffer size. But the output > file that is created does not honor the compression buffer size and always > defaults to 256KB. This will not be a problem when reading the orc file but > can create unwanted memory pressure because of wasted space within > compression buffer. > This issue also can make the merged file unreadable under certain cases. For > example, if the original compression buffer size is 8KB and if > hive.exec.orc.default.buffer.size is set to 4KB. The merge file operator will > use 4KB instead of actual 8KB which can result in hanging of ORC reader (more > specifically ZlibCodec will wait for more compression buffers). > {code:title=jstack output for hanging issue} > "main" prio=5 tid=0x7fc07300 nid=0x1703 runnable [0x70218000] >java.lang.Thread.State: RUNNABLE > at java.util.zip.Inflater.inflateBytes(Native Method) > at java.util.zip.Inflater.inflate(Inflater.java:259) > - locked <0x0007f5d5fdc8> (a java.util.zip.ZStreamRef) > at > org.apache.hadoop.hive.ql.io.orc.ZlibCodec.decompress(ZlibCodec.java:94) > at > org.apache.hadoop.hive.ql.io.orc.InStream$CompressedStream.readHeader(InStream.java:238) > at > org.apache.hadoop.hive.ql.io.orc.InStream$CompressedStream.read(InStream.java:262) > at java.io.InputStream.read(InputStream.java:101) > at > com.google.protobuf.CodedInputStream.refillBuffer(CodedInputStream.java:737) > at > com.google.protobuf.CodedInputStream.isAtEnd(CodedInputStream.java:701) > at > com.google.protobuf.CodedInputStream.readTag(CodedInputStream.java:99) > at > org.apache.hadoop.hive.ql.io.orc.OrcProto$StripeFooter.(OrcProto.java:10661) > at > org.apache.hadoop.hive.ql.io.orc.OrcProto$StripeFooter.(OrcProto.java:10625) > at > org.apache.hadoop.hive.ql.io.orc.OrcProto$StripeFooter$1.parsePartialFrom(OrcProto.java:10730) > at > org.apache.hadoop.hive.ql.io.orc.OrcProto$StripeFooter$1.parsePartialFrom(OrcProto.java:10725) > at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:89) > at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:95) > at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:49) > at > org.apache.hadoop.hive.ql.io.orc.OrcProto$StripeFooter.parseFrom(OrcProto.java:10958) > at > org.apache.hadoop.hive.ql.io.orc.MetadataReaderImpl.readStripeFooter(MetadataReaderImpl.java:114) > at > org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.readStripeFooter(RecordReaderImpl.java:240) > at > org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.beginReadStripe(RecordReaderImpl.java:847) > at > org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.readStripe(RecordReaderImpl.java:818) > at > org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.advanceStripe(RecordReaderImpl.java:1033) > at > org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.advanceToNextRow(RecordReaderImpl.java:1068) > at > org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.(RecordReaderImpl.java:217) > at > org.apache.hadoop.hive.ql.io.orc.ReaderImpl.rowsOptions(ReaderImpl.java:638) > at org.apache.hadoop.hive.ql.io.orc.ReaderImpl.rows(ReaderImpl.java:625) > at > org.apache.hadoop.hive.ql.io.orc.FileDump.printMetaData(FileDump.java:162) > at org.apache.hadoop.hive.ql.io.orc.FileDump.main(FileDump.java:110) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at org.apache.hadoop.util.RunJar.run(RunJar.java:221) > at
[jira] [Commented] (HIVE-7847) query orc partitioned table fail when table column type change
[ https://issues.apache.org/jira/browse/HIVE-7847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15728391#comment-15728391 ] roncenzhao commented on HIVE-7847: -- Hi, [~wzc1989], whether this problem is solved or not ?? Thanks~ > query orc partitioned table fail when table column type change > -- > > Key: HIVE-7847 > URL: https://issues.apache.org/jira/browse/HIVE-7847 > Project: Hive > Issue Type: Bug > Components: File Formats >Affects Versions: 0.11.0, 0.12.0, 0.13.0 >Reporter: Zhichun Wu >Assignee: Zhichun Wu > Attachments: HIVE-7847.1.patch, vector_alter_partition_change_col.q > > > I use the following script to test orc column type change with partitioned > table on branch-0.13: > {code} > use test; > DROP TABLE if exists orc_change_type_staging; > DROP TABLE if exists orc_change_type; > CREATE TABLE orc_change_type_staging ( > id int > ); > CREATE TABLE orc_change_type ( > id int > ) PARTITIONED BY (`dt` string) > stored as orc; > --- load staging table > LOAD DATA LOCAL INPATH '../hive/examples/files/int.txt' OVERWRITE INTO TABLE > orc_change_type_staging; > --- populate orc hive table > INSERT OVERWRITE TABLE orc_change_type partition(dt='20140718') select * FROM > orc_change_type_staging limit 1; > --- change column id from int to bigint > ALTER TABLE orc_change_type CHANGE id id bigint; > INSERT OVERWRITE TABLE orc_change_type partition(dt='20140719') select * FROM > orc_change_type_staging limit 1; > SELECT id FROM orc_change_type where dt between '20140718' and '20140719'; > {code} > if fails in the last query "SELECT id FROM orc_change_type where dt between > '20140718' and '20140719';" with exception: > {code} > Error: java.io.IOException: java.io.IOException: > java.lang.ClassCastException: org.apache.hadoop.io.IntWritable cannot be cast > to org.apache.hadoop.io.LongWritable > at > org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderNextException(HiveIOExceptionHandlerChain.java:121) > at > org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil.handleRecordReaderNextException(HiveIOExceptionHandlerUtil.java:77) > at > org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileRecordReader.doNextWithExceptionHandler(HadoopShimsSecure.java:256) > at > org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileRecordReader.next(HadoopShimsSecure.java:171) > at > org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:197) > at > org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:183) > at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:52) > at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:429) > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341) > at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:162) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491) > at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:157) > Caused by: java.io.IOException: java.lang.ClassCastException: > org.apache.hadoop.io.IntWritable cannot be cast to > org.apache.hadoop.io.LongWritable > at > org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderNextException(HiveIOExceptionHandlerChain.java:121) > at > org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil.handleRecordReaderNextException(HiveIOExceptionHandlerUtil.java:77) > at > org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.doNext(HiveContextAwareRecordReader.java:344) > at > org.apache.hadoop.hive.ql.io.CombineHiveRecordReader.doNext(CombineHiveRecordReader.java:101) > at > org.apache.hadoop.hive.ql.io.CombineHiveRecordReader.doNext(CombineHiveRecordReader.java:41) > at > org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.next(HiveContextAwareRecordReader.java:122) > at > org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileRecordReader.doNextWithExceptionHandler(HadoopShimsSecure.java:254) > ... 11 more > Caused by: java.lang.ClassCastException: org.apache.hadoop.io.IntWritable > cannot be cast to org.apache.hadoop.io.LongWritable > at > org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl$LongTreeReader.next(RecordReaderImpl.java:717) > at > org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl$StructTreeReader.next(RecordReaderImpl.java:1788) > at > org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.next(RecordReaderImpl.java:2997) > at >
[jira] [Updated] (HIVE-14797) reducer number estimating may lead to data skew
[ https://issues.apache.org/jira/browse/HIVE-14797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] roncenzhao updated HIVE-14797: -- Attachment: HIVE-14797.4.patch resolve the problem about running on spark/tez > reducer number estimating may lead to data skew > --- > > Key: HIVE-14797 > URL: https://issues.apache.org/jira/browse/HIVE-14797 > Project: Hive > Issue Type: Improvement > Components: Query Processor >Reporter: roncenzhao >Assignee: roncenzhao > Attachments: HIVE-14797.2.patch, HIVE-14797.3.patch, > HIVE-14797.4.patch, HIVE-14797.patch > > > HiveKey's hash code is generated by multipling by 31 key by key which is > implemented in method `ObjectInspectorUtils.getBucketHashCode()`: > for (int i = 0; i < bucketFields.length; i++) { > int fieldHash = ObjectInspectorUtils.hashCode(bucketFields[i], > bucketFieldInspectors[i]); > hashCode = 31 * hashCode + fieldHash; > } > The follow example will lead to data skew: > I hava two table called tbl1 and tbl2 and they have the same column: a int, b > string. The values of column 'a' in both two tables are not skew, but values > of column 'b' in both two tables are skew. > When my sql is "select * from tbl1 join tbl2 on tbl1.a=tbl2.a and > tbl1.b=tbl2.b" and the estimated reducer number is 31, it will lead to data > skew. > As we know, the HiveKey's hash code is generated by `hash(a)*31 + hash(b)`. > When reducer number is 31 the reducer No. of each row is `hash(b)%31`. In the > result, the job will be skew. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-14797) reducer number estimating may lead to data skew
[ https://issues.apache.org/jira/browse/HIVE-14797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15584262#comment-15584262 ] roncenzhao commented on HIVE-14797: --- Hi, [~lirui] , I hava resolved this problem in the new patch. Please check it. Thanks~ > reducer number estimating may lead to data skew > --- > > Key: HIVE-14797 > URL: https://issues.apache.org/jira/browse/HIVE-14797 > Project: Hive > Issue Type: Improvement > Components: Query Processor >Reporter: roncenzhao >Assignee: roncenzhao > Attachments: HIVE-14797.2.patch, HIVE-14797.3.patch, > HIVE-14797.4.patch, HIVE-14797.patch > > > HiveKey's hash code is generated by multipling by 31 key by key which is > implemented in method `ObjectInspectorUtils.getBucketHashCode()`: > for (int i = 0; i < bucketFields.length; i++) { > int fieldHash = ObjectInspectorUtils.hashCode(bucketFields[i], > bucketFieldInspectors[i]); > hashCode = 31 * hashCode + fieldHash; > } > The follow example will lead to data skew: > I hava two table called tbl1 and tbl2 and they have the same column: a int, b > string. The values of column 'a' in both two tables are not skew, but values > of column 'b' in both two tables are skew. > When my sql is "select * from tbl1 join tbl2 on tbl1.a=tbl2.a and > tbl1.b=tbl2.b" and the estimated reducer number is 31, it will lead to data > skew. > As we know, the HiveKey's hash code is generated by `hash(a)*31 + hash(b)`. > When reducer number is 31 the reducer No. of each row is `hash(b)%31`. In the > result, the job will be skew. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-14797) reducer number estimating may lead to data skew
[ https://issues.apache.org/jira/browse/HIVE-14797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15559171#comment-15559171 ] roncenzhao commented on HIVE-14797: --- Is there anyone who can review this patch? thanks~ > reducer number estimating may lead to data skew > --- > > Key: HIVE-14797 > URL: https://issues.apache.org/jira/browse/HIVE-14797 > Project: Hive > Issue Type: Improvement > Components: Query Processor >Reporter: roncenzhao >Assignee: roncenzhao > Attachments: HIVE-14797.2.patch, HIVE-14797.3.patch, HIVE-14797.patch > > > HiveKey's hash code is generated by multipling by 31 key by key which is > implemented in method `ObjectInspectorUtils.getBucketHashCode()`: > for (int i = 0; i < bucketFields.length; i++) { > int fieldHash = ObjectInspectorUtils.hashCode(bucketFields[i], > bucketFieldInspectors[i]); > hashCode = 31 * hashCode + fieldHash; > } > The follow example will lead to data skew: > I hava two table called tbl1 and tbl2 and they have the same column: a int, b > string. The values of column 'a' in both two tables are not skew, but values > of column 'b' in both two tables are skew. > When my sql is "select * from tbl1 join tbl2 on tbl1.a=tbl2.a and > tbl1.b=tbl2.b" and the estimated reducer number is 31, it will lead to data > skew. > As we know, the HiveKey's hash code is generated by `hash(a)*31 + hash(b)`. > When reducer number is 31 the reducer No. of each row is `hash(b)%31`. In the > result, the job will be skew. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-14797) reducer number estimating may lead to data skew
[ https://issues.apache.org/jira/browse/HIVE-14797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] roncenzhao updated HIVE-14797: -- Status: Patch Available (was: In Progress) > reducer number estimating may lead to data skew > --- > > Key: HIVE-14797 > URL: https://issues.apache.org/jira/browse/HIVE-14797 > Project: Hive > Issue Type: Improvement > Components: Query Processor >Reporter: roncenzhao >Assignee: roncenzhao > Attachments: HIVE-14797.2.patch, HIVE-14797.3.patch, HIVE-14797.patch > > > HiveKey's hash code is generated by multipling by 31 key by key which is > implemented in method `ObjectInspectorUtils.getBucketHashCode()`: > for (int i = 0; i < bucketFields.length; i++) { > int fieldHash = ObjectInspectorUtils.hashCode(bucketFields[i], > bucketFieldInspectors[i]); > hashCode = 31 * hashCode + fieldHash; > } > The follow example will lead to data skew: > I hava two table called tbl1 and tbl2 and they have the same column: a int, b > string. The values of column 'a' in both two tables are not skew, but values > of column 'b' in both two tables are skew. > When my sql is "select * from tbl1 join tbl2 on tbl1.a=tbl2.a and > tbl1.b=tbl2.b" and the estimated reducer number is 31, it will lead to data > skew. > As we know, the HiveKey's hash code is generated by `hash(a)*31 + hash(b)`. > When reducer number is 31 the reducer No. of each row is `hash(b)%31`. In the > result, the job will be skew. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-14797) reducer number estimating may lead to data skew
[ https://issues.apache.org/jira/browse/HIVE-14797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] roncenzhao updated HIVE-14797: -- Status: In Progress (was: Patch Available) > reducer number estimating may lead to data skew > --- > > Key: HIVE-14797 > URL: https://issues.apache.org/jira/browse/HIVE-14797 > Project: Hive > Issue Type: Improvement > Components: Query Processor >Reporter: roncenzhao >Assignee: roncenzhao > Attachments: HIVE-14797.2.patch, HIVE-14797.3.patch, HIVE-14797.patch > > > HiveKey's hash code is generated by multipling by 31 key by key which is > implemented in method `ObjectInspectorUtils.getBucketHashCode()`: > for (int i = 0; i < bucketFields.length; i++) { > int fieldHash = ObjectInspectorUtils.hashCode(bucketFields[i], > bucketFieldInspectors[i]); > hashCode = 31 * hashCode + fieldHash; > } > The follow example will lead to data skew: > I hava two table called tbl1 and tbl2 and they have the same column: a int, b > string. The values of column 'a' in both two tables are not skew, but values > of column 'b' in both two tables are skew. > When my sql is "select * from tbl1 join tbl2 on tbl1.a=tbl2.a and > tbl1.b=tbl2.b" and the estimated reducer number is 31, it will lead to data > skew. > As we know, the HiveKey's hash code is generated by `hash(a)*31 + hash(b)`. > When reducer number is 31 the reducer No. of each row is `hash(b)%31`. In the > result, the job will be skew. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (HIVE-14797) reducer number estimating may lead to data skew
[ https://issues.apache.org/jira/browse/HIVE-14797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15511843#comment-15511843 ] roncenzhao edited comment on HIVE-14797 at 9/23/16 8:04 AM: I don't think they are related to my patch. The failure testcases have run successfully in my own machine. was (Author: roncenzhao): I think they are not related to my patch. The failure testcases have run successfully in my own machine. > reducer number estimating may lead to data skew > --- > > Key: HIVE-14797 > URL: https://issues.apache.org/jira/browse/HIVE-14797 > Project: Hive > Issue Type: Improvement > Components: Query Processor >Reporter: roncenzhao >Assignee: roncenzhao > Attachments: HIVE-14797.2.patch, HIVE-14797.3.patch, HIVE-14797.patch > > > HiveKey's hash code is generated by multipling by 31 key by key which is > implemented in method `ObjectInspectorUtils.getBucketHashCode()`: > for (int i = 0; i < bucketFields.length; i++) { > int fieldHash = ObjectInspectorUtils.hashCode(bucketFields[i], > bucketFieldInspectors[i]); > hashCode = 31 * hashCode + fieldHash; > } > The follow example will lead to data skew: > I hava two table called tbl1 and tbl2 and they have the same column: a int, b > string. The values of column 'a' in both two tables are not skew, but values > of column 'b' in both two tables are skew. > When my sql is "select * from tbl1 join tbl2 on tbl1.a=tbl2.a and > tbl1.b=tbl2.b" and the estimated reducer number is 31, it will lead to data > skew. > As we know, the HiveKey's hash code is generated by `hash(a)*31 + hash(b)`. > When reducer number is 31 the reducer No. of each row is `hash(b)%31`. In the > result, the job will be skew. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-14797) reducer number estimating may lead to data skew
[ https://issues.apache.org/jira/browse/HIVE-14797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15511843#comment-15511843 ] roncenzhao commented on HIVE-14797: --- I think they are not related to my patch. The failure testcases have run successfully in my own machine. > reducer number estimating may lead to data skew > --- > > Key: HIVE-14797 > URL: https://issues.apache.org/jira/browse/HIVE-14797 > Project: Hive > Issue Type: Improvement > Components: Query Processor >Reporter: roncenzhao >Assignee: roncenzhao > Attachments: HIVE-14797.2.patch, HIVE-14797.3.patch, HIVE-14797.patch > > > HiveKey's hash code is generated by multipling by 31 key by key which is > implemented in method `ObjectInspectorUtils.getBucketHashCode()`: > for (int i = 0; i < bucketFields.length; i++) { > int fieldHash = ObjectInspectorUtils.hashCode(bucketFields[i], > bucketFieldInspectors[i]); > hashCode = 31 * hashCode + fieldHash; > } > The follow example will lead to data skew: > I hava two table called tbl1 and tbl2 and they have the same column: a int, b > string. The values of column 'a' in both two tables are not skew, but values > of column 'b' in both two tables are skew. > When my sql is "select * from tbl1 join tbl2 on tbl1.a=tbl2.a and > tbl1.b=tbl2.b" and the estimated reducer number is 31, it will lead to data > skew. > As we know, the HiveKey's hash code is generated by `hash(a)*31 + hash(b)`. > When reducer number is 31 the reducer No. of each row is `hash(b)%31`. In the > result, the job will be skew. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-14797) reducer number estimating may lead to data skew
[ https://issues.apache.org/jira/browse/HIVE-14797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] roncenzhao updated HIVE-14797: -- Attachment: HIVE-14797.3.patch Remove some code duplication > reducer number estimating may lead to data skew > --- > > Key: HIVE-14797 > URL: https://issues.apache.org/jira/browse/HIVE-14797 > Project: Hive > Issue Type: Improvement > Components: Query Processor >Reporter: roncenzhao >Assignee: roncenzhao > Attachments: HIVE-14797.2.patch, HIVE-14797.3.patch, HIVE-14797.patch > > > HiveKey's hash code is generated by multipling by 31 key by key which is > implemented in method `ObjectInspectorUtils.getBucketHashCode()`: > for (int i = 0; i < bucketFields.length; i++) { > int fieldHash = ObjectInspectorUtils.hashCode(bucketFields[i], > bucketFieldInspectors[i]); > hashCode = 31 * hashCode + fieldHash; > } > The follow example will lead to data skew: > I hava two table called tbl1 and tbl2 and they have the same column: a int, b > string. The values of column 'a' in both two tables are not skew, but values > of column 'b' in both two tables are skew. > When my sql is "select * from tbl1 join tbl2 on tbl1.a=tbl2.a and > tbl1.b=tbl2.b" and the estimated reducer number is 31, it will lead to data > skew. > As we know, the HiveKey's hash code is generated by `hash(a)*31 + hash(b)`. > When reducer number is 31 the reducer No. of each row is `hash(b)%31`. In the > result, the job will be skew. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-14797) reducer number estimating may lead to data skew
[ https://issues.apache.org/jira/browse/HIVE-14797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] roncenzhao updated HIVE-14797: -- Attachment: HIVE-14797.2.patch Let the seed have two options: 31 and 131. Meanwhile the default value is 31. In `ReduceSinkOperator` we get the reducer number from `hconf`, and then we will set the seed's value to be 131 if the `reduceNum` is equal to 31. > reducer number estimating may lead to data skew > --- > > Key: HIVE-14797 > URL: https://issues.apache.org/jira/browse/HIVE-14797 > Project: Hive > Issue Type: Improvement > Components: Query Processor >Reporter: roncenzhao >Assignee: roncenzhao > Attachments: HIVE-14797.2.patch, HIVE-14797.patch > > > HiveKey's hash code is generated by multipling by 31 key by key which is > implemented in method `ObjectInspectorUtils.getBucketHashCode()`: > for (int i = 0; i < bucketFields.length; i++) { > int fieldHash = ObjectInspectorUtils.hashCode(bucketFields[i], > bucketFieldInspectors[i]); > hashCode = 31 * hashCode + fieldHash; > } > The follow example will lead to data skew: > I hava two table called tbl1 and tbl2 and they have the same column: a int, b > string. The values of column 'a' in both two tables are not skew, but values > of column 'b' in both two tables are skew. > When my sql is "select * from tbl1 join tbl2 on tbl1.a=tbl2.a and > tbl1.b=tbl2.b" and the estimated reducer number is 31, it will lead to data > skew. > As we know, the HiveKey's hash code is generated by `hash(a)*31 + hash(b)`. > When reducer number is 31 the reducer No. of each row is `hash(b)%31`. In the > result, the job will be skew. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-14797) reducer number estimating may lead to data skew
[ https://issues.apache.org/jira/browse/HIVE-14797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15508756#comment-15508756 ] roncenzhao commented on HIVE-14797: --- Or we can use the follow way: Let the seed have two options: 31 and 131. In `ReduceSinkOperator` we can get the reducer number named `reduceNum`, and then we can choose the other value if the `reduceNum` is equal to 31 or 131. Is it OK? > reducer number estimating may lead to data skew > --- > > Key: HIVE-14797 > URL: https://issues.apache.org/jira/browse/HIVE-14797 > Project: Hive > Issue Type: Improvement > Components: Query Processor >Reporter: roncenzhao >Assignee: roncenzhao > Attachments: HIVE-14797.patch > > > HiveKey's hash code is generated by multipling by 31 key by key which is > implemented in method `ObjectInspectorUtils.getBucketHashCode()`: > for (int i = 0; i < bucketFields.length; i++) { > int fieldHash = ObjectInspectorUtils.hashCode(bucketFields[i], > bucketFieldInspectors[i]); > hashCode = 31 * hashCode + fieldHash; > } > The follow example will lead to data skew: > I hava two table called tbl1 and tbl2 and they have the same column: a int, b > string. The values of column 'a' in both two tables are not skew, but values > of column 'b' in both two tables are skew. > When my sql is "select * from tbl1 join tbl2 on tbl1.a=tbl2.a and > tbl1.b=tbl2.b" and the estimated reducer number is 31, it will lead to data > skew. > As we know, the HiveKey's hash code is generated by `hash(a)*31 + hash(b)`. > When reducer number is 31 the reducer No. of each row is `hash(b)%31`. In the > result, the job will be skew. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-14797) reducer number estimating may lead to data skew
[ https://issues.apache.org/jira/browse/HIVE-14797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] roncenzhao updated HIVE-14797: -- Status: Patch Available (was: Open) > reducer number estimating may lead to data skew > --- > > Key: HIVE-14797 > URL: https://issues.apache.org/jira/browse/HIVE-14797 > Project: Hive > Issue Type: Improvement > Components: Query Processor >Reporter: roncenzhao >Assignee: roncenzhao > Attachments: HIVE-14797.patch > > > HiveKey's hash code is generated by multipling by 31 key by key which is > implemented in method `ObjectInspectorUtils.getBucketHashCode()`: > for (int i = 0; i < bucketFields.length; i++) { > int fieldHash = ObjectInspectorUtils.hashCode(bucketFields[i], > bucketFieldInspectors[i]); > hashCode = 31 * hashCode + fieldHash; > } > The follow example will lead to data skew: > I hava two table called tbl1 and tbl2 and they have the same column: a int, b > string. The values of column 'a' in both two tables are not skew, but values > of column 'b' in both two tables are skew. > When my sql is "select * from tbl1 join tbl2 on tbl1.a=tbl2.a and > tbl1.b=tbl2.b" and the estimated reducer number is 31, it will lead to data > skew. > As we know, the HiveKey's hash code is generated by `hash(a)*31 + hash(b)`. > When reducer number is 31 the reducer No. of each row is `hash(b)%31`. In the > result, the job will be skew. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-14797) reducer number estimating may lead to data skew
[ https://issues.apache.org/jira/browse/HIVE-14797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15508405#comment-15508405 ] roncenzhao commented on HIVE-14797: --- Yes, we can not hard code the number (31). But we cannot know which number to be set before the end of the job. So, I think we can solve it easily by the follow ways: In the method "Utilities.estimateReducers(xxx)", when the `reducers` value can be divisible by 31 we let it plus 1. > reducer number estimating may lead to data skew > --- > > Key: HIVE-14797 > URL: https://issues.apache.org/jira/browse/HIVE-14797 > Project: Hive > Issue Type: Improvement > Components: Query Processor >Reporter: roncenzhao >Assignee: roncenzhao > Attachments: HIVE-14797.patch > > > HiveKey's hash code is generated by multipling by 31 key by key which is > implemented in method `ObjectInspectorUtils.getBucketHashCode()`: > for (int i = 0; i < bucketFields.length; i++) { > int fieldHash = ObjectInspectorUtils.hashCode(bucketFields[i], > bucketFieldInspectors[i]); > hashCode = 31 * hashCode + fieldHash; > } > The follow example will lead to data skew: > I hava two table called tbl1 and tbl2 and they have the same column: a int, b > string. The values of column 'a' in both two tables are not skew, but values > of column 'b' in both two tables are skew. > When my sql is "select * from tbl1 join tbl2 on tbl1.a=tbl2.a and > tbl1.b=tbl2.b" and the estimated reducer number is 31, it will lead to data > skew. > As we know, the HiveKey's hash code is generated by `hash(a)*31 + hash(b)`. > When reducer number is 31 the reducer No. of each row is `hash(b)%31`. In the > result, the job will be skew. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-14797) reducer number estimating may lead to data skew
[ https://issues.apache.org/jira/browse/HIVE-14797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] roncenzhao updated HIVE-14797: -- Attachment: HIVE-14797.patch > reducer number estimating may lead to data skew > --- > > Key: HIVE-14797 > URL: https://issues.apache.org/jira/browse/HIVE-14797 > Project: Hive > Issue Type: Improvement > Components: Query Processor >Reporter: roncenzhao >Assignee: roncenzhao > Attachments: HIVE-14797.patch > > > HiveKey's hash code is generated by multipling by 31 key by key which is > implemented in method `ObjectInspectorUtils.getBucketHashCode()`: > for (int i = 0; i < bucketFields.length; i++) { > int fieldHash = ObjectInspectorUtils.hashCode(bucketFields[i], > bucketFieldInspectors[i]); > hashCode = 31 * hashCode + fieldHash; > } > The follow example will lead to data skew: > I hava two table called tbl1 and tbl2 and they have the same column: a int, b > string. The values of column 'a' in both two tables are not skew, but values > of column 'b' in both two tables are skew. > When my sql is "select * from tbl1 join tbl2 on tbl1.a=tbl2.a and > tbl1.b=tbl2.b" and the estimated reducer number is 31, it will lead to data > skew. > As we know, the HiveKey's hash code is generated by `hash(a)*31 + hash(b)`. > When reducer number is 31 the reducer No. of each row is `hash(b)%31`. In the > result, the job will be skew. -- This message was sent by Atlassian JIRA (v6.3.4#6332)