[jira] [Created] (HIVE-2182) Avoid null pointer exception when executing UDF
Avoid null pointer exception when executing UDF --- Key: HIVE-2182 URL: https://issues.apache.org/jira/browse/HIVE-2182 Project: Hive Issue Type: Bug Components: Query Processor Affects Versions: 0.5.0, 0.8.0 Environment: Hadoop 0.20.1, Hive0.8.0 and SUSE Linux Enterprise Server 10 SP2 (i586) - Kernel 2.6.16.60-0.21-smp (5) Reporter: Chinna Rao Lalam Assignee: Chinna Rao Lalam For using UDF's executed following steps {noformat} add jar /home/udf/udf.jar; create temporary function grade as 'udf.Grade'; select m.userid,m.name,grade(m.maths,m.physics,m.chemistry) from marks m; {noformat} But from the above steps if we miss the first step (add jar) and execute remaining steps {noformat} create temporary function grade as 'udf.Grade'; select m.userid,m.name,grade(m.maths,m.physics,m.chemistry) from marks m; {noformat} In tasktracker it is throwing this exception {noformat} Caused by: java.lang.RuntimeException: Map operator initialization failed at org.apache.hadoop.hive.ql.exec.ExecMapper.configure(ExecMapper.java:121) ... 18 more Caused by: java.lang.RuntimeException: java.lang.NullPointerException at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:115) at org.apache.hadoop.hive.ql.udf.generic.GenericUDFBridge.initialize(GenericUDFBridge.java:126) at org.apache.hadoop.hive.ql.exec.ExprNodeGenericFuncEvaluator.initialize(ExprNodeGenericFuncEvaluator.java:133) at org.apache.hadoop.hive.ql.exec.Operator.initEvaluators(Operator.java:878) at org.apache.hadoop.hive.ql.exec.Operator.initEvaluatorsAndReturnStruct(Operator.java:904) at org.apache.hadoop.hive.ql.exec.SelectOperator.initializeOp(SelectOperator.java:60) at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:357) at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:433) at org.apache.hadoop.hive.ql.exec.Operator.initializeChildren(Operator.java:389) at org.apache.hadoop.hive.ql.exec.TableScanOperator.initializeOp(TableScanOperator.java:133) at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:357) at org.apache.hadoop.hive.ql.exec.MapOperator.initializeOp(MapOperator.java:444) at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:357) at org.apache.hadoop.hive.ql.exec.ExecMapper.configure(ExecMapper.java:98) ... 18 more Caused by: java.lang.NullPointerException at java.util.concurrent.ConcurrentHashMap.get(ConcurrentHashMap.java:768) at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:107) ... 31 more {noformat} Instead of null pointer exception it should throw meaning full exception -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: Review Request: HIVE-2147 : Add api to send / receive message to metastore
On 2011-05-25 03:43:30, Carl Steinbach wrote: trunk/metastore/if/hive_metastore.thrift, line 347 https://reviews.apache.org/r/738/diff/1/?file=18685#file18685line347 Having separate calls for sending request and response messages looks unnecessary. A sendMessage() function with separate request and response message types should work just as well, and will help to avoid confusion -- otherwise I think people will assume that receiveMessage is a polling call. This is starting to look like a general purpose messaging/rpc framework. Is that the intent? A sendMessage() function with separate request and response message types should work just as well. That is correct. But semantically they are different. In sendMessage() user is just notifying Metastore of an event and is not bothered of return value. recvMessage() user is asking for a response for his message. This distinction is further enforced by return types. We could just have one api sendMessage() for both as you suggested, but having distinct apis for sending and receiving makes it easier for client to understand the semantics. This is starting to look like a general purpose messaging/rpc framework. Well general purpose rpc framework would be much more sophisticated. I am not aiming for that. On 2011-05-25 03:43:30, Carl Steinbach wrote: trunk/metastore/if/hive_metastore.thrift, line 348 https://reviews.apache.org/r/738/diff/1/?file=18685#file18685line348 Identifying the message type using an integer seems brittle. This won't work if you have more than one application that is firing events at the metastore. There are two other alternatives that I thought of before settling on this one. 1) Add specific apis for different message types. This would have made doing this generic api redundant but then this will result in application specific apis in the metastore. E.g., in HCatalog we want to send a message for set of partitions telling Metastore to mark them as done. What does finalizePartition() mean in metastore api when Metastore itself is not aware of this concept as this is application specific. This would be confusing. 2) Use enums instead of integer. This will result in similar problem as above though on a lower scale. Enums give compile time safety so we have to define them in Metastore code. Defining application specific enums doesnt look like a good idea because of similar reasons. On 2011-05-25 03:43:30, Carl Steinbach wrote: trunk/metastore/src/java/org/apache/hadoop/hive/metastore/HiveMetaStore.java, line 3126 https://reviews.apache.org/r/738/diff/1/?file=18694#file18694line3126 So the event model is that each event may be handled by at most one event handler? Yes. On 2011-05-25 03:43:30, Carl Steinbach wrote: trunk/metastore/src/java/org/apache/hadoop/hive/metastore/HiveMetaStore.java, line 3134 https://reviews.apache.org/r/738/diff/1/?file=18694#file18694line3134 Please add some DEBUG or TRACE level logging here that indicates which handler consumed a particular event, or if an event was unserviceable. Will add logging. On 2011-05-25 03:43:30, Carl Steinbach wrote: trunk/metastore/src/java/org/apache/hadoop/hive/metastore/HiveMetaStore.java, line 3149 https://reviews.apache.org/r/738/diff/1/?file=18694#file18694line3149 Semantically this function looks more like sendRequest than receiveMessage (and sendMessage looks more like fireEvent). Same as my very first comment. On 2011-05-25 03:43:30, Carl Steinbach wrote: trunk/metastore/src/java/org/apache/hadoop/hive/metastore/HiveMetaStore.java, line 3151 https://reviews.apache.org/r/738/diff/1/?file=18694#file18694line3151 Checkstyle: you need a space between control flow tokens and open parens. will roll this in. On 2011-05-25 03:43:30, Carl Steinbach wrote: trunk/metastore/src/java/org/apache/hadoop/hive/metastore/IMetaStoreClient.java, line 640 https://reviews.apache.org/r/738/diff/1/?file=18696#file18696line640 Nice to have: javadoc. Will add. On 2011-05-25 03:43:30, Carl Steinbach wrote: trunk/metastore/src/java/org/apache/hadoop/hive/metastore/MetaStoreEventListener.java, line 86 https://reviews.apache.org/r/738/diff/1/?file=18697#file18697line86 canProcessSendMessage() looks like a redundant call. Is there any reason that this can't be be rolled into processSendMessage()? Event model is every event is handled by atmost one handler. If we roll this in processSendMsg() then we have to make this method return boolean which will tell whether this event got serviced by this handler or not. Then how will it communicate back the actual return value. In case of sendMsg() this is fine, but recvMsg() returns a valid value which is then need to be returned to a client. So, we first ask handler if it can handle the message and then expect a valid return value in processRecvMsg()
Build failed in Jenkins: Hive-branch-0.7.1-h0.21 #4
See https://builds.apache.org/hudson/job/Hive-branch-0.7.1-h0.21/4/ -- [...truncated 27403 lines...] [junit] POSTHOOK: query: create table testhivedrivertable (num int) [junit] POSTHOOK: type: CREATETABLE [junit] POSTHOOK: Output: default@testhivedrivertable [junit] OK [junit] PREHOOK: query: load data local inpath 'https://builds.apache.org/hudson/job/Hive-branch-0.7.1-h0.21/ws/hive/data/files/kv1.txt' into table testhivedrivertable [junit] PREHOOK: type: LOAD [junit] Copying data from https://builds.apache.org/hudson/job/Hive-branch-0.7.1-h0.21/ws/hive/data/files/kv1.txt [junit] Loading data to table default.testhivedrivertable [junit] POSTHOOK: query: load data local inpath 'https://builds.apache.org/hudson/job/Hive-branch-0.7.1-h0.21/ws/hive/data/files/kv1.txt' into table testhivedrivertable [junit] POSTHOOK: type: LOAD [junit] POSTHOOK: Output: default@testhivedrivertable [junit] OK [junit] PREHOOK: query: select count(1) as cnt from testhivedrivertable [junit] PREHOOK: type: QUERY [junit] PREHOOK: Input: default@testhivedrivertable [junit] PREHOOK: Output: file:/tmp/hudson/hive_2011-05-25_12-09-41_516_5538876000417618343/-mr-1 [junit] Total MapReduce jobs = 1 [junit] Launching Job 1 out of 1 [junit] Number of reduce tasks determined at compile time: 1 [junit] In order to change the average load for a reducer (in bytes): [junit] set hive.exec.reducers.bytes.per.reducer=number [junit] In order to limit the maximum number of reducers: [junit] set hive.exec.reducers.max=number [junit] In order to set a constant number of reducers: [junit] set mapred.reduce.tasks=number [junit] Job running in-process (local Hadoop) [junit] 2011-05-25 12:09:44,599 null map = 100%, reduce = 100% [junit] Ended Job = job_local_0001 [junit] POSTHOOK: query: select count(1) as cnt from testhivedrivertable [junit] POSTHOOK: type: QUERY [junit] POSTHOOK: Input: default@testhivedrivertable [junit] POSTHOOK: Output: file:/tmp/hudson/hive_2011-05-25_12-09-41_516_5538876000417618343/-mr-1 [junit] OK [junit] PREHOOK: query: drop table testhivedrivertable [junit] PREHOOK: type: DROPTABLE [junit] PREHOOK: Input: default@testhivedrivertable [junit] PREHOOK: Output: default@testhivedrivertable [junit] POSTHOOK: query: drop table testhivedrivertable [junit] POSTHOOK: type: DROPTABLE [junit] POSTHOOK: Input: default@testhivedrivertable [junit] POSTHOOK: Output: default@testhivedrivertable [junit] OK [junit] Hive history file=https://builds.apache.org/hudson/job/Hive-branch-0.7.1-h0.21/ws/hive/build/service/tmp/hive_job_log_hudson_201105251209_1058215749.txt [junit] PREHOOK: query: drop table testhivedrivertable [junit] PREHOOK: type: DROPTABLE [junit] POSTHOOK: query: drop table testhivedrivertable [junit] POSTHOOK: type: DROPTABLE [junit] OK [junit] PREHOOK: query: create table testhivedrivertable (num int) [junit] PREHOOK: type: CREATETABLE [junit] POSTHOOK: query: create table testhivedrivertable (num int) [junit] POSTHOOK: type: CREATETABLE [junit] POSTHOOK: Output: default@testhivedrivertable [junit] OK [junit] PREHOOK: query: load data local inpath 'https://builds.apache.org/hudson/job/Hive-branch-0.7.1-h0.21/ws/hive/data/files/kv1.txt' into table testhivedrivertable [junit] PREHOOK: type: LOAD [junit] Copying data from https://builds.apache.org/hudson/job/Hive-branch-0.7.1-h0.21/ws/hive/data/files/kv1.txt [junit] Loading data to table default.testhivedrivertable [junit] POSTHOOK: query: load data local inpath 'https://builds.apache.org/hudson/job/Hive-branch-0.7.1-h0.21/ws/hive/data/files/kv1.txt' into table testhivedrivertable [junit] POSTHOOK: type: LOAD [junit] POSTHOOK: Output: default@testhivedrivertable [junit] OK [junit] PREHOOK: query: select * from testhivedrivertable limit 10 [junit] PREHOOK: type: QUERY [junit] PREHOOK: Input: default@testhivedrivertable [junit] PREHOOK: Output: file:/tmp/hudson/hive_2011-05-25_12-09-46_073_8776542399968628601/-mr-1 [junit] POSTHOOK: query: select * from testhivedrivertable limit 10 [junit] POSTHOOK: type: QUERY [junit] POSTHOOK: Input: default@testhivedrivertable [junit] POSTHOOK: Output: file:/tmp/hudson/hive_2011-05-25_12-09-46_073_8776542399968628601/-mr-1 [junit] OK [junit] PREHOOK: query: drop table testhivedrivertable [junit] PREHOOK: type: DROPTABLE [junit] PREHOOK: Input: default@testhivedrivertable [junit] PREHOOK: Output: default@testhivedrivertable [junit] POSTHOOK: query: drop table testhivedrivertable [junit] POSTHOOK: type: DROPTABLE [junit] POSTHOOK: Input: default@testhivedrivertable [junit] POSTHOOK: Output: default@testhivedrivertable [junit] OK [junit] Hive history
Build failed in Jenkins: Hive-trunk-h0.21 #749
See https://builds.apache.org/hudson/job/Hive-trunk-h0.21/749/ -- [...truncated 32138 lines...] [echo] Writing POM to https://builds.apache.org/hudson/job/Hive-trunk-h0.21/ws/hive/build/jdbc/pom.xml No ivy:settings found for the default reference 'ivy.instance'. A default instance will be used no settings file found, using default... :: loading settings :: url = jar:file:/home/hudson/.ant/lib/ivy-2.0.0-rc2.jar!/org/apache/ivy/core/settings/ivysettings.xml ivy-init-dirs: ivy-download: [get] Getting: http://repo2.maven.org/maven2/org/apache/ivy/ivy/2.1.0/ivy-2.1.0.jar [get] To: https://builds.apache.org/hudson/job/Hive-trunk-h0.21/ws/hive/build/ivy/lib/ivy-2.1.0.jar [get] Not modified - so not downloaded ivy-probe-antlib: ivy-init-antlib: ivy-init: check-ivy: create-dirs: compile-ant-tasks: create-dirs: init: compile: [echo] Compiling: anttasks [javac] https://builds.apache.org/hudson/job/Hive-trunk-h0.21/ws/hive/ant/build.xml:40: warning: 'includeantruntime' was not set, defaulting to build.sysclasspath=last; set to false for repeatable builds deploy-ant-tasks: create-dirs: init: compile: [echo] Compiling: anttasks [javac] https://builds.apache.org/hudson/job/Hive-trunk-h0.21/ws/hive/ant/build.xml:40: warning: 'includeantruntime' was not set, defaulting to build.sysclasspath=last; set to false for repeatable builds jar: init: install-hadoopcore: install-hadoopcore-default: ivy-init-dirs: ivy-download: [get] Getting: http://repo2.maven.org/maven2/org/apache/ivy/ivy/2.1.0/ivy-2.1.0.jar [get] To: https://builds.apache.org/hudson/job/Hive-trunk-h0.21/ws/hive/build/ivy/lib/ivy-2.1.0.jar [get] Not modified - so not downloaded ivy-probe-antlib: ivy-init-antlib: ivy-init: ivy-retrieve-hadoop-source: :: loading settings :: file = https://builds.apache.org/hudson/job/Hive-trunk-h0.21/ws/hive/ivy/ivysettings.xml [ivy:retrieve] :: resolving dependencies :: org.apache.hive#hive-hwi;0.8.0-SNAPSHOT [ivy:retrieve] confs: [default] [ivy:retrieve] found hadoop#core;0.20.1 in hadoop-source [ivy:retrieve] :: resolution report :: resolve 661ms :: artifacts dl 1ms - | |modules|| artifacts | | conf | number| search|dwnlded|evicted|| number|dwnlded| - | default | 1 | 0 | 0 | 0 || 1 | 0 | - [ivy:retrieve] :: retrieving :: org.apache.hive#hive-hwi [ivy:retrieve] confs: [default] [ivy:retrieve] 0 artifacts copied, 1 already retrieved (0kB/1ms) install-hadoopcore-internal: setup: war: compile: [echo] Compiling: hwi [javac] https://builds.apache.org/hudson/job/Hive-trunk-h0.21/ws/hive/hwi/build.xml:71: warning: 'includeantruntime' was not set, defaulting to build.sysclasspath=last; set to false for repeatable builds jar: [echo] Jar: hwi make-pom: [echo] Writing POM to https://builds.apache.org/hudson/job/Hive-trunk-h0.21/ws/hive/build/hwi/pom.xml No ivy:settings found for the default reference 'ivy.instance'. A default instance will be used no settings file found, using default... :: loading settings :: url = jar:file:/home/hudson/.ant/lib/ivy-2.0.0-rc2.jar!/org/apache/ivy/core/settings/ivysettings.xml ivy-init-dirs: ivy-download: [get] Getting: http://repo2.maven.org/maven2/org/apache/ivy/ivy/2.1.0/ivy-2.1.0.jar [get] To: https://builds.apache.org/hudson/job/Hive-trunk-h0.21/ws/hive/build/ivy/lib/ivy-2.1.0.jar [get] Not modified - so not downloaded ivy-probe-antlib: ivy-init-antlib: ivy-init: check-ivy: create-dirs: compile-ant-tasks: create-dirs: init: compile: [echo] Compiling: anttasks [javac] https://builds.apache.org/hudson/job/Hive-trunk-h0.21/ws/hive/ant/build.xml:40: warning: 'includeantruntime' was not set, defaulting to build.sysclasspath=last; set to false for repeatable builds deploy-ant-tasks: create-dirs: init: compile: [echo] Compiling: anttasks [javac] https://builds.apache.org/hudson/job/Hive-trunk-h0.21/ws/hive/ant/build.xml:40: warning: 'includeantruntime' was not set, defaulting to build.sysclasspath=last; set to false for repeatable builds jar: init: setup: compile: [echo] Compiling: hbase-handler [javac] https://builds.apache.org/hudson/job/Hive-trunk-h0.21/ws/hive/build-common.xml:299: warning: 'includeantruntime' was not set, defaulting to build.sysclasspath=last; set to false for repeatable builds [copy] Warning: https://builds.apache.org/hudson/job/Hive-trunk-h0.21/ws/hive/hbase-handler/src/java/conf does not exist. jar: [echo] Jar: hbase-handler make-pom: [echo] Writing POM to
[jira] [Created] (HIVE-2185) extend table statistics to store the size of uncompressed data (+extend interfaces for collecting other types of statistics)
extend table statistics to store the size of uncompressed data (+extend interfaces for collecting other types of statistics) Key: HIVE-2185 URL: https://issues.apache.org/jira/browse/HIVE-2185 Project: Hive Issue Type: New Feature Components: Serializers/Deserializers, Statistics Reporter: Tomasz Nykiel Assignee: Tomasz Nykiel Currently, when executing INSERT OVERWRITE and ANALYZE TABLE commands we collect statistics about the number of rows per partition/table. Other statistics (e.g., total table/partition size) are derived from the file system. Here, we want to collect information about the sizes of uncompressed data, to be able to determine the efficiency of compression. Currently, a large part of statistics collection mechanism is hardcoded and not-easily extensible for other statistics. On top of adding the new statistic collected, it would be desirable to extend the collection mechanism, so any new statistics could be added easily. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-2185) extend table statistics to store the size of uncompressed data (+extend interfaces for collecting other types of statistics)
[ https://issues.apache.org/jira/browse/HIVE-2185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13039493#comment-13039493 ] jirapos...@reviews.apache.org commented on HIVE-2185: - --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/785/ --- Review request for hive. Summary --- Currently, when executing INSERT OVERWRITE and ANALYZE TABLE commands we collect statistics about the number of rows per partition/table. Other statistics (e.g., total table/partition size) are derived from the file system. We introduce a new feature for collecting information about the sizes of uncompressed data, to be able to determine the efficiency of compression. On top of adding the new statistic collected, this patch extends the stats collection mechanism, so any new statistics could be added easily. 1. serializer/deserializer classes are amended to accommodate collecting sizes of uncompressed data, when serializing/deserializing objects. We support: Columnar SerDe LazySimpleSerDe LazyBinarySerDe For other SerDe classes the uncompressed siez will be 0. 2. StatsPublisher / StatsAggregator interfaces are extended to support multi-stats collection for both JDBC and HBase. 3. For both INSERT OVERWRITE and ANALYZE statements, FileSinkOperator and TableScanOperator respectively are extended to support multi-stats collection. (2) and (3) enable easy extension for other types of statistics. 4. Collecting uncompressed size can be disabled by setting: hive.stats.collect.uncompressedsize = false This addresses bug HIVE-2185. https://issues.apache.org/jira/browse/HIVE-2185 Diffs - trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 1127756 trunk/contrib/src/java/org/apache/hadoop/hive/contrib/serde2/RegexSerDe.java 1127756 trunk/contrib/src/java/org/apache/hadoop/hive/contrib/serde2/TypedBytesSerDe.java 1127756 trunk/contrib/src/java/org/apache/hadoop/hive/contrib/serde2/s3/S3LogDeserializer.java 1127756 trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseSerDe.java 1127756 trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseStatsAggregator.java 1127756 trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseStatsPublisher.java 1127756 trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseStatsSetupConstants.java 1127756 trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseStatsUtils.java PRE-CREATION trunk/hbase-handler/src/test/queries/hbase_stats.q 1127756 trunk/hbase-handler/src/test/results/hbase_stats.q.out 1127756 trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/FileSinkOperator.java 1127756 trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/MapOperator.java 1127756 trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/Stat.java 1127756 trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/StatsTask.java 1127756 trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/TableScanOperator.java 1127756 trunk/ql/src/java/org/apache/hadoop/hive/ql/metadata/VirtualColumn.java 1127756 trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/ColumnPrunerProcFactory.java 1127756 trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java 1127756 trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/TableScanDesc.java 1127756 trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsAggregator.java 1127756 trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsPublisher.java 1127756 trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsSetupConst.java 1127756 trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsAggregator.java 1127756 trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsPublisher.java 1127756 trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsSetupConstants.java 1127756 trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsUtils.java PRE-CREATION trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestStatsPublisher.java 1127756 trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestStatsPublisherEnhanced.java PRE-CREATION trunk/ql/src/test/org/apache/hadoop/hive/serde2/TestSerDe.java 1127756 trunk/ql/src/test/queries/clientpositive/stats14.q PRE-CREATION trunk/ql/src/test/queries/clientpositive/stats15.q PRE-CREATION trunk/ql/src/test/results/clientpositive/bucketmapjoin1.q.out 1127756 trunk/ql/src/test/results/clientpositive/bucketmapjoin2.q.out 1127756 trunk/ql/src/test/results/clientpositive/bucketmapjoin3.q.out 1127756 trunk/ql/src/test/results/clientpositive/bucketmapjoin4.q.out 1127756 trunk/ql/src/test/results/clientpositive/bucketmapjoin5.q.out 1127756
[jira] [Updated] (HIVE-2185) extend table statistics to store the size of uncompressed data (+extend interfaces for collecting other types of statistics)
[ https://issues.apache.org/jira/browse/HIVE-2185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tomasz Nykiel updated HIVE-2185: Attachment: HIVE-2185.patch extend table statistics to store the size of uncompressed data (+extend interfaces for collecting other types of statistics) Key: HIVE-2185 URL: https://issues.apache.org/jira/browse/HIVE-2185 Project: Hive Issue Type: New Feature Components: Serializers/Deserializers, Statistics Reporter: Tomasz Nykiel Assignee: Tomasz Nykiel Attachments: HIVE-2185.patch Currently, when executing INSERT OVERWRITE and ANALYZE TABLE commands we collect statistics about the number of rows per partition/table. Other statistics (e.g., total table/partition size) are derived from the file system. Here, we want to collect information about the sizes of uncompressed data, to be able to determine the efficiency of compression. Currently, a large part of statistics collection mechanism is hardcoded and not-easily extensible for other statistics. On top of adding the new statistic collected, it would be desirable to extend the collection mechanism, so any new statistics could be added easily. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
Review Request: extend table statistics to store the size of uncompressed data (+extend interfaces for collecting other types of statistics)
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/785/ --- Review request for hive. Summary --- Currently, when executing INSERT OVERWRITE and ANALYZE TABLE commands we collect statistics about the number of rows per partition/table. Other statistics (e.g., total table/partition size) are derived from the file system. We introduce a new feature for collecting information about the sizes of uncompressed data, to be able to determine the efficiency of compression. On top of adding the new statistic collected, this patch extends the stats collection mechanism, so any new statistics could be added easily. 1. serializer/deserializer classes are amended to accommodate collecting sizes of uncompressed data, when serializing/deserializing objects. We support: Columnar SerDe LazySimpleSerDe LazyBinarySerDe For other SerDe classes the uncompressed siez will be 0. 2. StatsPublisher / StatsAggregator interfaces are extended to support multi-stats collection for both JDBC and HBase. 3. For both INSERT OVERWRITE and ANALYZE statements, FileSinkOperator and TableScanOperator respectively are extended to support multi-stats collection. (2) and (3) enable easy extension for other types of statistics. 4. Collecting uncompressed size can be disabled by setting: hive.stats.collect.uncompressedsize = false This addresses bug HIVE-2185. https://issues.apache.org/jira/browse/HIVE-2185 Diffs - trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 1127756 trunk/contrib/src/java/org/apache/hadoop/hive/contrib/serde2/RegexSerDe.java 1127756 trunk/contrib/src/java/org/apache/hadoop/hive/contrib/serde2/TypedBytesSerDe.java 1127756 trunk/contrib/src/java/org/apache/hadoop/hive/contrib/serde2/s3/S3LogDeserializer.java 1127756 trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseSerDe.java 1127756 trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseStatsAggregator.java 1127756 trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseStatsPublisher.java 1127756 trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseStatsSetupConstants.java 1127756 trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseStatsUtils.java PRE-CREATION trunk/hbase-handler/src/test/queries/hbase_stats.q 1127756 trunk/hbase-handler/src/test/results/hbase_stats.q.out 1127756 trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/FileSinkOperator.java 1127756 trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/MapOperator.java 1127756 trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/Stat.java 1127756 trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/StatsTask.java 1127756 trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/TableScanOperator.java 1127756 trunk/ql/src/java/org/apache/hadoop/hive/ql/metadata/VirtualColumn.java 1127756 trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/ColumnPrunerProcFactory.java 1127756 trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java 1127756 trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/TableScanDesc.java 1127756 trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsAggregator.java 1127756 trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsPublisher.java 1127756 trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsSetupConst.java 1127756 trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsAggregator.java 1127756 trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsPublisher.java 1127756 trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsSetupConstants.java 1127756 trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsUtils.java PRE-CREATION trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestStatsPublisher.java 1127756 trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestStatsPublisherEnhanced.java PRE-CREATION trunk/ql/src/test/org/apache/hadoop/hive/serde2/TestSerDe.java 1127756 trunk/ql/src/test/queries/clientpositive/stats14.q PRE-CREATION trunk/ql/src/test/queries/clientpositive/stats15.q PRE-CREATION trunk/ql/src/test/results/clientpositive/bucketmapjoin1.q.out 1127756 trunk/ql/src/test/results/clientpositive/bucketmapjoin2.q.out 1127756 trunk/ql/src/test/results/clientpositive/bucketmapjoin3.q.out 1127756 trunk/ql/src/test/results/clientpositive/bucketmapjoin4.q.out 1127756 trunk/ql/src/test/results/clientpositive/bucketmapjoin5.q.out 1127756 trunk/ql/src/test/results/clientpositive/combine2.q.out 1127756 trunk/ql/src/test/results/clientpositive/filter_join_breaktask.q.out 1127756 trunk/ql/src/test/results/clientpositive/join_map_ppr.q.out 1127756 trunk/ql/src/test/results/clientpositive/merge3.q.out 1127756