[jira] [Created] (HIVE-2182) Avoid null pointer exception when executing UDF

2011-05-25 Thread Chinna Rao Lalam (JIRA)
Avoid null pointer exception when executing UDF
---

 Key: HIVE-2182
 URL: https://issues.apache.org/jira/browse/HIVE-2182
 Project: Hive
  Issue Type: Bug
  Components: Query Processor
Affects Versions: 0.5.0, 0.8.0
 Environment: Hadoop 0.20.1, Hive0.8.0 and SUSE Linux Enterprise Server 
10 SP2 (i586) - Kernel 2.6.16.60-0.21-smp (5)
Reporter: Chinna Rao Lalam
Assignee: Chinna Rao Lalam


For using UDF's executed following steps

{noformat}
add jar /home/udf/udf.jar;
create temporary function grade as 'udf.Grade';
select m.userid,m.name,grade(m.maths,m.physics,m.chemistry) from marks m;
{noformat}

But from the above steps if we miss the first step (add jar) and execute 
remaining steps

{noformat}
create temporary function grade as 'udf.Grade';
select m.userid,m.name,grade(m.maths,m.physics,m.chemistry) from marks m;
{noformat}

In tasktracker it is throwing this exception
{noformat}
Caused by: java.lang.RuntimeException: Map operator initialization failed
 at 
org.apache.hadoop.hive.ql.exec.ExecMapper.configure(ExecMapper.java:121)
 ... 18 more
Caused by: java.lang.RuntimeException: java.lang.NullPointerException
 at 
org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:115)
 at 
org.apache.hadoop.hive.ql.udf.generic.GenericUDFBridge.initialize(GenericUDFBridge.java:126)
 at 
org.apache.hadoop.hive.ql.exec.ExprNodeGenericFuncEvaluator.initialize(ExprNodeGenericFuncEvaluator.java:133)
 at 
org.apache.hadoop.hive.ql.exec.Operator.initEvaluators(Operator.java:878)
 at 
org.apache.hadoop.hive.ql.exec.Operator.initEvaluatorsAndReturnStruct(Operator.java:904)
 at 
org.apache.hadoop.hive.ql.exec.SelectOperator.initializeOp(SelectOperator.java:60)
 at 
org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:357)
 at 
org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:433)
 at 
org.apache.hadoop.hive.ql.exec.Operator.initializeChildren(Operator.java:389)
 at 
org.apache.hadoop.hive.ql.exec.TableScanOperator.initializeOp(TableScanOperator.java:133)
 at 
org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:357)
 at 
org.apache.hadoop.hive.ql.exec.MapOperator.initializeOp(MapOperator.java:444)
 at 
org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:357)
 at 
org.apache.hadoop.hive.ql.exec.ExecMapper.configure(ExecMapper.java:98)
 ... 18 more
Caused by: java.lang.NullPointerException
 at 
java.util.concurrent.ConcurrentHashMap.get(ConcurrentHashMap.java:768)
 at 
org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:107)
 ... 31 more
{noformat}
Instead of null pointer exception it should throw meaning full exception

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


Re: Review Request: HIVE-2147 : Add api to send / receive message to metastore

2011-05-25 Thread Ashutosh Chauhan


 On 2011-05-25 03:43:30, Carl Steinbach wrote:
  trunk/metastore/if/hive_metastore.thrift, line 347
  https://reviews.apache.org/r/738/diff/1/?file=18685#file18685line347
 
  Having separate calls for sending request and response messages looks 
  unnecessary. A sendMessage() function with separate request and response 
  message types should work just as well, and will help to avoid confusion -- 
  otherwise I think people will assume that receiveMessage is a polling call.
  
  This is starting to look like a general purpose messaging/rpc 
  framework. Is that the intent?
 

 A sendMessage() function with separate request and response message types 
 should work just as well.
That is correct. But semantically they are different. In sendMessage() user is 
just notifying Metastore of an event and is not bothered of return value. 
recvMessage() user is asking for a response for his message. This distinction 
is further enforced by return types. We could just have one api sendMessage() 
for both as you suggested, but having distinct apis for sending and receiving 
makes it easier for client to understand the semantics.

 This is starting to look like a general purpose messaging/rpc framework. 
Well general purpose rpc framework would be much more sophisticated. I am not 
aiming for that. 


 On 2011-05-25 03:43:30, Carl Steinbach wrote:
  trunk/metastore/if/hive_metastore.thrift, line 348
  https://reviews.apache.org/r/738/diff/1/?file=18685#file18685line348
 
  Identifying the message type using an integer seems brittle. This won't 
  work if you have more than one application that is firing events at the 
  metastore.

There are two other alternatives that I thought of before settling on this one.
1) Add specific apis for different message types. This would have made doing 
this generic api redundant but then this will result in application specific 
apis in the metastore. E.g., in HCatalog we want to send a message for set of 
partitions telling Metastore to mark them as done. What does 
finalizePartition() mean in metastore api when Metastore itself is not aware of 
this concept as this is application specific. This would be confusing.
2) Use enums instead of integer. This will result in similar problem as above 
though on a lower scale. Enums give compile time safety so we have to define 
them in Metastore code. Defining application specific enums doesnt look like a 
good idea because of similar reasons. 


 On 2011-05-25 03:43:30, Carl Steinbach wrote:
  trunk/metastore/src/java/org/apache/hadoop/hive/metastore/HiveMetaStore.java,
   line 3126
  https://reviews.apache.org/r/738/diff/1/?file=18694#file18694line3126
 
  So the event model is that each event may be handled by at most one 
  event handler?

Yes.


 On 2011-05-25 03:43:30, Carl Steinbach wrote:
  trunk/metastore/src/java/org/apache/hadoop/hive/metastore/HiveMetaStore.java,
   line 3134
  https://reviews.apache.org/r/738/diff/1/?file=18694#file18694line3134
 
  Please add some DEBUG or TRACE level logging here that indicates which 
  handler consumed a particular event, or if an event was unserviceable.

Will add logging.


 On 2011-05-25 03:43:30, Carl Steinbach wrote:
  trunk/metastore/src/java/org/apache/hadoop/hive/metastore/HiveMetaStore.java,
   line 3149
  https://reviews.apache.org/r/738/diff/1/?file=18694#file18694line3149
 
  Semantically this function looks more like sendRequest than 
  receiveMessage (and sendMessage looks more like fireEvent).

Same as my very first comment.


 On 2011-05-25 03:43:30, Carl Steinbach wrote:
  trunk/metastore/src/java/org/apache/hadoop/hive/metastore/HiveMetaStore.java,
   line 3151
  https://reviews.apache.org/r/738/diff/1/?file=18694#file18694line3151
 
  Checkstyle: you need a space between control flow tokens and open 
  parens.
 

will roll this in.


 On 2011-05-25 03:43:30, Carl Steinbach wrote:
  trunk/metastore/src/java/org/apache/hadoop/hive/metastore/IMetaStoreClient.java,
   line 640
  https://reviews.apache.org/r/738/diff/1/?file=18696#file18696line640
 
  Nice to have: javadoc.

Will add.


 On 2011-05-25 03:43:30, Carl Steinbach wrote:
  trunk/metastore/src/java/org/apache/hadoop/hive/metastore/MetaStoreEventListener.java,
   line 86
  https://reviews.apache.org/r/738/diff/1/?file=18697#file18697line86
 
  canProcessSendMessage() looks like a redundant call. Is there any 
  reason that this can't be be rolled into processSendMessage()?
 

Event model is every event is handled by atmost one handler. If we roll this in 
processSendMsg() then we have to make this method return boolean which will 
tell whether this event got serviced by this handler or not. Then how will it 
communicate back the actual return value. In case of sendMsg() this is fine, 
but recvMsg() returns a valid value which is then need to be returned to a 
client. So, we first ask handler if it can handle the message and then expect a 
valid return value in processRecvMsg() 

Build failed in Jenkins: Hive-branch-0.7.1-h0.21 #4

2011-05-25 Thread Apache Jenkins Server
See https://builds.apache.org/hudson/job/Hive-branch-0.7.1-h0.21/4/

--
[...truncated 27403 lines...]
[junit] POSTHOOK: query: create table testhivedrivertable (num int)
[junit] POSTHOOK: type: CREATETABLE
[junit] POSTHOOK: Output: default@testhivedrivertable
[junit] OK
[junit] PREHOOK: query: load data local inpath 
'https://builds.apache.org/hudson/job/Hive-branch-0.7.1-h0.21/ws/hive/data/files/kv1.txt'
 into table testhivedrivertable
[junit] PREHOOK: type: LOAD
[junit] Copying data from 
https://builds.apache.org/hudson/job/Hive-branch-0.7.1-h0.21/ws/hive/data/files/kv1.txt
[junit] Loading data to table default.testhivedrivertable
[junit] POSTHOOK: query: load data local inpath 
'https://builds.apache.org/hudson/job/Hive-branch-0.7.1-h0.21/ws/hive/data/files/kv1.txt'
 into table testhivedrivertable
[junit] POSTHOOK: type: LOAD
[junit] POSTHOOK: Output: default@testhivedrivertable
[junit] OK
[junit] PREHOOK: query: select count(1) as cnt from testhivedrivertable
[junit] PREHOOK: type: QUERY
[junit] PREHOOK: Input: default@testhivedrivertable
[junit] PREHOOK: Output: 
file:/tmp/hudson/hive_2011-05-25_12-09-41_516_5538876000417618343/-mr-1
[junit] Total MapReduce jobs = 1
[junit] Launching Job 1 out of 1
[junit] Number of reduce tasks determined at compile time: 1
[junit] In order to change the average load for a reducer (in bytes):
[junit]   set hive.exec.reducers.bytes.per.reducer=number
[junit] In order to limit the maximum number of reducers:
[junit]   set hive.exec.reducers.max=number
[junit] In order to set a constant number of reducers:
[junit]   set mapred.reduce.tasks=number
[junit] Job running in-process (local Hadoop)
[junit] 2011-05-25 12:09:44,599 null map = 100%,  reduce = 100%
[junit] Ended Job = job_local_0001
[junit] POSTHOOK: query: select count(1) as cnt from testhivedrivertable
[junit] POSTHOOK: type: QUERY
[junit] POSTHOOK: Input: default@testhivedrivertable
[junit] POSTHOOK: Output: 
file:/tmp/hudson/hive_2011-05-25_12-09-41_516_5538876000417618343/-mr-1
[junit] OK
[junit] PREHOOK: query: drop table testhivedrivertable
[junit] PREHOOK: type: DROPTABLE
[junit] PREHOOK: Input: default@testhivedrivertable
[junit] PREHOOK: Output: default@testhivedrivertable
[junit] POSTHOOK: query: drop table testhivedrivertable
[junit] POSTHOOK: type: DROPTABLE
[junit] POSTHOOK: Input: default@testhivedrivertable
[junit] POSTHOOK: Output: default@testhivedrivertable
[junit] OK
[junit] Hive history 
file=https://builds.apache.org/hudson/job/Hive-branch-0.7.1-h0.21/ws/hive/build/service/tmp/hive_job_log_hudson_201105251209_1058215749.txt
[junit] PREHOOK: query: drop table testhivedrivertable
[junit] PREHOOK: type: DROPTABLE
[junit] POSTHOOK: query: drop table testhivedrivertable
[junit] POSTHOOK: type: DROPTABLE
[junit] OK
[junit] PREHOOK: query: create table testhivedrivertable (num int)
[junit] PREHOOK: type: CREATETABLE
[junit] POSTHOOK: query: create table testhivedrivertable (num int)
[junit] POSTHOOK: type: CREATETABLE
[junit] POSTHOOK: Output: default@testhivedrivertable
[junit] OK
[junit] PREHOOK: query: load data local inpath 
'https://builds.apache.org/hudson/job/Hive-branch-0.7.1-h0.21/ws/hive/data/files/kv1.txt'
 into table testhivedrivertable
[junit] PREHOOK: type: LOAD
[junit] Copying data from 
https://builds.apache.org/hudson/job/Hive-branch-0.7.1-h0.21/ws/hive/data/files/kv1.txt
[junit] Loading data to table default.testhivedrivertable
[junit] POSTHOOK: query: load data local inpath 
'https://builds.apache.org/hudson/job/Hive-branch-0.7.1-h0.21/ws/hive/data/files/kv1.txt'
 into table testhivedrivertable
[junit] POSTHOOK: type: LOAD
[junit] POSTHOOK: Output: default@testhivedrivertable
[junit] OK
[junit] PREHOOK: query: select * from testhivedrivertable limit 10
[junit] PREHOOK: type: QUERY
[junit] PREHOOK: Input: default@testhivedrivertable
[junit] PREHOOK: Output: 
file:/tmp/hudson/hive_2011-05-25_12-09-46_073_8776542399968628601/-mr-1
[junit] POSTHOOK: query: select * from testhivedrivertable limit 10
[junit] POSTHOOK: type: QUERY
[junit] POSTHOOK: Input: default@testhivedrivertable
[junit] POSTHOOK: Output: 
file:/tmp/hudson/hive_2011-05-25_12-09-46_073_8776542399968628601/-mr-1
[junit] OK
[junit] PREHOOK: query: drop table testhivedrivertable
[junit] PREHOOK: type: DROPTABLE
[junit] PREHOOK: Input: default@testhivedrivertable
[junit] PREHOOK: Output: default@testhivedrivertable
[junit] POSTHOOK: query: drop table testhivedrivertable
[junit] POSTHOOK: type: DROPTABLE
[junit] POSTHOOK: Input: default@testhivedrivertable
[junit] POSTHOOK: Output: default@testhivedrivertable
[junit] OK
[junit] Hive history 

Build failed in Jenkins: Hive-trunk-h0.21 #749

2011-05-25 Thread Apache Jenkins Server
See https://builds.apache.org/hudson/job/Hive-trunk-h0.21/749/

--
[...truncated 32138 lines...]
 [echo]  Writing POM to 
https://builds.apache.org/hudson/job/Hive-trunk-h0.21/ws/hive/build/jdbc/pom.xml
No ivy:settings found for the default reference 'ivy.instance'.  A default 
instance will be used
no settings file found, using default...
:: loading settings :: url = 
jar:file:/home/hudson/.ant/lib/ivy-2.0.0-rc2.jar!/org/apache/ivy/core/settings/ivysettings.xml

ivy-init-dirs:

ivy-download:
  [get] Getting: 
http://repo2.maven.org/maven2/org/apache/ivy/ivy/2.1.0/ivy-2.1.0.jar
  [get] To: 
https://builds.apache.org/hudson/job/Hive-trunk-h0.21/ws/hive/build/ivy/lib/ivy-2.1.0.jar
  [get] Not modified - so not downloaded

ivy-probe-antlib:

ivy-init-antlib:

ivy-init:

check-ivy:

create-dirs:

compile-ant-tasks:

create-dirs:

init:

compile:
 [echo] Compiling: anttasks
[javac] 
https://builds.apache.org/hudson/job/Hive-trunk-h0.21/ws/hive/ant/build.xml:40:
 warning: 'includeantruntime' was not set, defaulting to 
build.sysclasspath=last; set to false for repeatable builds

deploy-ant-tasks:

create-dirs:

init:

compile:
 [echo] Compiling: anttasks
[javac] 
https://builds.apache.org/hudson/job/Hive-trunk-h0.21/ws/hive/ant/build.xml:40:
 warning: 'includeantruntime' was not set, defaulting to 
build.sysclasspath=last; set to false for repeatable builds

jar:

init:

install-hadoopcore:

install-hadoopcore-default:

ivy-init-dirs:

ivy-download:
  [get] Getting: 
http://repo2.maven.org/maven2/org/apache/ivy/ivy/2.1.0/ivy-2.1.0.jar
  [get] To: 
https://builds.apache.org/hudson/job/Hive-trunk-h0.21/ws/hive/build/ivy/lib/ivy-2.1.0.jar
  [get] Not modified - so not downloaded

ivy-probe-antlib:

ivy-init-antlib:

ivy-init:

ivy-retrieve-hadoop-source:
:: loading settings :: file = 
https://builds.apache.org/hudson/job/Hive-trunk-h0.21/ws/hive/ivy/ivysettings.xml
[ivy:retrieve] :: resolving dependencies :: 
org.apache.hive#hive-hwi;0.8.0-SNAPSHOT
[ivy:retrieve]  confs: [default]
[ivy:retrieve]  found hadoop#core;0.20.1 in hadoop-source
[ivy:retrieve] :: resolution report :: resolve 661ms :: artifacts dl 1ms
-
|  |modules||   artifacts   |
|   conf   | number| search|dwnlded|evicted|| number|dwnlded|
-
|  default |   1   |   0   |   0   |   0   ||   1   |   0   |
-
[ivy:retrieve] :: retrieving :: org.apache.hive#hive-hwi
[ivy:retrieve]  confs: [default]
[ivy:retrieve]  0 artifacts copied, 1 already retrieved (0kB/1ms)

install-hadoopcore-internal:

setup:

war:

compile:
 [echo] Compiling: hwi
[javac] 
https://builds.apache.org/hudson/job/Hive-trunk-h0.21/ws/hive/hwi/build.xml:71:
 warning: 'includeantruntime' was not set, defaulting to 
build.sysclasspath=last; set to false for repeatable builds

jar:
 [echo] Jar: hwi

make-pom:
 [echo]  Writing POM to 
https://builds.apache.org/hudson/job/Hive-trunk-h0.21/ws/hive/build/hwi/pom.xml
No ivy:settings found for the default reference 'ivy.instance'.  A default 
instance will be used
no settings file found, using default...
:: loading settings :: url = 
jar:file:/home/hudson/.ant/lib/ivy-2.0.0-rc2.jar!/org/apache/ivy/core/settings/ivysettings.xml

ivy-init-dirs:

ivy-download:
  [get] Getting: 
http://repo2.maven.org/maven2/org/apache/ivy/ivy/2.1.0/ivy-2.1.0.jar
  [get] To: 
https://builds.apache.org/hudson/job/Hive-trunk-h0.21/ws/hive/build/ivy/lib/ivy-2.1.0.jar
  [get] Not modified - so not downloaded

ivy-probe-antlib:

ivy-init-antlib:

ivy-init:

check-ivy:

create-dirs:

compile-ant-tasks:

create-dirs:

init:

compile:
 [echo] Compiling: anttasks
[javac] 
https://builds.apache.org/hudson/job/Hive-trunk-h0.21/ws/hive/ant/build.xml:40:
 warning: 'includeantruntime' was not set, defaulting to 
build.sysclasspath=last; set to false for repeatable builds

deploy-ant-tasks:

create-dirs:

init:

compile:
 [echo] Compiling: anttasks
[javac] 
https://builds.apache.org/hudson/job/Hive-trunk-h0.21/ws/hive/ant/build.xml:40:
 warning: 'includeantruntime' was not set, defaulting to 
build.sysclasspath=last; set to false for repeatable builds

jar:

init:

setup:

compile:
 [echo] Compiling: hbase-handler
[javac] 
https://builds.apache.org/hudson/job/Hive-trunk-h0.21/ws/hive/build-common.xml:299:
 warning: 'includeantruntime' was not set, defaulting to 
build.sysclasspath=last; set to false for repeatable builds
 [copy] Warning: 
https://builds.apache.org/hudson/job/Hive-trunk-h0.21/ws/hive/hbase-handler/src/java/conf
 does not exist.

jar:
 [echo] Jar: hbase-handler

make-pom:
 [echo]  Writing POM to 

[jira] [Created] (HIVE-2185) extend table statistics to store the size of uncompressed data (+extend interfaces for collecting other types of statistics)

2011-05-25 Thread Tomasz Nykiel (JIRA)
extend table statistics to store the size of uncompressed data (+extend 
interfaces for collecting other types of statistics)


 Key: HIVE-2185
 URL: https://issues.apache.org/jira/browse/HIVE-2185
 Project: Hive
  Issue Type: New Feature
  Components: Serializers/Deserializers, Statistics
Reporter: Tomasz Nykiel
Assignee: Tomasz Nykiel


Currently, when executing INSERT OVERWRITE and ANALYZE TABLE commands we 
collect statistics about the number of rows per partition/table. Other 
statistics (e.g., total table/partition size) are derived from the file system. 

Here, we want to collect information about the sizes of uncompressed data, to 
be able to determine the efficiency of compression.
Currently, a large part of statistics collection mechanism is hardcoded and 
not-easily extensible for other statistics.
On top of adding the new statistic collected, it would be desirable to extend 
the collection mechanism, so any new statistics could be added easily.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-2185) extend table statistics to store the size of uncompressed data (+extend interfaces for collecting other types of statistics)

2011-05-25 Thread jirapos...@reviews.apache.org (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-2185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13039493#comment-13039493
 ] 

jirapos...@reviews.apache.org commented on HIVE-2185:
-


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/785/
---

Review request for hive.


Summary
---

Currently, when executing INSERT OVERWRITE and ANALYZE TABLE commands we 
collect statistics about the number of rows per partition/table. 
Other statistics (e.g., total table/partition size) are derived from the file 
system.

We introduce a new feature for collecting information about the sizes of 
uncompressed data, to be able to determine the efficiency of compression.
On top of adding the new statistic collected, this patch extends the stats 
collection mechanism, so any new statistics could be added easily.

1. serializer/deserializer classes are amended to accommodate collecting sizes 
of uncompressed data, when serializing/deserializing objects.
We support:

Columnar SerDe
LazySimpleSerDe
LazyBinarySerDe

For other SerDe classes the uncompressed siez will be 0.

2. StatsPublisher / StatsAggregator interfaces are extended to support 
multi-stats collection for both JDBC and HBase.

3. For both INSERT OVERWRITE and ANALYZE statements, FileSinkOperator and 
TableScanOperator respectively are extended to support multi-stats collection.

(2) and (3) enable easy extension for other types of statistics.

4. Collecting uncompressed size can be disabled by setting:

hive.stats.collect.uncompressedsize = false


This addresses bug HIVE-2185.
https://issues.apache.org/jira/browse/HIVE-2185


Diffs
-

  trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 1127756 
  trunk/contrib/src/java/org/apache/hadoop/hive/contrib/serde2/RegexSerDe.java 
1127756 
  
trunk/contrib/src/java/org/apache/hadoop/hive/contrib/serde2/TypedBytesSerDe.java
 1127756 
  
trunk/contrib/src/java/org/apache/hadoop/hive/contrib/serde2/s3/S3LogDeserializer.java
 1127756 
  trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseSerDe.java 
1127756 
  
trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseStatsAggregator.java
 1127756 
  
trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseStatsPublisher.java
 1127756 
  
trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseStatsSetupConstants.java
 1127756 
  
trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseStatsUtils.java 
PRE-CREATION 
  trunk/hbase-handler/src/test/queries/hbase_stats.q 1127756 
  trunk/hbase-handler/src/test/results/hbase_stats.q.out 1127756 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/FileSinkOperator.java 
1127756 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/MapOperator.java 1127756 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/Stat.java 1127756 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/StatsTask.java 1127756 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/TableScanOperator.java 
1127756 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/metadata/VirtualColumn.java 
1127756 
  
trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/ColumnPrunerProcFactory.java
 1127756 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java 
1127756 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/TableScanDesc.java 1127756 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsAggregator.java 
1127756 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsPublisher.java 1127756 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsSetupConst.java 
1127756 
  
trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsAggregator.java 
1127756 
  
trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsPublisher.java 
1127756 
  
trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsSetupConstants.java
 1127756 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsUtils.java 
PRE-CREATION 
  trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestStatsPublisher.java 
1127756 
  
trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestStatsPublisherEnhanced.java
 PRE-CREATION 
  trunk/ql/src/test/org/apache/hadoop/hive/serde2/TestSerDe.java 1127756 
  trunk/ql/src/test/queries/clientpositive/stats14.q PRE-CREATION 
  trunk/ql/src/test/queries/clientpositive/stats15.q PRE-CREATION 
  trunk/ql/src/test/results/clientpositive/bucketmapjoin1.q.out 1127756 
  trunk/ql/src/test/results/clientpositive/bucketmapjoin2.q.out 1127756 
  trunk/ql/src/test/results/clientpositive/bucketmapjoin3.q.out 1127756 
  trunk/ql/src/test/results/clientpositive/bucketmapjoin4.q.out 1127756 
  trunk/ql/src/test/results/clientpositive/bucketmapjoin5.q.out 1127756 
  

[jira] [Updated] (HIVE-2185) extend table statistics to store the size of uncompressed data (+extend interfaces for collecting other types of statistics)

2011-05-25 Thread Tomasz Nykiel (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-2185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tomasz Nykiel updated HIVE-2185:


Attachment: HIVE-2185.patch

 extend table statistics to store the size of uncompressed data (+extend 
 interfaces for collecting other types of statistics)
 

 Key: HIVE-2185
 URL: https://issues.apache.org/jira/browse/HIVE-2185
 Project: Hive
  Issue Type: New Feature
  Components: Serializers/Deserializers, Statistics
Reporter: Tomasz Nykiel
Assignee: Tomasz Nykiel
 Attachments: HIVE-2185.patch


 Currently, when executing INSERT OVERWRITE and ANALYZE TABLE commands we 
 collect statistics about the number of rows per partition/table. Other 
 statistics (e.g., total table/partition size) are derived from the file 
 system. 
 Here, we want to collect information about the sizes of uncompressed data, to 
 be able to determine the efficiency of compression.
 Currently, a large part of statistics collection mechanism is hardcoded and 
 not-easily extensible for other statistics.
 On top of adding the new statistic collected, it would be desirable to extend 
 the collection mechanism, so any new statistics could be added easily.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


Review Request: extend table statistics to store the size of uncompressed data (+extend interfaces for collecting other types of statistics)

2011-05-25 Thread Tomasz Nykiel

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/785/
---

Review request for hive.


Summary
---

Currently, when executing INSERT OVERWRITE and ANALYZE TABLE commands we 
collect statistics about the number of rows per partition/table. 
Other statistics (e.g., total table/partition size) are derived from the file 
system.

We introduce a new feature for collecting information about the sizes of 
uncompressed data, to be able to determine the efficiency of compression.
On top of adding the new statistic collected, this patch extends the stats 
collection mechanism, so any new statistics could be added easily.

1. serializer/deserializer classes are amended to accommodate collecting sizes 
of uncompressed data, when serializing/deserializing objects.
We support:

Columnar SerDe
LazySimpleSerDe
LazyBinarySerDe

For other SerDe classes the uncompressed siez will be 0.

2. StatsPublisher / StatsAggregator interfaces are extended to support 
multi-stats collection for both JDBC and HBase.

3. For both INSERT OVERWRITE and ANALYZE statements, FileSinkOperator and 
TableScanOperator respectively are extended to support multi-stats collection.

(2) and (3) enable easy extension for other types of statistics.

4. Collecting uncompressed size can be disabled by setting:

hive.stats.collect.uncompressedsize = false


This addresses bug HIVE-2185.
https://issues.apache.org/jira/browse/HIVE-2185


Diffs
-

  trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 1127756 
  trunk/contrib/src/java/org/apache/hadoop/hive/contrib/serde2/RegexSerDe.java 
1127756 
  
trunk/contrib/src/java/org/apache/hadoop/hive/contrib/serde2/TypedBytesSerDe.java
 1127756 
  
trunk/contrib/src/java/org/apache/hadoop/hive/contrib/serde2/s3/S3LogDeserializer.java
 1127756 
  trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseSerDe.java 
1127756 
  
trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseStatsAggregator.java
 1127756 
  
trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseStatsPublisher.java
 1127756 
  
trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseStatsSetupConstants.java
 1127756 
  
trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseStatsUtils.java 
PRE-CREATION 
  trunk/hbase-handler/src/test/queries/hbase_stats.q 1127756 
  trunk/hbase-handler/src/test/results/hbase_stats.q.out 1127756 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/FileSinkOperator.java 
1127756 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/MapOperator.java 1127756 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/Stat.java 1127756 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/StatsTask.java 1127756 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/TableScanOperator.java 
1127756 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/metadata/VirtualColumn.java 
1127756 
  
trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/ColumnPrunerProcFactory.java
 1127756 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java 
1127756 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/TableScanDesc.java 1127756 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsAggregator.java 
1127756 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsPublisher.java 1127756 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsSetupConst.java 
1127756 
  
trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsAggregator.java 
1127756 
  
trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsPublisher.java 
1127756 
  
trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsSetupConstants.java
 1127756 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsUtils.java 
PRE-CREATION 
  trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestStatsPublisher.java 
1127756 
  
trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestStatsPublisherEnhanced.java
 PRE-CREATION 
  trunk/ql/src/test/org/apache/hadoop/hive/serde2/TestSerDe.java 1127756 
  trunk/ql/src/test/queries/clientpositive/stats14.q PRE-CREATION 
  trunk/ql/src/test/queries/clientpositive/stats15.q PRE-CREATION 
  trunk/ql/src/test/results/clientpositive/bucketmapjoin1.q.out 1127756 
  trunk/ql/src/test/results/clientpositive/bucketmapjoin2.q.out 1127756 
  trunk/ql/src/test/results/clientpositive/bucketmapjoin3.q.out 1127756 
  trunk/ql/src/test/results/clientpositive/bucketmapjoin4.q.out 1127756 
  trunk/ql/src/test/results/clientpositive/bucketmapjoin5.q.out 1127756 
  trunk/ql/src/test/results/clientpositive/combine2.q.out 1127756 
  trunk/ql/src/test/results/clientpositive/filter_join_breaktask.q.out 1127756 
  trunk/ql/src/test/results/clientpositive/join_map_ppr.q.out 1127756 
  trunk/ql/src/test/results/clientpositive/merge3.q.out 1127756