[ANNOUNCE] New Hive Committer - Navis Ryu
The Apache Hive PMC has passed a vote to make Navis Ryu a new committer on the project. JIRA is currently down, so I can't send out a link with his contribution list at the moment, but if you have an account at reviews.facebook.net, you can see his activity here: https://reviews.facebook.net/p/navis/ Navis, please submit your CLA to the Apache Software Foundation as described here: http://www.apache.org/licenses/#clas Congratulations! JVS
parallel execution for Hive unit tests
Hey all, Marek Sapota has put together a doc on the new scripts for spreading Hive unit test execution across a cluster: https://cwiki.apache.org/confluence/display/Hive/Unit+Test+Parallel+Execution Whether you are a committer or someone contributing patches, if you are currently frustrated by waiting for Hive tests to complete for each patch, please set this up and give it a try. JVS
Re: Hive UDFs/ FunctionRegistry etc
On Dec 8, 2011, at 12:20 PM, Sam William wrote: I have a bunch of custom UDFs and I d like the others in the company to make use of then in an easy way .Im not very happy with the 'CREATE TEMPORARY FUNCTION' arrangement for each session . It d be great if our site-specific functions , work the sameway as the inbuilt functions. What options do I have other than modifying FunctionRegistry and recompiling ? Hey Sam, At Facebook, we use a standard Hive init script to preload the temporary functions, using the CLI's -i option: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Cli You could also take a look at the work I recently did on loading builtins from a separate jar using Hive's PDK. If you want to work on an enhancement, it shouldn't be too hard to add a configuration option to preload additional jars and their functions using the same technique: https://issues.apache.org/jira/browse/HIVE-2523 A real SQL/J implementation of persistent functions in the metastore would be preferable, but that's a lot of work. JVS
meeting minutes for 5-Dec-2011 contributor meeting
https://cwiki.apache.org/confluence/display/Hive/ContributorMinutes20111205 I created an INFRA ticket to take Hive out of Review Board: https://issues.apache.org/jira/browse/INFRA-4200 Please use Phabricator for all new review requests: https://cwiki.apache.org/confluence/display/Hive/PhabricatorCodeReview I'm updating the wiki accordingly. JVS
Re: Hive-Hbase integration require Hbase in Pseudo distributed??
As you can guess from the 0.89 dependency, there has been a lot of water under the bridge since this integration was developed. If someone would like to take on bringing it up to date, that would be great. Note that auxpath is to make the jars available in map/reduce task VM's (we don't put everything from lib there automatically). JVS On Dec 2, 2011, at 10:39 AM, jcfol...@pureperfect.com wrote: I am having the same issue. Hive won't connect to HBase and throws org.apache.hadoop.hbase.MasterNotRunningException despite the fact that the master is up and running. It may only work if HBase is in distributed mode or psuedo-distributed mode. I know HBase doesn't put files into HDFS otherwise. It certainly doesn't work for me running in standalone mode. I've tried about thirty different combinations of hive/hbase and can't get it going on any of them, so I switched to trying to get pseudo-distributed mode working in HBase, but haven't been able to find the magic combination of versions that will allow HBase to do anything in HDFS other than throw EOFExceptions. In any case, according to the Hive documentation (see below) it doesn't work with any version of HBase other than 0.89, but there are three 0.89 versions of HBase at archive.apache.org and the lib directories for Hive contain 0.89-SNAPSHOT. FYI: There's an official Hive/HBase integration page at the Confluence wiki, but that doesn't work either: https://cwiki.apache.org/confluence/display/Hive/HBaseIntegration It contains the instruction: The handler requires Hadoop 0.20 or higher, and has only been tested with dependency versions hadoop-0.20.x, hbase-0.89.0 and zookeeper-3.3.1. If you are not using hbase-0.89.0, you will need to rebuild the handler with the HBase jar matching your version, and change the --auxpath above accordingly. Failure to use matching versions will lead to misleading connection failures such as MasterNotRunningException since the HBase RPC protocol changes often. But that really doesn't make sense. Hive uses Ivy and you can't just simply replace the jar files. Updating the version Ivy fetches from the Apache repository in the second sentence contradicts the version exception in the previous sentence since the only official releases of HBase are 0.90 and forward: https://repository.apache.org/content/repositories/releases/org/apache/hbase/hbase/ I'm trying to get Hive to build against HBase 0.90 but Ivy wants to pull the 0.90 out of snapshots so trying to grab the jar file throws 404s. As a side note: the --auxpath seems unnecessary. The jars are already in the lib directory so it seems like they ought to be on the classpath already. Original Message Subject: Re: Hive-Hbase integration require Hbase in Pseudo distributed?? From: Mohammad Tariq donta...@gmail.com Date: Fri, December 02, 2011 7:28 am To: user@hive.apache.org Anyone there, cud you please confirm if I can use hive-hbase in standalone mode??? will it work? or should i go for Pseudo distributed mode ? Regards, Mohammad Tariq On Fri, Dec 2, 2011 at 5:54 PM, Alok Kumar alok...@gmail.com wrote: hi, yeah i've used $HIVE_HOME/bin/hive --auxpath $HIVE_HOME/lib/hive-hbase-handler-*.jar,$HIVE_HOME/lib/hbase-*.jar,$HIVE_HOME/lib/zookeeper-*.jar -hiveconf hbase.master=localhost:6 Hadoop version : hadoop-0.20.203.0 Hbase version : hbase-0.90.4 Hive version : hive-0.9.0 (built from trunk) on Ubuntu 11.10 --- Regards, Alok On Fri, Dec 2, 2011 at 5:49 PM, Ankit Jain ankitjainc...@gmail.com wrote: Hi, have you used following command to start the hive shell. $HIVE_HOME/bin/hive --auxpath $HIVE_HOME/lib/hive-hbase-handler-*.jar,$HIVE_HOME/lib/hbase-*.jar,$HIVE_HOME/lib/zookeeper-*.jar -hiveconf hbase.master=127.0.0.1:6 If no then used above command. Regards, Ankit On Fri, Dec 2, 2011 at 5:34 PM, Alok Kumar alok...@gmail.com wrote: Hi, // Hadoop core-site.xml configuration property namefs.default.name/name valuehdfs://localhost:9000/value /property property namehadoop.tmp.dir/name value/home/alokkumar/hadoop/tmp/value /property /configuration // hbase-site.xml configuration property namehbase.rootdir/name !--valuehdfs://localhost:9000/hbase/value-- valuefile:///home/alokkumar/hbase//value /property /configuration with these conf Hbase/Hive are runnning independently file.. hbase(main):003:0 status 1 servers, 0 dead, 4. average load but i'm stil getting $hive CREATE TABLE hbase_table_1(key int, value string) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES (hbase.columns.mapping = :key,cf1:val); FAILED: Error in metadata:
Re: Understanding Hive and Hbase
The queries go through the region servers, not directly to HDFS. JVS On Dec 2, 2011, at 10:53 AM, Gabriel Eisbruch wrote: Hi everybody, I have a question about hbase hive integration that, I have not can found in any where: if I run a hive query, this query will read the HFiles from hdfs and run into my hadoop cluster, Will the data in my region servers (not flushed yet) be process into the query? Thank you Very Much Gabriel
Re: Understanding Hive and Hbase
Yes, everything goes through the HBase API. JVS On Dec 2, 2011, at 2:09 PM, Gabriel Eisbruch wrote: Ok, so The map/reduce jobs are connnected to The regionservers? Gabriel. El dic 2, 2011 6:52 p.m., John Sichi jsi...@fb.com escribió: The queries go through the region servers, not directly to HDFS. JVS On Dec 2, 2011, at 10:53 AM, Gabriel Eisbruch wrote: Hi everybody, I have a question about hbase hive integration that, I have not can found in any where: if I run a hive query, this query will read the HFiles from hdfs and run into my hadoop cluster, Will the data in my region servers (not flushed yet) be process into the query? Thank you Very Much Gabriel
Re: Hive HBase wiki
It has been quite a while since those instructions were written, so maybe something has broken. There is a unit test for it (hbase-handler/src/test/queries/hbase_bulk.m) which is still passing. If you're running via CLI, logs by default go in /tmp/username Long-term, energy best expended on this would go here: https://issues.apache.org/jira/browse/HIVE-2365 JVS On Nov 17, 2011, at 10:59 AM, Ben West wrote: Hey all, I'm having some trouble with the HBase bulk load, following the instructions from https://cwiki.apache.org/confluence/display/Hive/HBaseBulkLoad. In the last step (Sort Data) I get: java.lang.RuntimeException: Hive Runtime Error while closing operators: java.io.IOException: No files found in hdfs://localhost/tmp/hive-cloudera/hive_2011-11-17_10-30-11_023_3494196694520237582/_tmp.-ext-1/_tmp.01_2 at org.apache.hadoop.hive.ql.exec.ExecReducer.close(ExecReducer.java:311) at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:479) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:417) at org.apache.hadoop.mapred.Child$4.run(Child.java:270) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1127) at org.apache.hadoop.mapred.Child.main(Child.java:264) Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: java.io.IOException: No files found in hdfs://localhost/tmp/hive-cloudera/hive_2011-11-17_10-30-11_023_3494196694520237582/_tmp.-ext-1/_tmp.01_2 at org.apache.hadoop.hive.ql.exec.FileSinkOperator$FSPaths.closeWriters(FileSinkOperator.java:171) at org.apache.hadoop.hive.ql.exec.FileSinkOperator.closeOp(FileSinkOperator.java:642) at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:557) at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:566) at org.apache.hadoop.hive.ql.exec.ExecReducer.close(ExecReducer.java:303) ... 7 more Caused by: java.io.IOException: No files found in hdfs://localhost/tmp/hive-cloudera/hive_2011-11-17_10-30-11_023_3494196694520237582/_tmp.-ext-1/_tmp.01_2 at org.apache.hadoop.hive.hbase.HiveHFileOutputFormat$2.close(HiveHFileOutputFormat.java:144) at org.apache.hadoop.hive.ql.exec.FileSinkOperator$FSPaths.closeWriters(FileSinkOperator.java:168) ... 11 more When I look at the source of HiveHFileOutputFormat.java it has: // Move the region file(s) from the task output directory // to the location specified by the user. There should // actually only be one (each reducer produces one HFile), // but we don't know what its name is. FileSystem fs = outputdir.getFileSystem(jc); fs.mkdirs(columnFamilyPath); Path srcDir = outputdir; for (;;) { FileStatus [] files = fs.listStatus(srcDir); if ((files == null) || (files.length == 0)) { throw new IOException(No files found in + srcDir); } So I am getting the issue where the task output directory is empty. I assume this is because the earlier task failed, but I'm not sure how to check this. Does anyone know what is going on or how I can find the error log of whatever was supposed to populate this directory? Thanks! -Ben
Re: testing out Phabricator for code review
Marek added support for svn, so that is working now too...give it a try! Instructions updated at https://cwiki.apache.org/confluence/display/Hive/PhabricatorCodeReview JVS On Oct 26, 2011, at 10:49 PM, wrote: I've put up instructions for how anyone can start using Phabricator for code review: https://cwiki.apache.org/confluence/display/Hive/PhabricatorCodeReview We've tested out the git workflows; still working on svn. Feedback on how it works for you, anything you noticed missing, etc is appreciated. JVS On Oct 20, 2011, at 2:00 PM, wrote: Hey all, Earlier this year, Facebook released a bunch of its code browsing/review tools as a new (and independent) open source project called Phabricator: http://phabricator.org/ We're currently experimenting with using it for improving the developer experience when contributing and reviewing Hive and HBase patches. (Also for eliminating committer confusion from different patch versions submitted to Review Board and JIRA, something which has bitten us a few times already.) You may notice some of this activity showing up in JIRA, e.g. https://issues.apache.org/jira/browse/HIVE-2515 I'll be sending out a lot more info once we've finished some of the setup and validation, but I just wanted to send out a heads-up for those who are already familiar with using the existing Review Board setup. Once validation is done, we'll publish instructions so that everyone can test it out for themselves as a potential alternative to Review Board. JVS
Re: testing out Phabricator for code review
I've put up instructions for how anyone can start using Phabricator for code review: https://cwiki.apache.org/confluence/display/Hive/PhabricatorCodeReview We've tested out the git workflows; still working on svn. Feedback on how it works for you, anything you noticed missing, etc is appreciated. JVS On Oct 20, 2011, at 2:00 PM, wrote: Hey all, Earlier this year, Facebook released a bunch of its code browsing/review tools as a new (and independent) open source project called Phabricator: http://phabricator.org/ We're currently experimenting with using it for improving the developer experience when contributing and reviewing Hive and HBase patches. (Also for eliminating committer confusion from different patch versions submitted to Review Board and JIRA, something which has bitten us a few times already.) You may notice some of this activity showing up in JIRA, e.g. https://issues.apache.org/jira/browse/HIVE-2515 I'll be sending out a lot more info once we've finished some of the setup and validation, but I just wanted to send out a heads-up for those who are already familiar with using the existing Review Board setup. Once validation is done, we'll publish instructions so that everyone can test it out for themselves as a potential alternative to Review Board. JVS
Re: Unit tests failing on hive-0.7.1
mirror.facebook.net is currently down and won't be back up for at least a few days. There's a fallback at http://archive.cloudera.com/hive-deps If it's not kicking in for you automatically, you'll need to edit ivy/ivysettings.xml. JVS On Sep 28, 2011, at 11:22 PM, Ramya Sunil wrote: Hi, I downloaded hive-0.7.1 and tried to run ant test However the build fails due to the following error: [ivy:retrieve] [FAILED ] hadoop#core;0.20.3-CDH3-SNAPSHOT!hadoop.tar.gz(source): The HTTP response code for http://mirror.facebook.net/facebook/hive-deps/hadoop/core/hadoop-0.20.3-CDH3-SNAPSHOT/hadoop-0.20.3-CDH3-SNAPSHOT.tar.gz did not indicate a success. See log for more detail. (201ms) [ivy:retrieve] [FAILED ] hadoop#core;0.20.3-CDH3-SNAPSHOT!hadoop.tar.gz(source): The HTTP response code for http://mirror.facebook.net/facebook/hive-deps/hadoop/core/hadoop-0.20.3-CDH3-SNAPSHOT/hadoop-0.20.3-CDH3-SNAPSHOT.tar.gz did not indicate a success. See log for more detail. (175ms) [ivy:retrieve] [FAILED ] hadoop#core;0.20.3-CDH3-SNAPSHOT!hadoop.tar.gz(source): The HTTP response code for http://mirror.facebook.net/facebook/hive-deps/hadoop/core/hadoop-0.20.3-CDH3-SNAPSHOT/hadoop-0.20.3-CDH3-SNAPSHOT.tar.gz did not indicate a success. See log for more detail. (135ms) [ivy:retrieve] [FAILED ] hadoop#core;0.20.3-CDH3-SNAPSHOT!hadoop.tar.gz(source): The HTTP response code for http://mirror.facebook.net/facebook/hive-deps/hadoop/core/hadoop-0.20.3-CDH3-SNAPSHOT/hadoop-0.20.3-CDH3-SNAPSHOT.tar.gz did not indicate a success. See log for more detail. (188ms) [ivy:retrieve] [FAILED ] hadoop#core;0.20.3-CDH3-SNAPSHOT!hadoop.tar.gz(source): (0ms) [ivy:retrieve] hadoop-source: tried [ivy:retrieve] http://mirror.facebook.net/facebook/hive-deps/hadoop/core/hadoop-0.20.3-CDH3-SNAPSHOT/hadoop-0.20.3-CDH3-SNAPSHOT.tar.gz [ivy:retrieve] http://mirror.facebook.net/facebook/hive-deps/hadoop/core/hadoop-0.20.3-CDH3-SNAPSHOT/hadoop-0.20.3-CDH3-SNAPSHOT.tar.gz [ivy:retrieve] apache-snapshot: tried [ivy:retrieve] https://repository.apache.org/content/repositories/snapshots/hadoop/core/0.20.3-CDH3-SNAPSHOT/hadoop-0.20.3-CDH3-SNAPSHOT.tar.gz [ivy:retrieve] http://mirror.facebook.net/facebook/hive-deps/hadoop/core/hadoop-0.20.3-CDH3-SNAPSHOT/hadoop-0.20.3-CDH3-SNAPSHOT.tar.gz [ivy:retrieve] maven2: tried [ivy:retrieve] http://repo1.maven.org/maven2/hadoop/core/0.20.3-CDH3-SNAPSHOT/core-0.20.3-CDH3-SNAPSHOT.tar.gz [ivy:retrieve] http://mirror.facebook.net/facebook/hive-deps/hadoop/core/hadoop-0.20.3-CDH3-SNAPSHOT/hadoop-0.20.3-CDH3-SNAPSHOT.tar.gz [ivy:retrieve] datanucleus-repo: tried [ivy:retrieve] http://www.datanucleus.org/downloads/maven2/hadoop/core/0.20.3-CDH3-SNAPSHOT/core-0.20.3-CDH3-SNAPSHOT.tar.gz [ivy:retrieve] http://mirror.facebook.net/facebook/hive-deps/hadoop/core/hadoop-0.20.3-CDH3-SNAPSHOT/hadoop-0.20.3-CDH3-SNAPSHOT.tar.gz [ivy:retrieve] :: [ivy:retrieve] :: UNRESOLVED DEPENDENCIES :: [ivy:retrieve] :: [ivy:retrieve] :: hadoop#core;0.20.1: not found [ivy:retrieve] :: [ivy:retrieve] :: [ivy:retrieve] :: FAILED DOWNLOADS:: [ivy:retrieve] :: ^ see resolution messages for details ^ :: [ivy:retrieve] :: [ivy:retrieve] :: hadoop#core;0.20.3-CDH3-SNAPSHOT!hadoop.tar.gz(source) [ivy:retrieve] :: Can anyone please let me know how to get around this problem? Thanks Ramya
Apache Software Foundation Branding Requirements
Hey, the Apache Hive project is responsible for coming into compliance with these: http://www.apache.org/foundation/marks/pmcs.html I've created a JIRA issue for tracking this, with sub-tasks for the various work items: https://issues.apache.org/jira/browse/HIVE-2432 Our quarterly reports from the PMC to the ASF board will continue to include status updates on these until they are all resolved. If you are interested in helping out with any of that, please assign the corresponding sub-tasks to yourself. JVS
Re: failed when create an index with partitioned by clause
The wiki docs are incorrect here. CREATE INDEX does not yet supported a PARTITIONED BY clause; that was added in the spec to support HIVE-1499, which hasn't been implemented yet. For now, the index partitioning always follows the table partitioning exactly. JVS On Aug 14, 2011, at 3:22 AM, Daniel,Wu wrote: create table part (a int,b int) PARTITIONED by (c int); create index part_idx on table part(b,c) AS 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler' WITH DEFERRED REBUILD partitioned by (a) ; hive create index part_idx on table part(b,c) AS 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler' WITH DEFERRED REBUILD partitioned by (a) ; FAILED: Parse Error: line 2:0 mismatched input 'partitioned' expecting EOF near 'REBUILD' hive If I remove the partitioned by (a), then the index can be created. But I need to partition it on column. Is that not supported yet or I made some mistake?
Re: Hive Wiki: Editing Permissions
I've granted you write access...thanks for helping to fix the wiki! JVS On Aug 8, 2011, at 11:08 AM, Travis Powell wrote: Hello, The Wiki has been one of the most important resources to me for learning Hive. There are a lot of broken links that make it hard to flip between topics. If I could gain editing permissions, I’d be happy to fix links and hopefully make this wiki a bit easier to learn from. Thanks, Travis Powell Travis Powell / tpow...@tealeaf.com
becoming a Hive committer
I often get asked questions about this topic, so I've put together a wiki page which expresses some of my thoughts on it: https://cwiki.apache.org/confluence/display/Hive/BecomingACommitter Let me know if there are points you'd like to add, or where you see it differently. JVS
Re: Hello!
As the comments in HIVE-1228 mention, we decided not to address the :timestamp requirement. So if you need that, you can work on enhancing the HBase storage handler by opening a JIRA issue, proposing an approach, and submitting a patch. JVS On Jul 24, 2011, at 12:42 PM, 张建轶 wrote: Hello! I'm a developer of web in China, I'm studying Hive and Hbase for my job. But I had encountered a problem of Hbase Handler in Hive. I want :timestamp column to be mapped for read or write, and then import data of timestamp to hbase's table from hive. I find follow url of this issue: https://issues.apache.org/jira/browse/HIVE-1228 It seems that this problem has resolved. But in this page https://cwiki.apache.org/confluence/display/Hive/HBaseIntegration said that there is currently no way to access the HBase timestamp attribute, and queries always access data with the latest timestamp. How can I map the latest timestamp to hive or INSERT OVERWRITE TABLE hbase_table_1 with timestamp? I have googled it many times, but nothing found. If you could help me, it will be a big kindnesses to me. I will appreciate your replay. With thanks and best wishes to you. Yours Jianyi Zhang
Re: wiki has moved!
On Jun 27, 2011, at 4:37 PM, Time Less wrote: Might as well add me as editor. I've found tons of errors and problems. Not the least of which the regexserde is now completely borked and nonsensical. Compare ([^]*) ([^]*) ... against ([^ ]*) ([^ ]*) ... - I thought I was going insane. Email me your Confluence account name. Also, Google still points to the old documentation which doesn't exist. You need to add in some 301 so Google will get the message, too: http://en.wikipedia.org/wiki/HTTP_301. I believe Google isn't the only HTTP client that will benefit from 301 status. I don't have control over the MoinMoin server; if someone has something specific they can create an INFRA request, but the page name translation is not 1-to-1, so it's probably not worth the effort; the old stuff should age out, and the new stuff will get crawled soon enough. JVS
Re: wiki has moved!
On Jun 27, 2011, at 5:16 PM, wrote: I don't have control over the MoinMoin server; if someone has something specific they can create an INFRA request, but the page name translation is not 1-to-1, so it's probably not worth the effort; the old stuff should age out, and the new stuff will get crawled soon enough. Hmm, but looks like there's a robots.txt which blocks crawlers in https://cwiki.apache.org/confluence Instead, from looking at other projects such as Avro, I guess the crawlers are supposed to hit the generated HTML under https://cwiki.apache.org/Hive But the HTML pages only seem to get regenerated on edit, so most of them aren't there post-import; let's see if a cron job kicks in. Also, the CSS is missing border padding, so we'll need to fix that. JVS
wiki has moved!
Hey there, With some wiki migration magic from Brock Noland (assisted by Gavin from INFRA), we've moved all of the content from MoinMoin to Confluence. The new location is here: https://cwiki.apache.org/confluence/display/Hive All of the MoinMoin pages have been deleted; this is to make sure people don't accidentally keep editing there. I left behind some forwarding info. We need your help (or at least tolerance) to deal with some of the imperfections in the migration process: https://cwiki.apache.org/confluence/display/Hive/AboutThisWiki If you already an editor on the old wiki, or if you would like to help with fixing/editing now, contact me for write access to the new one. If you turn out to be a spammer, I will hunt you down, disembowel you, and feed your entrails to my dog. JVS
Re: Getting a weird error when using the ngrams function
Hmmm, I think this might be a bug which is only exposed when one of the mappers gets zero rows of input. If you have a Hive build, can you try adding this before line 238 of GenericUDAFnGrams.java? if (n == 0) { return; } Just before this line: if(myagg.n 0 n 0 myagg.n != n) { If that fixes it, create a new JIRA issue so we can get a fix committed. JVS On Jun 20, 2011, at 8:06 AM, Matthew Rathbone wrote: Hoping someone with more expertise could help on this: I have no idea what's causing this to happen, but here is the exception: java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row (tag=0) {key:{},value:{_col0:[0,0,0,0]},alias:0} at org.apache.hadoop.hive.ql.exec.ExecReducer.reduce(ExecReducer.java:268) at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:467) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:415) at org.apache.hadoop.mapred.Child.main(Child.java:170) Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row (tag=0) {key:{},value:{_col0:[0,0,0,0]},alias:0} at org.apache.hadoop.hive.ql.exec.ExecReducer.reduce(ExecReducer.java:256) ... 3 more Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: GenericUDAFnGramEvaluator: mismatch in value for 'n', which usually is caused by a non-constant expression. Found '0' and '1'. at org.apache.hadoop.hive.ql.udf.generic.GenericUDAFnGrams$GenericUDAFnGramEvaluator.merge(GenericUDAFnGrams.java:239) at org.apache.hadoop.hive.ql.udf.generic.GenericUDAFEvaluator.aggregate(GenericUDAFEvaluator.java:142) at org.apache.hadoop.hive.ql.exec.GroupByOperator.updateAggregations(GroupByOperator.java:592) at org.apache.hadoop.hive.ql.exec.GroupByOperator.processAggr(GroupByOperator.java:816) at org.apache.hadoop.hive.ql.exec.GroupByOperator.processOp(GroupByOperator.java:716) at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:470) at org.apache.hadoop.hive.ql.exec.ExecReducer.reduce(ExecReducer.java:247) What does 'mismatch in value for n' mean? My query is super simple: select ngrams(sentences(text), 1, 50) from messages -- Matthew Rathbone Foursquare | Software Engineer | Server Engineering Team matt...@foursquare.com | @rathboma | 4sq
Travel Assistance applications now open for ApacheCon NA 2011
The Apache Software Foundation (ASF)'s Travel Assistance Committee (TAC) is now accepting applications for ApacheCon North America 2011, 7-11 November in Vancouver BC, Canada. The TAC is seeking individuals from the Apache community at-large --users, developers, educators, students, Committers, and Members-- who would like to attend ApacheCon, but need some financial support in order to be able to get there. There are limited places available, and all applicants will be scored on their individual merit. Financial assistance is available to cover flights/trains, accommodation and entrance fees either in part or in full, depending on circumstances. However, the support available for those attending only the BarCamp (7-8 November) is less than that for those attending the entire event (Conference + BarCamp 7-11 November). The Travel Assistance Committee aims to support all official ASF events, including cross-project activities; as such, it may be prudent for those in Asia and Europe to wait for an event geographically closer to them. More information can be found at http://www.apache.org/travel/index.html including a link to the online application and detailed instructions for submitting. Applications will close on 8 July 2011 at 22:00 BST (UTC/GMT +1). We wish good luck to all those who will apply, and thank you in advance for tweeting, blogging, and otherwise spreading the word. Regards, The Travel Assistance Committee
wiki spam
Apparently our roadmap includes hard drive recovery, wedding reception flowers, and developing muscles. http://wiki.apache.org/hadoop/Hive/Roadmap Anyone want to take a crack at migrating the HIve wiki content from Hadoop's MoinMoin over to the Hive-specific Confluence space we have set up? In Confluence, it's possible to restrict editing to specific groups. And the markup is a lot better, and attachments can be used, and ... This would be a great contribution by anyone interested in helping Hive out, so if you're interested, let me know. JVS
Re: hadoop, hive and hbase problem
Try one of these suggestions: (1) run HBase and Hive in separate clusters (downside is that map/reduce tasks will have to issue remote request to region servers whereas normally they could run on the same nodes) (2) debug the shim exception and see if you can contribute a patch that makes Hive compatible with that Hadoop version JVS On May 9, 2011, at 11:17 AM, labtrax wrote: Hello, it seems that hive 0.6 an 0.7 is incompatible with the hadoop-append-jar from hbase 0.90.2. But without the append-jar you cannot use hbase in production... Any advices for the hadoop/hbase/hive-version-jungle? I already ask this last month but I didn't get a reasonable answer. Cheers labtrax Hello, I have a hadoop cluster running with the hadoop_append-jar (hadoop-core-0.20-append-r1056497-core.jar) for hbase reason. I tried hive 0.6.0 and 0.7.0 and for both each when I start it I get Exception in thread main java.lang.RuntimeException: Could not load shims in class null at org.apache.hadoop.hive.shims.ShimLoader.loadShims(ShimLoader.java:90) at org.apache.hadoop.hive.shims.ShimLoader.getHadoopShims(ShimLoader.java:66) at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:249) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:156) Caused by: java.lang.NullPointerException at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:169) at org.apache.hadoop.hive.shims.ShimLoader.loadShims(ShimLoader.java:87) My hive-site.xml: ?xml version=1.0? ?xml-stylesheet type=text/xsl href=configuration.xsl? configuration property namehive.metastore.warehouse.dir/name valuehdfs://name-node:54310/hive/value descriptionlocation of default database for the warehouse/description /property /configuration My hadoop cluster is working properly, the hive dir is already created. Any clues? labtrax -- NEU: FreePhone - kostenlos mobil telefonieren und surfen! Jetzt informieren: http://www.gmx.net/de/go/freephone
Re: index is not working for me.
Automatic usage of indexes is still under development (HIVE-1644). JVS On Apr 15, 2011, at 1:31 AM, Erix Yao wrote: hi, I installed the hive-0.7 release for the index feature. Here's my test table schema: create table testforindex (id bigint, type int) row format delimited fields terminated by ',' lines terminated by '\n';; create index type_idx on table testforindex (type) AS 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler' WITH DEFERRED REBUILD; I loaded 10 data for index test: hive select count(*) from testforindex; OK 10 Time taken: 22.247 seconds Here's the test select for index: hive select count(*) from testforindex where type=1; OK 1000 Time taken: 20.279 seconds But in the jobtracker I see : Counter Map Reduce Total Map input records 100,000 0 100,000 The hive still use the full table scan for the result. Is there anybody that can tell me what's wrong in my test? -- haitao.yao@Beijing
Re: UDF constructor params?
https://issues.apache.org/jira/browse/HIVE-1016 https://issues.apache.org/jira/browse/HIVE-1360 JVS On Apr 5, 2011, at 11:20 AM, Larry Ogrodnek wrote: For some UDFs I'm working on now it feels like it would be handy to be able to pass in parameters during construction. It's an integration with an external reporting API... e.g. -- include last 30 days from april 4th create temporary function orders_last_month as 'com.example.OrderSearch(20110404, 30)' -- get orders for customer 11 select order_last_month(11), ... Obviously I can perform the same logic passing everything into the UDF: select orders_last_month(20110404, 30, 11), ... but this doesn't feel as nice.. additionally, having the information available in the constructor might give the UDF more information on how to perform caching, allow it to do more complex initialization, etc. Just wondering if this has ever been thought about, discussed, or needed by anyone else thanks, larry
Re: Performance between Hive queries vs. Hive over HBase queries
There's one here specifically for the Hive portion, but really a full-stack system profile is needed for deciding where to attack it: https://issues.apache.org/jira/browse/HIVE-1231 I don't know of anyone currently working in this area. JVS On Mar 8, 2011, at 9:51 PM, Otis Gospodnetic wrote: Hi, John, are there plans or specific JIRA issues related to this particular performance hit that you or somebody else is working on and that those of us interested in performance improvements when Hive points to external tables in HBase should watch? Thanks, Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message From: John Sichi jsi...@fb.com To: user@hive.apache.org user@hive.apache.org Sent: Tue, March 8, 2011 1:17:51 AM Subject: Re: Performance between Hive queries vs. Hive over HBase queries For native tables, Hive reads rows directly from HDFS. For HBase tables, it has to go through the HBase region servers, which reconstruct rows from column families (combining cache + HDFS). HBase makes it possible to keep your table up to date in real time, but you have to pay an overhead cost at query time. On the other hand, with native Hive tables, there's latency in loading new batches of data. JVS On Mar 7, 2011, at 10:13 PM, Biju Kaimal wrote: Hi, Could you please explain the reason for the behavior? Regards, Biju On Tue, Mar 8, 2011 at 11:35 AM, John Sichi jsi...@fb.com wrote: Yes. JVS On Mar 7, 2011, at 9:59 PM, Biju Kaimal wrote: Hi, I loaded a data set which has 1 million rows into both Hive and HBase tables. For the HBase table, I created a corresponding Hive table so that the data in HBase can be queried from Hive QL. Both tables have a key column and a value column For the same query (select value, count(*) from table group by value), the Hive only query runs much faster (~ 30 seconds) as compared to Hive over HBase (~ 150 seconds). Is this expected? Regards, Biju
Re: Performance between Hive queries vs. Hive over HBase queries
Factor of 5 closely matches the results I got when I was testing. JVS On Mar 9, 2011, at 1:23 PM, Otis Gospodnetic wrote: Hi, Biju's example shows a factor of 5 decrease in performance when Hive points to HBase tables. Does anyone know how much this factor varies? Is if often closer to 1 or is is more often close to 10? Just trying to get a better feel for this... Thanks, Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message From: John Sichi jsi...@fb.com To: user@hive.apache.org user@hive.apache.org Sent: Tue, March 8, 2011 1:05:34 AM Subject: Re: Performance between Hive queries vs. Hive over HBase queries Yes. JVS On Mar 7, 2011, at 9:59 PM, Biju Kaimal wrote: Hi, I loaded a data set which has 1 million rows into both Hive and HBase tables. For the HBase table, I created a corresponding Hive table so that the data in HBase can be queried from Hive QL. Both tables have a key column and a value column For the same query (select value, count(*) from table group by value), the Hive only query runs much faster (~ 30 seconds) as compared to Hive over HBase (~ 150 seconds). Is this expected? Regards, Biju
Re: Performance between Hive queries vs. Hive over HBase queries
Yes. JVS On Mar 7, 2011, at 9:59 PM, Biju Kaimal wrote: Hi, I loaded a data set which has 1 million rows into both Hive and HBase tables. For the HBase table, I created a corresponding Hive table so that the data in HBase can be queried from Hive QL. Both tables have a key column and a value column For the same query (select value, count(*) from table group by value), the Hive only query runs much faster (~ 30 seconds) as compared to Hive over HBase (~ 150 seconds). Is this expected? Regards, Biju
Re: Performance between Hive queries vs. Hive over HBase queries
For native tables, Hive reads rows directly from HDFS. For HBase tables, it has to go through the HBase region servers, which reconstruct rows from column families (combining cache + HDFS). HBase makes it possible to keep your table up to date in real time, but you have to pay an overhead cost at query time. On the other hand, with native Hive tables, there's latency in loading new batches of data. JVS On Mar 7, 2011, at 10:13 PM, Biju Kaimal wrote: Hi, Could you please explain the reason for the behavior? Regards, Biju On Tue, Mar 8, 2011 at 11:35 AM, John Sichi jsi...@fb.com wrote: Yes. JVS On Mar 7, 2011, at 9:59 PM, Biju Kaimal wrote: Hi, I loaded a data set which has 1 million rows into both Hive and HBase tables. For the HBase table, I created a corresponding Hive table so that the data in HBase can be queried from Hive QL. Both tables have a key column and a value column For the same query (select value, count(*) from table group by value), the Hive only query runs much faster (~ 30 seconds) as compared to Hive over HBase (~ 150 seconds). Is this expected? Regards, Biju
partitioned views
One of the impediments for uptake of the CREATE VIEW feature in Hive has been the lack of partition awareness. This made it non-transparent to replace a table with a view, e.g. for renaming purposes. To address this as well as some other use cases, I'm proposing the first steps towards view partition support: http://wiki.apache.org/hadoop/Hive/PartitionedViews This solution is still primitive, but should make using views at least possible in a number of cases, with a bit of extra DDL/ETL effort. Feedback welcome. JVS
Re: [VOTE] Sponsoring Howl as an Apache Incubator project
But Howl does layer on some additional code, right? https://github.com/yahoo/howl/tree/howl/howl JVS On Feb 3, 2011, at 1:49 PM, Ashutosh Chauhan wrote: There are none as of today. In the past, whenever we had to have changes, we do it in a separate branch in Howl and once those get committed to hive repo, we pull it over in our trunk and drop the branch. Ashutosh On Thu, Feb 3, 2011 at 13:41, yongqiang he heyongqiang...@gmail.com wrote: I am interested in some numbers around the lines of code changes (or files of changes) which are in Howl but not in Hive? Can anyone give some information here? Thanks Yongqiang On Thu, Feb 3, 2011 at 1:15 PM, Jeff Hammerbacher ham...@cloudera.com wrote: Hey, If we do go ahead with pulling the metastore out of Hive, it might make most sense for Howl to become its own TLP rather than a subproject. Yes, I did not read the proposal closely enough. I think an end state as a TLP makes more sense for Howl than as a Pig subproject. I'd really love to see Howl replace the metastore in Hive and it would be more natural to do so as a TLP than as a Pig subproject--especially since the current Howl repository is literally a fork of Hive. In the incubator proposal, we have mentioned these issues, but we've attempted to avoid prejudicing any decision. Instead, we'd like to assess the pros and cons (including effort required and impact expected) for both approaches as part of the incubation process. Glad the issues are being considered. Later, Jeff
Re: [VOTE] Sponsoring Howl as an Apache Incubator project
I forgot about the serde dependencies...can you add those to the Initial Source note in [[HowlProposal]] just for completeness? JVS On Feb 3, 2011, at 3:11 PM, Alan Gates wrote: Yes, it adds Input and Output formats for MapReduce and load and store functions for Pig. In the future it we expect it will continue to add more additional layers. Alan. On Feb 3, 2011, at 2:49 PM, John Sichi wrote: But Howl does layer on some additional code, right? https://github.com/yahoo/howl/tree/howl/howl JVS On Feb 3, 2011, at 1:49 PM, Ashutosh Chauhan wrote: There are none as of today. In the past, whenever we had to have changes, we do it in a separate branch in Howl and once those get committed to hive repo, we pull it over in our trunk and drop the branch. Ashutosh On Thu, Feb 3, 2011 at 13:41, yongqiang he heyongqiang...@gmail.com wrote: I am interested in some numbers around the lines of code changes (or files of changes) which are in Howl but not in Hive? Can anyone give some information here? Thanks Yongqiang On Thu, Feb 3, 2011 at 1:15 PM, Jeff Hammerbacher ham...@cloudera.com wrote: Hey, If we do go ahead with pulling the metastore out of Hive, it might make most sense for Howl to become its own TLP rather than a subproject. Yes, I did not read the proposal closely enough. I think an end state as a TLP makes more sense for Howl than as a Pig subproject. I'd really love to see Howl replace the metastore in Hive and it would be more natural to do so as a TLP than as a Pig subproject--especially since the current Howl repository is literally a fork of Hive. In the incubator proposal, we have mentioned these issues, but we've attempted to avoid prejudicing any decision. Instead, we'd like to assess the pros and cons (including effort required and impact expected) for both approaches as part of the incubation process. Glad the issues are being considered. Later, Jeff
Re: [VOTE] Sponsoring Howl as an Apache Incubator project
On Feb 3, 2011, at 5:09 PM, Alan Gates wrote: Are you referring to the serde jar or any particular serde's we are making use of? Both (see below). JVS [jsichi@dev1066 ~/open/howl/howl/howl/src/java/org/apache/hadoop/hive/howl] ls cli/ common/ data/ mapreduce/ pig/ rcfile/ [jsichi@dev1066 ~/open/howl/howl/howl/src/java/org/apache/hadoop/hive/howl] grep serde */* common/HowlUtil.java:import org.apache.hadoop.hive.serde2.typeinfo.TypeInfo; common/HowlUtil.java:import org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils; rcfile/RCFileInputDriver.java:import org.apache.hadoop.hive.serde.Constants; rcfile/RCFileInputDriver.java:import org.apache.hadoop.hive.serde2.ColumnProjectionUtils; rcfile/RCFileInputDriver.java:import org.apache.hadoop.hive.serde2.SerDe; rcfile/RCFileInputDriver.java:import org.apache.hadoop.hive.serde2.SerDeException; rcfile/RCFileInputDriver.java:import org.apache.hadoop.hive.serde2.columnar.BytesRefArrayWritable; rcfile/RCFileInputDriver.java:import org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe; rcfile/RCFileInputDriver.java:import org.apache.hadoop.hive.serde2.columnar.ColumnarStruct; rcfile/RCFileInputDriver.java:import org.apache.hadoop.hive.serde2.objectinspector.ListObjectInspector; rcfile/RCFileInputDriver.java:import org.apache.hadoop.hive.serde2.objectinspector.MapObjectInspector; rcfile/RCFileInputDriver.java:import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector; rcfile/RCFileInputDriver.java:import org.apache.hadoop.hive.serde2.objectinspector.PrimitiveObjectInspector; rcfile/RCFileInputDriver.java:import org.apache.hadoop.hive.serde2.objectinspector.StructField; rcfile/RCFileInputDriver.java:import org.apache.hadoop.hive.serde2.objectinspector.StructObjectInspector; rcfile/RCFileInputDriver.java: private SerDe serde; rcfile/RCFileInputDriver.java: struct = (ColumnarStruct)serde.deserialize(bytesRefArray); rcfile/RCFileInputDriver.java: serde = new ColumnarSerDe(); rcfile/RCFileInputDriver.java: serde.initialize(context.getConfiguration(), howlProperties); rcfile/RCFileInputDriver.java: oi = (StructObjectInspector) serde.getObjectInspector(); rcfile/RCFileMapReduceInputFormat.java:import org.apache.hadoop.hive.serde2.columnar.BytesRefArrayWritable; rcfile/RCFileMapReduceOutputFormat.java:import org.apache.hadoop.hive.serde2.columnar.BytesRefArrayWritable; rcfile/RCFileMapReduceRecordReader.java:import org.apache.hadoop.hive.serde2.columnar.BytesRefArrayWritable; rcfile/RCFileOutputDriver.java:import org.apache.hadoop.hive.serde.Constants; rcfile/RCFileOutputDriver.java:import org.apache.hadoop.hive.serde2.SerDe; rcfile/RCFileOutputDriver.java:import org.apache.hadoop.hive.serde2.SerDeException; rcfile/RCFileOutputDriver.java:import org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe; rcfile/RCFileOutputDriver.java:import org.apache.hadoop.hive.serde2.objectinspector.ListObjectInspector; rcfile/RCFileOutputDriver.java:import org.apache.hadoop.hive.serde2.objectinspector.MapObjectInspector; rcfile/RCFileOutputDriver.java:import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector; rcfile/RCFileOutputDriver.java:import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorFactory; rcfile/RCFileOutputDriver.java:import org.apache.hadoop.hive.serde2.objectinspector.StructObjectInspector; rcfile/RCFileOutputDriver.java:import org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorFactory; rcfile/RCFileOutputDriver.java:import org.apache.hadoop.hive.serde2.typeinfo.ListTypeInfo; rcfile/RCFileOutputDriver.java:import org.apache.hadoop.hive.serde2.typeinfo.MapTypeInfo; rcfile/RCFileOutputDriver.java:import org.apache.hadoop.hive.serde2.typeinfo.PrimitiveTypeInfo; rcfile/RCFileOutputDriver.java:import org.apache.hadoop.hive.serde2.typeinfo.StructTypeInfo; rcfile/RCFileOutputDriver.java:import org.apache.hadoop.hive.serde2.typeinfo.TypeInfo; rcfile/RCFileOutputDriver.java:import org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils; rcfile/RCFileOutputDriver.java: /** The serde for serializing the HowlRecord to bytes writable */ rcfile/RCFileOutputDriver.java: private SerDe serde; rcfile/RCFileOutputDriver.java: return serde.serialize(value.getAll(), objectInspector); rcfile/RCFileOutputDriver.java: serde = new ColumnarSerDe(); rcfile/RCFileOutputDriver.java: serde.initialize(context.getConfiguration(), howlProperties); Howl, howl, howl, howl! O! you are men of stones: Had I your tongues and eyes, I'd use them so That heaven's vaults should crack
Re: [VOTE] Sponsoring Howl as an Apache Incubator project
Got it, thanks for the correction. JVS On Feb 3, 2011, at 4:56 PM, Alex Boisvert wrote: Hi John, Just to clarify where I was going with my line of questioning. There's no Apache policy that prevents dependencies on incubator project, whether it's releases, snapshots or even home-made hacked-together packaging of an incubator project.It's been done before and as long as the incubator code's IP has been cleared and the packaging isn't represented as an official release if it isn't so, there's no wrong in doing that. Now, whether the project choses to use and release with an incubator dependency is a matter of judgment (and ultimately a vote by committers if there is no consensus). I just wanted to make sure there were no incorrect assumptions made. alex On Thu, Feb 3, 2011 at 4:07 PM, John Sichi jsi...@fb.com wrote: I was going off of what I read in HADOOP-3676 (which lacks a reference as well). But I guess if a release can be made from the incubator, then it's not a blocker. JVS On Feb 3, 2011, at 3:29 PM, Alex Boisvert wrote: On Thu, Feb 3, 2011 at 11:38 AM, John Sichi jsi...@fb.com wrote: Besides the fact that the refactoring required is significant, I don't think this is possible to do quickly since: 1) Hive (unlike Pig) requires a metastore 2) Hive releases can't depend on an incubator project I'm not sure what you mean by can't depend on an incubator project here. AFAIK, there is no policy at Apache that projects should not depend on incubator projects. Can you clarify what you mean and why you think such a restriction exists? alex
Re: Hive/Hbase Integration Error
Here is what you need to do: 1) Use svn to check out the source for Hive 0.6 2) In your checkout, replace the HBase 0.20.3 jars with the ones from 0.20.6 3) Build Hive 0.6 from source 4) Use your new Hive build JVS On Jan 6, 2011, at 2:34 AM, Adarsh Sharma wrote: Dear all, I am sorry I am posting this message again but I can't able to locate the root cause after googled a lot. I am trying Hive/Hbase Integration from the past 2 days. I am facing the below issue while creating external table in Hive. I am using hadoop-0.20.2, hbase-0.20.6, hive-0.6.0 ( Mysql as metstore ) and java-1.6.0_20. Hbase-0.20.3 is also checked. Problem arises when I issue the below command : hive CREATE TABLE hive_hbasetable_k(key int, value string) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES (hbase.columns.mapping = :key,cf1:val) TBLPROPERTIES (hbase.table.name = hivehbasek); FAILED: Error in metadata: MetaException(message:org.apache.hadoop.hbase.MasterNotRunningException at org.apache.hadoop.hbase.client.HConnectionManager$TableServers.getMaster(HConnectionManager.java:374) at org.apache.hadoop.hbase.client.HBaseAdmin.init(HBaseAdmin.java:72) at org.apache.hadoop.hive.hbase.HBaseStorageHandler.getHBaseAdmin(HBaseStorageHandler.java:64) at org.apache.hadoop.hive.hbase.HBaseStorageHandler.preCreateTable(HBaseStorageHandler.java:159) at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.createTable(HiveMetaStoreClient.java:275) at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:394) at org.apache.hadoop.hive.ql.exec.DDLTask.createTable(DDLTask.java:2126) at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask.java:166) at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:107) at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:55) at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:633) at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:506) at org.apache.hadoop.hive.ql.Driver.run(Driver.java:384) at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:138) at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:197) at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:302) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:156) FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask It seems my HMaster is not Running but I checked from IP:60010 that it is running and I am able to create,insert tables in Hbase Properly. Below is the contents of my hive.log : 2011-01-05 15:19:36,783 ERROR DataNucleus.Plugin (Log4JLogger.java:error(115)) - Bundle org.eclipse.jdt.core requires org.eclipse.core.resources but it cannot be resolved. 2011-01-05 15:19:36,783 ERROR DataNucleus.Plugin (Log4JLogger.java:error(115)) - Bundle org.eclipse.jdt.core requires org.eclipse.core.resources but it cannot be resolved. 2011-01-05 15:19:36,785 ERROR DataNucleus.Plugin (Log4JLogger.java:error(115)) - Bundle org.eclipse.jdt.core requires org.eclipse.core.runtime but it cannot be resolved. 2011-01-05 15:19:36,785 ERROR DataNucleus.Plugin (Log4JLogger.java:error(115)) - Bundle org.eclipse.jdt.core requires org.eclipse.core.runtime but it cannot be resolved. 2011-01-05 15:19:36,786 ERROR DataNucleus.Plugin (Log4JLogger.java:error(115)) - Bundle org.eclipse.jdt.core requires org.eclipse.text but it cannot be resolved. 2011-01-05 15:19:36,786 ERROR DataNucleus.Plugin (Log4JLogger.java:error(115)) - Bundle org.eclipse.jdt.core requires org.eclipse.text but it cannot be resolved. 2011-01-05 15:20:12,185 WARN zookeeper.ClientCnxn (ClientCnxn.java:run(967)) - Exception closing session 0x0 to sun.nio.ch.selectionkeyi...@561279c8 java.net.ConnectException: Connection refused at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:574) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:933) 2011-01-05 15:20:12,188 WARN zookeeper.ClientCnxn (ClientCnxn.java:cleanup(1001)) - Ignoring exception during shutdown input java.nio.channels.ClosedChannelException at sun.nio.ch.SocketChannelImpl.shutdownInput(SocketChannelImpl.java:638) at sun.nio.ch.SocketAdaptor.shutdownInput(SocketAdaptor.java:360) at org.apache.zookeeper.ClientCnxn$SendThread.cleanup(ClientCnxn.java:999) at
Re: Error in metadata: javax.jdo.JDOFatalDataStoreException
Since the exception below is from JDO, it has to do with the configuration of Hive's metastore (not HBase/Zookeeper). JVS On Jan 5, 2011, at 2:14 AM, Adarsh Sharma wrote: Dear all, I am trying Hive/Hbase Integration from the past 2 days. I am facing the below issue while creating external table in Hive. *Command-Line Error :- *had...@s2-ratw-1:~/project/hive-0.6.0/build/dist$ bin/hive --auxpath /home/hadoop/project/hive-0.6.0/build/dist/lib/hive_hbase-handler.jar,/home/hadoop/project/hive-0.6.0/build/dist/lib/hbase-0.20.3.jar,/home/hadoop/project/hive-0.6.0/build/dist/lib/zookeeper-3.2.2.jar -hiveconf hbase.zookeeper.quorum=192.168.1.103,192.168.1.114,192.168.1.115,192.168.1.104,192.168.1.107 Hive history file=/tmp/hadoop/hive_job_log_hadoop_201101051527_1728376885.txt hive show tables; FAILED: Error in metadata: javax.jdo.JDOFatalDataStoreException: Communications link failure The last packet sent successfully to the server was 0 milliseconds ago. The driver has not received any packets from the server. NestedThrowables: com.mysql.jdbc.exceptions.jdbc4.CommunicationsException: Communications link failure The last packet sent successfully to the server was 0 milliseconds ago. The driver has not received any packets from the server. FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask hive exit; had...@s2-ratw-1:~/project/hive-0.6.0/build/dist$ *My hive.log file says :* 2011-01-05 15:19:36,783 ERROR DataNucleus.Plugin (Log4JLogger.java:error(115)) - Bundle org.eclipse.jdt.core requires org.eclipse.core.resources but it cannot be resolved. 2011-01-05 15:19:36,783 ERROR DataNucleus.Plugin (Log4JLogger.java:error(115)) - Bundle org.eclipse.jdt.core requires org.eclipse.core.resources but it cannot be resolved. 2011-01-05 15:19:36,785 ERROR DataNucleus.Plugin (Log4JLogger.java:error(115)) - Bundle org.eclipse.jdt.core requires org.eclipse.core.runtime but it cannot be resolved. 2011-01-05 15:19:36,785 ERROR DataNucleus.Plugin (Log4JLogger.java:error(115)) - Bundle org.eclipse.jdt.core requires org.eclipse.core.runtime but it cannot be resolved. 2011-01-05 15:19:36,786 ERROR DataNucleus.Plugin (Log4JLogger.java:error(115)) - Bundle org.eclipse.jdt.core requires org.eclipse.text but it cannot be resolved. 2011-01-05 15:19:36,786 ERROR DataNucleus.Plugin (Log4JLogger.java:error(115)) - Bundle org.eclipse.jdt.core requires org.eclipse.text but it cannot be resolved. 2011-01-05 15:20:12,185 WARN zookeeper.ClientCnxn (ClientCnxn.java:run(967)) - Exception closing session 0x0 to sun.nio.ch.selectionkeyi...@561279c8 java.net.ConnectException: Connection refused at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:574) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:933) 2011-01-05 15:20:12,188 WARN zookeeper.ClientCnxn (ClientCnxn.java:cleanup(1001)) - Ignoring exception during shutdown input java.nio.channels.ClosedChannelException at sun.nio.ch.SocketChannelImpl.shutdownInput(SocketChannelImpl.java:638) at sun.nio.ch.SocketAdaptor.shutdownInput(SocketAdaptor.java:360) at org.apache.zookeeper.ClientCnxn$SendThread.cleanup(ClientCnxn.java:999) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:970) 2011-01-05 15:20:12,188 WARN zookeeper.ClientCnxn (ClientCnxn.java:cleanup(1006)) - Ignoring exception during shutdown output java.nio.channels.ClosedChannelException at sun.nio.ch.SocketChannelImpl.shutdownOutput(SocketChannelImpl.java:649) at sun.nio.ch.SocketAdaptor.shutdownOutput(SocketAdaptor.java:368) at org.apache.zookeeper.ClientCnxn$SendThread.cleanup(ClientCnxn.java:1004) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:970) 2011-01-05 15:20:12,621 WARN zookeeper.ClientCnxn (ClientCnxn.java:run(967)) - Exception closing session 0x0 to sun.nio.ch.selectionkeyi...@799dbc3b I overcomed from the previous issue of MasterNotRunning Exception which occured due to incompatibilities in hive_hbase jars. Now I'm using Hadoop-0.20.2, Hive-0.6.0 ( Bydefault Derby metastore ) and Hbase-0.20.3. Please tell how this could be resolved. Also I want to add one more thing that my hadoop Cluster is of 9 nodes and 8 nodes act as Datanodes,Tasktrackers and Regionservers. Among these nodes is set zookeeper.quorum.property to have 5 Datanodes. Would this is the issue. I don't know the number of servers needed for Zookeeper in fully distributed mode. Best Regards Adarsh Sharma
Re: Question about View in Hive 0.6
It runs the same as a nested select. Currently, since Hive doesn't do any relational common subexpression elimination, it will be executed twice. In the example below, this can be a good thing, since cond1 and cond2 can be pushed down separately. JVS On Dec 28, 2010, at 12:18 AM, Neil Xu wrote: in hive 0.6, we can create view for tables, views are actually run as a subquery when querying data, is it optimized in hive that a view is executed only once in a single query? Thanks in advance! such as: select * from ( select x,y from view1 where cond1 union all select x,y from view1 where cond2 ) Neil,
Re: Altering / querying default column names (_c1, _c2, etc)
Enclose them in backticks. alter table fb_images1 change `_c5` ref_array arraystring; JVS On Dec 20, 2010, at 3:23 PM, Leo Alekseyev wrote: Often I forget to name a column that results from running an aggregation. Then, I'm stuck: describe table lists those columns by their default names, i.e. something like _c1, but I can't seem to query or rename those columns: alter table fb_images1 change _c5 ref_array arraystring; FAILED: Parse Error: line 1:30 mismatched input '_c5' expecting Identifier in rename column name Is there a resolution for this? One workaround would be to create a new table and load the data into it, but it seems inelegant to say the least. --Leo
Re: Hive HBase intergration scan failing
It's supposed to happen automatically. The JIRA issue below mentions one case where it wasn't, and explains how I detected it and worked around it. To make you're getting locality, look at the task tracer and make sure that for your map tasks, the host used for executing the task matches the input split location. JVS On Dec 10, 2010, at 10:10 AM, vlisovsky wrote: Thanks for the info. Moreover how can we make sure that our regionservers are running with same Datanodes ( locality). Is there a way we can make sure? On Thu, Dec 9, 2010 at 11:09 PM, John Sichi jsi...@fb.com wrote: Try set hbase.client.scanner.caching=5000; Also, check to make sure that you are getting the expected locality so that mappers are running on the same nodes as the region servers they are scanning (assuming that you are running HBase and mapreduce on the same cluster). When I was testing this, I encountered this problem (but it may have been specific to our cluster configurations): https://issues.apache.org/jira/browse/HBASE-2535 JVS On Dec 9, 2010, at 10:46 PM, vlisovsky wrote: Hi Guys, Wonder if anybody could shed some light on how to reduce the load on HBase cluster when running a full scan. The need is to dump everything I have in HBase and into a Hive table. The HBase data size is around 500g. The job creates 9000 mappers, after about 1000 maps things go south every time.. If I run below insert it runs for about 30 minutes then starts bringing down HBase cluster after which region servers need to be restarted.. Wonder if there is a way to throttle it somehow or otherwise if there is any other method of getting structured data out? Any help is appreciated, Thanks, -Vitaly create external table hbase_linked_table ( mykeystring, infomapstring, string, ) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES (hbase.columns.mapping = :key,info:) TBLPROPERTIES (hbase.table.name = hbase_table2); set hive.exec.compress.output=true; set io.seqfile.compression.type=BLOCK; set mapred.output.compression.type=BLOCK; set mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec; set mapred.reduce.tasks=40; set mapred.map.tasks=25; INSERT overwrite table tmp_hive_destination select * from hbase_linked_table;
Re: Hive HBase intergration scan failing
Try set hbase.client.scanner.caching=5000; Also, check to make sure that you are getting the expected locality so that mappers are running on the same nodes as the region servers they are scanning (assuming that you are running HBase and mapreduce on the same cluster). When I was testing this, I encountered this problem (but it may have been specific to our cluster configurations): https://issues.apache.org/jira/browse/HBASE-2535 JVS On Dec 9, 2010, at 10:46 PM, vlisovsky wrote: Hi Guys, Wonder if anybody could shed some light on how to reduce the load on HBase cluster when running a full scan. The need is to dump everything I have in HBase and into a Hive table. The HBase data size is around 500g. The job creates 9000 mappers, after about 1000 maps things go south every time.. If I run below insert it runs for about 30 minutes then starts bringing down HBase cluster after which region servers need to be restarted.. Wonder if there is a way to throttle it somehow or otherwise if there is any other method of getting structured data out? Any help is appreciated, Thanks, -Vitaly create external table hbase_linked_table ( mykeystring, infomapstring, string, ) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES (hbase.columns.mapping = :key,info:) TBLPROPERTIES (hbase.table.name = hbase_table2); set hive.exec.compress.output=true; set io.seqfile.compression.type=BLOCK; set mapred.output.compression.type=BLOCK; set mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec; set mapred.reduce.tasks=40; set mapred.map.tasks=25; INSERT overwrite table tmp_hive_destination select * from hbase_linked_table;
Re: Hive/HBase integration issue.
As noted here, when writing to HBase, existing rows are overwritten, but old rows are not deleted. http://wiki.apache.org/hadoop/Hive/HBaseIntegration#Overwrite There is not yet any deletion support. JVS On Nov 18, 2010, at 1:00 AM, afancy wrote: Hi, Does the INSERT clause have to include the OVERWRITE, which means that the new data will overwrite the previous data? How to implement the indeed INSERT operation, instead of OVERWRITE? BTW: How to implement the DELETE operator? thanks afancy --- hive insert OVERWRITE table pagedim select 0, url, strToint('2'), 'domain', 'serversion' from downloadlog; Total MapReduce jobs = 2 Launching Job 1 out of 2 Number of reduce tasks is set to 0 since there's no reduce operator Starting Job = job_201011121525_0006, Tracking URL = http://localhost:50030/jobdetails.jsp?jobid=job_201011121525_0006 Kill Command = /home/xiliu/hadoop-0.20.2/bin/../bin/hadoop job -Dmapred.job.tracker=localhost:54311 -kill job_201011121525_0006 2010-11-18 09:55:52,155 Stage-1 map = 0%, reduce = 0% 2010-11-18 09:55:55,169 Stage-1 map = 100%, reduce = 0% 2010-11-18 09:55:58,200 Stage-1 map = 100%, reduce = 100% Ended Job = job_201011121525_0006 Ended Job = 487027960, job is filtered out (removed at runtime). Launching Job 2 out of 2 Number of reduce tasks is set to 0 since there's no reduce operator Starting Job = job_201011121525_0007, Tracking URL = http://localhost:50030/jobdetails.jsp?jobid=job_201011121525_0007 Kill Command = /home/xiliu/hadoop-0.20.2/bin/../bin/hadoop job -Dmapred.job.tracker=localhost:54311 -kill job_201011121525_0007 2010-11-18 09:56:04,701 Stage-2 map = 0%, reduce = 0% 2010-11-18 09:56:07,723 Stage-2 map = 100%, reduce = 0% 2010-11-18 09:56:10,751 Stage-2 map = 100%, reduce = 100% Ended Job = job_201011121525_0007 Loading data to table pagedim 1000 Rows loaded to pagedim OK Time taken: 23.194 seconds hive insert table pagedim select 0, url, strToint('2'), 'domain', 'serversion' from downloadlog; FAILED: Parse Error: line 1:7 mismatched input 'table' expecting OVERWRITE in insert clause
Re: HBase as input AND output?
If your query only accesses HBase tables, then yes, Hive does not access any source data directly from HDFS (although of course it may put intermediate results in HDFS, e.g. for the result of a join). However, if your query does something like join a HBase table with a native Hive table, then it will read data from both HBase and HDFS. Likewise, on the write side, it depends whether your INSERT targets an HBase table vs a native Hive table. The read and write sides are independent. JVS On Oct 13, 2010, at 2:24 PM, Otis Gospodnetic wrote: Thanks Tim. (and sorry for the duplicate email - need to fix my Hive email filter) Just to clarify one bit, though. When using Hive without HBase one has data stored in the appropriate directories on HDFS and runs MR jobs against those data. But, when using Hive *with* HBase, does Hive require any such data to be present in the HDFS? In other words, when using Hive with HBase, one really uses only Hive's ability to translate a Hive QL statement to a set of MR jobs (and read from/write to HBase) and execute them against only data stored in HBase. Is this correct? Thanks, Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Hadoop ecosystem search :: http://search-hadoop.com/ - Original Message From: Tim Robertson timrobertson...@gmail.com To: user@hive.apache.org Sent: Wed, October 13, 2010 4:45:31 PM Subject: Re: HBase as input AND output? That's right. Hive can use an HBase table as an input format to the hive query regardless of output format, and can also write the output to an HBase table regardless of the input format. You can also supposedly do a join in Hive that uses 1 side of the join from an HBase table, and the other side a text file, which is very powerful. I haven't done it myself, but intend to shortly. HTH, Tim On Wed, Oct 13, 2010 at 10:07 PM, Otis Gospodnetic otis_gospodne...@yahoo.com wrote: Hi, I was wondering how I can query data stored in HBase and remembered Hive's HBase integration: http://wiki.apache.org/hadoop/Hive/HBaseIntegration After watching John Sichi's video (http://developer.yahoo.com/blogs/hadoop/posts/2010/04/hundreds_of_hadoop_fans_at_the/ ) I have a better idea about what functionality this integration provides, but I still have some questions. Would it be correct to say that Hive-HBase integration makes the following data flow possible: 0) Hive or Files = Custom HQL statement that aggregates data == HBase 1) HBase == Custom HQL statement that aggregates data == HBase 2) HBase == Custom HQL statement that aggregates data == output (console?) Of the above, 1) is what I'm wondering the most about right now. In other words, it seems to me that Hive may be able to look at *just* data stored in HBase *without* the typical data/files in HDFS that Hive normally runs its MR jobs against. Is this correct? Thanks, Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Hadoop ecosystem search :: http://search-hadoop.com/