from:"John Sichi"

[ANNOUNCE] New Hive Committer - Navis Ryu

2012-08-10 Thread John Sichi

The Apache Hive PMC has passed a vote to make Navis Ryu a new
committer on the project.

JIRA is currently down, so I can't send out a link with his
contribution list at the moment, but if you have an account at
reviews.facebook.net, you can see his activity here:

https://reviews.facebook.net/p/navis/

Navis, please submit your CLA to the Apache Software Foundation as
described here:

http://www.apache.org/licenses/#clas

Congratulations!
JVS

parallel execution for Hive unit tests

2011-12-15 Thread John Sichi

Hey all,

Marek Sapota has put together a doc on the new scripts for spreading Hive unit 
test execution across a cluster:

https://cwiki.apache.org/confluence/display/Hive/Unit+Test+Parallel+Execution

Whether you are a committer or someone contributing patches, if you are 
currently frustrated by waiting for Hive tests to complete for each patch, 
please set this up and give it a try.

JVS

Re: Hive UDFs/ FunctionRegistry etc

2011-12-08 Thread John Sichi

On Dec 8, 2011, at 12:20 PM, Sam William wrote:

  I have a bunch of custom UDFs  and I d like the others in the company to 
 make use of then in an easy way  .Im not very happy  with the 'CREATE 
 TEMPORARY FUNCTION' arrangement   for each session .   It d be great if our 
 site-specific  functions  , work the sameway as the inbuilt functions.  What 
 options do I have other than  modifying FunctionRegistry and recompiling ?


Hey Sam,

At Facebook, we use a standard Hive init script to preload the temporary 
functions, using the CLI's -i option:

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Cli

You could also take a look at the work I recently did on loading builtins from 
a separate jar using Hive's PDK.  If you want to work on an enhancement, it 
shouldn't be too hard to add a configuration option to preload additional jars 
and their functions using the same technique:

https://issues.apache.org/jira/browse/HIVE-2523

A real SQL/J implementation of persistent functions in the metastore would be 
preferable, but that's a lot of work.

JVS

meeting minutes for 5-Dec-2011 contributor meeting

2011-12-08 Thread John Sichi

https://cwiki.apache.org/confluence/display/Hive/ContributorMinutes20111205

I created an INFRA ticket to take Hive out of Review Board:

https://issues.apache.org/jira/browse/INFRA-4200

Please use Phabricator for all new review requests:

https://cwiki.apache.org/confluence/display/Hive/PhabricatorCodeReview

I'm updating the wiki accordingly.

JVS

Re: Hive-Hbase integration require Hbase in Pseudo distributed??

2011-12-02 Thread John Sichi

As you can guess from the 0.89 dependency, there has been a lot of water under 
the bridge since this integration was developed.  If someone would like to take 
on bringing it up to date, that would be great.

Note that auxpath is to make the jars available in map/reduce task VM's (we 
don't put everything from lib there automatically).

JVS

On Dec 2, 2011, at 10:39 AM, jcfol...@pureperfect.com
 wrote:

 
 I am having the same issue. Hive won't connect to HBase and throws
 org.apache.hadoop.hbase.MasterNotRunningException despite the fact that
 the master is up and running. It may only work if HBase is in
 distributed mode or psuedo-distributed mode. I know HBase doesn't put
 files into HDFS otherwise.
 
 
 It certainly doesn't work for me running in standalone mode. I've tried
 about thirty different combinations of hive/hbase and can't get it going
 on any of them, so I switched to trying to get pseudo-distributed mode
 working in HBase, but haven't been able to find the magic combination of
 versions that will allow HBase to do anything in HDFS other than throw
 EOFExceptions.
 
 In any case, according to the Hive documentation (see below) it doesn't
 work with any version of HBase other than 0.89, but there are three 0.89
 versions of HBase at archive.apache.org and the lib directories for Hive
 contain 0.89-SNAPSHOT.
 
 
 FYI: There's an official Hive/HBase integration page at the Confluence
 wiki, but that doesn't work either:
 
 https://cwiki.apache.org/confluence/display/Hive/HBaseIntegration
 
 
 It contains the instruction:
 
 
 The handler requires Hadoop 0.20 or higher, and has only been tested
 with dependency versions hadoop-0.20.x, hbase-0.89.0 and
 zookeeper-3.3.1. If you are not using hbase-0.89.0, you will need to
 rebuild the handler with the HBase jar matching your version, and change
 the --auxpath above accordingly. Failure to use matching versions will
 lead to misleading connection failures such as 
 
 MasterNotRunningException 
 
 since the HBase RPC protocol changes often.
 
 
 But that really doesn't make sense. Hive uses Ivy and you can't just
 simply replace the jar files. Updating the version Ivy fetches from the
 Apache repository in the second sentence contradicts the version
 exception in the previous sentence since the only official releases of
 HBase are 0.90 and forward:
 
 https://repository.apache.org/content/repositories/releases/org/apache/hbase/hbase/
 
 I'm trying to get Hive to build against HBase 0.90 but Ivy wants to pull
 the 0.90 out of snapshots so trying to grab the jar file throws 404s.
 
 
 As a side note: the --auxpath seems unnecessary. The jars are already in
 the lib directory so it seems like they ought to be on the classpath
 already. 
 
 
 
  Original Message 
 Subject: Re: Hive-Hbase integration require Hbase in Pseudo
 distributed??
 From: Mohammad Tariq donta...@gmail.com
 Date: Fri, December 02, 2011 7:28 am
 To: user@hive.apache.org
 
 Anyone there, cud you please confirm if I can use hive-hbase in
 standalone mode???
 will it work? or should i go for Pseudo distributed mode ?
 
 Regards,
Mohammad Tariq
 
 
 
 On Fri, Dec 2, 2011 at 5:54 PM, Alok Kumar alok...@gmail.com wrote:
 hi,
 
 yeah i've used
 
 $HIVE_HOME/bin/hive --auxpath
 $HIVE_HOME/lib/hive-hbase-handler-*.jar,$HIVE_HOME/lib/hbase-*.jar,$HIVE_HOME/lib/zookeeper-*.jar
 -hiveconf hbase.master=localhost:6
 
 
 Hadoop version : hadoop-0.20.203.0
 Hbase version : hbase-0.90.4
 Hive version : hive-0.9.0 (built from trunk)
 on
 Ubuntu 11.10
 ---
 
 Regards,
 
 Alok
 
 
 On Fri, Dec 2, 2011 at 5:49 PM, Ankit Jain ankitjainc...@gmail.com wrote:
 
 Hi,
 
 have you used following command to start the hive shell.
 
 $HIVE_HOME/bin/hive --auxpath
 $HIVE_HOME/lib/hive-hbase-handler-*.jar,$HIVE_HOME/lib/hbase-*.jar,$HIVE_HOME/lib/zookeeper-*.jar
 -hiveconf hbase.master=127.0.0.1:6
 
 
 If no then used above command.
 Regards,
 Ankit
 
 
 On Fri, Dec 2, 2011 at 5:34 PM, Alok Kumar alok...@gmail.com wrote:
 
 Hi,
 
 // Hadoop core-site.xml
 configuration
property
namefs.default.name/name
valuehdfs://localhost:9000/value
/property
property
namehadoop.tmp.dir/name
value/home/alokkumar/hadoop/tmp/value
/property
 /configuration
 
 // hbase-site.xml
 configuration
property
namehbase.rootdir/name
   !--valuehdfs://localhost:9000/hbase/value--
valuefile:///home/alokkumar/hbase//value
/property
 /configuration
 
 with these conf Hbase/Hive are runnning independently file..
 hbase(main):003:0 status
 1 servers, 0 dead, 4. average load
 
 but i'm stil getting $hive CREATE TABLE hbase_table_1(key int, value
 string)
 STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
 WITH SERDEPROPERTIES (hbase.columns.mapping = :key,cf1:val);
 FAILED: Error in metadata:

Re: Understanding Hive and Hbase

2011-12-02 Thread John Sichi

The queries go through the region servers, not directly to HDFS.

JVS

On Dec 2, 2011, at 10:53 AM, Gabriel Eisbruch wrote:

 Hi everybody,
 I have a question about hbase hive integration that, I have not can 
 found in any where: if I run a hive query, this query will read the HFiles 
 from hdfs and run into my hadoop cluster, Will the data in my region servers 
 (not flushed yet) be process into the query?
 
 Thank you Very Much
 Gabriel

Re: Understanding Hive and Hbase

2011-12-02 Thread John Sichi

Yes, everything goes through the HBase API.

JVS

On Dec 2, 2011, at 2:09 PM, Gabriel Eisbruch wrote:

 Ok, so The map/reduce jobs are connnected to The regionservers?
 
 Gabriel.
 
 El dic 2, 2011 6:52 p.m., John Sichi jsi...@fb.com escribió:
 The queries go through the region servers, not directly to HDFS.
 
 JVS
 
 On Dec 2, 2011, at 10:53 AM, Gabriel Eisbruch wrote:
 
  Hi everybody,
  I have a question about hbase hive integration that, I have not can 
  found in any where: if I run a hive query, this query will read the HFiles 
  from hdfs and run into my hadoop cluster, Will the data in my region 
  servers (not flushed yet) be process into the query?
 
  Thank you Very Much
  Gabriel

Re: Hive HBase wiki

2011-11-17 Thread John Sichi

It has been quite a while since those instructions were written, so maybe 
something has broken.  There is a unit test for it 
(hbase-handler/src/test/queries/hbase_bulk.m) which is still passing.  

If you're running via CLI, logs by default go in /tmp/username 

Long-term, energy best expended on this would go here:

https://issues.apache.org/jira/browse/HIVE-2365

JVS

On Nov 17, 2011, at 10:59 AM, Ben West wrote:

 Hey all,
 
 I'm having some trouble with the HBase bulk load, following the instructions 
 from https://cwiki.apache.org/confluence/display/Hive/HBaseBulkLoad. In the 
 last step (Sort Data) I get:
 
 java.lang.RuntimeException: Hive Runtime Error while closing operators: 
 java.io.IOException: No files found in 
 hdfs://localhost/tmp/hive-cloudera/hive_2011-11-17_10-30-11_023_3494196694520237582/_tmp.-ext-1/_tmp.01_2
 at org.apache.hadoop.hive.ql.exec.ExecReducer.close(ExecReducer.java:311)
 at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:479)
 at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:417)
 at org.apache.hadoop.mapred.Child$4.run(Child.java:270)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:396)
 at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1127)
 at org.apache.hadoop.mapred.Child.main(Child.java:264)
 Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: 
 java.io.IOException: No files found in 
 hdfs://localhost/tmp/hive-cloudera/hive_2011-11-17_10-30-11_023_3494196694520237582/_tmp.-ext-1/_tmp.01_2
 at 
 org.apache.hadoop.hive.ql.exec.FileSinkOperator$FSPaths.closeWriters(FileSinkOperator.java:171)
 at 
 org.apache.hadoop.hive.ql.exec.FileSinkOperator.closeOp(FileSinkOperator.java:642)
 at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:557)
 at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:566)
 at org.apache.hadoop.hive.ql.exec.ExecReducer.close(ExecReducer.java:303)
 ... 7 more
 Caused by: java.io.IOException: No files found in 
 hdfs://localhost/tmp/hive-cloudera/hive_2011-11-17_10-30-11_023_3494196694520237582/_tmp.-ext-1/_tmp.01_2
 at 
 org.apache.hadoop.hive.hbase.HiveHFileOutputFormat$2.close(HiveHFileOutputFormat.java:144)
 at 
 org.apache.hadoop.hive.ql.exec.FileSinkOperator$FSPaths.closeWriters(FileSinkOperator.java:168)
 ... 11 more
 
 When I look at the source of HiveHFileOutputFormat.java it has:
 
 // Move the region file(s) from the task output directory
 // to the location specified by the user.  There should
 // actually only be one (each reducer produces one HFile),
 // but we don't know what its name is.
 FileSystem fs = outputdir.getFileSystem(jc);
 fs.mkdirs(columnFamilyPath);
 Path srcDir = outputdir;
 for (;;) {
 FileStatus [] files = fs.listStatus(srcDir);
 if ((files == null) || (files.length == 0)) {
 throw new IOException(No files found in  + srcDir);
 }
 
 So I am getting the issue where the task output directory is empty. I 
 assume this is because the earlier task failed, but I'm not sure how to check 
 this.
 
 Does anyone know what is going on or how I can find the error log of whatever 
 was supposed to populate this directory?
 
 Thanks!
 -Ben

Re: testing out Phabricator for code review

2011-10-31 Thread John Sichi

Marek added support for svn, so that is working now too...give it a try!  
Instructions updated at

https://cwiki.apache.org/confluence/display/Hive/PhabricatorCodeReview

JVS

On Oct 26, 2011, at 10:49 PM,  wrote:

 I've put up instructions for how anyone can start using Phabricator for code 
 review:
 
 https://cwiki.apache.org/confluence/display/Hive/PhabricatorCodeReview
 
 We've tested out the git workflows; still working on svn.
 
 Feedback on how it works for you, anything you noticed missing, etc is 
 appreciated.
 
 JVS
 
 On Oct 20, 2011, at 2:00 PM,  wrote:
 
 Hey all,
 
 Earlier this year, Facebook released a bunch of its code browsing/review 
 tools as a new (and independent) open source project called Phabricator:
 
 http://phabricator.org/
 
 We're currently experimenting with using it for improving the developer 
 experience when contributing and reviewing Hive and HBase patches.  (Also 
 for eliminating committer confusion from different patch versions submitted 
 to Review Board and JIRA, something which has bitten us a few times already.)
 
 You may notice some of this activity showing up in JIRA, e.g.
 
 https://issues.apache.org/jira/browse/HIVE-2515
 
 I'll be sending out a lot more info once we've finished some of the setup 
 and validation, but I just wanted to send out a heads-up for those who are 
 already familiar with using the existing Review Board setup.  Once 
 validation is done, we'll publish instructions so that everyone can test it 
 out for themselves as a potential alternative to Review Board.
 
 JVS

Re: testing out Phabricator for code review

2011-10-26 Thread John Sichi

I've put up instructions for how anyone can start using Phabricator for code 
review:

https://cwiki.apache.org/confluence/display/Hive/PhabricatorCodeReview

We've tested out the git workflows; still working on svn.

Feedback on how it works for you, anything you noticed missing, etc is 
appreciated.

JVS

On Oct 20, 2011, at 2:00 PM,  wrote:

 Hey all,
 
 Earlier this year, Facebook released a bunch of its code browsing/review 
 tools as a new (and independent) open source project called Phabricator:
 
 http://phabricator.org/
 
 We're currently experimenting with using it for improving the developer 
 experience when contributing and reviewing Hive and HBase patches.  (Also for 
 eliminating committer confusion from different patch versions submitted to 
 Review Board and JIRA, something which has bitten us a few times already.)
 
 You may notice some of this activity showing up in JIRA, e.g.
 
 https://issues.apache.org/jira/browse/HIVE-2515
 
 I'll be sending out a lot more info once we've finished some of the setup and 
 validation, but I just wanted to send out a heads-up for those who are 
 already familiar with using the existing Review Board setup.  Once validation 
 is done, we'll publish instructions so that everyone can test it out for 
 themselves as a potential alternative to Review Board.
 
 JVS

Re: Unit tests failing on hive-0.7.1

2011-09-29 Thread John Sichi

mirror.facebook.net is currently down and won't be back up for at least a few 
days.  There's a fallback at

http://archive.cloudera.com/hive-deps

If it's not kicking in for you automatically, you'll need to edit 
ivy/ivysettings.xml.

JVS

On Sep 28, 2011, at 11:22 PM, Ramya Sunil wrote:

 Hi,
 
 I downloaded hive-0.7.1 and tried to run ant test However the build fails 
 due to the following error:
 
 [ivy:retrieve]  [FAILED ] 
 hadoop#core;0.20.3-CDH3-SNAPSHOT!hadoop.tar.gz(source): The HTTP response 
 code for 
 http://mirror.facebook.net/facebook/hive-deps/hadoop/core/hadoop-0.20.3-CDH3-SNAPSHOT/hadoop-0.20.3-CDH3-SNAPSHOT.tar.gz
  did not indicate a success. See log for more detail. (201ms)
 [ivy:retrieve]  [FAILED ] 
 hadoop#core;0.20.3-CDH3-SNAPSHOT!hadoop.tar.gz(source): The HTTP response 
 code for 
 http://mirror.facebook.net/facebook/hive-deps/hadoop/core/hadoop-0.20.3-CDH3-SNAPSHOT/hadoop-0.20.3-CDH3-SNAPSHOT.tar.gz
  did not indicate a success. See log for more detail. (175ms)
 [ivy:retrieve]  [FAILED ] 
 hadoop#core;0.20.3-CDH3-SNAPSHOT!hadoop.tar.gz(source): The HTTP response 
 code for 
 http://mirror.facebook.net/facebook/hive-deps/hadoop/core/hadoop-0.20.3-CDH3-SNAPSHOT/hadoop-0.20.3-CDH3-SNAPSHOT.tar.gz
  did not indicate a success. See log for more detail. (135ms)
 [ivy:retrieve]  [FAILED ] 
 hadoop#core;0.20.3-CDH3-SNAPSHOT!hadoop.tar.gz(source): The HTTP response 
 code for 
 http://mirror.facebook.net/facebook/hive-deps/hadoop/core/hadoop-0.20.3-CDH3-SNAPSHOT/hadoop-0.20.3-CDH3-SNAPSHOT.tar.gz
  did not indicate a success. See log for more detail. (188ms)
 [ivy:retrieve]  [FAILED ] 
 hadoop#core;0.20.3-CDH3-SNAPSHOT!hadoop.tar.gz(source):  (0ms)
 [ivy:retrieve]   hadoop-source: tried
 [ivy:retrieve]
 http://mirror.facebook.net/facebook/hive-deps/hadoop/core/hadoop-0.20.3-CDH3-SNAPSHOT/hadoop-0.20.3-CDH3-SNAPSHOT.tar.gz
 [ivy:retrieve]
 http://mirror.facebook.net/facebook/hive-deps/hadoop/core/hadoop-0.20.3-CDH3-SNAPSHOT/hadoop-0.20.3-CDH3-SNAPSHOT.tar.gz
 [ivy:retrieve]   apache-snapshot: tried
 [ivy:retrieve]
 https://repository.apache.org/content/repositories/snapshots/hadoop/core/0.20.3-CDH3-SNAPSHOT/hadoop-0.20.3-CDH3-SNAPSHOT.tar.gz
 [ivy:retrieve]
 http://mirror.facebook.net/facebook/hive-deps/hadoop/core/hadoop-0.20.3-CDH3-SNAPSHOT/hadoop-0.20.3-CDH3-SNAPSHOT.tar.gz
 [ivy:retrieve]   maven2: tried
 [ivy:retrieve]
 http://repo1.maven.org/maven2/hadoop/core/0.20.3-CDH3-SNAPSHOT/core-0.20.3-CDH3-SNAPSHOT.tar.gz
 [ivy:retrieve]
 http://mirror.facebook.net/facebook/hive-deps/hadoop/core/hadoop-0.20.3-CDH3-SNAPSHOT/hadoop-0.20.3-CDH3-SNAPSHOT.tar.gz
 [ivy:retrieve]   datanucleus-repo: tried
 [ivy:retrieve]
 http://www.datanucleus.org/downloads/maven2/hadoop/core/0.20.3-CDH3-SNAPSHOT/core-0.20.3-CDH3-SNAPSHOT.tar.gz
 [ivy:retrieve]
 http://mirror.facebook.net/facebook/hive-deps/hadoop/core/hadoop-0.20.3-CDH3-SNAPSHOT/hadoop-0.20.3-CDH3-SNAPSHOT.tar.gz
 [ivy:retrieve]  ::
 [ivy:retrieve]  ::  UNRESOLVED DEPENDENCIES ::
 [ivy:retrieve]  ::
 [ivy:retrieve]  :: hadoop#core;0.20.1: not found
 [ivy:retrieve]  ::
 [ivy:retrieve]  ::
 [ivy:retrieve]  ::  FAILED DOWNLOADS::
 [ivy:retrieve]  :: ^ see resolution messages for details  ^ ::
 [ivy:retrieve]  ::
 [ivy:retrieve]  :: 
 hadoop#core;0.20.3-CDH3-SNAPSHOT!hadoop.tar.gz(source)
 [ivy:retrieve]  ::
 
 
 Can anyone please let me know how to get around this problem? 
 
 Thanks
 Ramya

Apache Software Foundation Branding Requirements

2011-09-08 Thread John Sichi

Hey, the Apache Hive project is responsible for coming into compliance with 
these:

http://www.apache.org/foundation/marks/pmcs.html

I've created a JIRA issue for tracking this, with sub-tasks for the various 
work items:

https://issues.apache.org/jira/browse/HIVE-2432

Our quarterly reports from the PMC to the ASF board will continue to include 
status updates on these until they are all resolved.

If you are interested in helping out with any of that, please assign the 
corresponding sub-tasks to yourself.

JVS

Re: failed when create an index with partitioned by clause

2011-08-14 Thread John Sichi

The wiki docs are incorrect here.  CREATE INDEX does not yet supported a 
PARTITIONED BY clause; that was added in the spec to support HIVE-1499, which 
hasn't been implemented yet.

For now, the index partitioning always follows the table partitioning exactly.

JVS

On Aug 14, 2011, at 3:22 AM, Daniel,Wu wrote:

   create table part (a int,b int) PARTITIONED by (c int);
 
 create index part_idx on table part(b,c) AS 
 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler' WITH DEFERRED 
 REBUILD
 partitioned by (a) ;
 
 hive create index part_idx on table part(b,c) AS 
 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler' WITH DEFERRED 
 REBUILD
  partitioned by (a) ;
 FAILED: Parse Error: line 2:0 mismatched input 'partitioned' expecting EOF 
 near 'REBUILD'
 
 hive 
 
 
 If I remove the partitioned by (a), then the index can be created. But I need 
 to partition it on column. Is that not supported yet or I made some mistake?

Re: Hive Wiki: Editing Permissions

2011-08-08 Thread John Sichi

I've granted you write access...thanks for helping to fix the wiki!

JVS

On Aug 8, 2011, at 11:08 AM, Travis Powell wrote:

 Hello,
  
 The Wiki has been one of the most important resources to me for learning 
 Hive. There are a lot of broken links that make it hard to flip between 
 topics.
  
 If I could gain editing permissions, I’d be happy to fix links and hopefully 
 make this wiki a bit easier to learn from.
  
 Thanks,
  
 Travis Powell
  
 Travis Powell / tpow...@tealeaf.com

becoming a Hive committer

2011-08-03 Thread John Sichi

I often get asked questions about this topic, so I've put together a wiki page 
which expresses some of my thoughts on it:

https://cwiki.apache.org/confluence/display/Hive/BecomingACommitter

Let me know if there are points you'd like to add, or where you see it 
differently.

JVS

Re: Hello!

2011-07-25 Thread John Sichi

As the comments in HIVE-1228 mention, we decided not to address the :timestamp 
requirement.  So if you need that, you can work on enhancing the HBase storage 
handler by opening a JIRA issue, proposing an approach, and submitting a patch.

JVS

On Jul 24, 2011, at 12:42 PM, 张建轶 wrote:

 Hello!
 
 I'm a developer of web in China, I'm studying Hive and Hbase for my job.
 But I had encountered a problem of
 Hbase Handler in Hive. I want :timestamp column to be mapped for read or
 write, and then import data of timestamp to hbase's table from hive.
 I find follow url of this issue:
 https://issues.apache.org/jira/browse/HIVE-1228
 It seems that this problem has resolved. But in this page
 https://cwiki.apache.org/confluence/display/Hive/HBaseIntegration
 said that there is currently no way to access the HBase timestamp
 attribute, and queries always access data with the latest timestamp.
 
 How can I map the latest timestamp to hive or INSERT OVERWRITE TABLE
 hbase_table_1 with timestamp?
 I have googled it many times, but nothing found. If you could help me,
 it will be a big kindnesses to me. I will appreciate your replay.
 
 With thanks and best wishes to you.
 
 Yours
 
 Jianyi Zhang

Re: wiki has moved!

2011-06-27 Thread John Sichi

On Jun 27, 2011, at 4:37 PM, Time Less wrote:
 Might as well add me as editor. I've found tons of errors and problems. Not 
 the least of which the regexserde is now completely borked and nonsensical. 
 Compare ([^]*) ([^]*) ... against ([^ ]*) ([^ ]*) ... - I thought I was 
 going insane.

Email me your Confluence account name.

 Also, Google still points to the old documentation which doesn't exist. You 
 need to add in some 301 so Google will get the message, too: 
 http://en.wikipedia.org/wiki/HTTP_301. I believe Google isn't the only HTTP 
 client that will benefit from 301 status.

I don't have control over the MoinMoin server; if someone has something 
specific they can create an INFRA request, but the page name translation is not 
1-to-1, so it's probably not worth the effort; the old stuff should age out, 
and the new stuff will get crawled soon enough.

JVS

Re: wiki has moved!

2011-06-27 Thread John Sichi

On Jun 27, 2011, at 5:16 PM,  wrote:
 I don't have control over the MoinMoin server; if someone has something 
 specific they can create an INFRA request, but the page name translation is 
 not 1-to-1, so it's probably not worth the effort; the old stuff should age 
 out, and the new stuff will get crawled soon enough.


Hmm, but looks like there's a robots.txt which blocks crawlers in

https://cwiki.apache.org/confluence

Instead, from looking at other projects such as Avro, I guess the crawlers are 
supposed to hit the generated HTML under

https://cwiki.apache.org/Hive

But the HTML pages only seem to get regenerated on edit, so most of them aren't 
there post-import; let's see if a cron job kicks in.

Also, the CSS is missing border padding, so we'll need to fix that.

JVS

wiki has moved!

2011-06-26 Thread John Sichi

Hey there,

With some wiki migration magic from Brock Noland (assisted by Gavin from 
INFRA), we've moved all of the content from MoinMoin to Confluence.

The new location is here:

https://cwiki.apache.org/confluence/display/Hive

All of the MoinMoin pages have been deleted; this is to make sure people don't 
accidentally keep editing there.  I left behind some forwarding info.

We need your help (or at least tolerance) to deal with some of the 
imperfections in the migration process:

https://cwiki.apache.org/confluence/display/Hive/AboutThisWiki

If you already an editor on the old wiki, or if you would like to help with 
fixing/editing now, contact me for write access to the new one.  If you turn 
out to be a spammer, I will hunt you down, disembowel you, and feed your 
entrails to my dog.

JVS

Re: Getting a weird error when using the ngrams function

2011-06-20 Thread John Sichi

Hmmm, I think this might be a bug which is only exposed when one of the mappers 
gets zero rows of input.

If you have a Hive build, can you try adding this before line 238 of 
GenericUDAFnGrams.java?

if (n == 0) {
  return;
}

Just before this line:

  if(myagg.n  0  n  0  myagg.n != n) {

If that fixes it, create a new JIRA issue so we can get a fix committed.

JVS

On Jun 20, 2011, at 8:06 AM, Matthew Rathbone wrote:

 Hoping someone with more expertise could help on this:
 
 I have no idea what's causing this to happen, but here is the exception:
 
 java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: 
 Hive Runtime Error while processing row (tag=0) 
 {key:{},value:{_col0:[0,0,0,0]},alias:0}
   at 
 org.apache.hadoop.hive.ql.exec.ExecReducer.reduce(ExecReducer.java:268)
   at 
 org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:467)
   at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:415)
   at org.apache.hadoop.mapred.Child.main(Child.java:170)
 Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime 
 Error while processing row (tag=0) 
 {key:{},value:{_col0:[0,0,0,0]},alias:0}
   at 
 org.apache.hadoop.hive.ql.exec.ExecReducer.reduce(ExecReducer.java:256)
   ... 3 more
 Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: 
 GenericUDAFnGramEvaluator: mismatch in value for 'n', which usually is caused 
 by a non-constant expression. Found '0' and '1'.
   at 
 org.apache.hadoop.hive.ql.udf.generic.GenericUDAFnGrams$GenericUDAFnGramEvaluator.merge(GenericUDAFnGrams.java:239)
   at 
 org.apache.hadoop.hive.ql.udf.generic.GenericUDAFEvaluator.aggregate(GenericUDAFEvaluator.java:142)
   at 
 org.apache.hadoop.hive.ql.exec.GroupByOperator.updateAggregations(GroupByOperator.java:592)
   at 
 org.apache.hadoop.hive.ql.exec.GroupByOperator.processAggr(GroupByOperator.java:816)
   at 
 org.apache.hadoop.hive.ql.exec.GroupByOperator.processOp(GroupByOperator.java:716)
   at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:470)
   at 
 org.apache.hadoop.hive.ql.exec.ExecReducer.reduce(ExecReducer.java:247)
 
 
 What does 'mismatch in value for n' mean?
 My query is super simple:
 select ngrams(sentences(text), 1, 50) from messages
 
 
 -- 
 Matthew Rathbone
 Foursquare | Software Engineer | Server Engineering Team
 matt...@foursquare.com | @rathboma | 4sq

Travel Assistance applications now open for ApacheCon NA 2011

2011-06-06 Thread John Sichi

The Apache Software Foundation (ASF)'s Travel Assistance Committee (TAC) is
now accepting applications for ApacheCon North America 2011, 7-11 November
in Vancouver BC, Canada.

The TAC is seeking individuals from the Apache community at-large --users,
developers, educators, students, Committers, and Members-- who would like to
attend ApacheCon, but need some financial support in order to be able to get
there. There are limited places available, and all applicants will be scored
on their individual merit.

Financial assistance is available to cover flights/trains, accommodation and
entrance fees either in part or in full, depending on circumstances.
However, the support available for those attending only the BarCamp (7-8
November) is less than that for those attending the entire event (Conference
+ BarCamp 7-11 November). The Travel Assistance Committee aims to support
all official ASF events, including cross-project activities; as such, it may
be prudent for those in Asia and Europe to wait for an event geographically
closer to them.

More information can be found at http://www.apache.org/travel/index.html
including a link to the online application and detailed instructions for
submitting.

Applications will close on 8 July 2011 at 22:00 BST (UTC/GMT +1).

We wish good luck to all those who will apply, and thank you in advance for
tweeting, blogging, and otherwise spreading the word.

Regards,
The Travel Assistance Committee

wiki spam

2011-05-13 Thread John Sichi

Apparently our roadmap includes hard drive recovery, wedding reception flowers, 
and developing muscles.

http://wiki.apache.org/hadoop/Hive/Roadmap

Anyone want to take a crack at migrating the HIve wiki content from Hadoop's 
MoinMoin over to the Hive-specific Confluence space we have set up?  In 
Confluence, it's possible to restrict editing to specific groups.  And the 
markup is a lot better, and attachments can be used, and ...

This would be a great contribution by anyone interested in helping Hive out, so 
if you're interested, let me know.

JVS

Re: hadoop, hive and hbase problem

2011-05-09 Thread John Sichi

Try one of these suggestions:

(1) run HBase and Hive in separate clusters (downside is that map/reduce tasks 
will have to issue remote request to region servers whereas normally they could 
run on the same nodes)

(2) debug the shim exception and see if you can contribute a patch that makes 
Hive compatible with that Hadoop version

JVS

On May 9, 2011, at 11:17 AM, labtrax wrote:

 Hello,
 
 it seems that hive 0.6 an 0.7 is incompatible with the hadoop-append-jar from 
 hbase 0.90.2. But without the append-jar you cannot use hbase in 
 production... Any advices for the hadoop/hbase/hive-version-jungle? I already 
 ask this last month but I didn't get a reasonable answer.
 
 Cheers
 labtrax
 
 
 Hello,
 
 I have a hadoop cluster running with the hadoop_append-jar 
 (hadoop-core-0.20-append-r1056497-core.jar)
 for hbase reason. 
 I tried hive 0.6.0 and 0.7.0 and for both each when I start it I get
 Exception in thread main java.lang.RuntimeException: Could not load shims 
 in class null
at 
 org.apache.hadoop.hive.shims.ShimLoader.loadShims(ShimLoader.java:90)
at 
 org.apache.hadoop.hive.shims.ShimLoader.getHadoopShims(ShimLoader.java:66)
at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:249)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
 Caused by: java.lang.NullPointerException
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:169)
at 
 org.apache.hadoop.hive.shims.ShimLoader.loadShims(ShimLoader.java:87)
 
 My hive-site.xml:
 ?xml version=1.0?
 ?xml-stylesheet type=text/xsl href=configuration.xsl?
 configuration
 property
 namehive.metastore.warehouse.dir/name
 valuehdfs://name-node:54310/hive/value
 descriptionlocation of default database for the warehouse/description
 /property
 /configuration
 
 My hadoop cluster is working properly, the hive dir is already created. 
 Any clues?
 
 labtrax
 
 
 -- 
 NEU: FreePhone - kostenlos mobil telefonieren und surfen! 
 Jetzt informieren: http://www.gmx.net/de/go/freephone

Re: index is not working for me.

2011-04-15 Thread John Sichi

Automatic usage of indexes is still under development (HIVE-1644).

JVS

On Apr 15, 2011, at 1:31 AM, Erix Yao wrote:

 hi, 
   I installed the hive-0.7 release for the index feature.
   Here's my test table schema:
 
 create table testforindex (id bigint, type int) row format delimited fields 
 terminated by ',' lines terminated by '\n';;
 create index type_idx on table testforindex (type)  AS 
 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler' WITH DEFERRED 
 REBUILD; 
 
 I loaded 10 data for index test:
 hive select count(*) from testforindex;
 OK
 10
 Time taken: 22.247 seconds
 
 Here's the test select for index:
 hive select count(*) from testforindex where type=1;
 OK
 1000
 Time taken: 20.279 seconds
 
 But in the jobtracker I see :
 Counter   Map  Reduce Total
 Map input records 100,000 0   100,000
 
 The hive still use the full table scan for the result.
 
 Is there anybody that can tell me what's wrong in my test?
 -- 
 haitao.yao@Beijing

Re: UDF constructor params?

2011-04-05 Thread John Sichi

https://issues.apache.org/jira/browse/HIVE-1016
https://issues.apache.org/jira/browse/HIVE-1360

JVS

On Apr 5, 2011, at 11:20 AM, Larry Ogrodnek wrote:

 For some UDFs I'm working on now it feels like it would be handy to be
 able to pass in parameters during construction.  It's an integration
 with an external reporting API...
 
 e.g.
 
 -- include last 30 days from april 4th
 create temporary function orders_last_month as
 'com.example.OrderSearch(20110404, 30)'
 
 -- get orders for customer 11
 select order_last_month(11), ...
 
 
 Obviously I can perform the same logic passing everything into the UDF:
 
 select orders_last_month(20110404, 30, 11), ...
 
 
 but this doesn't feel as nice..
 
 additionally, having the information available in the constructor
 might give the UDF more information on how to perform caching, allow
 it to do more complex initialization, etc.
 
 
 Just wondering if this has ever been thought about, discussed, or
 needed by anyone else
 
 thanks,
 larry

Re: Performance between Hive queries vs. Hive over HBase queries

2011-03-09 Thread John Sichi

There's one here specifically for the Hive portion, but really a full-stack 
system profile is needed for deciding where to attack it:

https://issues.apache.org/jira/browse/HIVE-1231

I don't know of anyone currently working in this area.

JVS

On Mar 8, 2011, at 9:51 PM, Otis Gospodnetic wrote:

 Hi,
 
 John, are there plans or specific JIRA issues related to this particular 
 performance hit that you or somebody else is working on and that those of us 
 interested in performance improvements when Hive points to external tables in 
 HBase should watch?
 
 Thanks,
 Otis
 
 Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
 Lucene ecosystem search :: http://search-lucene.com/
 
 
 
 - Original Message 
 From: John Sichi jsi...@fb.com
 To: user@hive.apache.org user@hive.apache.org
 Sent: Tue, March 8, 2011 1:17:51 AM
 Subject: Re: Performance between Hive queries vs. Hive over HBase queries
 
 For native tables, Hive reads rows directly from HDFS.
 
 For HBase tables,  it has to go through the HBase region servers, which 
 reconstruct rows from  column families (combining cache + HDFS).
 
 HBase makes it possible to keep  your table up to date in real time, but you 
 have to pay an overhead cost at  query time.
 
 On the other hand, with native Hive tables, there's latency  in loading new 
 batches of data.
 
 JVS
 
 On Mar 7, 2011, at 10:13 PM,  Biju Kaimal wrote:
 
 Hi,
 
 Could you please explain the  reason for the behavior? 
 
 Regards,
 Biju
 
 On Tue, Mar 8, 2011 at 11:35 AM, John Sichi jsi...@fb.com  wrote:
 Yes.
 
 JVS
 
 On Mar 7, 2011, at  9:59 PM, Biju Kaimal wrote:
 
 Hi,
 
 I loaded a data set which has 1 million rows into both Hive and HBase 
 tables.  For the HBase table, I created a corresponding Hive table so that 
 the 
 data in  HBase can be queried from Hive QL. Both tables have a key column 
 and a 
 value  column
 
 For the same query (select value, count(*) from  table group by value), 
 the 
 Hive only query runs much faster (~ 30 seconds) as  compared to Hive over 
 HBase 
 (~ 150 seconds).
 
 Is this  expected?
 
 Regards,
 Biju

Re: Performance between Hive queries vs. Hive over HBase queries

2011-03-09 Thread John Sichi

Factor of 5 closely matches the results I got when I was testing.

JVS

On Mar 9, 2011, at 1:23 PM, Otis Gospodnetic wrote:

 Hi,
 
 Biju's example shows a factor of 5 decrease in performance when Hive points 
 to 
 HBase tables.
 
 Does anyone know how much this factor varies?  Is if often closer to 1 or is 
 is 
 more often close to 10?
 Just trying to get a better feel for this...
 
 Thanks,
 Otis
 
 Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
 Lucene ecosystem search :: http://search-lucene.com/
 
 
 
 - Original Message 
 From: John Sichi jsi...@fb.com
 To: user@hive.apache.org user@hive.apache.org
 Sent: Tue, March 8, 2011 1:05:34 AM
 Subject: Re: Performance between Hive queries vs. Hive over HBase queries
 
 Yes.
 
 JVS
 
 On Mar 7, 2011, at 9:59 PM, Biju Kaimal  wrote:
 
 Hi,
 
 I loaded a data set which has 1 million  rows into both Hive and HBase 
 tables. For the HBase table, I created a  corresponding Hive table so that 
 the 
 data in HBase can be queried from Hive QL.  Both tables have a key column 
 and a 
 value column
 
 For the same  query (select value, count(*) from table group by value), the 
 Hive only query  runs much faster (~ 30 seconds) as compared to Hive over 
 HBase 
 (~ 150  seconds).
 
 Is this expected?
 
 Regards,
 Biju

Re: Performance between Hive queries vs. Hive over HBase queries

2011-03-07 Thread John Sichi

Yes.

JVS

On Mar 7, 2011, at 9:59 PM, Biju Kaimal wrote:

 Hi,
 
 I loaded a data set which has 1 million rows into both Hive and HBase tables. 
 For the HBase table, I created a corresponding Hive table so that the data in 
 HBase can be queried from Hive QL. Both tables have a key column and a value 
 column
 
 For the same query (select value, count(*) from table group by value), the 
 Hive only query runs much faster (~ 30 seconds) as compared to Hive over 
 HBase (~ 150 seconds).
 
 Is this expected?
 
 Regards,
 Biju

Re: Performance between Hive queries vs. Hive over HBase queries

2011-03-07 Thread John Sichi

For native tables, Hive reads rows directly from HDFS.

For HBase tables, it has to go through the HBase region servers, which 
reconstruct rows from column families (combining cache + HDFS).

HBase makes it possible to keep your table up to date in real time, but you 
have to pay an overhead cost at query time.

On the other hand, with native Hive tables, there's latency in loading new 
batches of data.

JVS

On Mar 7, 2011, at 10:13 PM, Biju Kaimal wrote:

 Hi,
 
 Could you please explain the reason for the behavior? 
 
 Regards,
 Biju
 
 On Tue, Mar 8, 2011 at 11:35 AM, John Sichi jsi...@fb.com wrote:
 Yes.
 
 JVS
 
 On Mar 7, 2011, at 9:59 PM, Biju Kaimal wrote:
 
  Hi,
 
  I loaded a data set which has 1 million rows into both Hive and HBase 
  tables. For the HBase table, I created a corresponding Hive table so that 
  the data in HBase can be queried from Hive QL. Both tables have a key 
  column and a value column
 
  For the same query (select value, count(*) from table group by value), the 
  Hive only query runs much faster (~ 30 seconds) as compared to Hive over 
  HBase (~ 150 seconds).
 
  Is this expected?
 
  Regards,
  Biju

partitioned views

2011-02-09 Thread John Sichi

One of the impediments for uptake of the CREATE VIEW feature in Hive has been 
the lack of partition awareness.  This made it non-transparent to replace a 
table with a view, e.g. for renaming purposes.  To address this as well as some 
other use cases, I'm proposing the first steps towards view partition support:

http://wiki.apache.org/hadoop/Hive/PartitionedViews

This solution is still primitive, but should make using views at least possible 
in a number of cases, with a bit of extra DDL/ETL effort.

Feedback welcome.

JVS

Re: [VOTE] Sponsoring Howl as an Apache Incubator project

2011-02-03 Thread John Sichi

But Howl does layer on some additional code, right?

https://github.com/yahoo/howl/tree/howl/howl

JVS

On Feb 3, 2011, at 1:49 PM, Ashutosh Chauhan wrote:

 There are none as of today. In the past, whenever we had to have
 changes, we do it in a separate branch in Howl and once those get
 committed to hive repo, we pull it over in our trunk and drop the
 branch.
 
 Ashutosh
 On Thu, Feb 3, 2011 at 13:41, yongqiang he heyongqiang...@gmail.com wrote:
 I am interested in some numbers around the lines of code changes (or
 files of changes) which are in Howl but not in Hive?
 Can anyone give some information here?
 
 Thanks
 Yongqiang
 On Thu, Feb 3, 2011 at 1:15 PM, Jeff Hammerbacher ham...@cloudera.com 
 wrote:
 Hey,
 
 
 If we do go ahead with pulling the metastore out of Hive, it might make
 most sense for Howl to become its own TLP rather than a subproject.
 
 Yes, I did not read the proposal closely enough. I think an end state as a
 TLP makes more sense for Howl than as a Pig subproject. I'd really love to
 see Howl replace the metastore in Hive and it would be more natural to do so
 as a TLP than as a Pig subproject--especially since the current Howl
 repository is literally a fork of Hive.
 
 
 In the incubator proposal, we have mentioned these issues, but we've
 attempted to avoid prejudicing any decision.  Instead, we'd like to assess
 the pros and cons (including effort required and impact expected) for both
 approaches as part of the incubation process.
 
 Glad the issues are being considered.
 Later,
 Jeff

Re: [VOTE] Sponsoring Howl as an Apache Incubator project

2011-02-03 Thread John Sichi

I forgot about the serde dependencies...can you add those to the Initial Source 
note in [[HowlProposal]] just for completeness?

JVS

On Feb 3, 2011, at 3:11 PM, Alan Gates wrote:

 Yes, it adds Input and Output formats for MapReduce and load and store 
 functions for Pig.  In the future it we expect it will continue to add more 
 additional layers.
 
 Alan.
 
 On Feb 3, 2011, at 2:49 PM, John Sichi wrote:
 
 But Howl does layer on some additional code, right?
 
 https://github.com/yahoo/howl/tree/howl/howl
 
 JVS
 
 On Feb 3, 2011, at 1:49 PM, Ashutosh Chauhan wrote:
 
 There are none as of today. In the past, whenever we had to have
 changes, we do it in a separate branch in Howl and once those get
 committed to hive repo, we pull it over in our trunk and drop the
 branch.
 
 Ashutosh
 On Thu, Feb 3, 2011 at 13:41, yongqiang he heyongqiang...@gmail.com wrote:
 I am interested in some numbers around the lines of code changes (or
 files of changes) which are in Howl but not in Hive?
 Can anyone give some information here?
 
 Thanks
 Yongqiang
 On Thu, Feb 3, 2011 at 1:15 PM, Jeff Hammerbacher ham...@cloudera.com 
 wrote:
 Hey,
 
 
 If we do go ahead with pulling the metastore out of Hive, it might make
 most sense for Howl to become its own TLP rather than a subproject.
 
 Yes, I did not read the proposal closely enough. I think an end state as a
 TLP makes more sense for Howl than as a Pig subproject. I'd really love to
 see Howl replace the metastore in Hive and it would be more natural to do 
 so
 as a TLP than as a Pig subproject--especially since the current Howl
 repository is literally a fork of Hive.
 
 
 In the incubator proposal, we have mentioned these issues, but we've
 attempted to avoid prejudicing any decision.  Instead, we'd like to 
 assess
 the pros and cons (including effort required and impact expected) for 
 both
 approaches as part of the incubation process.
 
 Glad the issues are being considered.
 Later,
 Jeff

Re: [VOTE] Sponsoring Howl as an Apache Incubator project

2011-02-03 Thread John Sichi

On Feb 3, 2011, at 5:09 PM, Alan Gates wrote:

 Are you referring to the serde jar or any particular serde's we are making 
 use of?


Both (see below).

JVS



[jsichi@dev1066 ~/open/howl/howl/howl/src/java/org/apache/hadoop/hive/howl] ls
cli/  common/  data/  mapreduce/  pig/  rcfile/
[jsichi@dev1066 ~/open/howl/howl/howl/src/java/org/apache/hadoop/hive/howl] 
grep serde */*
common/HowlUtil.java:import org.apache.hadoop.hive.serde2.typeinfo.TypeInfo;
common/HowlUtil.java:import 
org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils;
rcfile/RCFileInputDriver.java:import org.apache.hadoop.hive.serde.Constants;
rcfile/RCFileInputDriver.java:import 
org.apache.hadoop.hive.serde2.ColumnProjectionUtils;
rcfile/RCFileInputDriver.java:import org.apache.hadoop.hive.serde2.SerDe;
rcfile/RCFileInputDriver.java:import 
org.apache.hadoop.hive.serde2.SerDeException;
rcfile/RCFileInputDriver.java:import 
org.apache.hadoop.hive.serde2.columnar.BytesRefArrayWritable;
rcfile/RCFileInputDriver.java:import 
org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe;
rcfile/RCFileInputDriver.java:import 
org.apache.hadoop.hive.serde2.columnar.ColumnarStruct;
rcfile/RCFileInputDriver.java:import 
org.apache.hadoop.hive.serde2.objectinspector.ListObjectInspector;
rcfile/RCFileInputDriver.java:import 
org.apache.hadoop.hive.serde2.objectinspector.MapObjectInspector;
rcfile/RCFileInputDriver.java:import 
org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector;
rcfile/RCFileInputDriver.java:import 
org.apache.hadoop.hive.serde2.objectinspector.PrimitiveObjectInspector;
rcfile/RCFileInputDriver.java:import 
org.apache.hadoop.hive.serde2.objectinspector.StructField;
rcfile/RCFileInputDriver.java:import 
org.apache.hadoop.hive.serde2.objectinspector.StructObjectInspector;
rcfile/RCFileInputDriver.java:  private SerDe serde;
rcfile/RCFileInputDriver.java:  struct = 
(ColumnarStruct)serde.deserialize(bytesRefArray);
rcfile/RCFileInputDriver.java:  serde = new ColumnarSerDe();
rcfile/RCFileInputDriver.java:  
serde.initialize(context.getConfiguration(), howlProperties);
rcfile/RCFileInputDriver.java:  oi = (StructObjectInspector) 
serde.getObjectInspector();
rcfile/RCFileMapReduceInputFormat.java:import 
org.apache.hadoop.hive.serde2.columnar.BytesRefArrayWritable;
rcfile/RCFileMapReduceOutputFormat.java:import 
org.apache.hadoop.hive.serde2.columnar.BytesRefArrayWritable;
rcfile/RCFileMapReduceRecordReader.java:import 
org.apache.hadoop.hive.serde2.columnar.BytesRefArrayWritable;
rcfile/RCFileOutputDriver.java:import org.apache.hadoop.hive.serde.Constants;
rcfile/RCFileOutputDriver.java:import org.apache.hadoop.hive.serde2.SerDe;
rcfile/RCFileOutputDriver.java:import 
org.apache.hadoop.hive.serde2.SerDeException;
rcfile/RCFileOutputDriver.java:import 
org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe;
rcfile/RCFileOutputDriver.java:import 
org.apache.hadoop.hive.serde2.objectinspector.ListObjectInspector;
rcfile/RCFileOutputDriver.java:import 
org.apache.hadoop.hive.serde2.objectinspector.MapObjectInspector;
rcfile/RCFileOutputDriver.java:import 
org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector;
rcfile/RCFileOutputDriver.java:import 
org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorFactory;
rcfile/RCFileOutputDriver.java:import 
org.apache.hadoop.hive.serde2.objectinspector.StructObjectInspector;
rcfile/RCFileOutputDriver.java:import 
org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorFactory;
rcfile/RCFileOutputDriver.java:import 
org.apache.hadoop.hive.serde2.typeinfo.ListTypeInfo;
rcfile/RCFileOutputDriver.java:import 
org.apache.hadoop.hive.serde2.typeinfo.MapTypeInfo;
rcfile/RCFileOutputDriver.java:import 
org.apache.hadoop.hive.serde2.typeinfo.PrimitiveTypeInfo;
rcfile/RCFileOutputDriver.java:import 
org.apache.hadoop.hive.serde2.typeinfo.StructTypeInfo;
rcfile/RCFileOutputDriver.java:import 
org.apache.hadoop.hive.serde2.typeinfo.TypeInfo;
rcfile/RCFileOutputDriver.java:import 
org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils;
rcfile/RCFileOutputDriver.java:   /** The serde for serializing the HowlRecord 
to bytes writable */
rcfile/RCFileOutputDriver.java:   private SerDe serde;
rcfile/RCFileOutputDriver.java:  return serde.serialize(value.getAll(), 
objectInspector);
rcfile/RCFileOutputDriver.java:  serde = new ColumnarSerDe();
rcfile/RCFileOutputDriver.java:  
serde.initialize(context.getConfiguration(), howlProperties);



Howl, howl, howl, howl! O! you are men of stones:
Had I your tongues and eyes, I'd use them so
That heaven's vaults should crack

Re: [VOTE] Sponsoring Howl as an Apache Incubator project

2011-02-03 Thread John Sichi

Got it, thanks for the correction.

JVS

On Feb 3, 2011, at 4:56 PM, Alex Boisvert wrote:

 Hi John,
 
 Just to clarify where I was going with my line of questioning.   There's no 
 Apache policy that prevents dependencies on incubator project, whether it's 
 releases, snapshots or even home-made hacked-together packaging of an 
 incubator project.It's been done before and as long as the incubator 
 code's IP has been cleared and the packaging isn't represented as an official 
 release if it isn't so, there's no wrong in doing that.
 
 Now, whether the project choses to use and release with an incubator 
 dependency is a matter of judgment (and ultimately a vote by committers if 
 there is no consensus).   I just wanted to make sure there were no incorrect 
 assumptions made.
 
 alex
 
 
 On Thu, Feb 3, 2011 at 4:07 PM, John Sichi jsi...@fb.com wrote:
 I was going off of what I read in HADOOP-3676 (which lacks a reference as 
 well).  But I guess if a release can be made from the incubator, then it's 
 not a blocker.
 
 JVS
 
 On Feb 3, 2011, at 3:29 PM, Alex Boisvert wrote:
 
  On Thu, Feb 3, 2011 at 11:38 AM, John Sichi jsi...@fb.com wrote:
  Besides the fact that the refactoring required is significant, I don't 
  think this is possible to do quickly since:
 
  1) Hive (unlike Pig) requires a metastore
 
  2) Hive releases can't depend on an incubator project
 
  I'm not sure what you mean by can't depend on an incubator project here.  
  AFAIK, there is no policy at Apache that projects should not depend on 
  incubator projects.  Can you clarify what you mean and why you think such a 
  restriction exists?
 
  alex

Re: Hive/Hbase Integration Error

2011-01-06 Thread John Sichi

Here is what you need to do:

1) Use svn to check out the source for Hive 0.6

2) In your checkout, replace the HBase 0.20.3 jars with the ones from 0.20.6

3) Build Hive 0.6 from source

4) Use your new Hive build

JVS

On Jan 6, 2011, at 2:34 AM, Adarsh Sharma wrote:

 Dear all,
 
 I am sorry I am posting this message again but I can't able to locate the 
 root cause after googled a lot.
 
 I am trying Hive/Hbase Integration from the past 2 days. I am facing the 
 below issue while creating external table in Hive.
 
 I am using hadoop-0.20.2, hbase-0.20.6, hive-0.6.0 ( Mysql as metstore ) and 
 java-1.6.0_20. Hbase-0.20.3 is also checked.
 
 Problem arises when I issue the below command :
 
 hive CREATE TABLE hive_hbasetable_k(key int, value string)
  STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
  WITH SERDEPROPERTIES (hbase.columns.mapping = :key,cf1:val)
  TBLPROPERTIES (hbase.table.name = hivehbasek);
 
 
 FAILED: Error in metadata: 
 MetaException(message:org.apache.hadoop.hbase.MasterNotRunningException
 at 
 org.apache.hadoop.hbase.client.HConnectionManager$TableServers.getMaster(HConnectionManager.java:374)
 at 
 org.apache.hadoop.hbase.client.HBaseAdmin.init(HBaseAdmin.java:72)
 at 
 org.apache.hadoop.hive.hbase.HBaseStorageHandler.getHBaseAdmin(HBaseStorageHandler.java:64)
 at 
 org.apache.hadoop.hive.hbase.HBaseStorageHandler.preCreateTable(HBaseStorageHandler.java:159)
 at 
 org.apache.hadoop.hive.metastore.HiveMetaStoreClient.createTable(HiveMetaStoreClient.java:275)
 at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:394)
 at 
 org.apache.hadoop.hive.ql.exec.DDLTask.createTable(DDLTask.java:2126)
 at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask.java:166)
 at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:107)
 at 
 org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:55)
 at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:633)
 at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:506)
 at org.apache.hadoop.hive.ql.Driver.run(Driver.java:384)
 at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:138)
 at 
 org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:197)
 at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:302)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
 at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
 at java.lang.reflect.Method.invoke(Method.java:597)
 at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
 FAILED: Execution Error, return code 1 from 
 org.apache.hadoop.hive.ql.exec.DDLTask
 
 
 It seems my HMaster is not Running but I checked from IP:60010 that it is 
 running and I am able to create,insert tables in Hbase Properly.
 
 Below is the contents of my hive.log :
 
   2011-01-05 15:19:36,783 ERROR DataNucleus.Plugin 
 (Log4JLogger.java:error(115)) - Bundle org.eclipse.jdt.core requires 
 org.eclipse.core.resources but it cannot be resolved.
  2011-01-05 15:19:36,783 ERROR DataNucleus.Plugin 
 (Log4JLogger.java:error(115)) - Bundle org.eclipse.jdt.core requires 
 org.eclipse.core.resources but it cannot be resolved.
  2011-01-05 15:19:36,785 ERROR DataNucleus.Plugin 
 (Log4JLogger.java:error(115)) - Bundle org.eclipse.jdt.core requires 
 org.eclipse.core.runtime but it cannot be resolved.
  2011-01-05 15:19:36,785 ERROR DataNucleus.Plugin 
 (Log4JLogger.java:error(115)) - Bundle org.eclipse.jdt.core requires 
 org.eclipse.core.runtime but it cannot be resolved.
  2011-01-05 15:19:36,786 ERROR DataNucleus.Plugin 
 (Log4JLogger.java:error(115)) - Bundle org.eclipse.jdt.core requires 
 org.eclipse.text but it cannot be resolved.
  2011-01-05 15:19:36,786 ERROR DataNucleus.Plugin 
 (Log4JLogger.java:error(115)) - Bundle org.eclipse.jdt.core requires 
 org.eclipse.text but it cannot be resolved.
  2011-01-05 15:20:12,185 WARN  zookeeper.ClientCnxn 
 (ClientCnxn.java:run(967)) - Exception closing session 0x0 to 
 sun.nio.ch.selectionkeyi...@561279c8
  java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at 
 sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:574)
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:933)
  2011-01-05 15:20:12,188 WARN  zookeeper.ClientCnxn 
 (ClientCnxn.java:cleanup(1001)) - Ignoring exception during shutdown input
  java.nio.channels.ClosedChannelException
at 
 sun.nio.ch.SocketChannelImpl.shutdownInput(SocketChannelImpl.java:638)
at sun.nio.ch.SocketAdaptor.shutdownInput(SocketAdaptor.java:360)
at 
 org.apache.zookeeper.ClientCnxn$SendThread.cleanup(ClientCnxn.java:999)
at

Re: Error in metadata: javax.jdo.JDOFatalDataStoreException

2011-01-05 Thread John Sichi

Since the exception below is from JDO, it has to do with the configuration of 
Hive's metastore (not HBase/Zookeeper).

JVS

On Jan 5, 2011, at 2:14 AM, Adarsh Sharma wrote:

 
 
 
 
 Dear all,
 
 I am trying Hive/Hbase Integration from the past 2 days. I am facing the 
 below issue while creating external table in Hive.
 
 *Command-Line Error :-
 
 *had...@s2-ratw-1:~/project/hive-0.6.0/build/dist$ bin/hive --auxpath 
 /home/hadoop/project/hive-0.6.0/build/dist/lib/hive_hbase-handler.jar,/home/hadoop/project/hive-0.6.0/build/dist/lib/hbase-0.20.3.jar,/home/hadoop/project/hive-0.6.0/build/dist/lib/zookeeper-3.2.2.jar
   -hiveconf 
 hbase.zookeeper.quorum=192.168.1.103,192.168.1.114,192.168.1.115,192.168.1.104,192.168.1.107
 Hive history file=/tmp/hadoop/hive_job_log_hadoop_201101051527_1728376885.txt
 hive show tables;
 FAILED: Error in metadata: javax.jdo.JDOFatalDataStoreException: 
 Communications link failure
 
 The last packet sent successfully to the server was 0 milliseconds ago. The 
 driver has not received any packets from the server.
 NestedThrowables:
 com.mysql.jdbc.exceptions.jdbc4.CommunicationsException: Communications link 
 failure
 
 The last packet sent successfully to the server was 0 milliseconds ago. The 
 driver has not received any packets from the server.
 FAILED: Execution Error, return code 1 from 
 org.apache.hadoop.hive.ql.exec.DDLTask
 hive exit;
 had...@s2-ratw-1:~/project/hive-0.6.0/build/dist$
 
 *My hive.log file says :*
 
 2011-01-05 15:19:36,783 ERROR DataNucleus.Plugin 
 (Log4JLogger.java:error(115)) - Bundle org.eclipse.jdt.core requires 
 org.eclipse.core.resources but it cannot be resolved.
 2011-01-05 15:19:36,783 ERROR DataNucleus.Plugin 
 (Log4JLogger.java:error(115)) - Bundle org.eclipse.jdt.core requires 
 org.eclipse.core.resources but it cannot be resolved.
 2011-01-05 15:19:36,785 ERROR DataNucleus.Plugin 
 (Log4JLogger.java:error(115)) - Bundle org.eclipse.jdt.core requires 
 org.eclipse.core.runtime but it cannot be resolved.
 2011-01-05 15:19:36,785 ERROR DataNucleus.Plugin 
 (Log4JLogger.java:error(115)) - Bundle org.eclipse.jdt.core requires 
 org.eclipse.core.runtime but it cannot be resolved.
 2011-01-05 15:19:36,786 ERROR DataNucleus.Plugin 
 (Log4JLogger.java:error(115)) - Bundle org.eclipse.jdt.core requires 
 org.eclipse.text but it cannot be resolved.
 2011-01-05 15:19:36,786 ERROR DataNucleus.Plugin 
 (Log4JLogger.java:error(115)) - Bundle org.eclipse.jdt.core requires 
 org.eclipse.text but it cannot be resolved.
 2011-01-05 15:20:12,185 WARN  zookeeper.ClientCnxn (ClientCnxn.java:run(967)) 
 - Exception closing session 0x0 to sun.nio.ch.selectionkeyi...@561279c8
 java.net.ConnectException: Connection refused
   at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
   at 
 sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:574)
   at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:933)
 2011-01-05 15:20:12,188 WARN  zookeeper.ClientCnxn 
 (ClientCnxn.java:cleanup(1001)) - Ignoring exception during shutdown input
 java.nio.channels.ClosedChannelException
   at 
 sun.nio.ch.SocketChannelImpl.shutdownInput(SocketChannelImpl.java:638)
   at sun.nio.ch.SocketAdaptor.shutdownInput(SocketAdaptor.java:360)
   at 
 org.apache.zookeeper.ClientCnxn$SendThread.cleanup(ClientCnxn.java:999)
   at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:970)
 2011-01-05 15:20:12,188 WARN  zookeeper.ClientCnxn 
 (ClientCnxn.java:cleanup(1006)) - Ignoring exception during shutdown output
 java.nio.channels.ClosedChannelException
   at 
 sun.nio.ch.SocketChannelImpl.shutdownOutput(SocketChannelImpl.java:649)
   at sun.nio.ch.SocketAdaptor.shutdownOutput(SocketAdaptor.java:368)
   at 
 org.apache.zookeeper.ClientCnxn$SendThread.cleanup(ClientCnxn.java:1004)
   at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:970)
 2011-01-05 15:20:12,621 WARN  zookeeper.ClientCnxn (ClientCnxn.java:run(967)) 
 - Exception closing session 0x0 to sun.nio.ch.selectionkeyi...@799dbc3b
 
 I overcomed from the previous issue of MasterNotRunning Exception which 
 occured due to incompatibilities in hive_hbase jars.
 
 Now I'm using Hadoop-0.20.2, Hive-0.6.0 ( Bydefault Derby metastore  ) and 
 Hbase-0.20.3.
 
 Please tell how this could be resolved.
 
 Also I want to add one more thing that my hadoop Cluster is of 9 nodes and 8 
 nodes act as Datanodes,Tasktrackers and Regionservers.
 
 Among these nodes is set zookeeper.quorum.property to have 5 Datanodes. Would 
 this is the issue.
 I don't know the number of servers needed for Zookeeper in fully distributed 
 mode.
 
 
 Best Regards
 
 Adarsh Sharma

Re: Question about View in Hive 0.6

2010-12-28 Thread John Sichi

It runs the same as a nested select.  Currently, since Hive doesn't do any 
relational common subexpression elimination, it will be executed twice.  In the 
example below, this can be a good thing, since cond1 and cond2 can be pushed 
down separately.

JVS

On Dec 28, 2010, at 12:18 AM, Neil Xu wrote:

 in hive 0.6, we can create view for tables, views are actually run as a 
 subquery when querying data, is it optimized in hive that a view is executed 
 only once in a single query? Thanks in advance!
 such as:
   select *
   from (
   select x,y
   from view1
   where cond1
 
   union all
 
   select x,y
   from view1
   where cond2
   )
 
 Neil,

Re: Altering / querying default column names (_c1, _c2, etc)

2010-12-20 Thread John Sichi

Enclose them in backticks.

alter table fb_images1 change `_c5` ref_array arraystring;

JVS

On Dec 20, 2010, at 3:23 PM, Leo Alekseyev wrote:

 Often I forget to name a column that results from running an
 aggregation.  Then, I'm stuck: describe table lists those columns by
 their default names, i.e. something like _c1, but I can't seem to
 query or rename those columns:
 
 alter table fb_images1 change _c5 ref_array arraystring;
 FAILED: Parse Error: line 1:30 mismatched input '_c5' expecting
 Identifier in rename column name
 
 Is there a resolution for this?  One workaround would be to create a
 new table and load the data into it, but it seems inelegant to say the
 least.
 
 --Leo

Re: Hive HBase intergration scan failing

2010-12-11 Thread John Sichi

It's supposed to happen automatically.  The JIRA issue below mentions one case 
where it wasn't, and explains how I detected it and worked around it.  To make 
you're getting locality, look at the task tracer and make sure that for your 
map tasks, the host used for executing the task matches the input split 
location.

JVS

On Dec 10, 2010, at 10:10 AM, vlisovsky wrote:

 Thanks for the info. Moreover how can we make sure that our regionservers are 
 running with same Datanodes ( locality). Is there a way we can make sure? 
 
 On Thu, Dec 9, 2010 at 11:09 PM, John Sichi jsi...@fb.com wrote:
 Try
 
 set hbase.client.scanner.caching=5000;
 
 Also, check to make sure that you are getting the expected locality so that 
 mappers are running on the same nodes as the region servers they are scanning 
 (assuming that you are running HBase and mapreduce on the same cluster).  
 When I was testing this, I encountered this problem (but it may have been 
 specific to our cluster configurations):
 
 https://issues.apache.org/jira/browse/HBASE-2535
 
 JVS
 
 On Dec 9, 2010, at 10:46 PM, vlisovsky wrote:
 
 
  Hi Guys,
  Wonder if  anybody could shed some light on how to reduce the load on HBase 
  cluster when running a full scan.
  The need is to dump everything I have in HBase and into a Hive table. The 
  HBase data size is around 500g.
  The job creates 9000 mappers, after about 1000 maps things go south every 
  time..
  If I run below insert it runs for about 30 minutes then starts bringing 
  down HBase cluster after which region servers need to be restarted..
  Wonder if there is a way to throttle it somehow or otherwise if there is 
  any other method of getting structured data out?
  Any help is appreciated,
  Thanks,
  -Vitaly
 
  create external table hbase_linked_table (
  mykeystring,
  infomapstring, string,
  )
  STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
  WITH
  SERDEPROPERTIES (hbase.columns.mapping = :key,info:)
  TBLPROPERTIES (hbase.table.name = hbase_table2);
 
  set hive.exec.compress.output=true;
  set io.seqfile.compression.type=BLOCK;
  set mapred.output.compression.type=BLOCK;
  set mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec;
 
  set mapred.reduce.tasks=40;
  set mapred.map.tasks=25;
 
  INSERT overwrite table tmp_hive_destination
  select * from hbase_linked_table;

Re: Hive HBase intergration scan failing

2010-12-09 Thread John Sichi

Try

set hbase.client.scanner.caching=5000;

Also, check to make sure that you are getting the expected locality so that 
mappers are running on the same nodes as the region servers they are scanning 
(assuming that you are running HBase and mapreduce on the same cluster).  When 
I was testing this, I encountered this problem (but it may have been specific 
to our cluster configurations):

https://issues.apache.org/jira/browse/HBASE-2535

JVS

On Dec 9, 2010, at 10:46 PM, vlisovsky wrote:

 
 Hi Guys,
 Wonder if  anybody could shed some light on how to reduce the load on HBase 
 cluster when running a full scan.
 The need is to dump everything I have in HBase and into a Hive table. The 
 HBase data size is around 500g. 
 The job creates 9000 mappers, after about 1000 maps things go south every 
 time..
 If I run below insert it runs for about 30 minutes then starts bringing down 
 HBase cluster after which region servers need to be restarted..
 Wonder if there is a way to throttle it somehow or otherwise if there is any 
 other method of getting structured data out?
 Any help is appreciated,
 Thanks,
 -Vitaly
 
 create external table hbase_linked_table (
 mykeystring,
 infomapstring, string,
 )
 STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
 WITH 
 SERDEPROPERTIES (hbase.columns.mapping = :key,info:)
 TBLPROPERTIES (hbase.table.name = hbase_table2);
 
 set hive.exec.compress.output=true;
 set io.seqfile.compression.type=BLOCK;
 set mapred.output.compression.type=BLOCK;
 set mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec;
 
 set mapred.reduce.tasks=40;
 set mapred.map.tasks=25;
 
 INSERT overwrite table tmp_hive_destination
 select * from hbase_linked_table;

Re: Hive/HBase integration issue.

2010-11-18 Thread John Sichi

As noted here, when writing to HBase, existing rows are overwritten, but old 
rows are not deleted.

http://wiki.apache.org/hadoop/Hive/HBaseIntegration#Overwrite

There is not yet any deletion support.

JVS

On Nov 18, 2010, at 1:00 AM, afancy wrote:

 Hi, 
 
 Does  the INSERT clause have to include the OVERWRITE, which means that the 
 new data will overwrite the previous data?  How to implement the indeed 
 INSERT operation, instead of OVERWRITE? 
 BTW: How to implement the DELETE operator? thanks
 
 afancy
 
 
 ---
 hive insert OVERWRITE table  pagedim select 0, url, strToint('2'), 'domain', 
 'serversion' from downloadlog;
 Total MapReduce jobs = 2
 Launching Job 1 out of 2
 Number of reduce tasks is set to 0 since there's no reduce operator
 Starting Job = job_201011121525_0006, Tracking URL = 
 http://localhost:50030/jobdetails.jsp?jobid=job_201011121525_0006
 Kill Command = /home/xiliu/hadoop-0.20.2/bin/../bin/hadoop job  
 -Dmapred.job.tracker=localhost:54311 -kill job_201011121525_0006
 2010-11-18 09:55:52,155 Stage-1 map = 0%,  reduce = 0%
 2010-11-18 09:55:55,169 Stage-1 map = 100%,  reduce = 0%
 2010-11-18 09:55:58,200 Stage-1 map = 100%,  reduce = 100%
 Ended Job = job_201011121525_0006
 Ended Job = 487027960, job is filtered out (removed at runtime).
 Launching Job 2 out of 2
 Number of reduce tasks is set to 0 since there's no reduce operator
 Starting Job = job_201011121525_0007, Tracking URL = 
 http://localhost:50030/jobdetails.jsp?jobid=job_201011121525_0007
 Kill Command = /home/xiliu/hadoop-0.20.2/bin/../bin/hadoop job  
 -Dmapred.job.tracker=localhost:54311 -kill job_201011121525_0007
 2010-11-18 09:56:04,701 Stage-2 map = 0%,  reduce = 0%
 2010-11-18 09:56:07,723 Stage-2 map = 100%,  reduce = 0%
 2010-11-18 09:56:10,751 Stage-2 map = 100%,  reduce = 100%
 Ended Job = job_201011121525_0007
 Loading data to table pagedim
 1000 Rows loaded to pagedim
 OK
 Time taken: 23.194 seconds
 hive insert table  pagedim select 0, url, strToint('2'), 'domain', 
 'serversion' from downloadlog;  
 FAILED: Parse Error: line 1:7 mismatched input 'table' expecting OVERWRITE in 
 insert clause

Re: HBase as input AND output?

2010-10-13 Thread John Sichi

If your query only accesses HBase tables, then yes, Hive does not access any 
source data directly from HDFS (although of course it may put intermediate 
results in HDFS, e.g. for the result of a join).

However, if your query does something like join a HBase table with a native 
Hive table, then it will read data from both HBase and HDFS.

Likewise, on the write side, it depends whether your INSERT targets an HBase 
table vs a native Hive table.

The read and write sides are independent.

JVS

On Oct 13, 2010, at 2:24 PM, Otis Gospodnetic wrote:

 Thanks Tim.
 (and sorry for the duplicate email - need to fix my Hive email filter)
 
 
 Just to clarify one bit, though.
 When using Hive without HBase one has data stored in the appropriate 
 directories 
 on HDFS and runs MR jobs against those data.
 
 But, when using Hive *with* HBase, does Hive require any such data to be 
 present 
 in the HDFS?
 In other words, when using Hive with HBase, one really uses only Hive's 
 ability 
 to translate a Hive QL statement to a set of MR jobs (and read from/write to 
 HBase) and execute them against only data stored in HBase.  Is this correct?
 
 Thanks,
 Otis
 
 Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
 Hadoop ecosystem search :: http://search-hadoop.com/
 
 
 
 - Original Message 
 From: Tim Robertson timrobertson...@gmail.com
 To: user@hive.apache.org
 Sent: Wed, October 13, 2010 4:45:31 PM
 Subject: Re: HBase as input AND output?
 
 That's right.  Hive can use an HBase table as an input format to  the
 hive query regardless of output format, and can also write the  output
 to an HBase table regardless of the input format.  You can  also
 supposedly do a join in Hive that uses 1 side of the join from  an
 HBase table, and the other side a text file, which is very powerful.
 I  haven't done it myself, but intend to shortly.
 
 HTH,
 Tim
 
 
 On  Wed, Oct 13, 2010 at 10:07 PM, Otis Gospodnetic
 otis_gospodne...@yahoo.com  wrote:
 Hi,
 
 I was wondering how I can query data stored  in HBase and remembered Hive's 
 HBase
 integration:
 http://wiki.apache.org/hadoop/Hive/HBaseIntegration
 
 After  watching John Sichi's video
 
 (http://developer.yahoo.com/blogs/hadoop/posts/2010/04/hundreds_of_hadoop_fans_at_the/
 
  ) I have a better idea about what functionality this integration provides, 
  
 but
 I still have some questions.
 
 Would it be correct to  say that Hive-HBase integration makes the following 
 data
 flow  possible:
 
 0) Hive or Files = Custom HQL statement that  aggregates data  == HBase
 1) HBase == Custom HQL statement that  aggregates data  == HBase
 2) HBase == Custom HQL statement that  aggregates data  == output 
 (console?)
 
 Of the above, 1) is  what I'm wondering the most about right now.
 
 In other words, it  seems to me that Hive may be able to look at *just* data
 stored in HBase  *without* the typical data/files in HDFS that Hive 
 normally 
 runs
 its MR  jobs against.
 
 Is this correct?
 
 Thanks,
 Otis
 
 Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
 Hadoop  ecosystem search :: http://search-hadoop.com/

42 matches

Mail list logo