Re: Supported character set of hbase.columns.mapping?

2014-08-14 Thread Andrew Mains

Hi Felix,

Good question. Looking at the parsing code for column mappings in hive 
13.1 
(https://github.com/apache/hive/blob/release-0.13.1/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseSerDe.java#L177), 
there doesn't seem to currently any support for escaping. Trunk looks to 
have the same issue.


According to the documentation 
(https://cwiki.apache.org/confluence/display/Hive/HBaseIntegration#HBaseIntegration-ColumnMapping) 
the mapping entries are comma separated, and of the form:


|:key|or of the 
form|column-family-name:[column-name][#(binary|string)|(the type 
specification that delimited by/#/was added in Hive0.9.0 
https://issues.apache.org/jira/browse/HIVE-1634, earlier versions 
interpreted everything as strings)


So, it seems that at the moment, there's not necessarily a good 
workaround (that I can think of) for column families/qualifiers with any 
of the reserved characters (':', '#', ',') in them.


Might be time for a patch :).

Andrew

On 8/13/14, 9:21 PM, Felix Wang wrote:


Hi,

We want to create Hive EXTERNAL TABLE to map to HBase tables.

Our questions is, what kind of character set hbase.columns.mapping 
can support?


For example,  for below clause...

CREATE EXTERNAL TABLE SomeTable (Default_Key STRING, `Hive Column 
Name*`*STRING) STORED BY 
'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH 
SERDEPROPERTIES (hbase.columns.mapping = *HBase Column Name*) 
TBLPROPERTIES(hbase.table.name = SomeHBaseTable);


What kind of character set of HBase Column Name can support?

Especially, seems like when there are *COLON (:)* or *COMMA (,)*  it 
does not work. Is there any workaround (like, how to escape them) in 
these cases?


Thanks,

-Felix





request write access to Hive Wiki

2014-08-14 Thread Li, Chengxiang
Hi,
I'm working on HIVE-7142https://issues.apache.org/jira/browse/HIVE-7142 which 
should document some information on Hive wiki, while I didn't have write access 
to Hive wiki yet, could someone help me about this?
My Confluence usename: chengxiang.li

Thanks
chengxiang


Data in Hive

2014-08-14 Thread CHEBARO Abdallah
My target is to perform a SELECT query using Hive

When I have a small data on a single machine (namenode), I start by:
1-Creating a table that contains this data: create table table1 (int col1, 
string col2)
2-Loading the data from a file path: load data local inpath 'path' into table 
table1;
3-Perform my SELECT query: select * from table1 where col10

I have huge data, of 10 millions rows that doesn't fit into a single machine. 
Lets assume Hadoop divided my data into for example 10 datanodes and each 
datanode contains 1 million row.

Retrieving the data to a single computer is impossible due to its huge size or 
would take alot of time in case it is possible.

Will Hive create a table at each datanode and perform the SELECT query
or will Hive move all the data a one location (datanode) and create one table? 
(which is inefficient)
***

This e-mail contains information for the intended recipient only. It may 
contain proprietary material or confidential information. If you are not the 
intended recipient you are not authorised to distribute, copy or use this 
e-mail or any attachment to it. Murex cannot guarantee that it is virus free 
and accepts no responsibility for any loss or damage arising from its use. If 
you have received this e-mail in error please notify immediately the sender and 
delete the original email received, any attachments and all copies from your 
system.


Re: request write access to Hive Wiki

2014-08-14 Thread Ashutosh Chauhan
Hi Chengxiang,

Granted write access to wiki to you.

Thanks,
Ashutosh


On Thu, Aug 14, 2014 at 2:38 AM, Li, Chengxiang chengxiang...@intel.com
wrote:

  Hi,

 I'm working on HIVE-7142 https://issues.apache.org/jira/browse/HIVE-7142
 which should document some information on Hive wiki, while I didn't have
 write access to Hive wiki yet, could someone help me about this?

 My Confluence usename: chengxiang.li



 Thanks

 chengxiang



Re: ArrayWritableGroupConverter

2014-08-14 Thread Brock Noland
Hi,

Can you share your parquet schema?

Brock


On Tue, Aug 12, 2014 at 5:06 PM, Raymond Lau raymond.lau...@gmail.com
wrote:

 Hello.  (First off, sorry if I accidentally posted to the wrong mailing
 list before - dev - and you are getting this again)

 Regarding the ArrayWritableGroupConverter class: I was just wondering how
 come the field count has to be either 1 or 2?  I'm trying to read a column
 where the amount is fields is 3 and I'm getting an invalid parquet hive
 schema (in hive 0.12) error when I try to do so.  It looks like it links
 back to here.

 *https://github.com/apache/hive/blob/trunk/ql/src/java/org/apache/hadoop/hive/ql/io/parquet/convert/ArrayWritableGroupConverter.java
 https://github.com/apache/hive/blob/trunk/ql/src/java/org/apache/hadoop/hive/ql/io/parquet/convert/ArrayWritableGroupConverter.java*


 Thanks,
 -Raymond



Re: SerDe errors

2014-08-14 Thread Roberto Congiu
Can you provide the CREATE statement used to create the table and a sample
of the json that's causing the error ?
It sounds like you have a field declared as bigint on the schema, but it's
actually an object.


On Wed, Aug 13, 2014 at 5:05 AM, Charles Robertson 
charles.robert...@gmail.com wrote:

 Hi all,

 I have a Hive table which relies on a JSON SerDe to read the underlying
 files. When I ran the create script I specified the SerDe and it all went
 fine and the data was visible in the views above the table. When I tried to
 query the table directly, though, I received a ClassNotFound error. I
 solved this by putting the SerDe JAR in /usr/lib/hive/lib.

 Now, however, when I try to query the data I get:

 Failed with exception
 java.io.IOException:org.apache.hadoop.hive.ql.metadata.HiveException:
 java.lang.ClassCastException: org.json.JSONObject cannot be cast to
 [Ljava.lang.Object;

 (The serde is the json serde provided by Apache)

 Can anyone suggest why it was working before, but no longer is?

 Thanks,
 Charles




-- 
--
Good judgement comes with experience.
Experience comes with bad judgement.
--
Roberto Congiu - Data Engineer - OpenX
tel: +1 626 466 1141


Re: ArrayWritableGroupConverter

2014-08-14 Thread Raymond Lau
Original Thrift schema:

struct teststruct {
  1: optional string field1;
  2: optional string field2;
  3: optional string field3;
}

struct mainstruct {
  1: optional listteststruct teststructs;
}

This parquet file schema was generated:

message ParquetSchema {
  optional group teststructs {
repeated group teststruct_tuple {
  optional binary field1;
  optional binary field2;
  optional binary field3;
}
  }
}

When i try to run queries involving this 'teststructs' column, I get this
error:

Failed with exception java.io.IOException:java.lang.RuntimeException:
Invalid parquet hive schema: repeated group teststruct_tuple {
  optional binary field1;
  optional binary field2;
  optional binary field3;
}



On Thu, Aug 14, 2014 at 8:35 AM, Brock Noland br...@cloudera.com wrote:

 Hi,

 Can you share your parquet schema?

 Brock


 On Tue, Aug 12, 2014 at 5:06 PM, Raymond Lau raymond.lau...@gmail.com
 wrote:

 Hello.  (First off, sorry if I accidentally posted to the wrong mailing
 list before - dev - and you are getting this again)

 Regarding the ArrayWritableGroupConverter class: I was just wondering how
 come the field count has to be either 1 or 2?  I'm trying to read a column
 where the amount is fields is 3 and I'm getting an invalid parquet hive
 schema (in hive 0.12) error when I try to do so.  It looks like it links
 back to here.

 *https://github.com/apache/hive/blob/trunk/ql/src/java/org/apache/hadoop/hive/ql/io/parquet/convert/ArrayWritableGroupConverter.java
 https://github.com/apache/hive/blob/trunk/ql/src/java/org/apache/hadoop/hive/ql/io/parquet/convert/ArrayWritableGroupConverter.java*


 Thanks,
 -Raymond





-- 
*Raymond Lau*
Software Engineer - Intern |
r...@ooyala.com | (925) 395-3806


Re: Data in Hive

2014-08-14 Thread Nishant Kelkar
Hi there!

So Hive stores it's data in HDFS. That means, it is distributed by default.
The distribution factor is controlled by parameters, specifically the block
size(dfs.block.size). The file splits are also replicated, meaning that if
a data node were to fail, it's replicas on other nodes would be available
to serve the content lost on the failed data node. Here's a good
StackOverflow thread that discusses the file split issue:
http://stackoverflow.com/questions/9678180/change-file-split-size-in-hadoop

When Hive creates a table, it basically maintains metadata like schema
information, comments about each column, HDFS file location, input formats,
etc. So it is a layer of abstraction over a raw HDFS file, in essence.
Here's a good StackOverflow question that attempts to provide some details:
http://stackoverflow.com/questions/17065672/what-does-the-hive-metastore-and-name-node-do-in-a-cluster

So to answer your question, no, Hive does not move all data to one location
and create a single table. The whole point of using MapReduce as a
framework is to take the compute to the data, not vice versa.

Hope that helps!

Thanks and Regards,
Nishant Kelkar


On Thu, Aug 14, 2014 at 7:23 AM, CHEBARO Abdallah 
abdallah.cheb...@murex.com wrote:

  My target is to perform a SELECT query using Hive



 When I have a small data on a single machine (namenode), I start by:

 1-Creating a table that contains this data: create table table1 (int col1,
 string col2)

 2-Loading the data from a file path: load data local inpath 'path' into
 table table1;

 3-Perform my SELECT query: select * from table1 where col10



 I have huge data, of 10 millions rows that doesn't fit into a single
 machine. Lets assume Hadoop divided my data into for example 10 datanodes
 and each datanode contains 1 million row.



 Retrieving the data to a single computer is impossible due to its huge
 size or would take alot of time in case it is possible.



 Will Hive create a table at each datanode and perform the SELECT query

 or will Hive move all the data a one location (datanode) and create one
 table? (which is inefficient)

 ***

 This e-mail contains information for the intended recipient only. It may
 contain proprietary material or confidential information. If you are not
 the intended recipient you are not authorised to distribute, copy or use
 this e-mail or any attachment to it. Murex cannot guarantee that it is
 virus free and accepts no responsibility for any loss or damage arising
 from its use. If you have received this e-mail in error please notify
 immediately the sender and delete the original email received, any
 attachments and all copies from your system.



Re: Altering the Metastore on EC2

2014-08-14 Thread Stephen Sprague
i'll take a stab at this.

- probably no reason.

- if you can. is there a derby client s/t you can issue the command: alter
table COLUMNS_V2 modify TYPE_NAME varchar(32672). otherwise maybe use the
mysql or postgres metastores (instead of derby) and run that alter command
after the install.

- the schema only exists in one place and that's the metastore (which is
probably on your namenode for derby.) for mysql or postgres it can be
anywhere you want but again examples will probably show localhost (the
namenode)

that's a mighty big schema! you don't just want to use string type and use
get_json_object to pull data out of it dynamically? not as elegant as using
static syntax like nested structs but its better than nothing. something to
think about anyway.

i'm guessing given a nested struct that large you'll get over one hump only
to be faced with another one. hive needs to do some crazy mapping there for
every record. hopefully that's optimized. :)

Good luck! I'd be curious how it goes.


On Mon, Aug 11, 2014 at 5:52 PM, David Beveridge dbeveri...@cylance.com
wrote:

  We are creating an Hive schema for reading massive JSON files. Our JSON
 schema is rather large, and we have found that the default metastore schema
 for Hive cannot work for us as-is.

 To be specific, one field in our schema has about 17KB of nested structs
 within it. Unfortunately, it appears that Hive has a limit of varchar(4000)
 for the field that stores the resulting definition:



 CREATE TABLE COLUMNS_V2 (

 CD_ID bigint NOT NULL,

 COMMENT varchar(4000),

 COLUMN_NAME varchar(128) NOT NULL,

 TYPE_NAME varchar(4000),

 INTEGER_IDX INTEGER NOT NULL,

 PRIMARY KEY (CD_ID, COLUMN_NAME)

 );



 We are running this on Amazon MapReduce (v0.11 with default Derby
 metastore)



 So, our initial questions are:

 · Is there a reason that the TYPE_NAME is being limited to 4000
 (IIUC, varchar on derby can grow to 32672, which would be sufficient for
 a long time)

 · Can we alter the metastore schema without hacking/reinstalling
 Hive? (if so, how?)

 · If so, is there a proper way to update the schema on all nodes?





 Thanks in advance!

 --DB