Re: Supported character set of hbase.columns.mapping?
Hi Felix, Good question. Looking at the parsing code for column mappings in hive 13.1 (https://github.com/apache/hive/blob/release-0.13.1/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseSerDe.java#L177), there doesn't seem to currently any support for escaping. Trunk looks to have the same issue. According to the documentation (https://cwiki.apache.org/confluence/display/Hive/HBaseIntegration#HBaseIntegration-ColumnMapping) the mapping entries are comma separated, and of the form: |:key|or of the form|column-family-name:[column-name][#(binary|string)|(the type specification that delimited by/#/was added in Hive0.9.0 https://issues.apache.org/jira/browse/HIVE-1634, earlier versions interpreted everything as strings) So, it seems that at the moment, there's not necessarily a good workaround (that I can think of) for column families/qualifiers with any of the reserved characters (':', '#', ',') in them. Might be time for a patch :). Andrew On 8/13/14, 9:21 PM, Felix Wang wrote: Hi, We want to create Hive EXTERNAL TABLE to map to HBase tables. Our questions is, what kind of character set hbase.columns.mapping can support? For example, for below clause... CREATE EXTERNAL TABLE SomeTable (Default_Key STRING, `Hive Column Name*`*STRING) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES (hbase.columns.mapping = *HBase Column Name*) TBLPROPERTIES(hbase.table.name = SomeHBaseTable); What kind of character set of HBase Column Name can support? Especially, seems like when there are *COLON (:)* or *COMMA (,)* it does not work. Is there any workaround (like, how to escape them) in these cases? Thanks, -Felix
request write access to Hive Wiki
Hi, I'm working on HIVE-7142https://issues.apache.org/jira/browse/HIVE-7142 which should document some information on Hive wiki, while I didn't have write access to Hive wiki yet, could someone help me about this? My Confluence usename: chengxiang.li Thanks chengxiang
Data in Hive
My target is to perform a SELECT query using Hive When I have a small data on a single machine (namenode), I start by: 1-Creating a table that contains this data: create table table1 (int col1, string col2) 2-Loading the data from a file path: load data local inpath 'path' into table table1; 3-Perform my SELECT query: select * from table1 where col10 I have huge data, of 10 millions rows that doesn't fit into a single machine. Lets assume Hadoop divided my data into for example 10 datanodes and each datanode contains 1 million row. Retrieving the data to a single computer is impossible due to its huge size or would take alot of time in case it is possible. Will Hive create a table at each datanode and perform the SELECT query or will Hive move all the data a one location (datanode) and create one table? (which is inefficient) *** This e-mail contains information for the intended recipient only. It may contain proprietary material or confidential information. If you are not the intended recipient you are not authorised to distribute, copy or use this e-mail or any attachment to it. Murex cannot guarantee that it is virus free and accepts no responsibility for any loss or damage arising from its use. If you have received this e-mail in error please notify immediately the sender and delete the original email received, any attachments and all copies from your system.
Re: request write access to Hive Wiki
Hi Chengxiang, Granted write access to wiki to you. Thanks, Ashutosh On Thu, Aug 14, 2014 at 2:38 AM, Li, Chengxiang chengxiang...@intel.com wrote: Hi, I'm working on HIVE-7142 https://issues.apache.org/jira/browse/HIVE-7142 which should document some information on Hive wiki, while I didn't have write access to Hive wiki yet, could someone help me about this? My Confluence usename: chengxiang.li Thanks chengxiang
Re: ArrayWritableGroupConverter
Hi, Can you share your parquet schema? Brock On Tue, Aug 12, 2014 at 5:06 PM, Raymond Lau raymond.lau...@gmail.com wrote: Hello. (First off, sorry if I accidentally posted to the wrong mailing list before - dev - and you are getting this again) Regarding the ArrayWritableGroupConverter class: I was just wondering how come the field count has to be either 1 or 2? I'm trying to read a column where the amount is fields is 3 and I'm getting an invalid parquet hive schema (in hive 0.12) error when I try to do so. It looks like it links back to here. *https://github.com/apache/hive/blob/trunk/ql/src/java/org/apache/hadoop/hive/ql/io/parquet/convert/ArrayWritableGroupConverter.java https://github.com/apache/hive/blob/trunk/ql/src/java/org/apache/hadoop/hive/ql/io/parquet/convert/ArrayWritableGroupConverter.java* Thanks, -Raymond
Re: SerDe errors
Can you provide the CREATE statement used to create the table and a sample of the json that's causing the error ? It sounds like you have a field declared as bigint on the schema, but it's actually an object. On Wed, Aug 13, 2014 at 5:05 AM, Charles Robertson charles.robert...@gmail.com wrote: Hi all, I have a Hive table which relies on a JSON SerDe to read the underlying files. When I ran the create script I specified the SerDe and it all went fine and the data was visible in the views above the table. When I tried to query the table directly, though, I received a ClassNotFound error. I solved this by putting the SerDe JAR in /usr/lib/hive/lib. Now, however, when I try to query the data I get: Failed with exception java.io.IOException:org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.ClassCastException: org.json.JSONObject cannot be cast to [Ljava.lang.Object; (The serde is the json serde provided by Apache) Can anyone suggest why it was working before, but no longer is? Thanks, Charles -- -- Good judgement comes with experience. Experience comes with bad judgement. -- Roberto Congiu - Data Engineer - OpenX tel: +1 626 466 1141
Re: ArrayWritableGroupConverter
Original Thrift schema: struct teststruct { 1: optional string field1; 2: optional string field2; 3: optional string field3; } struct mainstruct { 1: optional listteststruct teststructs; } This parquet file schema was generated: message ParquetSchema { optional group teststructs { repeated group teststruct_tuple { optional binary field1; optional binary field2; optional binary field3; } } } When i try to run queries involving this 'teststructs' column, I get this error: Failed with exception java.io.IOException:java.lang.RuntimeException: Invalid parquet hive schema: repeated group teststruct_tuple { optional binary field1; optional binary field2; optional binary field3; } On Thu, Aug 14, 2014 at 8:35 AM, Brock Noland br...@cloudera.com wrote: Hi, Can you share your parquet schema? Brock On Tue, Aug 12, 2014 at 5:06 PM, Raymond Lau raymond.lau...@gmail.com wrote: Hello. (First off, sorry if I accidentally posted to the wrong mailing list before - dev - and you are getting this again) Regarding the ArrayWritableGroupConverter class: I was just wondering how come the field count has to be either 1 or 2? I'm trying to read a column where the amount is fields is 3 and I'm getting an invalid parquet hive schema (in hive 0.12) error when I try to do so. It looks like it links back to here. *https://github.com/apache/hive/blob/trunk/ql/src/java/org/apache/hadoop/hive/ql/io/parquet/convert/ArrayWritableGroupConverter.java https://github.com/apache/hive/blob/trunk/ql/src/java/org/apache/hadoop/hive/ql/io/parquet/convert/ArrayWritableGroupConverter.java* Thanks, -Raymond -- *Raymond Lau* Software Engineer - Intern | r...@ooyala.com | (925) 395-3806
Re: Data in Hive
Hi there! So Hive stores it's data in HDFS. That means, it is distributed by default. The distribution factor is controlled by parameters, specifically the block size(dfs.block.size). The file splits are also replicated, meaning that if a data node were to fail, it's replicas on other nodes would be available to serve the content lost on the failed data node. Here's a good StackOverflow thread that discusses the file split issue: http://stackoverflow.com/questions/9678180/change-file-split-size-in-hadoop When Hive creates a table, it basically maintains metadata like schema information, comments about each column, HDFS file location, input formats, etc. So it is a layer of abstraction over a raw HDFS file, in essence. Here's a good StackOverflow question that attempts to provide some details: http://stackoverflow.com/questions/17065672/what-does-the-hive-metastore-and-name-node-do-in-a-cluster So to answer your question, no, Hive does not move all data to one location and create a single table. The whole point of using MapReduce as a framework is to take the compute to the data, not vice versa. Hope that helps! Thanks and Regards, Nishant Kelkar On Thu, Aug 14, 2014 at 7:23 AM, CHEBARO Abdallah abdallah.cheb...@murex.com wrote: My target is to perform a SELECT query using Hive When I have a small data on a single machine (namenode), I start by: 1-Creating a table that contains this data: create table table1 (int col1, string col2) 2-Loading the data from a file path: load data local inpath 'path' into table table1; 3-Perform my SELECT query: select * from table1 where col10 I have huge data, of 10 millions rows that doesn't fit into a single machine. Lets assume Hadoop divided my data into for example 10 datanodes and each datanode contains 1 million row. Retrieving the data to a single computer is impossible due to its huge size or would take alot of time in case it is possible. Will Hive create a table at each datanode and perform the SELECT query or will Hive move all the data a one location (datanode) and create one table? (which is inefficient) *** This e-mail contains information for the intended recipient only. It may contain proprietary material or confidential information. If you are not the intended recipient you are not authorised to distribute, copy or use this e-mail or any attachment to it. Murex cannot guarantee that it is virus free and accepts no responsibility for any loss or damage arising from its use. If you have received this e-mail in error please notify immediately the sender and delete the original email received, any attachments and all copies from your system.
Re: Altering the Metastore on EC2
i'll take a stab at this. - probably no reason. - if you can. is there a derby client s/t you can issue the command: alter table COLUMNS_V2 modify TYPE_NAME varchar(32672). otherwise maybe use the mysql or postgres metastores (instead of derby) and run that alter command after the install. - the schema only exists in one place and that's the metastore (which is probably on your namenode for derby.) for mysql or postgres it can be anywhere you want but again examples will probably show localhost (the namenode) that's a mighty big schema! you don't just want to use string type and use get_json_object to pull data out of it dynamically? not as elegant as using static syntax like nested structs but its better than nothing. something to think about anyway. i'm guessing given a nested struct that large you'll get over one hump only to be faced with another one. hive needs to do some crazy mapping there for every record. hopefully that's optimized. :) Good luck! I'd be curious how it goes. On Mon, Aug 11, 2014 at 5:52 PM, David Beveridge dbeveri...@cylance.com wrote: We are creating an Hive schema for reading massive JSON files. Our JSON schema is rather large, and we have found that the default metastore schema for Hive cannot work for us as-is. To be specific, one field in our schema has about 17KB of nested structs within it. Unfortunately, it appears that Hive has a limit of varchar(4000) for the field that stores the resulting definition: CREATE TABLE COLUMNS_V2 ( CD_ID bigint NOT NULL, COMMENT varchar(4000), COLUMN_NAME varchar(128) NOT NULL, TYPE_NAME varchar(4000), INTEGER_IDX INTEGER NOT NULL, PRIMARY KEY (CD_ID, COLUMN_NAME) ); We are running this on Amazon MapReduce (v0.11 with default Derby metastore) So, our initial questions are: · Is there a reason that the TYPE_NAME is being limited to 4000 (IIUC, varchar on derby can grow to 32672, which would be sufficient for a long time) · Can we alter the metastore schema without hacking/reinstalling Hive? (if so, how?) · If so, is there a proper way to update the schema on all nodes? Thanks in advance! --DB