Also, if I return a fully formatted string containing all the flattened
values from my key+value (such as what you suggested), then I'd need to
split the resulting string into its component columns based on the
delimiter ("," or ";" or "\t" etc). How do I define the right table for
that?In other words, my custom input format will return a value string of this form: <Text>;<cd.status>;<cd.fetchTime>;<cd.retries>;<cd.map>;..... And so, on the Hive side, I'd like to use a ";" as the delimiter. Typically this Hive table would be defined as: CREATE TABLE crawldb (.....) ROWFORMAT DELIMITED FIELDS SEPARATED BY ';' .... .... Would I now be able to define my table the same way, using my custom input format: *CREATE TABLE crawldb (...) INPUTFORMAT 'MyFlatteningInputFormat' FIELDS SEPARATED BY ';' LOCATION '/user/training/deepcrawl/crawldb/current/part-00000';* ? Thanks, Safdar On Sun, May 6, 2012 at 4:34 AM, Ali Safdar Kureishy < [email protected]> wrote: > Thanks Edward. > > What are the Input and Output formats chosen by Hive for the "STORED > AS SEQUENCEFILE" selection? And if I want to add my own syntactic > sugar, is there a lookup mechanism where I can register my custom code > so that it would work with "STORED AS MYCUSTOMSEQUENCEFILE"? > > Thanks, > Safdar > > > On Sun, May 6, 2012 at 1:16 AM, Edward Capriolo <[email protected]> > wrote: > > Stored as sequence file is syntax sugar. It sets both the inputformat and > > outputformat. > > > > Create table x (thing int) > > Inputformat 'class.x' > > Outputformat 'class.y' > > > > For inputformat you can use your custom. > > > > For your output format you can stick with hive's > ignorekeytextoutputformat > > or ignorekeysequencefile format. > > > > To avoid having to write a serde your inputformat could also Chang the > types > > and format to something hive could easily recognize. > > > > > > On Saturday, May 5, 2012, Ali Safdar Kureishy <[email protected] > > > > wrote: > >> Thanks Edward...I feared this was going to be the case. > >> If I define a new input format, how do I use it in a hive table > >> definition? > >> For the SequenceFileInputFormat, the table definition would read as > >> "...STORED AS SEQUENCEFILE". > >> With the new one, how do I specify it in the definition? "STORED AS > >> 'com.xyz.abc.MyInputFormat'? > >> Thanks, > >> Safdar > >> > >> On Sat, May 5, 2012 at 2:44 PM, Edward Capriolo <[email protected]> > >> wrote: > >> > >> This is one of the things about hive the key is not easily available. > >> You are going to need an input format that creates a new value which > >> is contains the key and the value. > >> > >> Like this: > >> <url:Text> <data:CrawlDatum> -> <null-writable> new > >> MyKeyValue<<url:Text> <data:CrawlDatum>> > >> > >> > >> On Sat, May 5, 2012 at 4:05 PM, Ali Safdar Kureishy > >> <[email protected]> wrote: > >>> Hi, > >>> > >>> I have attached a Sequence file with the following format: > >>> <url:Text> <data:CrawlDatum> > >>> > >>> (CrawlDatum is a custom Java type, that contains several fields that > >>> would > >>> be flattened into several columns by the SerDe). > >>> > >>> In other words, what I would like to do, is to expose this > URL+CrawlDatum > >>> data via a Hive External table, with the following columns: > >>> || url || status || fetchtime || fetchinterval || modifiedtime || > retries > >>> || > >>> score || metadata || > >>> > >>> So, I was hoping that after defining a custom SerDe, I would just have > to > >>> define the Hive table as follows: > >>> > >>> CREATE EXTERNAL TABLE crawldb > >>> (url STRING, status STRING, fetchtime LONG, fetchinterval LONG, > >>> modifiedtime > >>> LONG, retries INT, score FLOAT, metadata MAP) > >>> ROW FORMAT SERDE 'NutchCrawlDBSequenceFileSerDe' > >>> STORED AS SEQUENCEFILE > >>> LOCATION '/user/training/deepcrawl/crawldb/current/part-00000'; > >>> > >>> For example, a sample record should like like the following through a > >>> Hive > >>> table: > >>> || http://www.cnn.com || FETCHED || 125355734857 || 36000 || > 12453775834 > >>> || > >>> 1 || 0.98 || {x=1,y=2,p=3,q=4} || > >>> > >>> I would like this to be possible without having to duplicate/flatten > the > >>> data through a separate transformation. Initially, I thought my custom > >>> SerDe > >>> could have following definition for serialize(): > >>> > >>> @override > >>> public Object deserialize(Writable obj) throws SerDeException { > >>> ... > >>> } > >>> > >>> But the problem is that the input argument obj above is only the > >>> VALUE portion of a Sequence record. There seems to be a limitation with > >>> the > >>> way Hive reads Sequence files. Specifically, for each row in a sequence > >>> file, the KEY is ignored and only the VALUE is used by Hive. This is > seen > >>> from the org.apache.hadoop.hive.ql.exec.FetchOperator::getNextRow() > >>> method > >>> below, which ignores the KEY when iterating over a RecordReader (see > bold > >>> text below from the corresponding Hive code for > >>> FetchOperator::getNextRow()): > >>> > >>> /** > >>> * Get the next row. The fetch context is modified appropriately. > >>> * > >>> **/ > >>> public InspectableObject getNextRow() throws IOException { > >>> try { > >>> while (true) { > >>> if (currRecReader == null) { > >>> currRecReader = getRecordReader(); > >>> if (currRecReader == null) { > >>> return null; > >>> } > >>> } > >>> > >>> boolean ret = currRecReader.next(key, value); > >>> if (ret) { > >>> if (this.currPart == null) { > >>> Object obj = serde.deserialize(value); > >>> return new InspectableObject(obj, > >>> serde.getObjectInspector()); > >>> } else { > >>> rowWithPart[0] = serde.deserialize(value); > >>> return new InspectableObject(rowWithPart, > >>> rowObjectInspector); > >>> } > >>> } else { > >>> currRecReader.close(); > >>> currRecReader = null; > >>> } > >>> } > >>> } catch (Exception e) { > >>> throw new IOException(e); > >>> } > >>> } > >>> > >>> As you can see, the "key" variable is ignored and never returned. The > >>> problem is that in the Nutch crawldb Sequence File, the KEY is the URL, > >>> and > >>> I need it to be displayed in the Hive table along with the fields of > >>> CrawlDatum. But when writing the the custom SerDe, I only see the > >>> CrawlDatum > >>> that comes after the key, on each record...which is not sufficient. > >>> > >>> One hack could be to write a CustomSequenceFileRecordReader.java that > >>> returns the offset in the sequence file as the KEY, and an aggregation > of > >>> the (Key+Value) as th >
