Thanks Edward. What are the Input and Output formats chosen by Hive for the "STORED AS SEQUENCEFILE" selection? And if I want to add my own syntactic sugar, is there a lookup mechanism where I can register my custom code so that it would work with "STORED AS MYCUSTOMSEQUENCEFILE"?
Thanks, Safdar On Sun, May 6, 2012 at 1:16 AM, Edward Capriolo <edlinuxg...@gmail.com> wrote: > Stored as sequence file is syntax sugar. It sets both the inputformat and > outputformat. > > Create table x (thing int) > Inputformat 'class.x' > Outputformat 'class.y' > > For inputformat you can use your custom. > > For your output format you can stick with hive's ignorekeytextoutputformat > or ignorekeysequencefile format. > > To avoid having to write a serde your inputformat could also Chang the types > and format to something hive could easily recognize. > > > On Saturday, May 5, 2012, Ali Safdar Kureishy <safdar.kurei...@gmail.com> > wrote: >> Thanks Edward...I feared this was going to be the case. >> If I define a new input format, how do I use it in a hive table >> definition? >> For the SequenceFileInputFormat, the table definition would read as >> "...STORED AS SEQUENCEFILE". >> With the new one, how do I specify it in the definition? "STORED AS >> 'com.xyz.abc.MyInputFormat'? >> Thanks, >> Safdar >> >> On Sat, May 5, 2012 at 2:44 PM, Edward Capriolo <edlinuxg...@gmail.com> >> wrote: >> >> This is one of the things about hive the key is not easily available. >> You are going to need an input format that creates a new value which >> is contains the key and the value. >> >> Like this: >> <url:Text> <data:CrawlDatum> -> <null-writable> new >> MyKeyValue<<url:Text> <data:CrawlDatum>> >> >> >> On Sat, May 5, 2012 at 4:05 PM, Ali Safdar Kureishy >> <safdar.kurei...@gmail.com> wrote: >>> Hi, >>> >>> I have attached a Sequence file with the following format: >>> <url:Text> <data:CrawlDatum> >>> >>> (CrawlDatum is a custom Java type, that contains several fields that >>> would >>> be flattened into several columns by the SerDe). >>> >>> In other words, what I would like to do, is to expose this URL+CrawlDatum >>> data via a Hive External table, with the following columns: >>> || url || status || fetchtime || fetchinterval || modifiedtime || retries >>> || >>> score || metadata || >>> >>> So, I was hoping that after defining a custom SerDe, I would just have to >>> define the Hive table as follows: >>> >>> CREATE EXTERNAL TABLE crawldb >>> (url STRING, status STRING, fetchtime LONG, fetchinterval LONG, >>> modifiedtime >>> LONG, retries INT, score FLOAT, metadata MAP) >>> ROW FORMAT SERDE 'NutchCrawlDBSequenceFileSerDe' >>> STORED AS SEQUENCEFILE >>> LOCATION '/user/training/deepcrawl/crawldb/current/part-00000'; >>> >>> For example, a sample record should like like the following through a >>> Hive >>> table: >>> || http://www.cnn.com || FETCHED || 125355734857 || 36000 || 12453775834 >>> || >>> 1 || 0.98 || {x=1,y=2,p=3,q=4} || >>> >>> I would like this to be possible without having to duplicate/flatten the >>> data through a separate transformation. Initially, I thought my custom >>> SerDe >>> could have following definition for serialize(): >>> >>> @override >>> public Object deserialize(Writable obj) throws SerDeException { >>> ... >>> } >>> >>> But the problem is that the input argument obj above is only the >>> VALUE portion of a Sequence record. There seems to be a limitation with >>> the >>> way Hive reads Sequence files. Specifically, for each row in a sequence >>> file, the KEY is ignored and only the VALUE is used by Hive. This is seen >>> from the org.apache.hadoop.hive.ql.exec.FetchOperator::getNextRow() >>> method >>> below, which ignores the KEY when iterating over a RecordReader (see bold >>> text below from the corresponding Hive code for >>> FetchOperator::getNextRow()): >>> >>> /** >>> * Get the next row. The fetch context is modified appropriately. >>> * >>> **/ >>> public InspectableObject getNextRow() throws IOException { >>> try { >>> while (true) { >>> if (currRecReader == null) { >>> currRecReader = getRecordReader(); >>> if (currRecReader == null) { >>> return null; >>> } >>> } >>> >>> boolean ret = currRecReader.next(key, value); >>> if (ret) { >>> if (this.currPart == null) { >>> Object obj = serde.deserialize(value); >>> return new InspectableObject(obj, >>> serde.getObjectInspector()); >>> } else { >>> rowWithPart[0] = serde.deserialize(value); >>> return new InspectableObject(rowWithPart, >>> rowObjectInspector); >>> } >>> } else { >>> currRecReader.close(); >>> currRecReader = null; >>> } >>> } >>> } catch (Exception e) { >>> throw new IOException(e); >>> } >>> } >>> >>> As you can see, the "key" variable is ignored and never returned. The >>> problem is that in the Nutch crawldb Sequence File, the KEY is the URL, >>> and >>> I need it to be displayed in the Hive table along with the fields of >>> CrawlDatum. But when writing the the custom SerDe, I only see the >>> CrawlDatum >>> that comes after the key, on each record...which is not sufficient. >>> >>> One hack could be to write a CustomSequenceFileRecordReader.java that >>> returns the offset in the sequence file as the KEY, and an aggregation of >>> the (Key+Value) as th