Hi, I have attached a *Sequence* file with the following format: <url:Text> <data:CrawlDatum>
(CrawlDatum is a custom Java type, that contains several fields that would be flattened into several columns by the SerDe). In other words, what I would like to do, is to expose this URL+CrawlDatum data via a Hive External table, with the following columns: || url || status || fetchtime || fetchinterval || modifiedtime || retries || score || metadata || So, I was hoping that after defining a custom SerDe, I would just have to define the Hive table as follows: CREATE EXTERNAL TABLE *crawldb* (url STRING, status STRING, fetchtime LONG, fetchinterval LONG, modifiedtime LONG, retries INT, score FLOAT, metadata MAP) ROW FORMAT SERDE 'NutchCrawlDBSequenceFileSerDe' STORED AS *SEQUENCEFILE* LOCATION '/user/training/deepcrawl/crawldb/current/part-00000'; For example, a sample record should like like the following through a Hive table: || http://www.cnn.com || FETCHED || 125355734857 || 36000 || 12453775834 || 1 || 0.98 || {x=1,y=2,p=3,q=4} || I would like this to be possible without having to duplicate/flatten the data through a separate transformation. Initially, I thought my custom SerDe could have following definition for serialize(): @override public Object deserialize(*Writable** obj*) throws SerDeException { ... } But the problem is that the input argument *obj *above is only the *VALUE* portion of a Sequence record. There seems to be a limitation with the way Hive reads Sequence files. Specifically, for each row in a sequence file, the KEY is ignored and only the VALUE is used by Hive. This is seen from the * org.apache.hadoop.hive.ql.**exec.FetchOperator*::*getNextRow*() method below, which ignores the KEY when iterating over a RecordReader (see bold text below from the corresponding Hive code for FetchOperator::getNextRow()): /** * Get the next row. The fetch context is modified appropriately. * **/ public InspectableObject getNextRow() throws IOException { try { while (true) { if (currRecReader == null) { currRecReader = getRecordReader(); if (currRecReader == null) { return null; } } boolean ret = currRecReader.next(*key*, *value*); if (ret) { if (this.currPart == null) { * *Object obj = serde.deserialize(*value*); return new InspectableObject(obj*, *serde.getObjectInspector()); } else { rowWithPart[0] = serde.deserialize(*value*); return new InspectableObject(rowWithPart, rowObjectInspector); } } else { currRecReader.close(); currRecReader = null; } } } catch (Exception e) { throw new IOException(e); } } As you can see, the "key" variable is ignored and never returned. The problem is that in the Nutch crawldb Sequence File, the KEY is the URL, and I need it to be displayed in the Hive table along with the fields of CrawlDatum. But when writing the the custom SerDe, I only see the CrawlDatum that comes after the key, on each record...which is not sufficient. One hack could be to write a CustomSequenceFileRecordReader.java that returns the offset in the sequence file as the KEY, and an aggregation of the (Key+Value) as the VALUE. For that, perhaps I need to hack the code below from SequenceFileRecordReader, which will get really very messy: protected synchronized boolean next(K key) throws IOException { if (!more) return false; long pos = in.getPosition(); boolean remaining = (in.next(key) != null); if (pos >= end && in.syncSeen()) { more = false; } else { more = remaining; } return more; } This would require me to write a CustomSequenceFileRecordReader and a CustomSequenceFileInputFormat and then some custom SerDe, and probably make several other changes as well. Is it possible to just get away with writing a custom SerDe and some pre-existing reader that includes the key when invoking SerDe.deserialize()? Unless I'm missing something, why does Hive have this limitation, when accessing Sequence files? I would imagine that the key of a sequence file record would be just as important as the value...so why is it left out by the FetchOperator:getNextRow() method? If this is the unfortunate reality with reading sequence files in Nutch, is there another Hive storage format I should use that works around this limitation? Such as "create external table ..... *STORED AS CUSTOM_SEQUENCEFILE*"? Or, let's say I write my own CustomHiveSequenceFileInputFormat, how do i register it with Hive and use it in the Hive "STORED AS" definition? Any help or pointers would be greatly appreciated. I hope I'm mistaken about the limitation above, and if not, hopefully there is an easy way to resolve this through a custom SerDe alone. Warm regards, Safdar

