Thanks Edward...I feared this was going to be the case. If I define a new input format, how do I use it in a hive table definition?
For the SequenceFileInputFormat, the table definition would read as "...STORED AS SEQUENCEFILE". With the new one, how do I specify it in the definition? "STORED AS 'com.xyz.abc.MyInputFormat'? Thanks, Safdar On Sat, May 5, 2012 at 2:44 PM, Edward Capriolo <[email protected]>wrote: > This is one of the things about hive the key is not easily available. > You are going to need an input format that creates a new value which > is contains the key and the value. > > Like this: > <url:Text> <data:CrawlDatum> -> <null-writable> new > MyKeyValue<<url:Text> <data:CrawlDatum>> > > > On Sat, May 5, 2012 at 4:05 PM, Ali Safdar Kureishy > <[email protected]> wrote: > > Hi, > > > > I have attached a Sequence file with the following format: > > <url:Text> <data:CrawlDatum> > > > > (CrawlDatum is a custom Java type, that contains several fields that > would > > be flattened into several columns by the SerDe). > > > > In other words, what I would like to do, is to expose this URL+CrawlDatum > > data via a Hive External table, with the following columns: > > || url || status || fetchtime || fetchinterval || modifiedtime || > retries || > > score || metadata || > > > > So, I was hoping that after defining a custom SerDe, I would just have to > > define the Hive table as follows: > > > > CREATE EXTERNAL TABLE crawldb > > (url STRING, status STRING, fetchtime LONG, fetchinterval LONG, > modifiedtime > > LONG, retries INT, score FLOAT, metadata MAP) > > ROW FORMAT SERDE 'NutchCrawlDBSequenceFileSerDe' > > STORED AS SEQUENCEFILE > > LOCATION '/user/training/deepcrawl/crawldb/current/part-00000'; > > > > For example, a sample record should like like the following through a > Hive > > table: > > || http://www.cnn.com || FETCHED || 125355734857 || 36000 || > 12453775834 || > > 1 || 0.98 || {x=1,y=2,p=3,q=4} || > > > > I would like this to be possible without having to duplicate/flatten the > > data through a separate transformation. Initially, I thought my custom > SerDe > > could have following definition for serialize(): > > > > @override > > public Object deserialize(Writable obj) throws SerDeException { > > ... > > } > > > > But the problem is that the input argument obj above is only the > > VALUE portion of a Sequence record. There seems to be a limitation with > the > > way Hive reads Sequence files. Specifically, for each row in a sequence > > file, the KEY is ignored and only the VALUE is used by Hive. This is seen > > from the org.apache.hadoop.hive.ql.exec.FetchOperator::getNextRow() > method > > below, which ignores the KEY when iterating over a RecordReader (see bold > > text below from the corresponding Hive code for > > FetchOperator::getNextRow()): > > > > /** > > * Get the next row. The fetch context is modified appropriately. > > * > > **/ > > public InspectableObject getNextRow() throws IOException { > > try { > > while (true) { > > if (currRecReader == null) { > > currRecReader = getRecordReader(); > > if (currRecReader == null) { > > return null; > > } > > } > > > > boolean ret = currRecReader.next(key, value); > > if (ret) { > > if (this.currPart == null) { > > Object obj = serde.deserialize(value); > > return new InspectableObject(obj, > serde.getObjectInspector()); > > } else { > > rowWithPart[0] = serde.deserialize(value); > > return new InspectableObject(rowWithPart, > rowObjectInspector); > > } > > } else { > > currRecReader.close(); > > currRecReader = null; > > } > > } > > } catch (Exception e) { > > throw new IOException(e); > > } > > } > > > > As you can see, the "key" variable is ignored and never returned. The > > problem is that in the Nutch crawldb Sequence File, the KEY is the URL, > and > > I need it to be displayed in the Hive table along with the fields of > > CrawlDatum. But when writing the the custom SerDe, I only see the > CrawlDatum > > that comes after the key, on each record...which is not sufficient. > > > > One hack could be to write a CustomSequenceFileRecordReader.java that > > returns the offset in the sequence file as the KEY, and an aggregation of > > the (Key+Value) as the VALUE. For that, perhaps I need to hack the code > > below from SequenceFileRecordReader, which will get really very messy: > > protected synchronized boolean next(K key) > > throws IOException { > > if (!more) return false; > > long pos = in.getPosition(); > > boolean remaining = (in.next(key) != null); > > if (pos >= end && in.syncSeen()) { > > more = false; > > } else { > > more = remaining; > > } > > return more; > > } > > > > This would require me to write a CustomSequenceFileRecordReader and a > > CustomSequenceFileInputFormat and then some custom SerDe, and probably > make > > several other changes as well. Is it possible to just get away with > writing > > a custom SerDe and some pre-existing reader that includes the key when > > invoking SerDe.deserialize()? Unless I'm missing something, why does Hive > > have this limitation, when accessing Sequence files? I would imagine that > > the key of a sequence file record would be just as important as the > > value...so why is it left out by the FetchOperator:getNextRow() method? > > > > If this is the unfortunate reality with reading sequence files in Nutch, > is > > there another Hive storage format I should use that works around this > > limitation? Such as "create external table ..... STORED AS > > CUSTOM_SEQUENCEFILE"? Or, let's say I write my own > > CustomHiveSequenceFileInputFormat, how do i register it with Hive and > use it > > in the Hive "STORED AS" definition? > > > > Any help or pointers would be greatly appreciated. I hope I'm mistaken > about > > the limitation above, and if not, hopefully there is an easy way to > resolve > > this through a custom SerDe alone. > > > > Warm regards, > > Safdar >

