Problem writing SerDe to read Nutch crawldb because Hive seems to ignore the Key and only reads the Value from SequenceFiles.

Ali Safdar Kureishy Sat, 05 May 2012 13:06:45 -0700

Hi,

I have attached a *Sequence* file with the following format:
<url:Text> <data:CrawlDatum>


(CrawlDatum is a custom Java type, that contains several fields that would
be flattened into several columns by the SerDe).

In other words, what I would like to do, is to expose this URL+CrawlDatum
data via a Hive External table, with the following columns:
|| url || status || fetchtime || fetchinterval || modifiedtime || retries
|| score || metadata ||

So, I was hoping that after defining a custom SerDe, I would just have to
define the Hive table as follows:

CREATE EXTERNAL TABLE *crawldb*
(url STRING, status STRING, fetchtime LONG, fetchinterval LONG,
modifiedtime LONG, retries INT, score FLOAT, metadata MAP)
ROW FORMAT SERDE 'NutchCrawlDBSequenceFileSerDe'
STORED AS *SEQUENCEFILE*
LOCATION '/user/training/deepcrawl/crawldb/current/part-00000';

For example, a sample record should like like the following through a Hive
table:
|| http://www.cnn.com || FETCHED || 125355734857 || 36000 || 12453775834 ||
1 || 0.98 || {x=1,y=2,p=3,q=4} ||

I would like this to be possible without having to duplicate/flatten the
data through a separate transformation. Initially, I thought my custom
SerDe could have following definition for serialize():

        @override
public Object deserialize(*Writable** obj*) throws SerDeException {
            ...
         }

But the problem is that the input argument *obj *above is only the
*VALUE* portion
of a Sequence record. There seems to be a limitation with the way Hive
reads Sequence files. Specifically, for each row in a sequence file, the
KEY is ignored and only the VALUE is used by Hive. This is seen from the *
org.apache.hadoop.hive.ql.**exec.FetchOperator*::*getNextRow*() method
below, which ignores the KEY when iterating over a RecordReader (see bold
text below from the corresponding Hive code for
FetchOperator::getNextRow()):

  /**
   * Get the next row. The fetch context is modified appropriately.
   *
   **/
  public InspectableObject getNextRow() throws IOException {
    try {
      while (true) {
        if (currRecReader == null) {
          currRecReader = getRecordReader();
          if (currRecReader == null) {
            return null;
          }
        }

        boolean ret = currRecReader.next(*key*, *value*);
        if (ret) {
          if (this.currPart == null) {
*            *Object obj = serde.deserialize(*value*);
            return new InspectableObject(obj*, *serde.getObjectInspector());
          } else {
            rowWithPart[0] = serde.deserialize(*value*);
            return new InspectableObject(rowWithPart, rowObjectInspector);
          }
        } else {
          currRecReader.close();
          currRecReader = null;
        }
      }
    } catch (Exception e) {
      throw new IOException(e);
    }
  }

As you can see, the "key" variable is ignored and never returned. The
problem is that in the Nutch crawldb Sequence File, the KEY is the URL, and
I need it to be displayed in the Hive table along with the fields of
CrawlDatum. But when writing the the custom SerDe, I only see the
CrawlDatum that comes after the key, on each record...which is not
sufficient.

One hack could be to write a CustomSequenceFileRecordReader.java that
returns the offset in the sequence file as the KEY, and an aggregation of
the (Key+Value) as the VALUE. For that, perhaps I need to hack the code
below from SequenceFileRecordReader, which will get really very messy:
  protected synchronized boolean next(K key)
    throws IOException {
    if (!more) return false;
    long pos = in.getPosition();
    boolean remaining = (in.next(key) != null);
    if (pos >= end && in.syncSeen()) {
      more = false;
    } else {
      more = remaining;
    }
    return more;
  }

This would require me to write a CustomSequenceFileRecordReader and a
CustomSequenceFileInputFormat and then some custom SerDe, and probably make
several other changes as well. Is it possible to just get away with writing
a custom SerDe and some pre-existing reader that includes the key when
invoking SerDe.deserialize()? Unless I'm missing something, why does Hive
have this limitation, when accessing Sequence files? I would imagine that
the key of a sequence file record would be just as important as the
value...so why is it left out by the FetchOperator:getNextRow() method?

If this is the unfortunate reality with reading sequence files in Nutch, is
there another Hive storage format I should use that works around this
limitation? Such as "create external table ..... *STORED AS
CUSTOM_SEQUENCEFILE*"? Or, let's say I write my own
CustomHiveSequenceFileInputFormat, how do i register it with Hive and use
it in the Hive "STORED AS" definition?

Any help or pointers would be greatly appreciated. I hope I'm mistaken
about the limitation above, and if not, hopefully there is an easy way to
resolve this through a custom SerDe alone.

Warm regards,
Safdar

Problem writing SerDe to read Nutch crawldb because Hive seems to ignore the Key and only reads the Value from SequenceFiles.

Reply via email to