Re: Problem writing SerDe to read Nutch crawldb because Hive seems to ignore the Key and only reads the Value from SequenceFiles.

Ali Safdar Kureishy Sat, 05 May 2012 14:49:32 -0700

Thanks Edward...I feared this was going to be the case.

If I define a new input format, how do I use it in a hive table definition?


For the SequenceFileInputFormat, the table definition would read as
"...STORED AS SEQUENCEFILE".
With the new one, how do I specify it in the definition? "STORED AS
'com.xyz.abc.MyInputFormat'?

Thanks,
Safdar


On Sat, May 5, 2012 at 2:44 PM, Edward Capriolo <[email protected]>wrote:

> This is one of the things about hive the key is not easily available.
> You are going to need an input format that creates a new value which
> is contains the key and the value.
>
> Like this:
> <url:Text> <data:CrawlDatum> -> <null-writable>  new
> MyKeyValue<<url:Text> <data:CrawlDatum>>
>
>
> On Sat, May 5, 2012 at 4:05 PM, Ali Safdar Kureishy
> <[email protected]> wrote:
> > Hi,
> >
> > I have attached a Sequence file with the following format:
> > <url:Text> <data:CrawlDatum>
> >
> > (CrawlDatum is a custom Java type, that contains several fields that
> would
> > be flattened into several columns by the SerDe).
> >
> > In other words, what I would like to do, is to expose this URL+CrawlDatum
> > data via a Hive External table, with the following columns:
> > || url || status || fetchtime || fetchinterval || modifiedtime ||
> retries ||
> > score || metadata ||
> >
> > So, I was hoping that after defining a custom SerDe, I would just have to
> > define the Hive table as follows:
> >
> > CREATE EXTERNAL TABLE crawldb
> > (url STRING, status STRING, fetchtime LONG, fetchinterval LONG,
> modifiedtime
> > LONG, retries INT, score FLOAT, metadata MAP)
> > ROW FORMAT SERDE 'NutchCrawlDBSequenceFileSerDe'
> > STORED AS SEQUENCEFILE
> > LOCATION '/user/training/deepcrawl/crawldb/current/part-00000';
> >
> > For example, a sample record should like like the following through a
> Hive
> > table:
> > || http://www.cnn.com || FETCHED || 125355734857 || 36000 ||
> 12453775834 ||
> > 1 || 0.98 || {x=1,y=2,p=3,q=4} ||
> >
> > I would like this to be possible without having to duplicate/flatten the
> > data through a separate transformation. Initially, I thought my custom
> SerDe
> > could have following definition for serialize():
> >
> >         @override
> > public Object deserialize(Writable obj) throws SerDeException {
> >             ...
> >          }
> >
> > But the problem is that the input argument obj above is only the
> > VALUE portion of a Sequence record. There seems to be a limitation with
> the
> > way Hive reads Sequence files. Specifically, for each row in a sequence
> > file, the KEY is ignored and only the VALUE is used by Hive. This is seen
> > from the org.apache.hadoop.hive.ql.exec.FetchOperator::getNextRow()
> method
> > below, which ignores the KEY when iterating over a RecordReader (see bold
> > text below from the corresponding Hive code for
> > FetchOperator::getNextRow()):
> >
> >   /**
> >    * Get the next row. The fetch context is modified appropriately.
> >    *
> >    **/
> >   public InspectableObject getNextRow() throws IOException {
> >     try {
> >       while (true) {
> >         if (currRecReader == null) {
> >           currRecReader = getRecordReader();
> >           if (currRecReader == null) {
> >             return null;
> >           }
> >         }
> >
> >         boolean ret = currRecReader.next(key, value);
> >         if (ret) {
> >           if (this.currPart == null) {
> >             Object obj = serde.deserialize(value);
> >             return new InspectableObject(obj,
> serde.getObjectInspector());
> >           } else {
> >             rowWithPart[0] = serde.deserialize(value);
> >             return new InspectableObject(rowWithPart,
> rowObjectInspector);
> >           }
> >         } else {
> >           currRecReader.close();
> >           currRecReader = null;
> >         }
> >       }
> >     } catch (Exception e) {
> >       throw new IOException(e);
> >     }
> >   }
> >
> > As you can see, the "key" variable is ignored and never returned. The
> > problem is that in the Nutch crawldb Sequence File, the KEY is the URL,
> and
> > I need it to be displayed in the Hive table along with the fields of
> > CrawlDatum. But when writing the the custom SerDe, I only see the
> CrawlDatum
> > that comes after the key, on each record...which is not sufficient.
> >
> > One hack could be to write a CustomSequenceFileRecordReader.java that
> > returns the offset in the sequence file as the KEY, and an aggregation of
> > the (Key+Value) as the VALUE. For that, perhaps I need to hack the code
> > below from SequenceFileRecordReader, which will get really very messy:
> >   protected synchronized boolean next(K key)
> >     throws IOException {
> >     if (!more) return false;
> >     long pos = in.getPosition();
> >     boolean remaining = (in.next(key) != null);
> >     if (pos >= end && in.syncSeen()) {
> >       more = false;
> >     } else {
> >       more = remaining;
> >     }
> >     return more;
> >   }
> >
> > This would require me to write a CustomSequenceFileRecordReader and a
> > CustomSequenceFileInputFormat and then some custom SerDe, and probably
> make
> > several other changes as well. Is it possible to just get away with
> writing
> > a custom SerDe and some pre-existing reader that includes the key when
> > invoking SerDe.deserialize()? Unless I'm missing something, why does Hive
> > have this limitation, when accessing Sequence files? I would imagine that
> > the key of a sequence file record would be just as important as the
> > value...so why is it left out by the FetchOperator:getNextRow() method?
> >
> > If this is the unfortunate reality with reading sequence files in Nutch,
> is
> > there another Hive storage format I should use that works around this
> > limitation? Such as "create external table ..... STORED AS
> > CUSTOM_SEQUENCEFILE"? Or, let's say I write my own
> > CustomHiveSequenceFileInputFormat, how do i register it with Hive and
> use it
> > in the Hive "STORED AS" definition?
> >
> > Any help or pointers would be greatly appreciated. I hope I'm mistaken
> about
> > the limitation above, and if not, hopefully there is an easy way to
> resolve
> > this through a custom SerDe alone.
> >
> > Warm regards,
> > Safdar
>

Re: Problem writing SerDe to read Nutch crawldb because Hive seems to ignore the Key and only reads the Value from SequenceFiles.

Reply via email to