Re: Problem writing SerDe to read Nutch crawldb because Hive seems to ignore the Key and only reads the Value from SequenceFiles.

Edward Capriolo Sat, 05 May 2012 15:17:07 -0700

Stored as sequence file is syntax sugar. It sets both the inputformat and
outputformat.


Create table x (thing int)
Inputformat 'class.x'
Outputformat 'class.y'

For inputformat you can use your custom.

For your output format you can stick with hive's ignorekeytextoutputformat
or ignorekeysequencefile format.

To avoid having to write a serde your inputformat could also Chang the
types and format to something hive could easily recognize.

On Saturday, May 5, 2012, Ali Safdar Kureishy <safdar.kurei...@gmail.com>
wrote:
> Thanks Edward...I feared this was going to be the case.
> If I define a new input format, how do I use it in a hive table
definition?
> For the SequenceFileInputFormat, the table definition would read as
"...STORED AS SEQUENCEFILE".
> With the new one, how do I specify it in the definition? "STORED AS
'com.xyz.abc.MyInputFormat'?
> Thanks,
> Safdar
>
> On Sat, May 5, 2012 at 2:44 PM, Edward Capriolo <edlinuxg...@gmail.com>
wrote:
>
> This is one of the things about hive the key is not easily available.
> You are going to need an input format that creates a new value which
> is contains the key and the value.
>
> Like this:
> <url:Text> <data:CrawlDatum> -> <null-writable>  new
> MyKeyValue<<url:Text> <data:CrawlDatum>>
>
>
> On Sat, May 5, 2012 at 4:05 PM, Ali Safdar Kureishy
> <safdar.kurei...@gmail.com> wrote:
>> Hi,
>>
>> I have attached a Sequence file with the following format:
>> <url:Text> <data:CrawlDatum>
>>
>> (CrawlDatum is a custom Java type, that contains several fields that
would
>> be flattened into several columns by the SerDe).
>>
>> In other words, what I would like to do, is to expose this URL+CrawlDatum
>> data via a Hive External table, with the following columns:
>> || url || status || fetchtime || fetchinterval || modifiedtime ||
retries ||
>> score || metadata ||
>>
>> So, I was hoping that after defining a custom SerDe, I would just have to
>> define the Hive table as follows:
>>
>> CREATE EXTERNAL TABLE crawldb
>> (url STRING, status STRING, fetchtime LONG, fetchinterval LONG,
modifiedtime
>> LONG, retries INT, score FLOAT, metadata MAP)
>> ROW FORMAT SERDE 'NutchCrawlDBSequenceFileSerDe'
>> STORED AS SEQUENCEFILE
>> LOCATION '/user/training/deepcrawl/crawldb/current/part-00000';
>>
>> For example, a sample record should like like the following through a
Hive
>> table:
>> || http://www.cnn.com || FETCHED || 125355734857 || 36000 || 12453775834
||
>> 1 || 0.98 || {x=1,y=2,p=3,q=4} ||
>>
>> I would like this to be possible without having to duplicate/flatten the
>> data through a separate transformation. Initially, I thought my custom
SerDe
>> could have following definition for serialize():
>>
>>         @override
>> public Object deserialize(Writable obj) throws SerDeException {
>>             ...
>>          }
>>
>> But the problem is that the input argument obj above is only the
>> VALUE portion of a Sequence record. There seems to be a limitation with
the
>> way Hive reads Sequence files. Specifically, for each row in a sequence
>> file, the KEY is ignored and only the VALUE is used by Hive. This is seen
>> from the org.apache.hadoop.hive.ql.exec.FetchOperator::getNextRow()
method
>> below, which ignores the KEY when iterating over a RecordReader (see bold
>> text below from the corresponding Hive code for
>> FetchOperator::getNextRow()):
>>
>>   /**
>>    * Get the next row. The fetch context is modified appropriately.
>>    *
>>    **/
>>   public InspectableObject getNextRow() throws IOException {
>>     try {
>>       while (true) {
>>         if (currRecReader == null) {
>>           currRecReader = getRecordReader();
>>           if (currRecReader == null) {
>>             return null;
>>           }
>>         }
>>
>>         boolean ret = currRecReader.next(key, value);
>>         if (ret) {
>>           if (this.currPart == null) {
>>             Object obj = serde.deserialize(value);
>>             return new InspectableObject(obj,
serde.getObjectInspector());
>>           } else {
>>             rowWithPart[0] = serde.deserialize(value);
>>             return new InspectableObject(rowWithPart,
rowObjectInspector);
>>           }
>>         } else {
>>           currRecReader.close();
>>           currRecReader = null;
>>         }
>>       }
>>     } catch (Exception e) {
>>       throw new IOException(e);
>>     }
>>   }
>>
>> As you can see, the "key" variable is ignored and never returned. The
>> problem is that in the Nutch crawldb Sequence File, the KEY is the URL,
and
>> I need it to be displayed in the Hive table along with the fields of
>> CrawlDatum. But when writing the the custom SerDe, I only see the
CrawlDatum
>> that comes after the key, on each record...which is not sufficient.
>>
>> One hack could be to write a CustomSequenceFileRecordReader.java that
>> returns the offset in the sequence file as the KEY, and an aggregation of
>> the (Key+Value) as th

Re: Problem writing SerDe to read Nutch crawldb because Hive seems to ignore the Key and only reads the Value from SequenceFiles.

Reply via email to