Re: Problem writing SerDe to read Nutch crawldb because Hive seems to ignore the Key and only reads the Value from SequenceFiles.

Ali Safdar Kureishy Sun, 06 May 2012 04:35:14 -0700

Thanks Edward.

What are the Input and Output formats chosen by Hive for the "STORED
AS SEQUENCEFILE" selection? And if I want to add my own syntactic
sugar, is there a lookup mechanism where I can register my custom code
so that it would work with "STORED AS MYCUSTOMSEQUENCEFILE"?


Thanks,
Safdar


On Sun, May 6, 2012 at 1:16 AM, Edward Capriolo <edlinuxg...@gmail.com> wrote:
> Stored as sequence file is syntax sugar. It sets both the inputformat and
> outputformat.
>
> Create table x (thing int)
> Inputformat 'class.x'
> Outputformat 'class.y'
>
> For inputformat you can use your custom.
>
> For your output format you can stick with hive's ignorekeytextoutputformat
> or ignorekeysequencefile format.
>
> To avoid having to write a serde your inputformat could also Chang the types
> and format to something hive could easily recognize.
>
>
> On Saturday, May 5, 2012, Ali Safdar Kureishy <safdar.kurei...@gmail.com>
> wrote:
>> Thanks Edward...I feared this was going to be the case.
>> If I define a new input format, how do I use it in a hive table
>> definition?
>> For the SequenceFileInputFormat, the table definition would read as
>> "...STORED AS SEQUENCEFILE".
>> With the new one, how do I specify it in the definition? "STORED AS
>> 'com.xyz.abc.MyInputFormat'?
>> Thanks,
>> Safdar
>>
>> On Sat, May 5, 2012 at 2:44 PM, Edward Capriolo <edlinuxg...@gmail.com>
>> wrote:
>>
>> This is one of the things about hive the key is not easily available.
>> You are going to need an input format that creates a new value which
>> is contains the key and the value.
>>
>> Like this:
>> <url:Text> <data:CrawlDatum> -> <null-writable>  new
>> MyKeyValue<<url:Text> <data:CrawlDatum>>
>>
>>
>> On Sat, May 5, 2012 at 4:05 PM, Ali Safdar Kureishy
>> <safdar.kurei...@gmail.com> wrote:
>>> Hi,
>>>
>>> I have attached a Sequence file with the following format:
>>> <url:Text> <data:CrawlDatum>
>>>
>>> (CrawlDatum is a custom Java type, that contains several fields that
>>> would
>>> be flattened into several columns by the SerDe).
>>>
>>> In other words, what I would like to do, is to expose this URL+CrawlDatum
>>> data via a Hive External table, with the following columns:
>>> || url || status || fetchtime || fetchinterval || modifiedtime || retries
>>> ||
>>> score || metadata ||
>>>
>>> So, I was hoping that after defining a custom SerDe, I would just have to
>>> define the Hive table as follows:
>>>
>>> CREATE EXTERNAL TABLE crawldb
>>> (url STRING, status STRING, fetchtime LONG, fetchinterval LONG,
>>> modifiedtime
>>> LONG, retries INT, score FLOAT, metadata MAP)
>>> ROW FORMAT SERDE 'NutchCrawlDBSequenceFileSerDe'
>>> STORED AS SEQUENCEFILE
>>> LOCATION '/user/training/deepcrawl/crawldb/current/part-00000';
>>>
>>> For example, a sample record should like like the following through a
>>> Hive
>>> table:
>>> || http://www.cnn.com || FETCHED || 125355734857 || 36000 || 12453775834
>>> ||
>>> 1 || 0.98 || {x=1,y=2,p=3,q=4} ||
>>>
>>> I would like this to be possible without having to duplicate/flatten the
>>> data through a separate transformation. Initially, I thought my custom
>>> SerDe
>>> could have following definition for serialize():
>>>
>>>         @override
>>> public Object deserialize(Writable obj) throws SerDeException {
>>>             ...
>>>          }
>>>
>>> But the problem is that the input argument obj above is only the
>>> VALUE portion of a Sequence record. There seems to be a limitation with
>>> the
>>> way Hive reads Sequence files. Specifically, for each row in a sequence
>>> file, the KEY is ignored and only the VALUE is used by Hive. This is seen
>>> from the org.apache.hadoop.hive.ql.exec.FetchOperator::getNextRow()
>>> method
>>> below, which ignores the KEY when iterating over a RecordReader (see bold
>>> text below from the corresponding Hive code for
>>> FetchOperator::getNextRow()):
>>>
>>>   /**
>>>    * Get the next row. The fetch context is modified appropriately.
>>>    *
>>>    **/
>>>   public InspectableObject getNextRow() throws IOException {
>>>     try {
>>>       while (true) {
>>>         if (currRecReader == null) {
>>>           currRecReader = getRecordReader();
>>>           if (currRecReader == null) {
>>>             return null;
>>>           }
>>>         }
>>>
>>>         boolean ret = currRecReader.next(key, value);
>>>         if (ret) {
>>>           if (this.currPart == null) {
>>>             Object obj = serde.deserialize(value);
>>>             return new InspectableObject(obj,
>>> serde.getObjectInspector());
>>>           } else {
>>>             rowWithPart[0] = serde.deserialize(value);
>>>             return new InspectableObject(rowWithPart,
>>> rowObjectInspector);
>>>           }
>>>         } else {
>>>           currRecReader.close();
>>>           currRecReader = null;
>>>         }
>>>       }
>>>     } catch (Exception e) {
>>>       throw new IOException(e);
>>>     }
>>>   }
>>>
>>> As you can see, the "key" variable is ignored and never returned. The
>>> problem is that in the Nutch crawldb Sequence File, the KEY is the URL,
>>> and
>>> I need it to be displayed in the Hive table along with the fields of
>>> CrawlDatum. But when writing the the custom SerDe, I only see the
>>> CrawlDatum
>>> that comes after the key, on each record...which is not sufficient.
>>>
>>> One hack could be to write a CustomSequenceFileRecordReader.java that
>>> returns the offset in the sequence file as the KEY, and an aggregation of
>>> the (Key+Value) as th

Re: Problem writing SerDe to read Nutch crawldb because Hive seems to ignore the Key and only reads the Value from SequenceFiles.

Reply via email to