Re: Problem writing SerDe to read Nutch crawldb because Hive seems to ignore the Key and only reads the Value from SequenceFiles.

Ali Safdar Kureishy Sun, 06 May 2012 07:30:43 -0700

Also, if I return a fully formatted string containing all the flattened
values from my key+value (such as what you suggested), then I'd need to
split the resulting string into its component columns based on the
delimiter ("," or ";" or "\t" etc). How do I define the right table for
that?


In other words, my custom input format will return a value string of this
form:
<Text>;<cd.status>;<cd.fetchTime>;<cd.retries>;<cd.map>;.....

And so, on the Hive side, I'd like to use a ";" as the delimiter. Typically
this Hive table would be defined as:

CREATE TABLE crawldb (.....)
ROWFORMAT DELIMITED
FIELDS SEPARATED BY ';'
....
....

Would I now be able to define my table the same way, using my custom input
format:
*CREATE TABLE crawldb (...)
INPUTFORMAT 'MyFlatteningInputFormat'
FIELDS SEPARATED BY ';'
LOCATION '/user/training/deepcrawl/crawldb/current/part-00000';*
?

Thanks,
Safdar

On Sun, May 6, 2012 at 4:34 AM, Ali Safdar Kureishy <
[email protected]> wrote:

> Thanks Edward.
>
> What are the Input and Output formats chosen by Hive for the "STORED
> AS SEQUENCEFILE" selection? And if I want to add my own syntactic
> sugar, is there a lookup mechanism where I can register my custom code
> so that it would work with "STORED AS MYCUSTOMSEQUENCEFILE"?
>
> Thanks,
> Safdar
>
>
> On Sun, May 6, 2012 at 1:16 AM, Edward Capriolo <[email protected]>
> wrote:
> > Stored as sequence file is syntax sugar. It sets both the inputformat and
> > outputformat.
> >
> > Create table x (thing int)
> > Inputformat 'class.x'
> > Outputformat 'class.y'
> >
> > For inputformat you can use your custom.
> >
> > For your output format you can stick with hive's
> ignorekeytextoutputformat
> > or ignorekeysequencefile format.
> >
> > To avoid having to write a serde your inputformat could also Chang the
> types
> > and format to something hive could easily recognize.
> >
> >
> > On Saturday, May 5, 2012, Ali Safdar Kureishy <[email protected]
> >
> > wrote:
> >> Thanks Edward...I feared this was going to be the case.
> >> If I define a new input format, how do I use it in a hive table
> >> definition?
> >> For the SequenceFileInputFormat, the table definition would read as
> >> "...STORED AS SEQUENCEFILE".
> >> With the new one, how do I specify it in the definition? "STORED AS
> >> 'com.xyz.abc.MyInputFormat'?
> >> Thanks,
> >> Safdar
> >>
> >> On Sat, May 5, 2012 at 2:44 PM, Edward Capriolo <[email protected]>
> >> wrote:
> >>
> >> This is one of the things about hive the key is not easily available.
> >> You are going to need an input format that creates a new value which
> >> is contains the key and the value.
> >>
> >> Like this:
> >> <url:Text> <data:CrawlDatum> -> <null-writable>  new
> >> MyKeyValue<<url:Text> <data:CrawlDatum>>
> >>
> >>
> >> On Sat, May 5, 2012 at 4:05 PM, Ali Safdar Kureishy
> >> <[email protected]> wrote:
> >>> Hi,
> >>>
> >>> I have attached a Sequence file with the following format:
> >>> <url:Text> <data:CrawlDatum>
> >>>
> >>> (CrawlDatum is a custom Java type, that contains several fields that
> >>> would
> >>> be flattened into several columns by the SerDe).
> >>>
> >>> In other words, what I would like to do, is to expose this
> URL+CrawlDatum
> >>> data via a Hive External table, with the following columns:
> >>> || url || status || fetchtime || fetchinterval || modifiedtime ||
> retries
> >>> ||
> >>> score || metadata ||
> >>>
> >>> So, I was hoping that after defining a custom SerDe, I would just have
> to
> >>> define the Hive table as follows:
> >>>
> >>> CREATE EXTERNAL TABLE crawldb
> >>> (url STRING, status STRING, fetchtime LONG, fetchinterval LONG,
> >>> modifiedtime
> >>> LONG, retries INT, score FLOAT, metadata MAP)
> >>> ROW FORMAT SERDE 'NutchCrawlDBSequenceFileSerDe'
> >>> STORED AS SEQUENCEFILE
> >>> LOCATION '/user/training/deepcrawl/crawldb/current/part-00000';
> >>>
> >>> For example, a sample record should like like the following through a
> >>> Hive
> >>> table:
> >>> || http://www.cnn.com || FETCHED || 125355734857 || 36000 ||
> 12453775834
> >>> ||
> >>> 1 || 0.98 || {x=1,y=2,p=3,q=4} ||
> >>>
> >>> I would like this to be possible without having to duplicate/flatten
> the
> >>> data through a separate transformation. Initially, I thought my custom
> >>> SerDe
> >>> could have following definition for serialize():
> >>>
> >>>         @override
> >>> public Object deserialize(Writable obj) throws SerDeException {
> >>>             ...
> >>>          }
> >>>
> >>> But the problem is that the input argument obj above is only the
> >>> VALUE portion of a Sequence record. There seems to be a limitation with
> >>> the
> >>> way Hive reads Sequence files. Specifically, for each row in a sequence
> >>> file, the KEY is ignored and only the VALUE is used by Hive. This is
> seen
> >>> from the org.apache.hadoop.hive.ql.exec.FetchOperator::getNextRow()
> >>> method
> >>> below, which ignores the KEY when iterating over a RecordReader (see
> bold
> >>> text below from the corresponding Hive code for
> >>> FetchOperator::getNextRow()):
> >>>
> >>>   /**
> >>>    * Get the next row. The fetch context is modified appropriately.
> >>>    *
> >>>    **/
> >>>   public InspectableObject getNextRow() throws IOException {
> >>>     try {
> >>>       while (true) {
> >>>         if (currRecReader == null) {
> >>>           currRecReader = getRecordReader();
> >>>           if (currRecReader == null) {
> >>>             return null;
> >>>           }
> >>>         }
> >>>
> >>>         boolean ret = currRecReader.next(key, value);
> >>>         if (ret) {
> >>>           if (this.currPart == null) {
> >>>             Object obj = serde.deserialize(value);
> >>>             return new InspectableObject(obj,
> >>> serde.getObjectInspector());
> >>>           } else {
> >>>             rowWithPart[0] = serde.deserialize(value);
> >>>             return new InspectableObject(rowWithPart,
> >>> rowObjectInspector);
> >>>           }
> >>>         } else {
> >>>           currRecReader.close();
> >>>           currRecReader = null;
> >>>         }
> >>>       }
> >>>     } catch (Exception e) {
> >>>       throw new IOException(e);
> >>>     }
> >>>   }
> >>>
> >>> As you can see, the "key" variable is ignored and never returned. The
> >>> problem is that in the Nutch crawldb Sequence File, the KEY is the URL,
> >>> and
> >>> I need it to be displayed in the Hive table along with the fields of
> >>> CrawlDatum. But when writing the the custom SerDe, I only see the
> >>> CrawlDatum
> >>> that comes after the key, on each record...which is not sufficient.
> >>>
> >>> One hack could be to write a CustomSequenceFileRecordReader.java that
> >>> returns the offset in the sequence file as the KEY, and an aggregation
> of
> >>> the (Key+Value) as th
>

Re: Problem writing SerDe to read Nutch crawldb because Hive seems to ignore the Key and only reads the Value from SequenceFiles.

Reply via email to