Hi,

I might be able to shed some light on this.

Every row in HBase webtable is a serialized
org.apache.nutch.storage.WebPage. When you use Gora related wrappers (like
the various Nutch jobs such as ParserJob, FetcherJob) you use the Avro
schemas so you don't have to think about how it is encoded. You get a
WebPage object that has all fields that you specified on the input of a Job.

When you want to bypass this for some reason, you can decode the fields
manually:

The bytes of the columns are encoded in a specific manner. I see that your
testcode simply tries to interpret every value as a UTF-8 encoded string.
That is what Bytes.toString(byte[]) assumes. Although this works for
certain fields because they actually are UTF-8 encoded strings, some values
are encoded differently. HBaseByteInterface (in the Gora project) shows the
different encodings. For example, the status field f:st "status" is an
Integer (as indicated in WebPage class) that is encoded as
Bytes.toBytes(int) and should be decoded accordingly with
Bytes.toInt(byte[]). The difficult fields are f:prot "ProtocolStatus" and
p:st "ParseStatus" are Avro records (so they have a schema of their own).
To decode those, you can use the same code as in HBaseByteInterface. That
is, a combination of SpecificDatumReader and BinaryDecoder.

Good luck.




On Thu, May 30, 2013 at 4:12 AM, Shah, Nishant <[email protected]> wrote:

> Thanks for the reply. Will look into your suggestions.
>
> -----Original Message-----
> From: Lewis John Mcgibbney [mailto:[email protected]]
> Sent: Wednesday, May 29, 2013 7:09 PM
> To: [email protected]
> Subject: Re: Extracting status code from hbase
>
> OH, BTW I meant to refer you to the test in line 178 of [0]. testPutNested
> hth Lewis
>
>
> On Wed, May 29, 2013 at 7:07 PM, Lewis John Mcgibbney <
> [email protected]> wrote:
>
> > This is most certainly better aimed at either Gora or HBase lists.
> > Obtaining better (and consistent) understanding and of course
> > abstracting users from such data structures is what we have been
> > addressing in current Gora development. (See GORA-174) You will want
> > to look specifically at some of the testing we do for this stuff over
> > in Goran namely in [0-1].
> > Specifically, the Query API in Gora for some data store
> > implementations could `probably` do with some attention... so please
> > voice you opinion over on user@gora if it tickles your fancy.
> > Thanks
> > Lewis
> >
> >
> > [0]
> > http://svn.apache.org/viewvc/gora/trunk/gora-core/src/test/java/org/ap
> > ache/gora/store/DataStoreTestBase.java?view=markup
> > [1]
> > http://svn.apache.org/viewvc/gora/trunk/gora-core/src/examples/avro/we
> > bpage.json?view=markup
> >
> >
> > On Wed, May 29, 2013 at 3:55 PM, Shah, Nishant <[email protected]>
> wrote:
> >
> >> Hi Everyone,
> >>
> >> I got my error. I was trying to use toString for a field which is int
> >> or float or long. But this leads me to another question.
> >> The protocol status is a nested structure. Similar to parseStatus.
> >> How could we parse these to get the individual majorcode,
> minorcode,args ?
> >> Also, how to detect if a url has returned a 404, or 200 or any other
> >> status code ?
> >> Thanks.
> >>
> >> -----Original Message-----
> >> From: Shah, Nishant
> >> Sent: Wednesday, May 29, 2013 1:51 PM
> >> To: [email protected]
> >> Subject: Extracting status code from hbase
> >>
> >> Hi Everyone,
> >>
> >> I have my Nutch 2.1 setup with Hbase. Once I am done with the crawl,
> >> I want to extract all the information from the column family 'f'.
> >> For this I do,
> >>
> >> Scan s = new Scan();
> >> ResultScanner scanner = table.getScanner(s); try { // Scanners return
> >> Result instances.
> >> // Now, for the actual iteration. One way is to use a while loop //
> >> like
> >> so:
> >> for (Result rr = scanner.next(); rr != null; rr = scanner.next()) {
> >> // print out the row we found and the columns we were looking // for
> >> System.out.println("Found row: " + rr); String[]
> >> rrs=getColumnsInColumnFamily(rr,"f");
> >> NavigableMap familyMap = rr.getFamilyMap(Bytes.toBytes("f"));
> >> Iterator entries = familyMap.entrySet().iterator();
> >> while(entries.hasNext()){
> >>
> >> Entry thisEntry = (Entry) entries.next(); Object key =
> >> thisEntry.getKey(); Object val = thisEntry.getValue();
> >> System.out.println(Bytes.toString((byte[])
> >> key)+"="+Bytes.toString((byte[]) val)); }
> >>
> >> The value for status is blank. It's not null, but blank. Same is the
> >> case with headers. 'mtdt' family and rest of the 'f' family is fine.
> >> Can anyone suggest why this is happening ?
> >> Thanks,
> >> Nishant
> >>
> >
> >
> >
> > --
> > *Lewis*
> >
>
>
>
> --
> *Lewis*
>



-- 
*Ferdy Galema*
Kalooga Development

-- 

*Kalooga* | Visual RelevanceCheck out our Visual Gallery Layer 
now!<http://www.independent.co.uk/arts-entertainment/music/news/david-cameron-gets-teenage-kicks-starring-in-one-direction-music-video-8499282.html#!kalooga-10369/%22One%20Direction%22>
Kalooga

Helperpark 288
9723 ZA Groningen
The Netherlands
+31 50 2103400

www.kalooga.com
[email protected] EMEA

53 Davies Street
W1K 5JH London
United Kingdom
+44 20 7129 1430Kalooga Spain and LatAM

Maria de Sevilla Diago No 3
28022 Madrid - Madrid
Spain
+34 670 580 872

Reply via email to