Just to add how to manually decode ProtocolStatus:

ByteArrayInputStream bis = new ByteArrayInputStream(bytes);
BinaryDecoder bd = new BinaryDecoder(bis);
System.out.print(bd.readInt()); //first value is an Integer
//second value is an array of
for(long i = bd.readArrayStart(); i != 0; i = bd.arrayNext()) {
  for (long j = 0; j < i; j++) {
    System.out.print(bd.readString(null).toString());
  }
}
System.out.print(bd.readLong()); //last value is a Long



On Thu, May 30, 2013 at 10:10 AM, Ferdy Galema <[email protected]>wrote:

> Hi,
>
> I might be able to shed some light on this.
>
> Every row in HBase webtable is a serialized
> org.apache.nutch.storage.WebPage. When you use Gora related wrappers (like
> the various Nutch jobs such as ParserJob, FetcherJob) you use the Avro
> schemas so you don't have to think about how it is encoded. You get a
> WebPage object that has all fields that you specified on the input of a Job.
>
> When you want to bypass this for some reason, you can decode the fields
> manually:
>
> The bytes of the columns are encoded in a specific manner. I see that your
> testcode simply tries to interpret every value as a UTF-8 encoded string.
> That is what Bytes.toString(byte[]) assumes. Although this works for
> certain fields because they actually are UTF-8 encoded strings, some values
> are encoded differently. HBaseByteInterface (in the Gora project) shows the
> different encodings. For example, the status field f:st "status" is an
> Integer (as indicated in WebPage class) that is encoded as
> Bytes.toBytes(int) and should be decoded accordingly with
> Bytes.toInt(byte[]). The difficult fields are f:prot "ProtocolStatus" and
> p:st "ParseStatus" are Avro records (so they have a schema of their own).
> To decode those, you can use the same code as in HBaseByteInterface. That
> is, a combination of SpecificDatumReader and BinaryDecoder.
>
> Good luck.
>
>
>
>
> On Thu, May 30, 2013 at 4:12 AM, Shah, Nishant <[email protected]> wrote:
>
>> Thanks for the reply. Will look into your suggestions.
>>
>> -----Original Message-----
>> From: Lewis John Mcgibbney [mailto:[email protected]]
>> Sent: Wednesday, May 29, 2013 7:09 PM
>> To: [email protected]
>> Subject: Re: Extracting status code from hbase
>>
>> OH, BTW I meant to refer you to the test in line 178 of [0].
>> testPutNested hth Lewis
>>
>>
>> On Wed, May 29, 2013 at 7:07 PM, Lewis John Mcgibbney <
>> [email protected]> wrote:
>>
>> > This is most certainly better aimed at either Gora or HBase lists.
>> > Obtaining better (and consistent) understanding and of course
>> > abstracting users from such data structures is what we have been
>> > addressing in current Gora development. (See GORA-174) You will want
>> > to look specifically at some of the testing we do for this stuff over
>> > in Goran namely in [0-1].
>> > Specifically, the Query API in Gora for some data store
>> > implementations could `probably` do with some attention... so please
>> > voice you opinion over on user@gora if it tickles your fancy.
>> > Thanks
>> > Lewis
>> >
>> >
>> > [0]
>> > http://svn.apache.org/viewvc/gora/trunk/gora-core/src/test/java/org/ap
>> > ache/gora/store/DataStoreTestBase.java?view=markup
>> > [1]
>> > http://svn.apache.org/viewvc/gora/trunk/gora-core/src/examples/avro/we
>> > bpage.json?view=markup
>> >
>> >
>> > On Wed, May 29, 2013 at 3:55 PM, Shah, Nishant <[email protected]>
>> wrote:
>> >
>> >> Hi Everyone,
>> >>
>> >> I got my error. I was trying to use toString for a field which is int
>> >> or float or long. But this leads me to another question.
>> >> The protocol status is a nested structure. Similar to parseStatus.
>> >> How could we parse these to get the individual majorcode,
>> minorcode,args ?
>> >> Also, how to detect if a url has returned a 404, or 200 or any other
>> >> status code ?
>> >> Thanks.
>> >>
>> >> -----Original Message-----
>> >> From: Shah, Nishant
>> >> Sent: Wednesday, May 29, 2013 1:51 PM
>> >> To: [email protected]
>> >> Subject: Extracting status code from hbase
>> >>
>> >> Hi Everyone,
>> >>
>> >> I have my Nutch 2.1 setup with Hbase. Once I am done with the crawl,
>> >> I want to extract all the information from the column family 'f'.
>> >> For this I do,
>> >>
>> >> Scan s = new Scan();
>> >> ResultScanner scanner = table.getScanner(s); try { // Scanners return
>> >> Result instances.
>> >> // Now, for the actual iteration. One way is to use a while loop //
>> >> like
>> >> so:
>> >> for (Result rr = scanner.next(); rr != null; rr = scanner.next()) {
>> >> // print out the row we found and the columns we were looking // for
>> >> System.out.println("Found row: " + rr); String[]
>> >> rrs=getColumnsInColumnFamily(rr,"f");
>> >> NavigableMap familyMap = rr.getFamilyMap(Bytes.toBytes("f"));
>> >> Iterator entries = familyMap.entrySet().iterator();
>> >> while(entries.hasNext()){
>> >>
>> >> Entry thisEntry = (Entry) entries.next(); Object key =
>> >> thisEntry.getKey(); Object val = thisEntry.getValue();
>> >> System.out.println(Bytes.toString((byte[])
>> >> key)+"="+Bytes.toString((byte[]) val)); }
>> >>
>> >> The value for status is blank. It's not null, but blank. Same is the
>> >> case with headers. 'mtdt' family and rest of the 'f' family is fine.
>> >> Can anyone suggest why this is happening ?
>> >> Thanks,
>> >> Nishant
>> >>
>> >
>> >
>> >
>> > --
>> > *Lewis*
>> >
>>
>>
>>
>> --
>> *Lewis*
>>
>
>
>
> --
> *Ferdy Galema*
> Kalooga Development
>



-- 
*Ferdy Galema*
Kalooga Development

-- 

*Kalooga* | Visual RelevanceCheck out our Visual Gallery Layer 
now!<http://www.independent.co.uk/arts-entertainment/music/news/david-cameron-gets-teenage-kicks-starring-in-one-direction-music-video-8499282.html#!kalooga-10369/%22One%20Direction%22>
Kalooga

Helperpark 288
9723 ZA Groningen
The Netherlands
+31 50 2103400

www.kalooga.com
[email protected] EMEA

53 Davies Street
W1K 5JH London
United Kingdom
+44 20 7129 1430Kalooga Spain and LatAM

Maria de Sevilla Diago No 3
28022 Madrid - Madrid
Spain
+34 670 580 872

Reply via email to