Hi, I might be able to shed some light on this.
Every row in HBase webtable is a serialized org.apache.nutch.storage.WebPage. When you use Gora related wrappers (like the various Nutch jobs such as ParserJob, FetcherJob) you use the Avro schemas so you don't have to think about how it is encoded. You get a WebPage object that has all fields that you specified on the input of a Job. When you want to bypass this for some reason, you can decode the fields manually: The bytes of the columns are encoded in a specific manner. I see that your testcode simply tries to interpret every value as a UTF-8 encoded string. That is what Bytes.toString(byte[]) assumes. Although this works for certain fields because they actually are UTF-8 encoded strings, some values are encoded differently. HBaseByteInterface (in the Gora project) shows the different encodings. For example, the status field f:st "status" is an Integer (as indicated in WebPage class) that is encoded as Bytes.toBytes(int) and should be decoded accordingly with Bytes.toInt(byte[]). The difficult fields are f:prot "ProtocolStatus" and p:st "ParseStatus" are Avro records (so they have a schema of their own). To decode those, you can use the same code as in HBaseByteInterface. That is, a combination of SpecificDatumReader and BinaryDecoder. Good luck. On Thu, May 30, 2013 at 4:12 AM, Shah, Nishant <[email protected]> wrote: > Thanks for the reply. Will look into your suggestions. > > -----Original Message----- > From: Lewis John Mcgibbney [mailto:[email protected]] > Sent: Wednesday, May 29, 2013 7:09 PM > To: [email protected] > Subject: Re: Extracting status code from hbase > > OH, BTW I meant to refer you to the test in line 178 of [0]. testPutNested > hth Lewis > > > On Wed, May 29, 2013 at 7:07 PM, Lewis John Mcgibbney < > [email protected]> wrote: > > > This is most certainly better aimed at either Gora or HBase lists. > > Obtaining better (and consistent) understanding and of course > > abstracting users from such data structures is what we have been > > addressing in current Gora development. (See GORA-174) You will want > > to look specifically at some of the testing we do for this stuff over > > in Goran namely in [0-1]. > > Specifically, the Query API in Gora for some data store > > implementations could `probably` do with some attention... so please > > voice you opinion over on user@gora if it tickles your fancy. > > Thanks > > Lewis > > > > > > [0] > > http://svn.apache.org/viewvc/gora/trunk/gora-core/src/test/java/org/ap > > ache/gora/store/DataStoreTestBase.java?view=markup > > [1] > > http://svn.apache.org/viewvc/gora/trunk/gora-core/src/examples/avro/we > > bpage.json?view=markup > > > > > > On Wed, May 29, 2013 at 3:55 PM, Shah, Nishant <[email protected]> > wrote: > > > >> Hi Everyone, > >> > >> I got my error. I was trying to use toString for a field which is int > >> or float or long. But this leads me to another question. > >> The protocol status is a nested structure. Similar to parseStatus. > >> How could we parse these to get the individual majorcode, > minorcode,args ? > >> Also, how to detect if a url has returned a 404, or 200 or any other > >> status code ? > >> Thanks. > >> > >> -----Original Message----- > >> From: Shah, Nishant > >> Sent: Wednesday, May 29, 2013 1:51 PM > >> To: [email protected] > >> Subject: Extracting status code from hbase > >> > >> Hi Everyone, > >> > >> I have my Nutch 2.1 setup with Hbase. Once I am done with the crawl, > >> I want to extract all the information from the column family 'f'. > >> For this I do, > >> > >> Scan s = new Scan(); > >> ResultScanner scanner = table.getScanner(s); try { // Scanners return > >> Result instances. > >> // Now, for the actual iteration. One way is to use a while loop // > >> like > >> so: > >> for (Result rr = scanner.next(); rr != null; rr = scanner.next()) { > >> // print out the row we found and the columns we were looking // for > >> System.out.println("Found row: " + rr); String[] > >> rrs=getColumnsInColumnFamily(rr,"f"); > >> NavigableMap familyMap = rr.getFamilyMap(Bytes.toBytes("f")); > >> Iterator entries = familyMap.entrySet().iterator(); > >> while(entries.hasNext()){ > >> > >> Entry thisEntry = (Entry) entries.next(); Object key = > >> thisEntry.getKey(); Object val = thisEntry.getValue(); > >> System.out.println(Bytes.toString((byte[]) > >> key)+"="+Bytes.toString((byte[]) val)); } > >> > >> The value for status is blank. It's not null, but blank. Same is the > >> case with headers. 'mtdt' family and rest of the 'f' family is fine. > >> Can anyone suggest why this is happening ? > >> Thanks, > >> Nishant > >> > > > > > > > > -- > > *Lewis* > > > > > > -- > *Lewis* > -- *Ferdy Galema* Kalooga Development -- *Kalooga* | Visual RelevanceCheck out our Visual Gallery Layer now!<http://www.independent.co.uk/arts-entertainment/music/news/david-cameron-gets-teenage-kicks-starring-in-one-direction-music-video-8499282.html#!kalooga-10369/%22One%20Direction%22> Kalooga Helperpark 288 9723 ZA Groningen The Netherlands +31 50 2103400 www.kalooga.com [email protected] EMEA 53 Davies Street W1K 5JH London United Kingdom +44 20 7129 1430Kalooga Spain and LatAM Maria de Sevilla Diago No 3 28022 Madrid - Madrid Spain +34 670 580 872

