I tried the below code with the Avro -1.3.3.jar included. The BinaryDecoder shows as deprecated. If I use the below code, I get the first value as 1 and second as 0. This doesn't seem to be the correct value. So I tried using the DecoderFactory as DecoderFactory.get().binaryDecoder(byteValue,null) but it says get is not defined. What do you think is going wrong here ? Also, what other way is there to decode this if not manual. Can we use gora for this ? Sorry if this is the wrong place to post. I could post in the avro or gora or hbase list if you guys feel that's more appropriate.
Thanks. -----Original Message----- From: Ferdy Galema [mailto:[email protected]] Sent: Thursday, May 30, 2013 1:16 AM To: [email protected] Subject: Re: Extracting status code from hbase Just to add how to manually decode ProtocolStatus: ByteArrayInputStream bis = new ByteArrayInputStream(bytes); BinaryDecoder bd = new BinaryDecoder(bis); System.out.print(bd.readInt()); //first value is an Integer //second value is an array of for(long i = bd.readArrayStart(); i != 0; i = bd.arrayNext()) { for (long j = 0; j < i; j++) { System.out.print(bd.readString(null).toString()); } } System.out.print(bd.readLong()); //last value is a Long On Thu, May 30, 2013 at 10:10 AM, Ferdy Galema <[email protected]>wrote: > Hi, > > I might be able to shed some light on this. > > Every row in HBase webtable is a serialized > org.apache.nutch.storage.WebPage. When you use Gora related wrappers > (like the various Nutch jobs such as ParserJob, FetcherJob) you use > the Avro schemas so you don't have to think about how it is encoded. > You get a WebPage object that has all fields that you specified on the input > of a Job. > > When you want to bypass this for some reason, you can decode the > fields > manually: > > The bytes of the columns are encoded in a specific manner. I see that > your testcode simply tries to interpret every value as a UTF-8 encoded string. > That is what Bytes.toString(byte[]) assumes. Although this works for > certain fields because they actually are UTF-8 encoded strings, some > values are encoded differently. HBaseByteInterface (in the Gora > project) shows the different encodings. For example, the status field > f:st "status" is an Integer (as indicated in WebPage class) that is > encoded as > Bytes.toBytes(int) and should be decoded accordingly with > Bytes.toInt(byte[]). The difficult fields are f:prot "ProtocolStatus" > and p:st "ParseStatus" are Avro records (so they have a schema of their own). > To decode those, you can use the same code as in HBaseByteInterface. > That is, a combination of SpecificDatumReader and BinaryDecoder. > > Good luck. > > > > > On Thu, May 30, 2013 at 4:12 AM, Shah, Nishant <[email protected]> wrote: > >> Thanks for the reply. Will look into your suggestions. >> >> -----Original Message----- >> From: Lewis John Mcgibbney [mailto:[email protected]] >> Sent: Wednesday, May 29, 2013 7:09 PM >> To: [email protected] >> Subject: Re: Extracting status code from hbase >> >> OH, BTW I meant to refer you to the test in line 178 of [0]. >> testPutNested hth Lewis >> >> >> On Wed, May 29, 2013 at 7:07 PM, Lewis John Mcgibbney < >> [email protected]> wrote: >> >> > This is most certainly better aimed at either Gora or HBase lists. >> > Obtaining better (and consistent) understanding and of course >> > abstracting users from such data structures is what we have been >> > addressing in current Gora development. (See GORA-174) You will >> > want to look specifically at some of the testing we do for this >> > stuff over in Goran namely in [0-1]. >> > Specifically, the Query API in Gora for some data store >> > implementations could `probably` do with some attention... so >> > please voice you opinion over on user@gora if it tickles your fancy. >> > Thanks >> > Lewis >> > >> > >> > [0] >> > http://svn.apache.org/viewvc/gora/trunk/gora-core/src/test/java/org >> > /ap ache/gora/store/DataStoreTestBase.java?view=markup >> > [1] >> > http://svn.apache.org/viewvc/gora/trunk/gora-core/src/examples/avro >> > /we >> > bpage.json?view=markup >> > >> > >> > On Wed, May 29, 2013 at 3:55 PM, Shah, Nishant <[email protected]> >> wrote: >> > >> >> Hi Everyone, >> >> >> >> I got my error. I was trying to use toString for a field which is >> >> int or float or long. But this leads me to another question. >> >> The protocol status is a nested structure. Similar to parseStatus. >> >> How could we parse these to get the individual majorcode, >> minorcode,args ? >> >> Also, how to detect if a url has returned a 404, or 200 or any >> >> other status code ? >> >> Thanks. >> >> >> >> -----Original Message----- >> >> From: Shah, Nishant >> >> Sent: Wednesday, May 29, 2013 1:51 PM >> >> To: [email protected] >> >> Subject: Extracting status code from hbase >> >> >> >> Hi Everyone, >> >> >> >> I have my Nutch 2.1 setup with Hbase. Once I am done with the >> >> crawl, I want to extract all the information from the column family 'f'. >> >> For this I do, >> >> >> >> Scan s = new Scan(); >> >> ResultScanner scanner = table.getScanner(s); try { // Scanners >> >> return Result instances. >> >> // Now, for the actual iteration. One way is to use a while loop >> >> // like >> >> so: >> >> for (Result rr = scanner.next(); rr != null; rr = scanner.next()) >> >> { // print out the row we found and the columns we were looking // >> >> for System.out.println("Found row: " + rr); String[] >> >> rrs=getColumnsInColumnFamily(rr,"f"); >> >> NavigableMap familyMap = rr.getFamilyMap(Bytes.toBytes("f")); >> >> Iterator entries = familyMap.entrySet().iterator(); >> >> while(entries.hasNext()){ >> >> >> >> Entry thisEntry = (Entry) entries.next(); Object key = >> >> thisEntry.getKey(); Object val = thisEntry.getValue(); >> >> System.out.println(Bytes.toString((byte[]) >> >> key)+"="+Bytes.toString((byte[]) val)); } >> >> >> >> The value for status is blank. It's not null, but blank. Same is >> >> the case with headers. 'mtdt' family and rest of the 'f' family is fine. >> >> Can anyone suggest why this is happening ? >> >> Thanks, >> >> Nishant >> >> >> > >> > >> > >> > -- >> > *Lewis* >> > >> >> >> >> -- >> *Lewis* >> > > > > -- > *Ferdy Galema* > Kalooga Development > -- *Ferdy Galema* Kalooga Development -- *Kalooga* | Visual RelevanceCheck out our Visual Gallery Layer now!<http://www.independent.co.uk/arts-entertainment/music/news/david-cameron-gets-teenage-kicks-starring-in-one-direction-music-video-8499282.html#!kalooga-10369/%22One%20Direction%22> Kalooga Helperpark 288 9723 ZA Groningen The Netherlands +31 50 2103400 www.kalooga.com [email protected] EMEA 53 Davies Street W1K 5JH London United Kingdom +44 20 7129 1430Kalooga Spain and LatAM Maria de Sevilla Diago No 3 28022 Madrid - Madrid Spain +34 670 580 872

