Re: GSoC: Avro Serialization over HBase

Mihai Soloi Tue, 12 Jun 2012 04:33:15 -0700

On 12.06.2012 11:30, Eric Charles wrote:

Hi Mihai,


Glad to hear your exams are over (I hope they went fine) :)

Hi Eric,

Thanks, they went very well, I got high marks.

As Ioan said, Avro serialization HBase will be deprecated in favor ofProtobuf (if I understand well...).

I think Avro could be changed rather easily with Protobuf as they'reboth doing basically the same thing, only that Avro uses JSON schemasand can be used with any other language, which is of no of value to theproject.

I also like Avro because it gives you serialization & storage formatin one box, but is this what we want? The key point here is more aneffective access to the persisted data.

If the data is passed through Avro we'll have it serialized anddeserialization is basically handled by Avro, but we'll always have tointeract with the schemas. In Protobuf we have the objects compiled intoour classes, from what i gather it's mostly usefull for RPC, but Avroalso has the protocol in which by using the avro-maven-plugin you cangenerate you own classes with which to interact. I can't say I'm anexpert in either but I fancy Avro.

There has been a few tentatives so far to marry HBase and Lucene (see[1], [2], [3] and [4] for example, see also [5] for a more recentarticle).

Thank you for the github links, i will look thouroughly through theprojects. I was already aware of Basene and Solandra(former Lucandra),they have simillar aproaches.

The questions I am wondering:
1. Will you focus on a 'generic' solution (reusable outside James), oron a very specific one tuned/optimized only for James mailbox needs?

I was thinking of writing generic code so that maybe it could be usedoutside of James but the data format would be specific to James mailboxneeds, so the answer in the end is that it will be tuned for James.

2. What strategy will you take (custom Directory or customIndexReader/Writer, usage of Coprocessor or not...)?

I was thinking that a custom Directory was the way to go, but I soonrealized that it's not as simple as it sounds and overriding the higherlevel classes of IndexReader and IndexWriter would be moreappropriate.(as in article [5]) So by bypassing the Directory I wouldhave to make use of Hbase Coprocessors. As far as I can think of it, aRegionObserver would be employed to gather frequently performed on datafor the Lucene queries and Endpoints.



[1] https://github.com/akkumar/hbasene
[2] https://github.com/thkoch2001/lucehbase
[3] https://github.com/jasonrutherglen/HBASE-SEARCH
[4] https://github.com/jasonrutherglen/LUCENE-FOR-HBASE
[5] http://www.infoq.com/articles/LuceneHbase


---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscr...@james.apache.org
For additional commands, e-mail: server-dev-h...@james.apache.org

Re: GSoC: Avro Serialization over HBase

Reply via email to