Re: GSoC: Avro Serialization over HBase

Eric Charles Tue, 12 Jun 2012 01:31:14 -0700

Hi Mihai,

Glad to hear your exams are over (I hope they went fine) :)

As Ioan said, Avro serialization HBase will be deprecated in favor ofProtobuf (if I understand well...).

I also like Avro because it gives you serialization & storage format inone box, but is this what we want? The key point here is more aneffective access to the persisted data.

There has been a few tentatives so far to marry HBase and Lucene (see[1], [2], [3] and [4] for example, see also [5] for a more recent article).


The questions I am wondering:

1. Will you focus on a 'generic' solution (reusable outside James), oron a very specific one tuned/optimized only for James mailbox needs?

2. What strategy will you take (custom Directory or customIndexReader/Writer, usage of Coprocessor or not...)?

It would be good you sketch the answers on your MAILBOX-173 with alittle architecture diagram (and also take back in MAILBOX-173 theuseful information I read onhttp://www.google-melange.com/gsoc/proposal/review/google/gsoc2012/mihaisoloi/1).


Thx,
Eric

[1] https://github.com/akkumar/hbasene
[2] https://github.com/thkoch2001/lucehbase
[3] https://github.com/jasonrutherglen/HBASE-SEARCH
[4] https://github.com/jasonrutherglen/LUCENE-FOR-HBASE
[5] http://www.infoq.com/articles/LuceneHbase

On 06/11/2012 08:01 PM, Mihai Soloi wrote:

On 11.06.2012 20:49, Ioan Eugen Stan wrote:

Hi Mihai,

After a quick look...

2012/6/11 Mihai Soloi <[email protected]>:

Hello Eugen and everybody on the list,

I've completed my exams but I've also done some work on the project,
lately
I've been reading up on the HBase API and AVRO API specifications[1]
so that
I can get to know them better.

If you need to store AVRO objects, basically, arrays of bytes, into
HBase
then you would need to store a schema with the data, for example in the
header of the file, so that you can later read it, if the schema changes
radically over time. Ofcourse AVRO does support some of extension to
modifying it's schemas, if you would look at my test code[0] you'd
see that
I was able to extend an existing schema, and prove that it does work
with
backward compatibility, I've followed Boris Lublinky's article[4] on
using
AVRO to get more familiar with it.

Great. It's nice to experiment.

I've encountered a situation in which I do want to store my data through
AVRO on HBase(due to less memory and structured format and HBase
integration) and I see that there is a class on
"org.apache.hadoop.hbase.avro" like AvroServer which basically starts
up a
server through which all sorts of clients can interact with the data
store,
and also generated classes(e.g. AColumnValues, APut, AGet, etc.). These
classes from what it would appear in my mind are used to translate the
requests to the server into HBase Puts and Gets by also using the
AvroUtils
but I don't know if this is the way to go.

AvroServer is deprecated in 0.94 and scheduled to be removed in 0.96
(https://issues.apache.org/jira/browse/HBASE-5948). AvroServer
handles the RPC service to use Avro instead of Writables.

Serialization = save an object to disk/file/network and load it in
memory again in the same way (deserialization). We need to
serialize/de-serialize a lucene index into HBase in an efficient way
(we care about indexing speed, search speed and how much disk/ram it's
going to cost us).

Please read
http://stackoverflow.com/questions/2486721/what-is-a-data-serialization-system

.

Another thing I've been considering is using Sam Pullara's HAvroBase
implementation[2] and code on github[3]. Sam proposes storing only a
hashcode of the schema and schemas stored separately. HAvroBase is
much more
than I would need as it also supports mysql, mongoDB, etc. So I could
use
only the storing part for the Lucene IndexWriter.

I think HAvroBase does a bit more than what we need. It's a bit
generic and I think we can do without adding it as a dependency. The
Lucene index format is not likely to change that much.

Another way to go is to assume that there will never be a change in the
object schemas and just store data just the way it is. This is dangerous
because if there is a change, we would have to change code, instead of a
simple JSON schema.

The way Lucene stores the postings list is pretty standard and will
probably not change that much. I think using Avro is enough.

[0]
http://code.google.com/a/apache-extras.org/p/mailbox-lucene-index-hbase/source/browse/LuceneTest/src/test/java/org/apache/james/mailbox/lucene/avro/AvroInheritanceTest.java

[1] http://avro.apache.org/docs/current/spec.html
[2]
http://www.javarants.com/2010/06/30/havrobase-a-searchable-evolvable-entity-store-on-top-of-hbase-and-solr/

[3] https://github.com/spullara/havrobase
[4]
http://www.infoq.com/articles/ApacheAvro;jsessistore that
monid=6A801F1882512F455322B572F4B69E24




---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]


--
eric | http://about.echarles.net | @echarles

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: GSoC: Avro Serialization over HBase

Reply via email to