Re: Key factors for production readiness of Hedwig

2010-11-10 Thread Erwin Tam
Thanks for the offer to help Milind!  We've covered most of the things
that are still missing with Hedwig to make it more production ready.  I
mentioned having an engineering team mainly in the context of finding an
internal team here at work to support and develop it.  But we now have
it open sourced so hopefully we can get enough interest in people using
it to work on it.  Here's a list of missing things in Hedwig.

1. Ops tools including monitoring and administration.
2. Cleaner deployment strategies (just have some hacky bash scripts to
do this now).
3. Non-star topologies including logic on forwarding messages between
regions, tools to redefine your topology on the fly (in case a
datacenter region goes down for some reason), topology validation tool
to make sure all regions are connected.
4. Policies on where we can send messages to.  This could be topic based
such as defining a topic containing inclusion/exclusion filters on which
regions it should exist in.  Additionally, perhaps a message level
mechanism for this would be useful.  e.g. expose this to clients so when
they publish messages, they can state where they want the messages to go
to.
5. More testing to find bugs, stress the system, performance tuning,
etc.

Erwin

On Tue, 2010-11-09 at 12:47 -0800, Milind Parikh wrote:

 What are the other features? How can I help? I would love to experiment with
 Hedwig. So I could help, in my spare time , with the documentation. Why
 would you need an engineering team to support it? You have this open
 sourced. Couldn't you use the power of the community?
 
 As you can see, I am pretty eager to try hedwig.
 
 Regards
 -- Milind
 
 
 On Tue, Nov 9, 2010 at 12:00 PM, Erwin Tam e...@yahoo-inc.com wrote:
 
  Hi Milind, Hedwig is still in prototype phase so it lacks several
  features to make it production ready.  We are trying to find an
  engineering team to work on and support it.
 
  (a) Monitoring is one of the big missing pieces to make it production
  ready.  We have some thoughts on it.  The simplest solution would be to
  use JMX style bindings for monitoring and admin purposes.
 
  (b) We have designs on non-star topologies and the underlying pieces to
  support this are there (version vectors for messages published/consumed
  in all regions).  This is something we've thought about since the
  beginning.  However, due to lack of resources, we have not implemented
  this yet.  There are a few enhancements that we know about and will
  shortly document them either via a wiki page or open jiras for Hedwig.
 
  (c) We are in the process of converting some internal documents we have
  for Hedwig for third party consumption.  That's a priority since we
  definitely want people to know quickly what Hedwig is all about (large
  scale multi-colo guaranteed delivery pub/sub system) and the
  architecture of it (shared nothing, commodity boxes, ZooKeeper for
  coordination, Bookkeeper for persistence).  We also want to emphasize
  that all pieces of Hedwig are modular and you can slot in your own
  implementation for the various parts.
 
  Erwin
 
 
   From: Milind Parikh milindpar...@gmail.com
   Reply-To: zookeeper-user@hadoop.apache.org
   Date: Fri, 5 Nov 2010 20:08:18 -0700
   To: zookeeper-user@hadoop.apache.org
   Subject: Key factors for production readiness of Hedwig
  
   What are the key factors for production readiness of Hedwig?  I would
   really
   like to use Hedwig in some major projects; but having a production
   readiness
   stamp is important. For my perspective:
  
   (a) Monitoring
   (b) Deployment Patterns (non star based)
   (c) Documentation
  
   Regards
   -- Milind
  
 
 
 


Re: Key factors for production readiness of Hedwig

2010-11-09 Thread Erwin Tam
Hi Milind, Hedwig is still in prototype phase so it lacks several
features to make it production ready.  We are trying to find an
engineering team to work on and support it.  

(a) Monitoring is one of the big missing pieces to make it production
ready.  We have some thoughts on it.  The simplest solution would be to
use JMX style bindings for monitoring and admin purposes.

(b) We have designs on non-star topologies and the underlying pieces to
support this are there (version vectors for messages published/consumed
in all regions).  This is something we've thought about since the
beginning.  However, due to lack of resources, we have not implemented
this yet.  There are a few enhancements that we know about and will
shortly document them either via a wiki page or open jiras for Hedwig.

(c) We are in the process of converting some internal documents we have
for Hedwig for third party consumption.  That's a priority since we
definitely want people to know quickly what Hedwig is all about (large
scale multi-colo guaranteed delivery pub/sub system) and the
architecture of it (shared nothing, commodity boxes, ZooKeeper for
coordination, Bookkeeper for persistence).  We also want to emphasize
that all pieces of Hedwig are modular and you can slot in your own
implementation for the various parts.

Erwin


 From: Milind Parikh milindpar...@gmail.com
 Reply-To: zookeeper-user@hadoop.apache.org
 Date: Fri, 5 Nov 2010 20:08:18 -0700
 To: zookeeper-user@hadoop.apache.org
 Subject: Key factors for production readiness of Hedwig
 
 What are the key factors for production readiness of Hedwig?  I would
 really
 like to use Hedwig in some major projects; but having a production
 readiness
 stamp is important. For my perspective:
 
 (a) Monitoring
 (b) Deployment Patterns (non star based)
 (c) Documentation
 
 Regards
 -- Milind
 




Re: Questions about Hedwig architecture

2010-11-04 Thread Erwin Tam
Hi Amit, sorry for the late responses to your Hedwig questions.  I just
joined the mailing list so hopefully I can shed some light on things.
First, here's a response to a question you posted last month that I
don't think has been answered

 In hedwig, how many publishers can publish to a particular topic
simultaneously. 
 Any concerns / important points on message ordering?

You can have as many concurrent publishers publishing to the same topic.
The guarantee we make is that for each publisher, the messages are in
the same relative order they were published in.  They can be interleaved
with other publishers though.

Here are the responses to your other questions.

1. Yes, each of the Hedwig server hubs has a RegionManager module that
implements the SubscriptionEventListener interface.  So when a hub's
SubscriptionManager takes ownership for a topic the very first time, it
can invoke all of the listeners subscribed.  The RegionManager will then
look up the list of all other regions it knows about (stored in the
ServerConfiguration), and subscribe to that topic using the special
topic subscriber name reserved for hub subscribers.

2. As described above in #1, the Subscription Manager is the one that
knows when the hub takes ownership of a topic.  If subscriber X and Y
both simultaneously subscribe to a topic that the hub was previously not
the master of, the first subscribe call will cause the hub to become the
topic owner/master and trigger all of the registered
SubscriptionEventListeners.  The second subscriber's call will not
trigger this anymore since the hub now has at least one local subscriber
for that topic it has just taken ownership of.

3. Region A will have a hub that itself is a subscriber to topics in all
other regions.  So in this case, Subscriber X in Region A is interested
in topic T.  The hub responsible for topic T in Region A will be a
remote subscriber to topic T in Region B.  If messages are published to
topic T in Region B, they are first sent to all the local subscribers
there and then delivered to all remote subscribers (actually no real
ordering).  The hub in Region A will receive the messages (like any
normal subscriber) and persist them via the PersistenceManager used.
Later on, the hub DeliveryManager will pick up that message and then
push it to all local subscribers.  There is logic to prevent it from
being delivered to the RegionManager's hub subscription since it is the
one that got the message remotely in the first place.

4. The proxy stuff was an initial attempt to create something akin to a
web service call for C++ clients before we implemented a native Hedwig C
++ client.  We didn't want to reproduce the logic in the java client so
the easiest solution would be a proxy type of service that the C++
clients can call which in turn will just call the java client stuff.
This is so the client apps that use Hedwig don't have to run a jvm on
their box.

5. When a Hedwig hub dies, the order of operations is this.  First off,
all of the hubs are registered with ZK as ephemeral nodes to know which
are the live boxes.  If one dies, the ephemeral node goes away and ZK
does not consider that hub as a live one.  There is logic in the client
that caches where it thinks a certain topic's hub master is.  If it
publishes to that hub directly and gets an error, it will redo the
initial handshake protocol where it publishes to the default server
host first.  The default server host was designed to be a vip that
fronted all of the hubs.  The idea is that it will just pick one of the
live hubs at random.  So the client's publish request goes to a random
hub where it sees a publish request for a topic.  This random hub checks
if it is the topic owner, if not, then it looks up who the topic owner
is in ZK.  If nobody is the topic owner (topic is up for grabs now since
the last topic owner hub died), a random one out of all the live hubs is
chosen (weighted by each server's load).  A redirect response is sent
back to the client (unless that particular hub is randomly chosen to be
the topic owner).  The client will then send the publish request to the
redirected hub, and if successful, cache that locally as the hub that is
the owner for the given topic.

6. The current design is that every Hedwig hub knows two things.  First
is where the local ZK server or quorum is (in case we're running
replicated ZK).  Second, it knows where the default server host/VIP for
the hubs in all other regions are.  This is stored in the
ServerConfiguration.  ZK will then be the central place that knows about
all the Hedwig hubs in a given region.  It knows what topics there are,
who is the master of which topic, what load each of the hubs currently
has, etc..  Note that it doesn't know anything about other regions.  

The Hedwig client knows only one thing, where the region's default
server host is.  This is so it can contact one of the hubs which in turn
can redirect it to whoever the appropriate hub is that owns