Re: Key factors for production readiness of Hedwig
Thanks for the offer to help Milind! We've covered most of the things that are still missing with Hedwig to make it more production ready. I mentioned having an engineering team mainly in the context of finding an internal team here at work to support and develop it. But we now have it open sourced so hopefully we can get enough interest in people using it to work on it. Here's a list of missing things in Hedwig. 1. Ops tools including monitoring and administration. 2. Cleaner deployment strategies (just have some hacky bash scripts to do this now). 3. Non-star topologies including logic on forwarding messages between regions, tools to redefine your topology on the fly (in case a datacenter region goes down for some reason), topology validation tool to make sure all regions are connected. 4. Policies on where we can send messages to. This could be topic based such as defining a topic containing inclusion/exclusion filters on which regions it should exist in. Additionally, perhaps a message level mechanism for this would be useful. e.g. expose this to clients so when they publish messages, they can state where they want the messages to go to. 5. More testing to find bugs, stress the system, performance tuning, etc. Erwin On Tue, 2010-11-09 at 12:47 -0800, Milind Parikh wrote: What are the other features? How can I help? I would love to experiment with Hedwig. So I could help, in my spare time , with the documentation. Why would you need an engineering team to support it? You have this open sourced. Couldn't you use the power of the community? As you can see, I am pretty eager to try hedwig. Regards -- Milind On Tue, Nov 9, 2010 at 12:00 PM, Erwin Tam e...@yahoo-inc.com wrote: Hi Milind, Hedwig is still in prototype phase so it lacks several features to make it production ready. We are trying to find an engineering team to work on and support it. (a) Monitoring is one of the big missing pieces to make it production ready. We have some thoughts on it. The simplest solution would be to use JMX style bindings for monitoring and admin purposes. (b) We have designs on non-star topologies and the underlying pieces to support this are there (version vectors for messages published/consumed in all regions). This is something we've thought about since the beginning. However, due to lack of resources, we have not implemented this yet. There are a few enhancements that we know about and will shortly document them either via a wiki page or open jiras for Hedwig. (c) We are in the process of converting some internal documents we have for Hedwig for third party consumption. That's a priority since we definitely want people to know quickly what Hedwig is all about (large scale multi-colo guaranteed delivery pub/sub system) and the architecture of it (shared nothing, commodity boxes, ZooKeeper for coordination, Bookkeeper for persistence). We also want to emphasize that all pieces of Hedwig are modular and you can slot in your own implementation for the various parts. Erwin From: Milind Parikh milindpar...@gmail.com Reply-To: zookeeper-user@hadoop.apache.org Date: Fri, 5 Nov 2010 20:08:18 -0700 To: zookeeper-user@hadoop.apache.org Subject: Key factors for production readiness of Hedwig What are the key factors for production readiness of Hedwig? I would really like to use Hedwig in some major projects; but having a production readiness stamp is important. For my perspective: (a) Monitoring (b) Deployment Patterns (non star based) (c) Documentation Regards -- Milind
Re: Key factors for production readiness of Hedwig
Hi Milind, Hedwig is still in prototype phase so it lacks several features to make it production ready. We are trying to find an engineering team to work on and support it. (a) Monitoring is one of the big missing pieces to make it production ready. We have some thoughts on it. The simplest solution would be to use JMX style bindings for monitoring and admin purposes. (b) We have designs on non-star topologies and the underlying pieces to support this are there (version vectors for messages published/consumed in all regions). This is something we've thought about since the beginning. However, due to lack of resources, we have not implemented this yet. There are a few enhancements that we know about and will shortly document them either via a wiki page or open jiras for Hedwig. (c) We are in the process of converting some internal documents we have for Hedwig for third party consumption. That's a priority since we definitely want people to know quickly what Hedwig is all about (large scale multi-colo guaranteed delivery pub/sub system) and the architecture of it (shared nothing, commodity boxes, ZooKeeper for coordination, Bookkeeper for persistence). We also want to emphasize that all pieces of Hedwig are modular and you can slot in your own implementation for the various parts. Erwin From: Milind Parikh milindpar...@gmail.com Reply-To: zookeeper-user@hadoop.apache.org Date: Fri, 5 Nov 2010 20:08:18 -0700 To: zookeeper-user@hadoop.apache.org Subject: Key factors for production readiness of Hedwig What are the key factors for production readiness of Hedwig? I would really like to use Hedwig in some major projects; but having a production readiness stamp is important. For my perspective: (a) Monitoring (b) Deployment Patterns (non star based) (c) Documentation Regards -- Milind
Re: Questions about Hedwig architecture
Hi Amit, sorry for the late responses to your Hedwig questions. I just joined the mailing list so hopefully I can shed some light on things. First, here's a response to a question you posted last month that I don't think has been answered In hedwig, how many publishers can publish to a particular topic simultaneously. Any concerns / important points on message ordering? You can have as many concurrent publishers publishing to the same topic. The guarantee we make is that for each publisher, the messages are in the same relative order they were published in. They can be interleaved with other publishers though. Here are the responses to your other questions. 1. Yes, each of the Hedwig server hubs has a RegionManager module that implements the SubscriptionEventListener interface. So when a hub's SubscriptionManager takes ownership for a topic the very first time, it can invoke all of the listeners subscribed. The RegionManager will then look up the list of all other regions it knows about (stored in the ServerConfiguration), and subscribe to that topic using the special topic subscriber name reserved for hub subscribers. 2. As described above in #1, the Subscription Manager is the one that knows when the hub takes ownership of a topic. If subscriber X and Y both simultaneously subscribe to a topic that the hub was previously not the master of, the first subscribe call will cause the hub to become the topic owner/master and trigger all of the registered SubscriptionEventListeners. The second subscriber's call will not trigger this anymore since the hub now has at least one local subscriber for that topic it has just taken ownership of. 3. Region A will have a hub that itself is a subscriber to topics in all other regions. So in this case, Subscriber X in Region A is interested in topic T. The hub responsible for topic T in Region A will be a remote subscriber to topic T in Region B. If messages are published to topic T in Region B, they are first sent to all the local subscribers there and then delivered to all remote subscribers (actually no real ordering). The hub in Region A will receive the messages (like any normal subscriber) and persist them via the PersistenceManager used. Later on, the hub DeliveryManager will pick up that message and then push it to all local subscribers. There is logic to prevent it from being delivered to the RegionManager's hub subscription since it is the one that got the message remotely in the first place. 4. The proxy stuff was an initial attempt to create something akin to a web service call for C++ clients before we implemented a native Hedwig C ++ client. We didn't want to reproduce the logic in the java client so the easiest solution would be a proxy type of service that the C++ clients can call which in turn will just call the java client stuff. This is so the client apps that use Hedwig don't have to run a jvm on their box. 5. When a Hedwig hub dies, the order of operations is this. First off, all of the hubs are registered with ZK as ephemeral nodes to know which are the live boxes. If one dies, the ephemeral node goes away and ZK does not consider that hub as a live one. There is logic in the client that caches where it thinks a certain topic's hub master is. If it publishes to that hub directly and gets an error, it will redo the initial handshake protocol where it publishes to the default server host first. The default server host was designed to be a vip that fronted all of the hubs. The idea is that it will just pick one of the live hubs at random. So the client's publish request goes to a random hub where it sees a publish request for a topic. This random hub checks if it is the topic owner, if not, then it looks up who the topic owner is in ZK. If nobody is the topic owner (topic is up for grabs now since the last topic owner hub died), a random one out of all the live hubs is chosen (weighted by each server's load). A redirect response is sent back to the client (unless that particular hub is randomly chosen to be the topic owner). The client will then send the publish request to the redirected hub, and if successful, cache that locally as the hub that is the owner for the given topic. 6. The current design is that every Hedwig hub knows two things. First is where the local ZK server or quorum is (in case we're running replicated ZK). Second, it knows where the default server host/VIP for the hubs in all other regions are. This is stored in the ServerConfiguration. ZK will then be the central place that knows about all the Hedwig hubs in a given region. It knows what topics there are, who is the master of which topic, what load each of the hubs currently has, etc.. Note that it doesn't know anything about other regions. The Hedwig client knows only one thing, where the region's default server host is. This is so it can contact one of the hubs which in turn can redirect it to whoever the appropriate hub is that owns