There are a couple of projects inside Yahoo! that use ZooKeeper as an
event manager for feed processing.
I am little bit unclear on your example below. As I understand it-
1. There are 1 million feeds that will be stored in Hbase.
2. A map reduce job will be run on these feeds to find out which feeds need
to be fetched.
3. This will create queues in ZooKeeper to fetch the feeds
4. Workers will pull items from this queue and process feeds
Did I understand it correctly? Also, if above is the case, how many queue
items would you anticipate be accumulated every hour?
On 4/12/10 1:21 AM, "Thomas Koch" <tho...@koch.ro> wrote:
> I'd like to implement a feed loader with Hadoop and most likely HBase. I've
> got around 1 million feeds, that should be loaded and checked for new entries.
> However the feeds have different priorities based on their average update
> frequency in the past and their relevance.
> The feeds (url, last_fetched timestamp, priority) are stored in HBase. How
> could I implement the fetch queue for the loaders?
> - An hourly map-reduce job to produce new queues for each node and save them
> on the nodes?
> - but how to know, which feeds have been fetched in the last hour?
> - what to do, if a fetch node dies?
> - Store a fetch queue in zookeeper and add to the queue with map-reduce each
> - Isn't that too much load for zookeeper? (I could make one znode for a
> bunch of urls...?)
> - Use gearman to store the fetch queue?
> - But the gearman job server still seems to be a SPOF
>  http://gearman.org
> Thank you!
> Thomas Koch, http://www.koch.ro