I'd like to implement a feed loader with Hadoop and most likely HBase. I've
got around 1 million feeds, that should be loaded and checked for new entries.
However the feeds have different priorities based on their average update
frequency in the past and their relevance.
The feeds (url, last_fetched timestamp, priority) are stored in HBase. How
could I implement the fetch queue for the loaders?
- An hourly map-reduce job to produce new queues for each node and save them
on the nodes?
- but how to know, which feeds have been fetched in the last hour?
- what to do, if a fetch node dies?
- Store a fetch queue in zookeeper and add to the queue with map-reduce each
- Isn't that too much load for zookeeper? (I could make one znode for a
bunch of urls...?)
- Use gearman to store the fetch queue?
- But the gearman job server still seems to be a SPOF
Thomas Koch, http://www.koch.ro