On Thu, Jan 9, 2014 at 6:08 PM, Li Li <[email protected]> wrote: > thanks. > 1. this is just a url frontier for url duplication and scheduler usage. >
I have made a few attempts at trying to reply to this note but I keep running into the fact that I know little about how you are thinking of architecting the crawler, what scale you are hoping to achieve, how the crawlers will be fed and where they will put the crawled content when done, what is doing the page parse for new URLs to crawl, is it inline w/ the crawl or done offline, whether the crawl is continuous or stepped (what is wrong w/ nutch), how revisit priority is done, page fingerpriinting and how you will resolve when you find the same page via two different URLs, etc....... and so my response keeps running off into speculation. If you have a one pager you'd like to share off-list, I could give some feedback (no problem). I have an interest in this topic. High level I'd consider HBase as a candidate repository for all URLs ever seen keeping all state related to an URL for a distributed crawler. Ditto for domains. Regards already-seen/dup URLs lookup and for the queues the crawlers pull from allowing inserts of higher priority items, etc., while you could do this HBase -- and it might be ok to start here -- I think you will soon find that you will want to do more purposed implementations. For example, if the already-seen check is being done inline w/ the page crawl, a lookup into hbase would be sub-millisecond if out of cache but milliseconds if not. You do not want your crawler to stall on a millisecond lookup per URL discovered, etc. On 6., try and make it so you can scan rather than point get for each URL. Less i/o. St.Ack
