Re: is this rowkey schema feasible?

Stack Fri, 10 Jan 2014 09:42:15 -0800

On Thu, Jan 9, 2014 at 6:08 PM, Li Li <[email protected]> wrote:

> thanks.
> 1. this is just a url frontier for url duplication and scheduler usage.
>



I have made a few attempts at trying to reply to this note but I keep
running into the fact that I know little about how you are thinking of
architecting the crawler, what scale you are hoping to achieve, how the
crawlers will be fed and where they will put the crawled content when done,
what is doing the page parse for new URLs to crawl, is it inline w/ the
crawl or done offline, whether the crawl is continuous or stepped (what is
wrong w/ nutch), how revisit priority is done, page fingerpriinting and how
you will resolve when you find the same page via two different URLs,
etc....... and so my response keeps running off into speculation.

If you have a one pager you'd like to share off-list, I could give some
feedback (no problem).  I have an interest in this topic.

High level I'd consider HBase as a candidate repository for all URLs ever
seen keeping all state related to an URL for a distributed crawler.  Ditto
for domains.

Regards already-seen/dup URLs lookup and for the queues the crawlers pull
from allowing inserts of higher priority items, etc., while you could do
this HBase -- and it might be ok to start here -- I think you will soon
find that you will want to do more purposed implementations.  For example,
if the already-seen check is being done inline w/ the page crawl, a lookup
into hbase would be sub-millisecond if out of cache but milliseconds if
not.  You do not want your crawler to stall on a millisecond lookup per URL
discovered, etc.

On 6., try and make it so you can scan rather than point get for each URL.
 Less i/o.

St.Ack

Re: is this rowkey schema feasible?

Reply via email to