Hi, That's some really good information about your use case. Here's an idea I had for your similar question on the HBase mailing list: http://mail-archives.apache.org/mod_mbox/hbase-user/201401.mbox/%3CCAG_TOPDpp%2BTWm0wZ3es6dPFwtn5x1grzrrXA-YsSHdXzRnVf%2BQ%40mail.gmail.com%3E
Thanks, James On Wed, Jan 15, 2014 at 3:37 AM, Li Li <[email protected]> wrote: > 1. goal > I want to implement a distributed vertical(topical) crawler. it > will only store webpages of a certan topic. I will have a classifier > to do this. > I estimated the amount of webpages that need be store is about > tens of millions(maybe hundreds of millions as time goes). > for vertical crawler, it should crawl the pages most likely > related to my target topics. So I need a frontier that can dispatch > task by priorities. > for now, the priority is simple but we hope it can deal with > complicated priority algorithms. > 1.1 host priority > we should crawl many hosts rather than only one single host > at the same time. initally, each hosts should be equally crawled. but > after time, we can calculate the priority of host dynamically > e.g. we can control the speed of a certain host by it's > crawl history(some site will ban our crawler if we use too many > concurrent thread to it). or we can adjust the priority of a host by > whether it > is relevant to our topic(we can calculate the relevance of > crawled page). > 1.2 enqueue time > first enqueued webpages should get higher priority > 1.3 depth > webpages with small depth will get higher priority(something > like BFS traverse) > 1.4 other page priorities > e.g. page rank, list page/detail page ... > > 2. archeitecture > see picture: http://www.flickr.com/photos/114261973@N07/ > 2.1 Seed Discover > use google or other website to find some seed urls > 2.2 Url DB > a distributed DB to store all metadata about urls(that's the > most hbase related) > 2.3 Task Scheduler > as described before, the task scheduler select top N priority > webpages and dispatch them to fetcher clusters > 2.4 Message Queues > we use ActiveMQ to decouple different modules and also load > balance > 2.5 Fetchers > Download webpages > 2.6 WebPageDB > store webpages crawled and extracted metadata(such as > title,content, pub_time, author, etc ....) of this webpage. we > consider using hbase too. > 2.7 Extractors > Using classifier to judge whether this page is related to > our topics and extracting metadata from it and store them to WebPageDB > > > 3. main challenges > 3.1 Url DB > as described before, this store(maybe hbase) should support > sophisticated pirority algorithms. and also we use it to avoid > crawling a webpage more than once. > 3.2 task scheduler > how to achieve our goal > > 4. current solution > 4.1 use hbase(maybe together with phoenix) to store urls(we now > have not done the schema design, hoping get some advice here) > 4.2 scheduler algorithm > int batchSize=10000; > //dispatch batchSize tasks to different hosts by host priorities; > Map<String,Integer> hostCount=... > //select top priority urls from each host > List<String> toBeCrawledUrls=new ArrayList<String>(batchSize); > for(Entry<String,Integer> entry:hostCount.entrySet()){ > //select top priority N urls from a given host > List<String> > urls=selectTopNUrlsFromHost( > entry.getKey(), entry.getValue()); > toBeCrawledUrls.addAll(urls); > } > //dispatch this urls to message queue > //monitor the message queue status > //if the queue is all(or 3/4) consumed, goto top and > dispatch another batch urls > > 5. table colums > 5.1 url varchar e.g. http://www.google.com/index.html > 5.2 status tinyint 0: not_crawled 1: crawling 2: success 3: fail > 5.3 host varchar www.google.com > 5.4 depth tinyint > 5.5 crawl_time date(what's the difference of date, time and timestamp?) > 5.6 add_time date > 5.7 priority int > > 6. table schema 1 > primary key: url > index (host,status) including(pirority, add_time, url) > 6.1 host > select distinct host from url_db; > > 6.2 dedup > select url from url_db where url='http://www.google.com/index.html' > btw, how to use upsert to avoid update previous version? > something like mysql statement: > insert into on duplicate key ignore... > 6.3 select > select url from url_db where (host, > status)=('www.google.com',0) order by priority,add_time limit 100; > 6.4 update status > update url_db set status=1 where url=' > http://www.google.com/index.html' > performance analysis: > 1. select host can use rowkey scan? > 2. dedup by url very quick because it's rowkey > 3. select is quick because secondary index > 4. update will not remove a row(compare with schema2) but > Put new values in main table > 5. but I guess update will cause row deletion and insertion > > 7. table schema 2 > primary key: url > index(host) column_family(status) including(priority, add_time, > url) > is this valid in phoenix? I mean create index on host and > create column_family by status? > > 8. table schema 3 > primary key: (host,status,priority,add_time) > index(url) > all sql statements are the same as above. > > performance analysis: > 1. select host can use rowkey scan? > 2. select fast because of rowkey > 3. dedup by secondary index > 4. update will cause main table deletion and insertion > 5. I guess update will also cause index table update > > 9. schema 4 > using an dedicated host table to store only host, host count is > far less than url > but using another table will client code to keep consistency. > and also may be in future we need some query like: > select host,count(host) as cc from url_db group by host order by cc > desc; > > hope you give me some advices, thanks. >
