On Thu, Jan 16, 2014 at 12:54 AM, Li Li <[email protected]> wrote: > I have read this. But my question is slightly different with prevoius > one. I have four table operations: > 1. check whether a url already exists. if exists, then do nothing. > else insert it. I can implement it by get first and then checkAndPut > in hbase. >
You can accomplish this by doing an UPSERT that sets a status field on your url_db row and then for (3) below, only select urls with a given status. Phoenix does not support check-and-put functionality. 2. select distinct hosts(maybe in future group them and couting) > 3. select urls where host=? and status=? order by priority,add_time limit ? > 4. update table set status=1 where url=? > > If I follow your advice, I should create table and index like this: > > CREATE TABLE url_db ( > status TINYINT, > priority INTEGER NOT NULL, > added_time DATE, > host VARCHAR NOT NULL, > url VARCHAR NOT NULL > CONSTRAINT pk PRIMARY KEY (host, status, priority, added_time, url)); > > create index my_index on url_db(url); > > performance analysis: > 1. check url exists using secondary index, it's fast > 2. select host is fast by rowkey range scan? > 3. select urls by host and status and ordering by priority and > add_time is fast by carefully designed rowkey range scan > 4. update status by secondary index > > my question is what will be done by phoenix? > update status will first look up rowkey in secondary table and find > the original rowkey of url_db(that's my guess), do a Delete and a Put > with url_db and update secondary index by Put? > am I right? > > > On Thu, Jan 16, 2014 at 4:19 PM, James Taylor <[email protected]> > wrote: > > Hi, > > That's some really good information about your use case. Here's an idea > I > > had for your similar question on the HBase mailing list: > > > http://mail-archives.apache.org/mod_mbox/hbase-user/201401.mbox/%3CCAG_TOPDpp%2BTWm0wZ3es6dPFwtn5x1grzrrXA-YsSHdXzRnVf%2BQ%40mail.gmail.com%3E > > > > Thanks, > > James > > > > > > On Wed, Jan 15, 2014 at 3:37 AM, Li Li <[email protected]> wrote: > >> > >> 1. goal > >> I want to implement a distributed vertical(topical) crawler. it > >> will only store webpages of a certan topic. I will have a classifier > >> to do this. > >> I estimated the amount of webpages that need be store is about > >> tens of millions(maybe hundreds of millions as time goes). > >> for vertical crawler, it should crawl the pages most likely > >> related to my target topics. So I need a frontier that can dispatch > >> task by priorities. > >> for now, the priority is simple but we hope it can deal with > >> complicated priority algorithms. > >> 1.1 host priority > >> we should crawl many hosts rather than only one single host > >> at the same time. initally, each hosts should be equally crawled. but > >> after time, we can calculate the priority of host dynamically > >> e.g. we can control the speed of a certain host by it's > >> crawl history(some site will ban our crawler if we use too many > >> concurrent thread to it). or we can adjust the priority of a host by > >> whether it > >> is relevant to our topic(we can calculate the relevance of > >> crawled page). > >> 1.2 enqueue time > >> first enqueued webpages should get higher priority > >> 1.3 depth > >> webpages with small depth will get higher priority(something > >> like BFS traverse) > >> 1.4 other page priorities > >> e.g. page rank, list page/detail page ... > >> > >> 2. archeitecture > >> see picture: http://www.flickr.com/photos/114261973@N07/ > >> 2.1 Seed Discover > >> use google or other website to find some seed urls > >> 2.2 Url DB > >> a distributed DB to store all metadata about urls(that's the > >> most hbase related) > >> 2.3 Task Scheduler > >> as described before, the task scheduler select top N priority > >> webpages and dispatch them to fetcher clusters > >> 2.4 Message Queues > >> we use ActiveMQ to decouple different modules and also load > >> balance > >> 2.5 Fetchers > >> Download webpages > >> 2.6 WebPageDB > >> store webpages crawled and extracted metadata(such as > >> title,content, pub_time, author, etc ....) of this webpage. we > >> consider using hbase too. > >> 2.7 Extractors > >> Using classifier to judge whether this page is related to > >> our topics and extracting metadata from it and store them to WebPageDB > >> > >> > >> 3. main challenges > >> 3.1 Url DB > >> as described before, this store(maybe hbase) should support > >> sophisticated pirority algorithms. and also we use it to avoid > >> crawling a webpage more than once. > >> 3.2 task scheduler > >> how to achieve our goal > >> > >> 4. current solution > >> 4.1 use hbase(maybe together with phoenix) to store urls(we now > >> have not done the schema design, hoping get some advice here) > >> 4.2 scheduler algorithm > >> int batchSize=10000; > >> //dispatch batchSize tasks to different hosts by host > >> priorities; > >> Map<String,Integer> hostCount=... > >> //select top priority urls from each host > >> List<String> toBeCrawledUrls=new ArrayList<String>(batchSize); > >> for(Entry<String,Integer> entry:hostCount.entrySet()){ > >> //select top priority N urls from a given host > >> List<String> > >> urls=selectTopNUrlsFromHost( > >> entry.getKey(), entry.getValue()); > >> toBeCrawledUrls.addAll(urls); > >> } > >> //dispatch this urls to message queue > >> //monitor the message queue status > >> //if the queue is all(or 3/4) consumed, goto top and > >> dispatch another batch urls > >> > >> 5. table colums > >> 5.1 url varchar e.g. http://www.google.com/index.html > >> 5.2 status tinyint 0: not_crawled 1: crawling 2: success 3: fail > >> 5.3 host varchar www.google.com > >> 5.4 depth tinyint > >> 5.5 crawl_time date(what's the difference of date, time and timestamp?) > >> 5.6 add_time date > >> 5.7 priority int > >> > >> 6. table schema 1 > >> primary key: url > >> index (host,status) including(pirority, add_time, url) > >> 6.1 host > >> select distinct host from url_db; > >> > >> 6.2 dedup > >> select url from url_db where url=' > http://www.google.com/index.html' > >> btw, how to use upsert to avoid update previous version? > >> something like mysql statement: > >> insert into on duplicate key ignore... > >> 6.3 select > >> select url from url_db where (host, > >> status)=('www.google.com',0) order by priority,add_time limit 100; > >> 6.4 update status > >> update url_db set status=1 where > >> url='http://www.google.com/index.html' > >> performance analysis: > >> 1. select host can use rowkey scan? > >> 2. dedup by url very quick because it's rowkey > >> 3. select is quick because secondary index > >> 4. update will not remove a row(compare with schema2) but > >> Put new values in main table > >> 5. but I guess update will cause row deletion and insertion > >> > >> 7. table schema 2 > >> primary key: url > >> index(host) column_family(status) including(priority, add_time, > >> url) > >> is this valid in phoenix? I mean create index on host and > >> create column_family by status? > >> > >> 8. table schema 3 > >> primary key: (host,status,priority,add_time) > >> index(url) > >> all sql statements are the same as above. > >> > >> performance analysis: > >> 1. select host can use rowkey scan? > >> 2. select fast because of rowkey > >> 3. dedup by secondary index > >> 4. update will cause main table deletion and insertion > >> 5. I guess update will also cause index table update > >> > >> 9. schema 4 > >> using an dedicated host table to store only host, host count is > >> far less than url > >> but using another table will client code to keep consistency. > >> and also may be in future we need some query like: > >> select host,count(host) as cc from url_db group by host order by > cc > >> desc; > >> > >> hope you give me some advices, thanks. > > > > > > -- > You received this message because you are subscribed to the Google Groups > "Phoenix HBase User" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > For more options, visit https://groups.google.com/groups/opt_out. >
