Re: how to desgin my table?

James Taylor Thu, 16 Jan 2014 09:33:53 -0800

On Thu, Jan 16, 2014 at 12:54 AM, Li Li <[email protected]> wrote:

> I have read this. But my question is slightly different with prevoius
> one. I have four table operations:
> 1. check whether a url already exists. if exists, then do nothing.
> else insert it. I can implement it by get first and then checkAndPut
> in hbase.
>


You can accomplish this by doing an UPSERT that sets a status field on your
url_db row and then for (3) below, only select urls with a given status.
Phoenix does not support check-and-put functionality.

2. select distinct hosts(maybe in future group them and couting)
> 3. select urls where host=? and status=? order by priority,add_time limit ?
> 4. update table set status=1 where url=?
>
> If I follow your advice, I should create table and index like this:
>
> CREATE TABLE url_db (
>     status TINYINT,
>     priority INTEGER NOT NULL,
>     added_time DATE,
>     host VARCHAR NOT NULL,
>     url VARCHAR NOT NULL
>     CONSTRAINT pk PRIMARY KEY (host, status, priority, added_time, url));
>
> create index my_index on url_db(url);
>
> performance analysis:
> 1. check url exists using secondary index, it's fast
> 2. select host is fast by rowkey range scan?
> 3. select urls by host and status and ordering by priority and
> add_time is fast by carefully designed rowkey range scan
> 4. update status by secondary index
>
> my question is what will be done by phoenix?
> update status will first look up rowkey in secondary table and find
> the original rowkey of url_db(that's my guess), do a Delete and a Put
> with url_db and update secondary index by Put?
> am I right?
>
>
> On Thu, Jan 16, 2014 at 4:19 PM, James Taylor <[email protected]>
> wrote:
> > Hi,
> > That's some really good information about your use case.  Here's an idea
> I
> > had for your similar question on the HBase mailing list:
> >
> http://mail-archives.apache.org/mod_mbox/hbase-user/201401.mbox/%3CCAG_TOPDpp%2BTWm0wZ3es6dPFwtn5x1grzrrXA-YsSHdXzRnVf%2BQ%40mail.gmail.com%3E
> >
> > Thanks,
> > James
> >
> >
> > On Wed, Jan 15, 2014 at 3:37 AM, Li Li <[email protected]> wrote:
> >>
> >> 1. goal
> >>     I want to implement a distributed vertical(topical) crawler. it
> >> will only store webpages of a certan topic. I will have a classifier
> >> to do this.
> >>     I estimated the amount of  webpages that need be store is about
> >> tens of millions(maybe hundreds of millions as time goes).
> >>     for vertical crawler, it should crawl the pages most likely
> >> related to my target topics. So I need a frontier that can dispatch
> >> task by priorities.
> >>     for now, the priority is simple but we hope it can deal with
> >> complicated priority algorithms.
> >>     1.1 host priority
> >>           we should crawl many hosts rather than only one single host
> >> at the same time. initally, each hosts should be equally crawled. but
> >> after time, we can calculate the priority of host dynamically
> >>           e.g. we can control the speed of a certain host by it's
> >> crawl history(some site will ban our crawler if we use too many
> >> concurrent thread to it). or we can adjust the priority of a host by
> >> whether it
> >>           is relevant to our topic(we can calculate the relevance of
> >> crawled page).
> >>     1.2 enqueue time
> >>           first enqueued webpages should get higher priority
> >>     1.3 depth
> >>           webpages with small depth will get higher priority(something
> >> like BFS traverse)
> >>     1.4 other page priorities
> >>           e.g. page rank, list page/detail page ...
> >>
> >> 2. archeitecture
> >>    see picture:   http://www.flickr.com/photos/114261973@N07/
> >>     2.1 Seed Discover
> >>           use google or other website to find some seed urls
> >>     2.2 Url DB
> >>           a distributed DB to store all metadata about urls(that's the
> >> most hbase related)
> >>     2.3 Task Scheduler
> >>          as described before, the task scheduler select top N priority
> >> webpages and dispatch them to fetcher clusters
> >>     2.4 Message Queues
> >>          we use ActiveMQ to decouple different modules and also load
> >> balance
> >>     2.5 Fetchers
> >>           Download webpages
> >>     2.6 WebPageDB
> >>           store webpages crawled and extracted metadata(such as
> >> title,content, pub_time, author, etc ....) of this webpage. we
> >> consider using hbase too.
> >>     2.7 Extractors
> >>           Using classifier to judge whether this page is related to
> >> our topics and extracting metadata from it and store them to WebPageDB
> >>
> >>
> >> 3. main challenges
> >>     3.1 Url DB
> >>        as described before, this store(maybe hbase) should support
> >> sophisticated pirority algorithms. and also we use it to avoid
> >> crawling a webpage more than once.
> >>     3.2 task scheduler
> >>        how to achieve our goal
> >>
> >> 4. current solution
> >>     4.1 use hbase(maybe together with phoenix) to store urls(we now
> >> have not done the schema design, hoping get some advice here)
> >>     4.2 scheduler algorithm
> >>           int batchSize=10000;
> >>           //dispatch batchSize tasks to different hosts by host
> >> priorities;
> >>           Map<String,Integer> hostCount=...
> >>           //select top priority urls from each host
> >>           List<String> toBeCrawledUrls=new ArrayList<String>(batchSize);
> >>           for(Entry<String,Integer> entry:hostCount.entrySet()){
> >>                //select top priority N urls from a given host
> >>                List<String>
> >> urls=selectTopNUrlsFromHost(
> >> entry.getKey(), entry.getValue());
> >>                toBeCrawledUrls.addAll(urls);
> >>           }
> >>           //dispatch this urls to message queue
> >>           //monitor the message queue status
> >>           //if the queue is all(or 3/4) consumed, goto top  and
> >> dispatch another batch urls
> >>
> >> 5. table colums
> >> 5.1 url  varchar  e.g. http://www.google.com/index.html
> >> 5.2 status tinyint 0: not_crawled  1: crawling   2: success  3: fail
> >> 5.3 host  varchar  www.google.com
> >> 5.4 depth tinyint
> >> 5.5 crawl_time date(what's the difference of date, time and timestamp?)
> >> 5.6 add_time date
> >> 5.7 priority int
> >>
> >> 6. table schema 1
> >>     primary key: url
> >>     index (host,status) including(pirority, add_time, url)
> >> 6.1 host
> >>       select distinct host from url_db;
> >>
> >> 6.2 dedup
> >>       select url from url_db where url='
> http://www.google.com/index.html'
> >>       btw, how to use upsert to avoid update previous version?
> >> something like mysql statement:
> >>       insert into on duplicate key ignore...
> >> 6.3  select
> >>        select url from url_db where (host,
> >> status)=('www.google.com',0) order by priority,add_time limit 100;
> >> 6.4  update status
> >>        update url_db set status=1 where
> >> url='http://www.google.com/index.html'
> >>        performance analysis:
> >>            1. select host can use rowkey scan?
> >>            2. dedup by url very quick because it's rowkey
> >>            3. select is quick because secondary index
> >>            4. update will not remove a row(compare with schema2) but
> >> Put new values in main table
> >>            5. but I guess update will cause row deletion and insertion
> >>
> >> 7. table schema 2
> >>          primary key: url
> >>          index(host) column_family(status) including(priority, add_time,
> >> url)
> >>          is this valid in phoenix? I mean create index on host and
> >> create column_family by status?
> >>
> >> 8. table schema 3
> >>          primary key: (host,status,priority,add_time)
> >>          index(url)
> >>      all sql statements are the same as above.
> >>
> >>        performance analysis:
> >>            1. select host can use rowkey scan?
> >>            2. select fast because of rowkey
> >>            3. dedup by secondary index
> >>            4. update will cause main table deletion and insertion
> >>            5. I guess update will also cause index table update
> >>
> >> 9. schema 4
> >>       using an dedicated host table to store only host, host count is
> >> far less than url
> >>       but using another table will client code to keep consistency.
> >>       and also may be in future we need some query like:
> >>       select host,count(host) as cc from url_db group by host order by
> cc
> >> desc;
> >>
> >> hope you give me some advices, thanks.
> >
> >
>
> --
> You received this message because you are subscribed to the Google Groups
> "Phoenix HBase User" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> For more options, visit https://groups.google.com/groups/opt_out.
>

Re: how to desgin my table?

Reply via email to