is this rowkey schema feasible?

Li Li Thu, 09 Jan 2014 02:42:57 -0800

hi all,
    I want to use hbase to store all urls for a distributed crawler.
there is a central scheduler to schedule all unCrawled urls by
priority. Following is my design of rowkey and common data access
pattern, is there any better rowkey design for my usecase?


    the row key is: reverse_host--status--priority--MD5(path). some example:
    com.google.www/-0-10-MD5(path1)
    com.google.www/-0-9-MD5(path2)
    ...
    com.google.www/-1-10-MD5(path3)
    status 0 means not crawled and 1 means crawled
    my scheduler:
    int batchSize=10000;
    Map<String,Integer> hostCount=calcHostPriority(batchSize);
    List<String> toBeCrawledUrls=..
    for(Map.Entry<String,Integer> entry:hostCount.entrySet()){
         //select top N priority uncrawled urls for this host
        startRow=Bytes.toString(reverse(entry.getKey())+"/-0");
        stopRow=Bytes.toString(reverse(entry.getKey())+"/-1");
         Scan s = new Scan(startRow, stopRow);
         s.setMaxResultSize(entry.getValue());
         for(String url:scanResult){
              toBeCrawledUrls.add(url);
         }
    }

    //update after crawling
    for(String url:crawledUrls){
         delete url //com.google.www/-0-10-MD5(path)
         put url //com.google.www/-1-10-MD5(path)
    }

    //check url exists
    any better method than this?
     assuming only 1-10 priority
   try get:
        com.google.www/-0-10-MD5(path)
        com.google.www/-1-10-MD5(path)
        com.google.www/-0-9-MD5(path)
        ....
        com.google.www/-1-1-MD5(path)
    if any exists, then true
    else false

is this rowkey schema feasible?

Reply via email to