hi all,
I want to use hbase to store all urls for a distributed crawler.
there is a central scheduler to schedule all unCrawled urls by
priority. Following is my design of rowkey and common data access
pattern, is there any better rowkey design for my usecase?
the row key is: reverse_host--status--priority--MD5(path). some example:
com.google.www/-0-10-MD5(path1)
com.google.www/-0-9-MD5(path2)
...
com.google.www/-1-10-MD5(path3)
status 0 means not crawled and 1 means crawled
my scheduler:
int batchSize=10000;
Map<String,Integer> hostCount=calcHostPriority(batchSize);
List<String> toBeCrawledUrls=..
for(Map.Entry<String,Integer> entry:hostCount.entrySet()){
//select top N priority uncrawled urls for this host
startRow=Bytes.toString(reverse(entry.getKey())+"/-0");
stopRow=Bytes.toString(reverse(entry.getKey())+"/-1");
Scan s = new Scan(startRow, stopRow);
s.setMaxResultSize(entry.getValue());
for(String url:scanResult){
toBeCrawledUrls.add(url);
}
}
//update after crawling
for(String url:crawledUrls){
delete url //com.google.www/-0-10-MD5(path)
put url //com.google.www/-1-10-MD5(path)
}
//check url exists
any better method than this?
assuming only 1-10 priority
try get:
com.google.www/-0-10-MD5(path)
com.google.www/-1-10-MD5(path)
com.google.www/-0-9-MD5(path)
....
com.google.www/-1-1-MD5(path)
if any exists, then true
else false