how to get the depth of url in nutch

atawfik Sat, 09 Aug 2014 15:33:06 -0700

I am trying to crawl and index Urls based on the their depth levels. In my
scenario, I am interested in two content types: html and images. For images,
I need to index any imaged based Url regardless of its depth. However, for
html content, I only need to index them if they are provided via my seed
list (depth 1).


I am thinking of writing a custom indexFilter plugin that returns an empty
document if the parsed content meets the condition above.

However, I do not know how to get the depth of a Url. So, I looked into the
scoring-depth plugin and it seems I can get the depth using :

String depthString = parseData.getMeta(DEPTH_KEY);

Can I do that or there is a better way?

Thanks in advance




--
View this message in context: 
http://lucene.472066.n3.nabble.com/how-to-get-the-depth-of-url-in-nutch-tp4152122.html
Sent from the Nutch - User mailing list archive at Nabble.com.

how to get the depth of url in nutch

Reply via email to