Dear all,

I want to adjust Nutch for crawling of only one big text-based website and 
therefore to set up the develop plugins/set up settings for the best crawling 
performance.

Precisely, there is a website that has 3 category : A,B,C. The urls therefore 
website/A/itemN, website/B/articleN, website/C/descriptioN
For example category A contains pages web-shop like kind of pages with price, 
ratings etc. B has articles pages including header, text, author and so on.

1) How to write the html-parser that produces diffirent key-values pairs for 
diffirent urls patterns(diffirent HTML patterns) e.g NameOfItem, Price for 
website/A/ children, Header, Text for website/B children?
Should I implement HTML parser from scratch or I can add parsing afterwards? 
What is the best place to do it and how should I distinguish between different 
URLs categories?

2) Assuming I turned off external links and I crawl only internaly. I would 
like to crawl each category to the diffirent depth. For example, I want to 
crawl 50000 pages in A category, 10000 in B and only 100 in C. How can I make 
it in the best way? 
There is an URL filter plugin, but I don't know how to use it based on URL 
pattern or parrent URL metadata.

Thank you.
Semyon.

Reply via email to