Dear all, I want to adjust Nutch for crawling of only one big text-based website and therefore to set up the develop plugins/set up settings for the best crawling performance.
Precisely, there is a website that has 3 category : A,B,C. The urls therefore website/A/itemN, website/B/articleN, website/C/descriptioN For example category A contains pages web-shop like kind of pages with price, ratings etc. B has articles pages including header, text, author and so on. 1) How to write the html-parser that produces diffirent key-values pairs for diffirent urls patterns(diffirent HTML patterns) e.g NameOfItem, Price for website/A/ children, Header, Text for website/B children? Should I implement HTML parser from scratch or I can add parsing afterwards? What is the best place to do it and how should I distinguish between different URLs categories? 2) Assuming I turned off external links and I crawl only internaly. I would like to crawl each category to the diffirent depth. For example, I want to crawl 50000 pages in A category, 10000 in B and only 100 in C. How can I make it in the best way? There is an URL filter plugin, but I don't know how to use it based on URL pattern or parrent URL metadata. Thank you. Semyon.