Hello,
I am a new user of Nutch and though looked through several manuals on the web, I still have questions. Hope you will be able to give answers or point me to some manual. My questions: . I intend to use Nutch to crawl several particular sites and (as I know data structure inside it) want to extract particular information (fields). After some processing, I want to dump data into elasticsearch/sql. Can you recommend some solution here? Plugins? I know that there is plugin for extraction data pieces based on Xpath, but not sure that it will be flexible enough for my need. My thought was to dump raw html into Sorl and use some kind of batch parser on python or other language that will query for html, process it and then dump into elasticsearch/sql. . How can I get raw HTML? This is top question according to google search and there are answers like: write your own plugin, grab data from crawldb, grab links and then download html via additional software. What is your recommended way of doing so? . At your presentation you mentioned that you cannot guarantee low latency. Can new page be crawled for example within 1 day. Is it doable? I am targeting 10 sites with more than 100k pages each and that are updated constantly. Thank you for sharing your experience. Best Regards, Tigran.

