Nutch how to create database or other storage to store scraped data other than the url?
I'm new to nutch and am trying to develop nutch plugins to parse html contents of the crawled urls and to scrape for certain data (for my case I'm gathering bitcoin addresses id). However, I learned that the nutch lifecycle produces batches of urls, so my question is, is it possible and how to store the intended data separately? (all solutions I found are mostly concerned with adding new fields to urls, but not the data I want to gather.) I'm looking for ways to store them in a way such that, when new bitcoin addresses are discovered in the crawling process, the existing data can be checked against, and to which can only be added if they don't exist yet. -- Sent from: http://lucene.472066.n3.nabble.com/Nutch-User-f603147.html
Nutch how to create database or other storage to store scraped data other than the url?
I'm new to nutch and am trying to develop nutch plugins to parse html contents of the crawled urls and to scrape for certain data (for my case I'm gathering bitcoin addresses id). However, I learned that the nutch lifecycle produces batches of urls, so my question is, is it possible and how to store the intended data separately? (all solutions I found are mostly concerned with adding new fields to urls, but not the data I want to gather.) I'm looking for ways to store them in a way such that, when new bitcoin addresses are discovered in the crawling process, the existing data can be checked against, and to which can only be added if they don't exist yet. -- Sent from: http://lucene.472066.n3.nabble.com/Nutch-User-f603147.html