Nutch how to create database or other storage to store scraped data other than the url?

2019-03-23 Thread hxdariux
I'm new to nutch and am trying to develop nutch plugins to parse html
contents of the crawled urls and to scrape for certain data (for my case I'm
gathering bitcoin addresses id). However, I learned that the nutch lifecycle
produces batches of urls, so my question is, is it possible and how to store
the intended data separately? (all solutions I found are mostly concerned
with adding new fields to urls, but not the data I want to gather.)

I'm looking for ways to store them in a way such that, when new bitcoin
addresses are discovered in the crawling process, the existing data can be
checked against, and to which can only be added if they don't exist yet.



--
Sent from: http://lucene.472066.n3.nabble.com/Nutch-User-f603147.html


Nutch how to create database or other storage to store scraped data other than the url?

2019-03-23 Thread hxdariux
I'm new to nutch and am trying to develop nutch plugins to parse html
contents of the crawled urls and to scrape for certain data (for my case I'm
gathering bitcoin addresses id). However, I learned that the nutch lifecycle
produces batches of urls, so my question is, is it possible and how to store
the intended data separately? (all solutions I found are mostly concerned
with adding new fields to urls, but not the data I want to gather.)

I'm looking for ways to store them in a way such that, when new bitcoin
addresses are discovered in the crawling process, the existing data can be
checked against, and to which can only be added if they don't exist yet.



--
Sent from: http://lucene.472066.n3.nabble.com/Nutch-User-f603147.html