Hi Lewis I realise I was thinking about NUTCH-880, not NUTCH-932 which is indeed about retrieving crawl results as JSON
> From my own pov it appears that Nutch 2.X is 'closer' to the model required for a multiple backends implementation although there is > still quite a bit of work to do here. backend for crawl storage != target of the exporter/indexer > What I am slightly confused > about, which hasn't been mentioned on this particular issue is whether > individual Gora modules would make up part of the stack or whether the > abstraction would somehow be written @Nutch side... of course this > then gets a bit more tricky when we begin thinking about current 1.X > and how to progress with a suitable long term vision. > this is definitely on the Nutch side and applies in the same way for 1.x and 2.x. Think about it as a pluggable indexer : regardless of what backend is used for storing the crawl table you might want to send some of the content (with possible transformations) to e.g. SOLR, ElasticSearch, a text file, a database etc... At the moment we are limited to SOLR - which is OK as most people use Nutch for indexing / searching but the point is that we should have more flexibility. I have used the terms 'pluggable indexer' before as well as 'pluggable exporter' I suppose the difference is whether we take care of finding which URLs should be deleted (indexer) or just dump a snapshot of the content (exporter). See comments on https://issues.apache.org/jira/browse/NUTCH-1047 On Wed, Jul 11, 2012 at 8:54 AM, Julien Nioche > <[email protected]> wrote: > > I'd think that this would be more a case for the universal exporter > (a.k.a > > multiple indexing backends) that we mentioned several times. The REST API > > is more a way of piloting a crawl remotely. It could certainly be twisted > > into doing all sorts of things but I am not sure it would be very > > practical when dealing with very large data. Instead having a pluggable > > exporter would allow you to define what backend you want to send the data > > to and what transformations to do on the way (e.g. convert to JSON). > > Alternatively a good old custom map reduce job based is the way to go. > > > > HTH > > > > Jul > > > > On 10 July 2012 22:42, Lewis John Mcgibbney <[email protected] > >wrote: > > > >> Hi, > >> > >> I am looking to create a dataset for use in an example scenario where > >> I want to create all the products you would typically find in the > >> online Amazon store e.g. loads of products with different categories, > >> different prices, titles, availability, condition etc etc etc. One way > >> I was thinking of doing this was using the above API written into > >> Nutch 2.X to get the results as JSON these could then hopefully be > >> loaded into my product table in my datastore and we could begin to > >> build up the database of products. > >> > >> Having never used the REST API directly I wonder if anyone has any > >> information on this and whether I can obtain some direction relating > >> to producing my crawl results as JSON. I'm also going to look into > >> Andrzej's patch in NUTCH-932 also so I'll try to update this thread > >> once I make some progress with it. > >> > >> Thanks in advance for any sharing of experiences with this one. > >> > >> Best > >> Lewis > >> > >> -- > >> Lewis > >> > > > > > > > > -- > > * > > *Open Source Solutions for Text Engineering > > > > http://digitalpebble.blogspot.com/ > > http://www.digitalpebble.com > > http://twitter.com/digitalpebble > > > > -- > Lewis > -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com http://twitter.com/digitalpebble

