Hi Julian, Just to share our experiences with using Nutch 2.0:
Indexing in Nutch actually has nothing to do with indexing itself. It just selects some fields from a WebPage, does some very minimal processing (both typically in the indexing filter plugins) and sends the result to a writer. What I notice is that we tend to develop IndexingFilter/IndexingWriter combinations for exporting/pushing data to different external systems (Solr, elasticsearch,...) because not only do these systems use different format/interface (handled by IndexingWriter) but also may support different use cases, and thus may require different fields (done in IndexingFilter). Since indexing is the obvious use case here, I can understand the naming of this process, but again, the data can be pushed anywhere. Currently, we need to call a different IndexingJob (which uses a different Writer) and change the NutchConfiguration (to include the right IndexingFilters) to push data to another sink. I would be great if Nutch could support different target systems with one configuration. Mathijs On Jul 11, 2012, at 13:34 , Julien Nioche wrote: > Hi Lewis > > I realise I was thinking about NUTCH-880, not NUTCH-932 which is indeed > about retrieving crawl results as JSON > > >> From my own pov it appears that Nutch 2.X is 'closer' to the model > > required for a multiple backends implementation although there is >> still quite a bit of work to do here. > > > backend for crawl storage != target of the exporter/indexer > > >> What I am slightly confused >> about, which hasn't been mentioned on this particular issue is whether >> individual Gora modules would make up part of the stack or whether the >> abstraction would somehow be written @Nutch side... of course this >> then gets a bit more tricky when we begin thinking about current 1.X >> and how to progress with a suitable long term vision. >> > > this is definitely on the Nutch side and applies in the same way for 1.x > and 2.x. Think about it as a pluggable indexer : regardless of what backend > is used for storing the crawl table you might want to send some of the > content (with possible transformations) to e.g. SOLR, ElasticSearch, a > text file, a database etc... At the moment we are limited to SOLR - which > is OK as most people use Nutch for indexing / searching but the point is > that we should have more flexibility. I have used the terms 'pluggable > indexer' before as well as 'pluggable exporter' I suppose the difference is > whether we take care of finding which URLs should be deleted (indexer) or > just dump a snapshot of the content (exporter). > > See comments on https://issues.apache.org/jira/browse/NUTCH-1047 > > > On Wed, Jul 11, 2012 at 8:54 AM, Julien Nioche >> <[email protected]> wrote: >>> I'd think that this would be more a case for the universal exporter >> (a.k.a >>> multiple indexing backends) that we mentioned several times. The REST API >>> is more a way of piloting a crawl remotely. It could certainly be twisted >>> into doing all sorts of things but I am not sure it would be very >>> practical when dealing with very large data. Instead having a pluggable >>> exporter would allow you to define what backend you want to send the data >>> to and what transformations to do on the way (e.g. convert to JSON). >>> Alternatively a good old custom map reduce job based is the way to go. >>> >>> HTH >>> >>> Jul >>> >>> On 10 July 2012 22:42, Lewis John Mcgibbney <[email protected] >>> wrote: >>> >>>> Hi, >>>> >>>> I am looking to create a dataset for use in an example scenario where >>>> I want to create all the products you would typically find in the >>>> online Amazon store e.g. loads of products with different categories, >>>> different prices, titles, availability, condition etc etc etc. One way >>>> I was thinking of doing this was using the above API written into >>>> Nutch 2.X to get the results as JSON these could then hopefully be >>>> loaded into my product table in my datastore and we could begin to >>>> build up the database of products. >>>> >>>> Having never used the REST API directly I wonder if anyone has any >>>> information on this and whether I can obtain some direction relating >>>> to producing my crawl results as JSON. I'm also going to look into >>>> Andrzej's patch in NUTCH-932 also so I'll try to update this thread >>>> once I make some progress with it. >>>> >>>> Thanks in advance for any sharing of experiences with this one. >>>> >>>> Best >>>> Lewis >>>> >>>> -- >>>> Lewis >>>> >>> >>> >>> >>> -- >>> * >>> *Open Source Solutions for Text Engineering >>> >>> http://digitalpebble.blogspot.com/ >>> http://www.digitalpebble.com >>> http://twitter.com/digitalpebble >> >> >> >> -- >> Lewis >> > > > > -- > * > *Open Source Solutions for Text Engineering > > http://digitalpebble.blogspot.com/ > http://www.digitalpebble.com > http://twitter.com/digitalpebble

