Hi Thanks for your comments. This confirms if needs be that https://issues.apache.org/jira/browse/NUTCH-1047 would be a useful thing to have.
J On 11 July 2012 13:22, Mathijs Homminga <[email protected]>wrote: > Hi Julian, > > Just to share our experiences with using Nutch 2.0: > > Indexing in Nutch actually has nothing to do with indexing itself. It just > selects some fields from a WebPage, does some very minimal processing (both > typically in the indexing filter plugins) and sends the result to a writer. > What I notice is that we tend to develop IndexingFilter/IndexingWriter > combinations for exporting/pushing data to different external systems > (Solr, elasticsearch,...) because not only do these systems use different > format/interface (handled by IndexingWriter) but also may support different > use cases, and thus may require different fields (done in IndexingFilter). > > Since indexing is the obvious use case here, I can understand the naming > of this process, but again, the data can be pushed anywhere. > > Currently, we need to call a different IndexingJob (which uses a different > Writer) and change the NutchConfiguration (to include the right > IndexingFilters) to push data to another sink. I would be great if Nutch > could support different target systems with one configuration. > > Mathijs > > > > > > On Jul 11, 2012, at 13:34 , Julien Nioche wrote: > > > Hi Lewis > > > > I realise I was thinking about NUTCH-880, not NUTCH-932 which is indeed > > about retrieving crawl results as JSON > > > > > >> From my own pov it appears that Nutch 2.X is 'closer' to the model > > > > required for a multiple backends implementation although there is > >> still quite a bit of work to do here. > > > > > > backend for crawl storage != target of the exporter/indexer > > > > > >> What I am slightly confused > >> about, which hasn't been mentioned on this particular issue is whether > >> individual Gora modules would make up part of the stack or whether the > >> abstraction would somehow be written @Nutch side... of course this > >> then gets a bit more tricky when we begin thinking about current 1.X > >> and how to progress with a suitable long term vision. > >> > > > > this is definitely on the Nutch side and applies in the same way for 1.x > > and 2.x. Think about it as a pluggable indexer : regardless of what > backend > > is used for storing the crawl table you might want to send some of the > > content (with possible transformations) to e.g. SOLR, ElasticSearch, a > > text file, a database etc... At the moment we are limited to SOLR - which > > is OK as most people use Nutch for indexing / searching but the point is > > that we should have more flexibility. I have used the terms 'pluggable > > indexer' before as well as 'pluggable exporter' I suppose the difference > is > > whether we take care of finding which URLs should be deleted (indexer) or > > just dump a snapshot of the content (exporter). > > > > See comments on https://issues.apache.org/jira/browse/NUTCH-1047 > > > > > > On Wed, Jul 11, 2012 at 8:54 AM, Julien Nioche > >> <[email protected]> wrote: > >>> I'd think that this would be more a case for the universal exporter > >> (a.k.a > >>> multiple indexing backends) that we mentioned several times. The REST > API > >>> is more a way of piloting a crawl remotely. It could certainly be > twisted > >>> into doing all sorts of things but I am not sure it would be very > >>> practical when dealing with very large data. Instead having a pluggable > >>> exporter would allow you to define what backend you want to send the > data > >>> to and what transformations to do on the way (e.g. convert to JSON). > >>> Alternatively a good old custom map reduce job based is the way to go. > >>> > >>> HTH > >>> > >>> Jul > >>> > >>> On 10 July 2012 22:42, Lewis John Mcgibbney <[email protected] > >>> wrote: > >>> > >>>> Hi, > >>>> > >>>> I am looking to create a dataset for use in an example scenario where > >>>> I want to create all the products you would typically find in the > >>>> online Amazon store e.g. loads of products with different categories, > >>>> different prices, titles, availability, condition etc etc etc. One way > >>>> I was thinking of doing this was using the above API written into > >>>> Nutch 2.X to get the results as JSON these could then hopefully be > >>>> loaded into my product table in my datastore and we could begin to > >>>> build up the database of products. > >>>> > >>>> Having never used the REST API directly I wonder if anyone has any > >>>> information on this and whether I can obtain some direction relating > >>>> to producing my crawl results as JSON. I'm also going to look into > >>>> Andrzej's patch in NUTCH-932 also so I'll try to update this thread > >>>> once I make some progress with it. > >>>> > >>>> Thanks in advance for any sharing of experiences with this one. > >>>> > >>>> Best > >>>> Lewis > >>>> > >>>> -- > >>>> Lewis > >>>> > >>> > >>> > >>> > >>> -- > >>> * > >>> *Open Source Solutions for Text Engineering > >>> > >>> http://digitalpebble.blogspot.com/ > >>> http://www.digitalpebble.com > >>> http://twitter.com/digitalpebble > >> > >> > >> > >> -- > >> Lewis > >> > > > > > > > > -- > > * > > *Open Source Solutions for Text Engineering > > > > http://digitalpebble.blogspot.com/ > > http://www.digitalpebble.com > > http://twitter.com/digitalpebble > > -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com http://twitter.com/digitalpebble

