Re: Anyone using the 2.X REST API to retrieve crawl results as JSON

Mathijs Homminga Wed, 11 Jul 2012 05:22:42 -0700

Hi Julian,

Just to share our experiences with using Nutch 2.0:


Indexing in Nutch actually has nothing to do with indexing itself. It just 
selects some fields from a WebPage, does some very minimal processing (both 
typically in the indexing filter plugins) and sends the result to a writer. 
What I notice is that we tend to develop IndexingFilter/IndexingWriter 
combinations for exporting/pushing data to different external systems (Solr, 
elasticsearch,...) because not only do these systems use different 
format/interface (handled by IndexingWriter) but also may support different use 
cases, and thus may require different fields (done in IndexingFilter).

Since indexing is the obvious use case here, I can understand the naming of 
this process, but again, the data can be pushed anywhere.

Currently, we need to call a different IndexingJob (which uses a different 
Writer) and change the NutchConfiguration (to include the right 
IndexingFilters) to push data to another sink. I would be great if Nutch could 
support different target systems with one configuration.

Mathijs





On Jul 11, 2012, at 13:34 , Julien Nioche wrote:

> Hi Lewis
> 
> I realise I was thinking about NUTCH-880, not NUTCH-932 which is indeed
> about retrieving crawl results as JSON
> 
> 
>> From my own pov it appears that Nutch 2.X is 'closer' to the model
> 
> required for a multiple backends implementation although there is
>> still quite a bit of work to do here.
> 
> 
> backend for crawl storage != target of the exporter/indexer
> 
> 
>> What I am slightly confused
>> about, which hasn't been mentioned on this particular issue is whether
>> individual Gora modules would make up part of the stack or whether the
>> abstraction would somehow be written @Nutch side... of course this
>> then gets a bit more tricky when we begin thinking about current 1.X
>> and how to progress with a suitable long term vision.
>> 
> 
> this is definitely on the Nutch side and applies in the same way for 1.x
> and 2.x. Think about it as a pluggable indexer : regardless of what backend
> is used for storing the crawl table you might want to send some of the
> content (with possible transformations) to e.g.  SOLR, ElasticSearch, a
> text file, a database etc... At the moment we are limited to SOLR - which
> is OK as most people use Nutch for indexing / searching but the point is
> that we should have more flexibility. I have used the terms 'pluggable
> indexer' before as well as 'pluggable exporter' I suppose the difference is
> whether we take care of finding which URLs should be deleted (indexer) or
> just dump a snapshot of the content (exporter).
> 
> See comments on https://issues.apache.org/jira/browse/NUTCH-1047
> 
> 
> On Wed, Jul 11, 2012 at 8:54 AM, Julien Nioche
>> <[email protected]> wrote:
>>> I'd think that this would be more a case for the universal exporter
>> (a.k.a
>>> multiple indexing backends) that we mentioned several times. The REST API
>>> is more a way of piloting a crawl remotely. It could certainly be twisted
>>> into doing all sorts of  things but I am not sure it would be very
>>> practical when dealing with very large data. Instead having a pluggable
>>> exporter would allow you to define what backend you want to send the data
>>> to and what transformations to do on the way (e.g. convert to JSON).
>>> Alternatively a good old custom map reduce job based is the way to go.
>>> 
>>> HTH
>>> 
>>> Jul
>>> 
>>> On 10 July 2012 22:42, Lewis John Mcgibbney <[email protected]
>>> wrote:
>>> 
>>>> Hi,
>>>> 
>>>> I am looking to create a dataset for use in an example scenario where
>>>> I want to create all the products you would typically find in the
>>>> online Amazon store e.g. loads of products with different categories,
>>>> different prices, titles, availability, condition etc etc etc. One way
>>>> I was thinking of doing this was using the above API written into
>>>> Nutch 2.X to get the results as JSON these could then hopefully be
>>>> loaded into my product table in my datastore and we could begin to
>>>> build up the database of products.
>>>> 
>>>> Having never used the REST API directly I wonder if anyone has any
>>>> information on this and whether I can obtain some direction relating
>>>> to producing my crawl results as JSON. I'm also going to look into
>>>> Andrzej's patch in NUTCH-932 also so I'll try to update this thread
>>>> once I make some progress with it.
>>>> 
>>>> Thanks in advance for any sharing of experiences with this one.
>>>> 
>>>> Best
>>>> Lewis
>>>> 
>>>> --
>>>> Lewis
>>>> 
>>> 
>>> 
>>> 
>>> --
>>> *
>>> *Open Source Solutions for Text Engineering
>>> 
>>> http://digitalpebble.blogspot.com/
>>> http://www.digitalpebble.com
>>> http://twitter.com/digitalpebble
>> 
>> 
>> 
>> --
>> Lewis
>> 
> 
> 
> 
> -- 
> *
> *Open Source Solutions for Text Engineering
> 
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
> http://twitter.com/digitalpebble

Re: Anyone using the 2.X REST API to retrieve crawl results as JSON

Reply via email to