Re: Anyone using the 2.X REST API to retrieve crawl results as JSON

Julien Nioche Wed, 11 Jul 2012 05:48:14 -0700

Hi

Thanks for your comments. This confirms if needs be that
https://issues.apache.org/jira/browse/NUTCH-1047 would be a useful thing to
have.


J

On 11 July 2012 13:22, Mathijs Homminga <[email protected]>wrote:

> Hi Julian,
>
> Just to share our experiences with using Nutch 2.0:
>
> Indexing in Nutch actually has nothing to do with indexing itself. It just
> selects some fields from a WebPage, does some very minimal processing (both
> typically in the indexing filter plugins) and sends the result to a writer.
> What I notice is that we tend to develop IndexingFilter/IndexingWriter
> combinations for exporting/pushing data to different external systems
> (Solr, elasticsearch,...) because not only do these systems use different
> format/interface (handled by IndexingWriter) but also may support different
> use cases, and thus may require different fields (done in IndexingFilter).
>
> Since indexing is the obvious use case here, I can understand the naming
> of this process, but again, the data can be pushed anywhere.
>
> Currently, we need to call a different IndexingJob (which uses a different
> Writer) and change the NutchConfiguration (to include the right
> IndexingFilters) to push data to another sink. I would be great if Nutch
> could support different target systems with one configuration.
>
> Mathijs
>
>
>
>
>
> On Jul 11, 2012, at 13:34 , Julien Nioche wrote:
>
> > Hi Lewis
> >
> > I realise I was thinking about NUTCH-880, not NUTCH-932 which is indeed
> > about retrieving crawl results as JSON
> >
> >
> >> From my own pov it appears that Nutch 2.X is 'closer' to the model
> >
> > required for a multiple backends implementation although there is
> >> still quite a bit of work to do here.
> >
> >
> > backend for crawl storage != target of the exporter/indexer
> >
> >
> >> What I am slightly confused
> >> about, which hasn't been mentioned on this particular issue is whether
> >> individual Gora modules would make up part of the stack or whether the
> >> abstraction would somehow be written @Nutch side... of course this
> >> then gets a bit more tricky when we begin thinking about current 1.X
> >> and how to progress with a suitable long term vision.
> >>
> >
> > this is definitely on the Nutch side and applies in the same way for 1.x
> > and 2.x. Think about it as a pluggable indexer : regardless of what
> backend
> > is used for storing the crawl table you might want to send some of the
> > content (with possible transformations) to e.g.  SOLR, ElasticSearch, a
> > text file, a database etc... At the moment we are limited to SOLR - which
> > is OK as most people use Nutch for indexing / searching but the point is
> > that we should have more flexibility. I have used the terms 'pluggable
> > indexer' before as well as 'pluggable exporter' I suppose the difference
> is
> > whether we take care of finding which URLs should be deleted (indexer) or
> > just dump a snapshot of the content (exporter).
> >
> > See comments on https://issues.apache.org/jira/browse/NUTCH-1047
> >
> >
> > On Wed, Jul 11, 2012 at 8:54 AM, Julien Nioche
> >> <[email protected]> wrote:
> >>> I'd think that this would be more a case for the universal exporter
> >> (a.k.a
> >>> multiple indexing backends) that we mentioned several times. The REST
> API
> >>> is more a way of piloting a crawl remotely. It could certainly be
> twisted
> >>> into doing all sorts of  things but I am not sure it would be very
> >>> practical when dealing with very large data. Instead having a pluggable
> >>> exporter would allow you to define what backend you want to send the
> data
> >>> to and what transformations to do on the way (e.g. convert to JSON).
> >>> Alternatively a good old custom map reduce job based is the way to go.
> >>>
> >>> HTH
> >>>
> >>> Jul
> >>>
> >>> On 10 July 2012 22:42, Lewis John Mcgibbney <[email protected]
> >>> wrote:
> >>>
> >>>> Hi,
> >>>>
> >>>> I am looking to create a dataset for use in an example scenario where
> >>>> I want to create all the products you would typically find in the
> >>>> online Amazon store e.g. loads of products with different categories,
> >>>> different prices, titles, availability, condition etc etc etc. One way
> >>>> I was thinking of doing this was using the above API written into
> >>>> Nutch 2.X to get the results as JSON these could then hopefully be
> >>>> loaded into my product table in my datastore and we could begin to
> >>>> build up the database of products.
> >>>>
> >>>> Having never used the REST API directly I wonder if anyone has any
> >>>> information on this and whether I can obtain some direction relating
> >>>> to producing my crawl results as JSON. I'm also going to look into
> >>>> Andrzej's patch in NUTCH-932 also so I'll try to update this thread
> >>>> once I make some progress with it.
> >>>>
> >>>> Thanks in advance for any sharing of experiences with this one.
> >>>>
> >>>> Best
> >>>> Lewis
> >>>>
> >>>> --
> >>>> Lewis
> >>>>
> >>>
> >>>
> >>>
> >>> --
> >>> *
> >>> *Open Source Solutions for Text Engineering
> >>>
> >>> http://digitalpebble.blogspot.com/
> >>> http://www.digitalpebble.com
> >>> http://twitter.com/digitalpebble
> >>
> >>
> >>
> >> --
> >> Lewis
> >>
> >
> >
> >
> > --
> > *
> > *Open Source Solutions for Text Engineering
> >
> > http://digitalpebble.blogspot.com/
> > http://www.digitalpebble.com
> > http://twitter.com/digitalpebble
>
>


-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Re: Anyone using the 2.X REST API to retrieve crawl results as JSON

Reply via email to