Re: Anyone using the 2.X REST API to retrieve crawl results as JSON

Julien Nioche Wed, 11 Jul 2012 04:35:01 -0700

Hi Lewis

I realise I was thinking about NUTCH-880, not NUTCH-932 which is indeed
about retrieving crawl results as JSON



> From my own pov it appears that Nutch 2.X is 'closer' to the model

required for a multiple backends implementation although there is
> still quite a bit of work to do here.


backend for crawl storage != target of the exporter/indexer


> What I am slightly confused
> about, which hasn't been mentioned on this particular issue is whether
> individual Gora modules would make up part of the stack or whether the
> abstraction would somehow be written @Nutch side... of course this
> then gets a bit more tricky when we begin thinking about current 1.X
> and how to progress with a suitable long term vision.
>

this is definitely on the Nutch side and applies in the same way for 1.x
and 2.x. Think about it as a pluggable indexer : regardless of what backend
is used for storing the crawl table you might want to send some of the
content (with possible transformations) to e.g.  SOLR, ElasticSearch, a
text file, a database etc... At the moment we are limited to SOLR - which
is OK as most people use Nutch for indexing / searching but the point is
that we should have more flexibility. I have used the terms 'pluggable
indexer' before as well as 'pluggable exporter' I suppose the difference is
whether we take care of finding which URLs should be deleted (indexer) or
just dump a snapshot of the content (exporter).

 See comments on https://issues.apache.org/jira/browse/NUTCH-1047


On Wed, Jul 11, 2012 at 8:54 AM, Julien Nioche
> <[email protected]> wrote:
> > I'd think that this would be more a case for the universal exporter
> (a.k.a
> > multiple indexing backends) that we mentioned several times. The REST API
> > is more a way of piloting a crawl remotely. It could certainly be twisted
> > into doing all sorts of  things but I am not sure it would be very
> > practical when dealing with very large data. Instead having a pluggable
> > exporter would allow you to define what backend you want to send the data
> > to and what transformations to do on the way (e.g. convert to JSON).
> > Alternatively a good old custom map reduce job based is the way to go.
> >
> > HTH
> >
> > Jul
> >
> > On 10 July 2012 22:42, Lewis John Mcgibbney <[email protected]
> >wrote:
> >
> >> Hi,
> >>
> >> I am looking to create a dataset for use in an example scenario where
> >> I want to create all the products you would typically find in the
> >> online Amazon store e.g. loads of products with different categories,
> >> different prices, titles, availability, condition etc etc etc. One way
> >> I was thinking of doing this was using the above API written into
> >> Nutch 2.X to get the results as JSON these could then hopefully be
> >> loaded into my product table in my datastore and we could begin to
> >> build up the database of products.
> >>
> >> Having never used the REST API directly I wonder if anyone has any
> >> information on this and whether I can obtain some direction relating
> >> to producing my crawl results as JSON. I'm also going to look into
> >> Andrzej's patch in NUTCH-932 also so I'll try to update this thread
> >> once I make some progress with it.
> >>
> >> Thanks in advance for any sharing of experiences with this one.
> >>
> >> Best
> >> Lewis
> >>
> >> --
> >> Lewis
> >>
> >
> >
> >
> > --
> > *
> > *Open Source Solutions for Text Engineering
> >
> > http://digitalpebble.blogspot.com/
> > http://www.digitalpebble.com
> > http://twitter.com/digitalpebble
>
>
>
> --
> Lewis
>



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Re: Anyone using the 2.X REST API to retrieve crawl results as JSON

Reply via email to