Re: Crawling relation database

lewis john mcgibbney Tue, 05 Jul 2011 16:20:05 -0700

Thanks for this Markus, it had occured to me that DIH was a very plausable
solution to progress with. I think you have just confirmed due to the
flexibility it offers amongst other attributes.

I'm looking at creating a context aware web application which would use
geographical search to obtain results based on location. This is required as
the data will contain (amongst others) fields with integer values which vary
dependent upon a building location cost index. Similarity is directly linked
through geographical location factor. I wanted to have the data stored
within the n number of distributed RDB's available in a cloud environment
which could be searched as oppose to the non-trivial task of searching
across a fragmented distrubuted number of DB's.

As you mention, it does make more sense to save documents in a doc (or
column) oriented DB.

Essentially, using the DIH tool would remove requirement for Nutch?

I think to progress with this, I'm best moving the thread to Solr-user@ if I
have further questions.

Thank you
On Tue, Jul 5, 2011 at 3:53 PM, Markus Jelsma <[email protected]>wrote:

> Hi Lewis,
>
> It sounds to me you'd be better of using Solr's very advanced
> DataImportHandler [1]. It can (delta) import data from your RDBMS' and
> offers
> much flexibility on how to transform entities.
>
> Besides crawling you also mentions you'd like to push results (of what) to
> another structured data store. But why would you want that? Handling,
> processing and serving search results is done by Solr (and ES in the
> future)
> and since our entities are flat (just a document) it makes more sense to me
> to
> save documents in a document (or column) oriented DB.
>
> [1] :http://wiki.apache.org/solr/DataImportHandler
>
> Cheers,
>
> > Hi,
> >
> > I'm curious to hear if anyone has information for configuring Nutch to
> > crawl a RDB such as MySQL. In my hypothetical example there are N number
> > of databases residing in various distributed geographical locations, to
> > make a worst case scenario, say that they are NOT all the same type, and
> I
> > wish to use Nutch trunk 2.0 to push the results to some other structured
> > data store which I can then connect to to serve search results.
> >
> > Does anyone have any information such as an overview of database crawling
> > and serving using Nutch? I have been unsuccesful obtaining info on the
> Web
> > as query results are ambiguous and usually refer to crawldb or linkdb.
> >
> > If I can get this it would be a real nice entry for inclusion in our
> wiki.
> >
> > Thanks for any suggestions or info.
>

-- 
*Lewis*

Re: Crawling relation database

Reply via email to