I was tasked with doing something like this. We didn't do it, but my
thought process was to use the Java JDBC URL pathed out to the
database.
The two ideas we had were the following:
1. jdbc://dbserver.domain.com/tableName/schemaName/tableName
2. jdbc://dbserver.domain.com/${SQLQuery}
Then write the protocol handler for this, which given a URL of form
one, generate outlinks, one per row, where the URL is derived from the
primary key in the first case. In the second case, you'd have to
handle that by assembling the appropriate primary keys, we didn't
finish that.
That is at least if you are searching for 'rows'. We primarily wanted
to integrate Nutch into a Federated Search system, where it could
search both the public web, and a local document store. We didn't get
too concerned about in/out links or the like. Although there is
nothing to stop you from doing that, and following the foreign keys
and treating them like links. The problem is that will likely lead
you to nasty scoring skew.
So if there were an internal row that had text which matched the core
values, it was imported and would likely be found. The big problem
with that is ensuring that lookup tables get applied (in a
denormalized database, the author's name likely isn't attached to a
document, it is likely referenced via a foreign key, so searching for
those became problematic, as 'John Doe' and '1234' where associated in
the author table, and 1234 and the full text of the document were
associated, 'John Doe' and the full text weren't). That's where the
second idea came from, so we'd enumerate queries which were
representative of what we thought the results from querying the
database should be.
Kirby
On Tue, Jul 5, 2011 at 5:44 PM, lewis john mcgibbney
<[email protected]> wrote:
> Hi,
>
> I'm curious to hear if anyone has information for configuring Nutch to crawl
> a RDB such as MySQL. In my hypothetical example there are N number of
> databases residing in various distributed geographical locations, to make a
> worst case scenario, say that they are NOT all the same type, and I wish to
> use Nutch trunk 2.0 to push the results to some other structured data store
> which I can then connect to to serve search results.
>
> Does anyone have any information such as an overview of database crawling
> and serving using Nutch? I have been unsuccesful obtaining info on the Web
> as query results are ambiguous and usually refer to crawldb or linkdb.
>
> If I can get this it would be a real nice entry for inclusion in our wiki.
>
> Thanks for any suggestions or info.
>
> --
> *Lewis*
>