Re: Are there any web crawlers based on database?

Scott Gonyea Tue, 26 Oct 2010 14:48:27 -0700

Yep, GORA will be a huge boon.  One future problem to be dealt with is
going to be versioning many iterations of the same content.  My
application of Nutch is geared more towards compliance, and is less
interested in the analytics beyond high-level statistics and drilling
down into web content.


For my purposes, there will be a growing need to compare many
iterations of the same (or similar) content, over a given time period.
 More than that, retaining original data is very important (and is
something I already do, in Amazon's S3).

I'm really interested in GORA, and wish I had more time to really look
at it and contribute.  Such is life :(

Scott

On Tue, Oct 26, 2010 at 2:15 PM, Andrzej Bialecki <[email protected]> wrote:
> On 2010-10-26 22:39, Scott Gonyea wrote:
>> I love relational databases, but their many features are (in my
>> opinion) wasted on what you find in Nutch.  Row-locking and
>> transactional integrity is great for lots of applications, but becomes
>> a whole lot of overhead when it's of next-to-no-value to whatever
>> you're doing.
>>
>> RE: counting URLs:  Have you looked at Solr's facets, etc?  I use them
>> like they're going out of style--and it's very powerful.
>>
>> For my application, Solr *is* my database.  Nutch crawls data, stores
>
> .. then you may be interested in the upcoming Gora feature:
> http://issues.apache.org/jira/browse/GORA-9 . When this is committed you
> will be able to keep all your data in Solr.
>
>
>
> --
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>

Re: Are there any web crawlers based on database?

Reply via email to