Yep, GORA will be a huge boon. One future problem to be dealt with is going to be versioning many iterations of the same content. My application of Nutch is geared more towards compliance, and is less interested in the analytics beyond high-level statistics and drilling down into web content.
For my purposes, there will be a growing need to compare many iterations of the same (or similar) content, over a given time period. More than that, retaining original data is very important (and is something I already do, in Amazon's S3). I'm really interested in GORA, and wish I had more time to really look at it and contribute. Such is life :( Scott On Tue, Oct 26, 2010 at 2:15 PM, Andrzej Bialecki <[email protected]> wrote: > On 2010-10-26 22:39, Scott Gonyea wrote: >> I love relational databases, but their many features are (in my >> opinion) wasted on what you find in Nutch. Row-locking and >> transactional integrity is great for lots of applications, but becomes >> a whole lot of overhead when it's of next-to-no-value to whatever >> you're doing. >> >> RE: counting URLs: Have you looked at Solr's facets, etc? I use them >> like they're going out of style--and it's very powerful. >> >> For my application, Solr *is* my database. Nutch crawls data, stores > > .. then you may be interested in the upcoming Gora feature: > http://issues.apache.org/jira/browse/GORA-9 . When this is committed you > will be able to keep all your data in Solr. > > > > -- > Best regards, > Andrzej Bialecki <>< > ___. ___ ___ ___ _ _ __________________________________ > [__ || __|__/|__||\/| Information Retrieval, Semantic Web > ___|||__|| \| || | Embedded Unix, System Integration > http://www.sigram.com Contact: info at sigram dot com > >

