Thanks Markus, this sounds like a good starting.
Yes, I'll probably want to delete older posts.
On 21/10/10 18:09, Markus Jelsma wrote:
Fetch and parse the feeds and store the newly discovered URL's in the CrawlDB.
Then generate a new fetch list, fetch and parse and index the most recent
item.
Fetch and parse the feeds and store the newly discovered URL's in the CrawlDB.
Then generate a new fetch list, fetch and parse and index the most recent
item.
The remaining problem is how to know which is the most recent. Maybe you
should create a plugin that will only add the most recent URL t
Hi Robert,
true, but then the problem is how to get a corpus of feeds? I thought I
had no choice if I didn't have a list of feeds to fetch.
On 21/10/10 17:49, Robert Douglass wrote:
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1
Don't most blogs have RSS feeds these days? Sounds like you'd s
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1
Don't most blogs have RSS feeds these days? Sounds like you'd save a lot
of trouble by using a feed reader instead.
On 10/21/2010 05:45 PM, Alberto wrote:
> Hi everybody,
>
> I'm using Nutch to analyze trends in the blogosphere. That's why I'm
> only
Hi everybody,
I'm using Nutch to analyze trends in the blogosphere. That's why I'm
only interested on the last post of everyblog I crawl. The problem is
that if want to do a good crawling I need to crawl the urls on the
entire blog, not just the last post. But if I do this, then I'll have a
c
5 matches
Mail list logo