Re: Crawl the whole blog, but store just the last post

2010-10-22 Thread Alberto
Thanks Markus, this sounds like a good starting. Yes, I'll probably want to delete older posts. On 21/10/10 18:09, Markus Jelsma wrote: Fetch and parse the feeds and store the newly discovered URL's in the CrawlDB. Then generate a new fetch list, fetch and parse and index the most recent item.

Re: Crawl the whole blog, but store just the last post

2010-10-21 Thread Markus Jelsma
Fetch and parse the feeds and store the newly discovered URL's in the CrawlDB. Then generate a new fetch list, fetch and parse and index the most recent item. The remaining problem is how to know which is the most recent. Maybe you should create a plugin that will only add the most recent URL t

Re: Crawl the whole blog, but store just the last post

2010-10-21 Thread Alberto
Hi Robert, true, but then the problem is how to get a corpus of feeds? I thought I had no choice if I didn't have a list of feeds to fetch. On 21/10/10 17:49, Robert Douglass wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Don't most blogs have RSS feeds these days? Sounds like you'd s

Re: Crawl the whole blog, but store just the last post

2010-10-21 Thread Robert Douglass
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Don't most blogs have RSS feeds these days? Sounds like you'd save a lot of trouble by using a feed reader instead. On 10/21/2010 05:45 PM, Alberto wrote: > Hi everybody, > > I'm using Nutch to analyze trends in the blogosphere. That's why I'm > only

Crawl the whole blog, but store just the last post

2010-10-21 Thread Alberto
Hi everybody, I'm using Nutch to analyze trends in the blogosphere. That's why I'm only interested on the last post of everyblog I crawl. The problem is that if want to do a good crawling I need to crawl the urls on the entire blog, not just the last post. But if I do this, then I'll have a c