Fetch and parse the feeds and store the newly discovered URL's in the CrawlDB. Then generate a new fetch list, fetch and parse and index the most recent item.
The remaining problem is how to know which is the most recent. Maybe you should create a plugin that will only add the most recent URL to the CrawlDB in the first place. Then there is the issue of revisiting, because there will be newer blog posts, will you delete earlier `most recent` postst then? > Hi Robert, > > true, but then the problem is how to get a corpus of feeds? I thought I > had no choice if I didn't have a list of feeds to fetch. > > On 21/10/10 17:49, Robert Douglass wrote: > > -----BEGIN PGP SIGNED MESSAGE----- > > Hash: SHA1 > > > > Don't most blogs have RSS feeds these days? Sounds like you'd save a lot > > of trouble by using a feed reader instead. > > > > On 10/21/2010 05:45 PM, Alberto wrote: > >> Hi everybody, > >> > >> I'm using Nutch to analyze trends in the blogosphere. That's why I'm > >> only interested on the last post of everyblog I crawl. The problem is > >> that if want to do a good crawling I need to crawl the urls on the > >> entire blog, not just the last post. But if I do this, then I'll have a > >> corpus of blogs witn N posts each one, instead a corpus of blogs with 1 > >> post each. > >> > >> How do you think I should do this? > >> > >> (a) Crawling normally and then removing the posts I don't want? That > >> would waste much time (and I don't know whether it is possible to remove > >> posts, I guess it can be done with Lucene) > >> > >> (b) Or maybe with some "online" filtering I can apply to Nutch for it to > >> do want I want properly? > >> > >> (c) Others? > >> > >> Thanks in advance, > >> > >> Alberto > > > > -----BEGIN PGP SIGNATURE----- > > Version: GnuPG v1.4.10 (GNU/Linux) > > Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ > > > > iQEcBAEBAgAGBQJMwGElAAoJEJGMmWjehO3iZjAIAKe4JpNXd4XeiCoyrqJhnBQl > > DGlMc9uIQzLp6++Mi0wB0NN+aYiwtvCLHuaOtbmgjejaPJxusIAIy+Hgi5m2Xw8+ > > jV47A5kxGkzkAwcOQf5AalqZAifbM8DTtP/ynOxrsFCSD3JSCmJ7wvtMNjQzjolD > > rK6Xex9+YlAhqUlMbbqVb7YbSzyf2DYTUnVqhyVY1u4ZsGyjQgFAFSyA5EArn47w > > /WlbkxiUPfq5lb+3xyug9EEs4nPN3S2RsxhFaxKVIY291wCBaAnQTKh7htSfbArS > > nZQUwNhFlEy0UmGVVRZJ30iECFjSpPv+8mQm/2fvLbzdfby/h2ib4hRciQrEsGQ= > > =86uT > > -----END PGP SIGNATURE-----

