-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Don't most blogs have RSS feeds these days? Sounds like you'd save a lot of trouble by using a feed reader instead.
On 10/21/2010 05:45 PM, Alberto wrote: > Hi everybody, > > I'm using Nutch to analyze trends in the blogosphere. That's why I'm > only interested on the last post of everyblog I crawl. The problem is > that if want to do a good crawling I need to crawl the urls on the > entire blog, not just the last post. But if I do this, then I'll have a > corpus of blogs witn N posts each one, instead a corpus of blogs with 1 > post each. > > How do you think I should do this? > > (a) Crawling normally and then removing the posts I don't want? That > would waste much time (and I don't know whether it is possible to remove > posts, I guess it can be done with Lucene) > > (b) Or maybe with some "online" filtering I can apply to Nutch for it to > do want I want properly? > > (c) Others? > > Thanks in advance, > > Alberto -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.10 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iQEcBAEBAgAGBQJMwGElAAoJEJGMmWjehO3iZjAIAKe4JpNXd4XeiCoyrqJhnBQl DGlMc9uIQzLp6++Mi0wB0NN+aYiwtvCLHuaOtbmgjejaPJxusIAIy+Hgi5m2Xw8+ jV47A5kxGkzkAwcOQf5AalqZAifbM8DTtP/ynOxrsFCSD3JSCmJ7wvtMNjQzjolD rK6Xex9+YlAhqUlMbbqVb7YbSzyf2DYTUnVqhyVY1u4ZsGyjQgFAFSyA5EArn47w /WlbkxiUPfq5lb+3xyug9EEs4nPN3S2RsxhFaxKVIY291wCBaAnQTKh7htSfbArS nZQUwNhFlEy0UmGVVRZJ30iECFjSpPv+8mQm/2fvLbzdfby/h2ib4hRciQrEsGQ= =86uT -----END PGP SIGNATURE-----

