Crawl the whole blog, but store just the last post

Alberto Thu, 21 Oct 2010 08:46:08 -0700

Hi everybody,

I'm using Nutch to analyze trends in the blogosphere. That's why I'monly interested on the last post of everyblog I crawl. The problem isthat if want to do a good crawling I need to crawl the urls on theentire blog, not just the last post. But if I do this, then I'll have acorpus of blogs witn N posts each one, instead a corpus of blogs with 1post each.


How do you think I should do this?

(a) Crawling normally and then removing the posts I don't want? Thatwould waste much time (and I don't know whether it is possible to removeposts, I guess it can be done with Lucene)

(b) Or maybe with some "online" filtering I can apply to Nutch for it todo want I want properly?


(c) Others?

Thanks in advance,

Alberto

Crawl the whole blog, but store just the last post

Reply via email to