Hi everybody,

I'm using Nutch to analyze trends in the blogosphere. That's why I'm only interested on the last post of everyblog I crawl. The problem is that if want to do a good crawling I need to crawl the urls on the entire blog, not just the last post. But if I do this, then I'll have a corpus of blogs witn N posts each one, instead a corpus of blogs with 1 post each.

How do you think I should do this?

(a) Crawling normally and then removing the posts I don't want? That would waste much time (and I don't know whether it is possible to remove posts, I guess it can be done with Lucene)

(b) Or maybe with some "online" filtering I can apply to Nutch for it to do want I want properly?

(c) Others?

Thanks in advance,

Alberto

Reply via email to