Hi everybody,
I'm using Nutch to analyze trends in the blogosphere. That's why I'm
only interested on the last post of everyblog I crawl. The problem is
that if want to do a good crawling I need to crawl the urls on the
entire blog, not just the last post. But if I do this, then I'll have a
corpus of blogs witn N posts each one, instead a corpus of blogs with 1
post each.
How do you think I should do this?
(a) Crawling normally and then removing the posts I don't want? That
would waste much time (and I don't know whether it is possible to remove
posts, I guess it can be done with Lucene)
(b) Or maybe with some "online" filtering I can apply to Nutch for it to
do want I want properly?
(c) Others?
Thanks in advance,
Alberto