Re: Crawl the whole blog, but store just the last post

Alberto Thu, 21 Oct 2010 08:57:20 -0700

Hi Robert,

true, but then the problem is how to get a corpus of feeds? I thought Ihad no choice if I didn't have a list of feeds to fetch.


On 21/10/10 17:49, Robert Douglass wrote:

Hi everybody,

I'm using Nutch to analyze trends in the blogosphere. That's why I'm
only interested on the last post of everyblog I crawl. The problem is
that if want to do a good crawling I need to crawl the urls on the
entire blog, not just the last post. But if I do this, then I'll have a
corpus of blogs witn N posts each one, instead a corpus of blogs with 1
post each.

How do you think I should do this?

(a) Crawling normally and then removing the posts I don't want? That
would waste much time (and I don't know whether it is possible to remove
posts, I guess it can be done with Lucene)

(b) Or maybe with some "online" filtering I can apply to Nutch for it to
do want I want properly?

(c) Others?

Thanks in advance,

Alberto

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iQEcBAEBAgAGBQJMwGElAAoJEJGMmWjehO3iZjAIAKe4JpNXd4XeiCoyrqJhnBQl
DGlMc9uIQzLp6++Mi0wB0NN+aYiwtvCLHuaOtbmgjejaPJxusIAIy+Hgi5m2Xw8+
jV47A5kxGkzkAwcOQf5AalqZAifbM8DTtP/ynOxrsFCSD3JSCmJ7wvtMNjQzjolD
rK6Xex9+YlAhqUlMbbqVb7YbSzyf2DYTUnVqhyVY1u4ZsGyjQgFAFSyA5EArn47w
/WlbkxiUPfq5lb+3xyug9EEs4nPN3S2RsxhFaxKVIY291wCBaAnQTKh7htSfbArS
nZQUwNhFlEy0UmGVVRZJ30iECFjSpPv+8mQm/2fvLbzdfby/h2ib4hRciQrEsGQ=
=86uT
-----END PGP SIGNATURE-----

Re: Crawl the whole blog, but store just the last post

Reply via email to