Re: Crawl the whole blog, but store just the last post

Robert Douglass Thu, 21 Oct 2010 08:50:35 -0700

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Don't most blogs have RSS feeds these days? Sounds like you'd save a lot
of trouble by using a feed reader instead.


On 10/21/2010 05:45 PM, Alberto wrote:
> Hi everybody,
> 
> I'm using Nutch to analyze trends in the blogosphere. That's why I'm
> only interested on the last post of everyblog I crawl. The problem is
> that if want to do a good crawling I need to crawl the urls on the
> entire blog, not just the last post. But if I do this, then I'll have a
> corpus of blogs witn N posts each one, instead a corpus of blogs with 1
> post each.
> 
> How do you think I should do this?
> 
> (a) Crawling normally and then removing the posts I don't want? That
> would waste much time (and I don't know whether it is possible to remove
> posts, I guess it can be done with Lucene)
> 
> (b) Or maybe with some "online" filtering I can apply to Nutch for it to
> do want I want properly?
> 
> (c) Others?
> 
> Thanks in advance,
> 
> Alberto

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iQEcBAEBAgAGBQJMwGElAAoJEJGMmWjehO3iZjAIAKe4JpNXd4XeiCoyrqJhnBQl
DGlMc9uIQzLp6++Mi0wB0NN+aYiwtvCLHuaOtbmgjejaPJxusIAIy+Hgi5m2Xw8+
jV47A5kxGkzkAwcOQf5AalqZAifbM8DTtP/ynOxrsFCSD3JSCmJ7wvtMNjQzjolD
rK6Xex9+YlAhqUlMbbqVb7YbSzyf2DYTUnVqhyVY1u4ZsGyjQgFAFSyA5EArn47w
/WlbkxiUPfq5lb+3xyug9EEs4nPN3S2RsxhFaxKVIY291wCBaAnQTKh7htSfbArS
nZQUwNhFlEy0UmGVVRZJ30iECFjSpPv+8mQm/2fvLbzdfby/h2ib4hRciQrEsGQ=
=86uT
-----END PGP SIGNATURE-----

Re: Crawl the whole blog, but store just the last post

Reply via email to