Hi,
interesting project. I was wondering though how do you make sure two
crawlers do not crawl the same URL twice if there is no global state? :)
If I read it correctly you're going to have to spawn a lot of threads to
have at least a few busy with extraction at an point in time, as most of
them
László Török:
I was wondering though how do you make sure two
crawlers do not crawl the same URL twice if there is no global state? :)
By adding sharing state, for a single app instance, typically an atom. As for
separating different instances,
it is not uncommon to hash seed URLs (or
Hi,
don't want to turn this to a lengthy discussion about crawling, but happy
to continue off list. ;)
Sitemaps work surprisingly well in certain domains (web shops powered by
standard web shop software, large e-commerce sites) and can make life
easier based on our experience.
Another point: i
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1
Hi all,
I'm pleased to announce the initial 0.1.0 release of Itsy. Itsy is a
threaded web spider written in Clojure. A list of some of the Itsy
features:
- - Multithreaded, with the ability to add and remove workers as needed
- - No global state, run