Re: [ANN] Itsy 0.1.0 released, a threaded web spider written in Clojure

2012-06-01 Thread László Török
Hi, interesting project. I was wondering though how do you make sure two crawlers do not crawl the same URL twice if there is no global state? :) If I read it correctly you're going to have to spawn a lot of threads to have at least a few busy with extraction at an point in time, as most of them

Re: [ANN] Itsy 0.1.0 released, a threaded web spider written in Clojure

2012-06-01 Thread Michael Klishin
László Török: I was wondering though how do you make sure two crawlers do not crawl the same URL twice if there is no global state? :) By adding sharing state, for a single app instance, typically an atom. As for separating different instances, it is not uncommon to hash seed URLs (or

Re: [ANN] Itsy 0.1.0 released, a threaded web spider written in Clojure

2012-06-01 Thread László Török
Hi, don't want to turn this to a lengthy discussion about crawling, but happy to continue off list. ;) Sitemaps work surprisingly well in certain domains (web shops powered by standard web shop software, large e-commerce sites) and can make life easier based on our experience. Another point: i

[ANN] Itsy 0.1.0 released, a threaded web spider written in Clojure

2012-05-31 Thread Lee Hinman
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Hi all, I'm pleased to announce the initial 0.1.0 release of Itsy. Itsy is a threaded web spider written in Clojure. A list of some of the Itsy features: - - Multithreaded, with the ability to add and remove workers as needed - - No global state, run