Re: [ANNOUNCE] Web Crawler

findbestopensource Wed, 02 Mar 2011 01:02:49 -0800

Hello Dominique Bejean,

Good job.


We identified almost 8 open source web crawlers
http://www.findbestopensource.com/tagged/webcrawler   I don't know how far
yours would be different from the rest.

Your license states that it is not open source but it is free for personnel
use.

Regards
Aditya
www.findbestopensource.com


On Wed, Mar 2, 2011 at 5:55 AM, Dominique Bejean
<dominique.bej...@eolya.fr>wrote:

> Hi,
>
> I would like to announce Crawl Anywhere. Crawl-Anywhere is a Java Web
> Crawler. It includes :
>
>   * a crawler
>   * a document processing pipeline
>   * a solr indexer
>
> The crawler has a web administration in order to manage web sites to be
> crawled. Each web site crawl is configured with a lot of possible parameters
> (no all mandatory) :
>
>   * number of simultaneous items crawled by site
>   * recrawl period rules based on item type (html, PDF, …)
>   * item type inclusion / exclusion rules
>   * item path inclusion / exclusion / strategy rules
>   * max depth
>   * web site authentication
>   * language
>   * country
>   * tags
>   * collections
>   * ...
>
> The pileline includes various ready to use stages (text extraction,
> language detection, Solr ready to index xml writer, ...).
>
> All is very configurable and extendible either by scripting or java coding.
>
> With scripting technology, you can help the crawler to handle javascript
> links or help the pipeline to extract relevant title and cleanup the html
> pages (remove menus, header, footers, ..)
>
> With java coding, you can develop your own pipeline stage stage
>
> The Crawl Anywhere web site provides good explanations and screen shots.
> All is documented in a wiki.
>
> The current version is 1.1.4. You can download and try it out from here :
> www.crawl-anywhere.com
>
>
> Regards
>
> Dominique
>
>

Re: [ANNOUNCE] Web Crawler

Reply via email to