Re: [ANNOUNCE] Web Crawler

Dominique Bejean Wed, 02 Mar 2011 06:47:03 -0800

Hi,

No, it doesn't. It looks like to be a apache httpclient 3.x limitation.
https://issues.apache.org/jira/browse/HTTPCLIENT-579


Dominique

Le 02/03/11 15:04, Thumuluri, Sai a écrit :

Dominique, Does your crawler support NTLM2 authentication? We have content 
under SiteMinder which uses NTLM2 and that is posing challenges with Nutch?

-----Original Message-----
From: Dominique Bejean [mailto:dominique.bej...@eolya.fr]
Sent: Wednesday, March 02, 2011 6:22 AM
To: solr-user@lucene.apache.org
Subject: Re: [ANNOUNCE] Web Crawler

Aditya,

The crawler is not open source and won't be in the next future. Anyway,
I have to change the license because it can be use for any personal or
commercial projects.

Sincerely,

Dominique

Le 02/03/11 10:02, findbestopensource a écrit :

Hello Dominique Bejean,

Good job.

We identified almost 8 open source web crawlers
http://www.findbestopensource.com/tagged/webcrawler   I don't know how
far yours would be different from the rest.

Your license states that it is not open source but it is free for
personnel use.

Regards
Aditya
www.findbestopensource.com<http://www.findbestopensource.com>


On Wed, Mar 2, 2011 at 5:55 AM, Dominique Bejean
<dominique.bej...@eolya.fr<mailto:dominique.bej...@eolya.fr>>  wrote:

     Hi,

     I would like to announce Crawl Anywhere. Crawl-Anywhere is a Java
     Web Crawler. It includes :

       * a crawler
       * a document processing pipeline
       * a solr indexer

     The crawler has a web administration in order to manage web sites
     to be crawled. Each web site crawl is configured with a lot of
     possible parameters (no all mandatory) :

       * number of simultaneous items crawled by site
       * recrawl period rules based on item type (html, PDF, ...)
       * item type inclusion / exclusion rules
       * item path inclusion / exclusion / strategy rules
       * max depth
       * web site authentication
       * language
       * country
       * tags
       * collections
       * ...

     The pileline includes various ready to use stages (text
     extraction, language detection, Solr ready to index xml writer, ...).

     All is very configurable and extendible either by scripting or
     java coding.

     With scripting technology, you can help the crawler to handle
     javascript links or help the pipeline to extract relevant title
     and cleanup the html pages (remove menus, header, footers, ..)

     With java coding, you can develop your own pipeline stage stage

     The Crawl Anywhere web site provides good explanations and screen
     shots. All is documented in a wiki.

     The current version is 1.1.4. You can download and try it out from
     here : www.crawl-anywhere.com<http://www.crawl-anywhere.com>


     Regards

     Dominique

Re: [ANNOUNCE] Web Crawler

Reply via email to