Dominique, Does your crawler support NTLM2 authentication? We have content under SiteMinder which uses NTLM2 and that is posing challenges with Nutch?
-----Original Message----- From: Dominique Bejean [mailto:dominique.bej...@eolya.fr] Sent: Wednesday, March 02, 2011 6:22 AM To: solr-user@lucene.apache.org Subject: Re: [ANNOUNCE] Web Crawler Aditya, The crawler is not open source and won't be in the next future. Anyway, I have to change the license because it can be use for any personal or commercial projects. Sincerely, Dominique Le 02/03/11 10:02, findbestopensource a écrit : > Hello Dominique Bejean, > > Good job. > > We identified almost 8 open source web crawlers > http://www.findbestopensource.com/tagged/webcrawler I don't know how > far yours would be different from the rest. > > Your license states that it is not open source but it is free for > personnel use. > > Regards > Aditya > www.findbestopensource.com <http://www.findbestopensource.com> > > > On Wed, Mar 2, 2011 at 5:55 AM, Dominique Bejean > <dominique.bej...@eolya.fr <mailto:dominique.bej...@eolya.fr>> wrote: > > Hi, > > I would like to announce Crawl Anywhere. Crawl-Anywhere is a Java > Web Crawler. It includes : > > * a crawler > * a document processing pipeline > * a solr indexer > > The crawler has a web administration in order to manage web sites > to be crawled. Each web site crawl is configured with a lot of > possible parameters (no all mandatory) : > > * number of simultaneous items crawled by site > * recrawl period rules based on item type (html, PDF, ...) > * item type inclusion / exclusion rules > * item path inclusion / exclusion / strategy rules > * max depth > * web site authentication > * language > * country > * tags > * collections > * ... > > The pileline includes various ready to use stages (text > extraction, language detection, Solr ready to index xml writer, ...). > > All is very configurable and extendible either by scripting or > java coding. > > With scripting technology, you can help the crawler to handle > javascript links or help the pipeline to extract relevant title > and cleanup the html pages (remove menus, header, footers, ..) > > With java coding, you can develop your own pipeline stage stage > > The Crawl Anywhere web site provides good explanations and screen > shots. All is documented in a wiki. > > The current version is 1.1.4. You can download and try it out from > here : www.crawl-anywhere.com <http://www.crawl-anywhere.com> > > > Regards > > Dominique > >