Re: [ANNOUNCE] Web Crawler

Dominique Bejean Thu, 23 May 2013 00:46:11 -0700

Hi,

Release 3.0.3 was tested with :


* Oracle Java 6 but should work fine with version 7
* Tomcat 5.5 and 6 and 7
* PHP 5.2.x and 5.3.x
* Apache 2.2.x
* MongoDB 64 bits 2.2 (know issue with 2.4)

The new release 4.0.0-alpha-2 is available under Github -https://github.com/bejean/crawl-anywhere


The pre-requisites are :

Oracle Java 6 or >
Tomcat 5.5 or >
Apache 2.2 or >
PHP 5.2.x or 5.3.x or 5.4.x
MongoDB 64 bits 2.2 or >
Solr 3.x or > (configuration files provided for Solr 4.3.0)

And the up to date installation instructions are herehttp://www.crawl-anywhere.com/installation-v400/


Please read the Github project home page, all information are provided.

Regards.

Dominique




Le 23/05/13 07:38, Rajesh Nikam a écrit :

Hi,

crawl anywhere seems to using old versions of java, tomcat, etc.

http://www.crawl-anywhere.com/installation-v300/

Will it work with new versions of these required software ?

Is there updated installation guide available ?

Thanks
Rajesh

On Wed, May 22, 2013 at 6:48 PM, Dominique Bejean<dominique.bej...@eolya.fr <mailto:dominique.bej...@eolya.fr>> wrote:


    Hi,

    Crawl-Anywhere is now open-source -
    https://github.com/bejean/crawl-anywhere

    Best regards.


    Le 02/03/11 10:02, findbestopensource a écrit :

        Hello Dominique Bejean,

        Good job.

        We identified almost 8 open source web crawlers
        http://www.findbestopensource.com/tagged/webcrawler   I don't
        know how far yours would be different from the rest.

        Your license states that it is not open source but it is free
        for personnel use.

        Regards
        Aditya
        www.findbestopensource.com <http://www.findbestopensource.com>
        <http://www.findbestopensource.com>


        On Wed, Mar 2, 2011 at 5:55 AM, Dominique Bejean
        <dominique.bej...@eolya.fr <mailto:dominique.bej...@eolya.fr>
        <mailto:dominique.bej...@eolya.fr
        <mailto:dominique.bej...@eolya.fr>>> wrote:

            Hi,

            I would like to announce Crawl Anywhere. Crawl-Anywhere is
        a Java
            Web Crawler. It includes :

              * a crawler
              * a document processing pipeline
              * a solr indexer

            The crawler has a web administration in order to manage
        web sites
            to be crawled. Each web site crawl is configured with a lot of
            possible parameters (no all mandatory) :

              * number of simultaneous items crawled by site
              * recrawl period rules based on item type (html, PDF, …)
              * item type inclusion / exclusion rules
              * item path inclusion / exclusion / strategy rules
              * max depth
              * web site authentication
              * language
              * country
              * tags
              * collections
              * ...

            The pileline includes various ready to use stages (text
            extraction, language detection, Solr ready to index xml
        writer, ...).

            All is very configurable and extendible either by scripting or
            java coding.

            With scripting technology, you can help the crawler to handle
            javascript links or help the pipeline to extract relevant
        title
            and cleanup the html pages (remove menus, header, footers, ..)

            With java coding, you can develop your own pipeline stage
        stage

            The Crawl Anywhere web site provides good explanations and
        screen
            shots. All is documented in a wiki.

            The current version is 1.1.4. You can download and try it
        out from
            here : www.crawl-anywhere.com
        <http://www.crawl-anywhere.com> <http://www.crawl-anywhere.com>


            Regards

            Dominique

--Dominique Béjean

    +33 6 08 46 12 43
    skype: dbejean
    www.eolya.fr <http://www.eolya.fr>
    www.crawl-anywhere.com <http://www.crawl-anywhere.com>
    www.mysolrserver.com <http://www.mysolrserver.com>


--
Dominique Béjean
+33 6 08 46 12 43
skype: dbejean
www.eolya.fr
www.crawl-anywhere.com

Re: [ANNOUNCE] Web Crawler

Reply via email to