Hi,

The crawler comes with a extendible document processing pipeline. If you know java libraries or web services for 'wrapper induction' processing, it is possible to implement a dedicated stage in the pipeline.

Dominique

Le 02/03/11 12:20, Geert-Jan Brits a écrit :
Hi Dominique,

This looks nice.
In the past, I've been interested in (semi)-automatically inducing a scheme/wrapper from a set of example webpages (often called 'wrapper induction' is the scientific field) . This would allow for fast scheme-creation which could be used as a basis for extraction.

Lately I've been looking for crawlers that incoporate this technology but without success.
Any plans on incorporating this?

Cheers,
Geert-Jan

2011/3/2 Dominique Bejean <dominique.bej...@eolya.fr <mailto:dominique.bej...@eolya.fr>>

    Rosa,

    In the pipeline, there is a stage that extract the text from the
    original document (PDF, HTML, ...).
    It is possible to plug scripts (Java 6 compliant) in order to keep
    only relevant parts of the document.
    See
    http://www.wiizio.com/confluence/display/CRAWLUSERS/DocTextExtractor+stage

    Dominique

    Le 02/03/11 09:36, Rosa (Anuncios) a écrit :

        Nice job!

        It would be good to be able to extract specific data from a
        given page via XPATH though.

        Regards,


        Le 02/03/2011 01:25, Dominique Bejean a écrit :

            Hi,

            I would like to announce Crawl Anywhere. Crawl-Anywhere is
            a Java Web Crawler. It includes :

              * a crawler
              * a document processing pipeline
              * a solr indexer

            The crawler has a web administration in order to manage
            web sites to be crawled. Each web site crawl is configured
            with a lot of possible parameters (no all mandatory) :

              * number of simultaneous items crawled by site
              * recrawl period rules based on item type (html, PDF, …)
              * item type inclusion / exclusion rules
              * item path inclusion / exclusion / strategy rules
              * max depth
              * web site authentication
              * language
              * country
              * tags
              * collections
              * ...

            The pileline includes various ready to use stages (text
            extraction, language detection, Solr ready to index xml
            writer, ...).

            All is very configurable and extendible either by
            scripting or java coding.

            With scripting technology, you can help the crawler to
            handle javascript links or help the pipeline to extract
            relevant title and cleanup the html pages (remove menus,
            header, footers, ..)

            With java coding, you can develop your own pipeline stage
            stage

            The Crawl Anywhere web site provides good explanations and
            screen shots. All is documented in a wiki.

            The current version is 1.1.4. You can download and try it
            out from here : www.crawl-anywhere.com
            <http://www.crawl-anywhere.com>


            Regards

            Dominique





Reply via email to