Re: [ANNOUNCE] Web Crawler

Paul Libbrecht Wed, 02 Mar 2011 03:37:08 -0800

VIewing the indexing result, which is a part of what you are describing I 
think, is a nice job for such an indexing framework.


Do you guys know whether such feature is already out there?

paul


Le 2 mars 2011 à 12:20, Geert-Jan Brits a écrit :

> Hi Dominique,
> 
> This looks nice.
> In the past, I've been interested in (semi)-automatically inducing a
> scheme/wrapper from a set of example webpages (often called 'wrapper
> induction' is the scientific field) .
> This would allow for fast scheme-creation which could be used as a basis for
> extraction.
> 
> Lately I've been looking for crawlers that incoporate this technology but
> without success.
> Any plans on incorporating this?
> 
> Cheers,
> Geert-Jan
> 
> 2011/3/2 Dominique Bejean <dominique.bej...@eolya.fr>
> 
>> Rosa,
>> 
>> In the pipeline, there is a stage that extract the text from the original
>> document (PDF, HTML, ...).
>> It is possible to plug scripts (Java 6 compliant) in order to keep only
>> relevant parts of the document.
>> See
>> http://www.wiizio.com/confluence/display/CRAWLUSERS/DocTextExtractor+stage
>> 
>> Dominique
>> 
>> Le 02/03/11 09:36, Rosa (Anuncios) a écrit :
>> 
>> Nice job!
>>> 
>>> It would be good to be able to extract specific data from a given page via
>>> XPATH though.
>>> 
>>> Regards,
>>> 
>>> 
>>> Le 02/03/2011 01:25, Dominique Bejean a écrit :
>>> 
>>>> Hi,
>>>> 
>>>> I would like to announce Crawl Anywhere. Crawl-Anywhere is a Java Web
>>>> Crawler. It includes :
>>>> 
>>>>  * a crawler
>>>>  * a document processing pipeline
>>>>  * a solr indexer
>>>> 
>>>> The crawler has a web administration in order to manage web sites to be
>>>> crawled. Each web site crawl is configured with a lot of possible 
>>>> parameters
>>>> (no all mandatory) :
>>>> 
>>>>  * number of simultaneous items crawled by site
>>>>  * recrawl period rules based on item type (html, PDF, …)
>>>>  * item type inclusion / exclusion rules
>>>>  * item path inclusion / exclusion / strategy rules
>>>>  * max depth
>>>>  * web site authentication
>>>>  * language
>>>>  * country
>>>>  * tags
>>>>  * collections
>>>>  * ...
>>>> 
>>>> The pileline includes various ready to use stages (text extraction,
>>>> language detection, Solr ready to index xml writer, ...).
>>>> 
>>>> All is very configurable and extendible either by scripting or java
>>>> coding.
>>>> 
>>>> With scripting technology, you can help the crawler to handle javascript
>>>> links or help the pipeline to extract relevant title and cleanup the html
>>>> pages (remove menus, header, footers, ..)
>>>> 
>>>> With java coding, you can develop your own pipeline stage stage
>>>> 
>>>> The Crawl Anywhere web site provides good explanations and screen shots.
>>>> All is documented in a wiki.
>>>> 
>>>> The current version is 1.1.4. You can download and try it out from here :
>>>> www.crawl-anywhere.com
>>>> 
>>>> 
>>>> Regards
>>>> 
>>>> Dominique
>>>> 
>>>> 
>>>> 
>>> 
>>>

Re: [ANNOUNCE] Web Crawler

Reply via email to