VIewing the indexing result, which is a part of what you are describing I think, is a nice job for such an indexing framework.
Do you guys know whether such feature is already out there? paul Le 2 mars 2011 à 12:20, Geert-Jan Brits a écrit : > Hi Dominique, > > This looks nice. > In the past, I've been interested in (semi)-automatically inducing a > scheme/wrapper from a set of example webpages (often called 'wrapper > induction' is the scientific field) . > This would allow for fast scheme-creation which could be used as a basis for > extraction. > > Lately I've been looking for crawlers that incoporate this technology but > without success. > Any plans on incorporating this? > > Cheers, > Geert-Jan > > 2011/3/2 Dominique Bejean <dominique.bej...@eolya.fr> > >> Rosa, >> >> In the pipeline, there is a stage that extract the text from the original >> document (PDF, HTML, ...). >> It is possible to plug scripts (Java 6 compliant) in order to keep only >> relevant parts of the document. >> See >> http://www.wiizio.com/confluence/display/CRAWLUSERS/DocTextExtractor+stage >> >> Dominique >> >> Le 02/03/11 09:36, Rosa (Anuncios) a écrit : >> >> Nice job! >>> >>> It would be good to be able to extract specific data from a given page via >>> XPATH though. >>> >>> Regards, >>> >>> >>> Le 02/03/2011 01:25, Dominique Bejean a écrit : >>> >>>> Hi, >>>> >>>> I would like to announce Crawl Anywhere. Crawl-Anywhere is a Java Web >>>> Crawler. It includes : >>>> >>>> * a crawler >>>> * a document processing pipeline >>>> * a solr indexer >>>> >>>> The crawler has a web administration in order to manage web sites to be >>>> crawled. Each web site crawl is configured with a lot of possible >>>> parameters >>>> (no all mandatory) : >>>> >>>> * number of simultaneous items crawled by site >>>> * recrawl period rules based on item type (html, PDF, …) >>>> * item type inclusion / exclusion rules >>>> * item path inclusion / exclusion / strategy rules >>>> * max depth >>>> * web site authentication >>>> * language >>>> * country >>>> * tags >>>> * collections >>>> * ... >>>> >>>> The pileline includes various ready to use stages (text extraction, >>>> language detection, Solr ready to index xml writer, ...). >>>> >>>> All is very configurable and extendible either by scripting or java >>>> coding. >>>> >>>> With scripting technology, you can help the crawler to handle javascript >>>> links or help the pipeline to extract relevant title and cleanup the html >>>> pages (remove menus, header, footers, ..) >>>> >>>> With java coding, you can develop your own pipeline stage stage >>>> >>>> The Crawl Anywhere web site provides good explanations and screen shots. >>>> All is documented in a wiki. >>>> >>>> The current version is 1.1.4. You can download and try it out from here : >>>> www.crawl-anywhere.com >>>> >>>> >>>> Regards >>>> >>>> Dominique >>>> >>>> >>>> >>> >>>