Interesting! That triple extractor and wdc parser could be useful indeed! It 
already uses any23. I wonder how easy we could integrate it into Apache Tika, 
and then use it in Nutch! But since it does use any23, i wonder if it relies on 
SAX events, ot the HTML body as a whole, which is bad.

I am also curious whether the scoring filter supports incremental crawls as 
opposed to OPIC. If not, it is might not be that interesting.

Anyone knows? :)

M.

 
 
-----Original message-----
> From:Christian Kunz <[email protected]>
> Sent: Thursday 17th December 2015 7:30
> To: [email protected]
> Subject: AW: Anthelion from Yahoo
> 
> Hi Otis,
> 
> haven't tried it yet. I wrote a little article that explains roughly how it 
> works: 
> http://www.seo-suedwest.de/1398-yahoo-crawler-strukturierte-daten-open-source.html
> 
> If anyone has practical experience with it please let me know.
> 
> Regards,
> Christian
> 
> 
> 
> -----Ursprüngliche Nachricht-----
> Von: Otis Gospodnetić [mailto:[email protected]] 
> Gesendet: Donnerstag, 17. Dezember 2015 03:55
> An: [email protected]
> Betreff: Anthelion from Yahoo
> 
> Hi,
> 
> FYI: https://github.com/yahoo/anthelion
> 
> Anyone tried using it yet?
> 
> Otis
> --
> Monitoring - Log Management - Alerting - Anomaly Detection Solr & 
> Elasticsearch Consulting Support Training - http://sematext.com/
> 

Reply via email to