Interesting! That triple extractor and wdc parser could be useful indeed! It already uses any23. I wonder how easy we could integrate it into Apache Tika, and then use it in Nutch! But since it does use any23, i wonder if it relies on SAX events, ot the HTML body as a whole, which is bad.
I am also curious whether the scoring filter supports incremental crawls as opposed to OPIC. If not, it is might not be that interesting. Anyone knows? :) M. -----Original message----- > From:Christian Kunz <[email protected]> > Sent: Thursday 17th December 2015 7:30 > To: [email protected] > Subject: AW: Anthelion from Yahoo > > Hi Otis, > > haven't tried it yet. I wrote a little article that explains roughly how it > works: > http://www.seo-suedwest.de/1398-yahoo-crawler-strukturierte-daten-open-source.html > > If anyone has practical experience with it please let me know. > > Regards, > Christian > > > > -----Ursprüngliche Nachricht----- > Von: Otis Gospodnetić [mailto:[email protected]] > Gesendet: Donnerstag, 17. Dezember 2015 03:55 > An: [email protected] > Betreff: Anthelion from Yahoo > > Hi, > > FYI: https://github.com/yahoo/anthelion > > Anyone tried using it yet? > > Otis > -- > Monitoring - Log Management - Alerting - Anomaly Detection Solr & > Elasticsearch Consulting Support Training - http://sematext.com/ >

