Got it. Seems like there is great overlap here with the work that Sujen and Asitang and our team at JPL already did directly in Nutch to allow focused crawling based on Naive Bayes and also scoring similarity using cosine similarity. A cool project would be to compare the approaches (at least that’s what we’re working on) and now it looks like we have Anthelion too to look at.
The Any23 part is nice - I’ve always though we should more actively integrate that into Nutch and/or Tika. Lewis had done some work on that as well. Cheers, Chris ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: [email protected] WWW: http://sunset.usc.edu/~mattmann/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ -----Original Message----- From: BlackIce <[email protected]> Reply-To: "[email protected]" <[email protected]> Date: Thursday, December 17, 2015 at 4:16 AM To: "[email protected]" <[email protected]> Subject: Re: Anthelion from Yahoo >Interesting indeed, in more than one way... This is just a plug-in right? >so it can be compiled with nutch 1.11? > >On Thu, Dec 17, 2015 at 10:25 AM, Markus Jelsma ><[email protected]> >wrote: > >> Interesting! That triple extractor and wdc parser could be useful >>indeed! >> It already uses any23. I wonder how easy we could integrate it into >>Apache >> Tika, and then use it in Nutch! But since it does use any23, i wonder >>if it >> relies on SAX events, ot the HTML body as a whole, which is bad. >> >> I am also curious whether the scoring filter supports incremental crawls >> as opposed to OPIC. If not, it is might not be that interesting. >> >> Anyone knows? :) >> >> M. >> >> >> >> -----Original message----- >> > From:Christian Kunz <[email protected]> >> > Sent: Thursday 17th December 2015 7:30 >> > To: [email protected] >> > Subject: AW: Anthelion from Yahoo >> > >> > Hi Otis, >> > >> > haven't tried it yet. I wrote a little article that explains roughly >>how >> it works: >> >>http://www.seo-suedwest.de/1398-yahoo-crawler-strukturierte-daten-open-so >>urce.html >> > >> > If anyone has practical experience with it please let me know. >> > >> > Regards, >> > Christian >> > >> > >> > >> > -----Ursprüngliche Nachricht----- >> > Von: Otis Gospodnetić [mailto:[email protected]] >> > Gesendet: Donnerstag, 17. Dezember 2015 03:55 >> > An: [email protected] >> > Betreff: Anthelion from Yahoo >> > >> > Hi, >> > >> > FYI: https://github.com/yahoo/anthelion >> > >> > Anyone tried using it yet? >> > >> > Otis >> > -- >> > Monitoring - Log Management - Alerting - Anomaly Detection Solr & >> Elasticsearch Consulting Support Training - http://sematext.com/ >> > >>

