Hi, It seems they calculate features based on link graph within website and domain: https://github.com/yahoo/anthelion/blob/42deeadc38f99af6fc053ecddc19ce1bfe0c1495/anthelion/src/main/java/com/yahoo/research/robme/anthelion/framework/AnthOnlineClassifier.java#L149 <https://github.com/yahoo/anthelion/blob/42deeadc38f99af6fc053ecddc19ce1bfe0c1495/anthelion/src/main/java/com/yahoo/research/robme/anthelion/framework/AnthOnlineClassifier.java#L149>
and they don’t have much. Then, there is a usage of online learning algorithms from MOA http://moa.cms.waikato.ac.nz/details/classification/classifiers-2/ <http://moa.cms.waikato.ac.nz/details/classification/classifiers-2/> So they collect a batch of training samples and then do re-training on the whole batch. However, it’s hard to guess during which stage: fetch, parse, mergedb or any other? Overal, I think this can be easily replicated in any other configuration. A. > 17 дек. 2015 г., в 18:57, Mattmann, Chris A (3980) > <[email protected]> написал(а): > > Got it. > > Seems like there is great overlap here with the work that Sujen > and Asitang and our team at JPL already did directly in Nutch > to allow focused crawling based on Naive Bayes and also scoring > similarity using cosine similarity. A cool project would be to > compare the approaches (at least that’s what we’re working on) > and now it looks like we have Anthelion too to look at. > > The Any23 part is nice - I’ve always though we should more > actively integrate that into Nutch and/or Tika. Lewis had > done some work on that as well. > > Cheers, > Chris > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > Chris Mattmann, Ph.D. > Chief Architect > Instrument Software and Science Data Systems Section (398) > NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA > Office: 168-519, Mailstop: 168-527 > Email: [email protected] > WWW: http://sunset.usc.edu/~mattmann/ > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > Adjunct Associate Professor, Computer Science Department > University of Southern California, Los Angeles, CA 90089 USA > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > > > > > -----Original Message----- > From: BlackIce <[email protected]> > Reply-To: "[email protected]" <[email protected]> > Date: Thursday, December 17, 2015 at 4:16 AM > To: "[email protected]" <[email protected]> > Subject: Re: Anthelion from Yahoo > >> Interesting indeed, in more than one way... This is just a plug-in right? >> so it can be compiled with nutch 1.11? >> >> On Thu, Dec 17, 2015 at 10:25 AM, Markus Jelsma >> <[email protected]> >> wrote: >> >>> Interesting! That triple extractor and wdc parser could be useful >>> indeed! >>> It already uses any23. I wonder how easy we could integrate it into >>> Apache >>> Tika, and then use it in Nutch! But since it does use any23, i wonder >>> if it >>> relies on SAX events, ot the HTML body as a whole, which is bad. >>> >>> I am also curious whether the scoring filter supports incremental crawls >>> as opposed to OPIC. If not, it is might not be that interesting. >>> >>> Anyone knows? :) >>> >>> M. >>> >>> >>> >>> -----Original message----- >>>> From:Christian Kunz <[email protected]> >>>> Sent: Thursday 17th December 2015 7:30 >>>> To: [email protected] >>>> Subject: AW: Anthelion from Yahoo >>>> >>>> Hi Otis, >>>> >>>> haven't tried it yet. I wrote a little article that explains roughly >>> how >>> it works: >>> >>> http://www.seo-suedwest.de/1398-yahoo-crawler-strukturierte-daten-open-so >>> urce.html >>>> >>>> If anyone has practical experience with it please let me know. >>>> >>>> Regards, >>>> Christian >>>> >>>> >>>> >>>> -----Ursprüngliche Nachricht----- >>>> Von: Otis Gospodnetić [mailto:[email protected]] >>>> Gesendet: Donnerstag, 17. Dezember 2015 03:55 >>>> An: [email protected] >>>> Betreff: Anthelion from Yahoo >>>> >>>> Hi, >>>> >>>> FYI: https://github.com/yahoo/anthelion >>>> >>>> Anyone tried using it yet? >>>> >>>> Otis >>>> -- >>>> Monitoring - Log Management - Alerting - Anomaly Detection Solr & >>> Elasticsearch Consulting Support Training - http://sematext.com/ >>>> >>> >

