Hi,

It seems they calculate features based on link graph within website and domain:
https://github.com/yahoo/anthelion/blob/42deeadc38f99af6fc053ecddc19ce1bfe0c1495/anthelion/src/main/java/com/yahoo/research/robme/anthelion/framework/AnthOnlineClassifier.java#L149
 
<https://github.com/yahoo/anthelion/blob/42deeadc38f99af6fc053ecddc19ce1bfe0c1495/anthelion/src/main/java/com/yahoo/research/robme/anthelion/framework/AnthOnlineClassifier.java#L149>

and they don’t have much. Then, there is a usage of online learning algorithms 
from MOA
http://moa.cms.waikato.ac.nz/details/classification/classifiers-2/ 
<http://moa.cms.waikato.ac.nz/details/classification/classifiers-2/>
So they collect a batch of training samples and then do re-training on the 
whole batch.

However, it’s hard to guess during which stage: fetch, parse, mergedb or any 
other?

Overal, I think this can be easily replicated in any other configuration.

A.



> 17 дек. 2015 г., в 18:57, Mattmann, Chris A (3980) 
> <[email protected]> написал(а):
> 
> Got it.
> 
> Seems like there is great overlap here with the work that Sujen
> and Asitang and our team at JPL already did directly in Nutch
> to allow focused crawling based on Naive Bayes and also scoring
> similarity using cosine similarity. A cool project would be to
> compare the approaches (at least that’s what we’re working on)
> and now it looks like we have Anthelion too to look at.
> 
> The Any23 part is nice - I’ve always though we should more
> actively integrate that into Nutch and/or Tika. Lewis had
> done some work on that as well.
> 
> Cheers,
> Chris
> 
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Chief Architect
> Instrument Software and Science Data Systems Section (398)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: [email protected]
> WWW:  http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> 
> 
> 
> 
> 
> -----Original Message-----
> From: BlackIce <[email protected]>
> Reply-To: "[email protected]" <[email protected]>
> Date: Thursday, December 17, 2015 at 4:16 AM
> To: "[email protected]" <[email protected]>
> Subject: Re: Anthelion from Yahoo
> 
>> Interesting indeed, in more than one way...  This is just a plug-in right?
>> so it can be compiled with nutch 1.11?
>> 
>> On Thu, Dec 17, 2015 at 10:25 AM, Markus Jelsma
>> <[email protected]>
>> wrote:
>> 
>>> Interesting! That triple extractor and wdc parser could be useful
>>> indeed!
>>> It already uses any23. I wonder how easy we could integrate it into
>>> Apache
>>> Tika, and then use it in Nutch! But since it does use any23, i wonder
>>> if it
>>> relies on SAX events, ot the HTML body as a whole, which is bad.
>>> 
>>> I am also curious whether the scoring filter supports incremental crawls
>>> as opposed to OPIC. If not, it is might not be that interesting.
>>> 
>>> Anyone knows? :)
>>> 
>>> M.
>>> 
>>> 
>>> 
>>> -----Original message-----
>>>> From:Christian Kunz <[email protected]>
>>>> Sent: Thursday 17th December 2015 7:30
>>>> To: [email protected]
>>>> Subject: AW: Anthelion from Yahoo
>>>> 
>>>> Hi Otis,
>>>> 
>>>> haven't tried it yet. I wrote a little article that explains roughly
>>> how
>>> it works:
>>> 
>>> http://www.seo-suedwest.de/1398-yahoo-crawler-strukturierte-daten-open-so
>>> urce.html
>>>> 
>>>> If anyone has practical experience with it please let me know.
>>>> 
>>>> Regards,
>>>> Christian
>>>> 
>>>> 
>>>> 
>>>> -----Ursprüngliche Nachricht-----
>>>> Von: Otis Gospodnetić [mailto:[email protected]]
>>>> Gesendet: Donnerstag, 17. Dezember 2015 03:55
>>>> An: [email protected]
>>>> Betreff: Anthelion from Yahoo
>>>> 
>>>> Hi,
>>>> 
>>>> FYI: https://github.com/yahoo/anthelion
>>>> 
>>>> Anyone tried using it yet?
>>>> 
>>>> Otis
>>>> --
>>>> Monitoring - Log Management - Alerting - Anomaly Detection Solr &
>>> Elasticsearch Consulting Support Training - http://sematext.com/
>>>> 
>>> 
> 

Reply via email to