Hi,
May be some people will find that posting interesting.
Webspam is one of the biggest issues or nutch for whole web crawls from my POV.

Greetings,
Stefan



During AIRWeb'06 we announced the availability of the collection.

We are currently planning a Web Spam challenge based on the dataset we
have built. I assume most of you will be interested on this, so I have
moved the "webspam-volunteers" list to "webspam-announces". If you do
not want to be in this new "webspam-announces" list, please send me an
e-mail.

This was shown during AIRWeb in Seattle:

.............................................................

Web Spam Collection Available
August 10th, 2006

We are pleased to announce the availability of a public collection for
research on Web spam. This collection is the result of efforts by a
team of volunteers:

Thiago Alves    Antonio Gulli            Tamas Sarlos
Luca Becchetti  Zoltan Gyongyi           Mike Thelwall
Paolo Boldi     Thomas Lavergn           Belle Tseng
Paul Chirita    Alex Ntoulas             Tanguy Urvoy
Mirel Cosulschi Josiane-Xavier Parreira  Wenzhong Zhao
Brian Davison   Xiaoguang Qi
Pascal Filoche  Massimo Santini

The corpus is a large set of Web pages in 11,000 {\tt .uk} hosts
downloaded in May 2006 by the Laboratory of Web Algorithmics,
Universit{\`a} degli Studi di Milano. The labelling process was
coordinated by Carlos Castillo working at the Algorithmic Engineering
group at Universit{\`a} di Roma ``La Sapienza'' The project was funded
by the DELIS project (Dynamically Evolving, Large Scale Information
Systems).

Volunteers were provided with a set of guidelines and were asked to
mark a set of hosts as either normal, spam, or borderline. The
collection includes about 6,700 judgments done by the volunteers and
can be used for testing link-based and content-based Web spam
detection and demotion techniques.

More information is available in our Web page, including the
guidelines given to the human judges, the instructions for obtaining
the links and contents of the pages in this collection, and the
contact information for questions and comments.

http://aeserver.dis.uniroma1.it/webspam/

If you use this data set please subscribe to our mailing list by
sending an e-mail to [EMAIL PROTECTED]

--
Carlos Castillo
Universita di Roma "La Sapienza"
Rome, ITALY





Yahoo! Groups Links

<*> To visit your group on the web, go to:
    http://groups.yahoo.com/group/webspam-announces/

<*> To unsubscribe from this group, send an email to:
    [EMAIL PROTECTED]

<*> Your use of Yahoo! Groups is subject to:
    http://docs.yahoo.com/info/terms/





Reply via email to