Software learns new words from Wikipedia

    * 17:12 04 September 2006
    * NewScientist.com news service
    * Tom Simonite


A program that works out the meaning of newly coined words using the online encyclopaedia Wikipedia could help machines understand the slang used in blogs and other informal texts, say researchers.

The program – called Zeitgeist – hunts through Wikipedia looking for entries about new words that do not appear in an online resource called WordNet, an official linguistics tool that is both a dictionary and a thesaurus. WordNet is used by researchers to help computers understand human language. New words, or neologisms, that do not appear in WordNet inevitably leave computers stumped.

When Zeitgeist finds a Wikipedia entry about a new word, it looks at the links to and from the page, explains lead researcher Tony Veale from University College Dublin, Ireland. "Is there a pattern amongst those linkages that allows us to understand what the new word means?" he asks.

For example, having found an entry for the word "gastropub" – a bar that specialises in food – Zeitgeist can work out the definition for itself thanks to the links to the entries for "pub" and "gastronomy". The program does not read the linked-to pages but relates their titles to entries in the WordNet database.
"Link diarrhoea"

"The link structure reflects linkages between ideas," says Veale, "but people have a tendency to link everything – they get link diarrhoea." To prevent this from confusing the program, Zeitgeist ignores links that are not reciprocated. If the page a link points to does not link back to the neologism, it is discounted.

One of Zeitgeist's limitations is that links sometimes point to an article that is not part of a neologism's definition. For example, it understands "feminazi" – a word used to characterise a woman as man-hating – as being a combination of the words "feminist" and "Nazi" because of the links on the Wikipedia entry.

But feminazi is actually a term of abuse that has nothing to do with the Nazi doctrine of National Socialism. For that reason, Zeitgeist cannot be relied on to create a dictionary-style definition.

But this need not be a problem, says Veale. He thinks Zeitgeist's approach is good enough to work out the sentiment of human writing. A link to the term Nazi should make it clear that a neologism carries a negative connotation, he says.

"We're interested in a computer processing a text and having a way to understand the meaning and intention of words that are new to it," Veale explains. "That's useful for applications from understanding emails to summarising news reports."
Fast-changing lingo

Many companies are interested in such technology to get a feel for what people are saying about their products on blogs and message boards. "They're likely to have a lot of slang and neologisms," explains Veale. "These words emerge too fast to appear in dictionaries or resources like WordNet."

John Carrol, who develops systems that can understand human language at Sussex University, UK, agrees that Wikipedia is a good place to look for new words: "It's such a large and up-to-date resource, I think we'll see it used more for projects like this in the future," he says.

"Zeitgeist is a neat tool," adds Carrol. But he points out that its limitations mean it can handle only 75% of the neologisms it finds in Wikipedia. Another technique is to use the context of a new word to guess at its meaning, he says. Adding that ability to Zeitgeist could make it much more powerful.

Veale presented his work on Zeitgeist at the European Conference on Artificial Intelligence in Riva del Garda, Italy, last week.


24/7 PROTOMEDIA BREEDING GROUND

JOGLARS CROSSMEDIA BROADCAST
(collaborative text & media)

SPIDERTANGLE
International Network of VisPoets

XEXOXIAL EDITIONS
Appropriate Scale Publishing since 1980

INTERNALATIONAL DICTIONARY OF NEOLOGISMS
research | reference | ongoing collection

Dreamtime Village
Hypermedia Permaculture EcoVillage in Southwest Wisconsin

"The word is the first stereotype."  Isidore Isou, 1947.



Reply via email to