[WSG] Wild metadata

2005-11-14 Thread Jonathan O'Donnell

Hi DC-General and the Web Standards Group

Here's another half-baked idea that I am trying to straighten out.  I  
would appreciate your feedback and suggestions.  This will be my last  
one for a while, I promise.


** The problem **
On the Web, DC.description and DC.subject are not very effective  
finding aids when the full text is indexed.


** The solution **
Wild metadata, such as anchor text, blog descriptions and folksonomies  
may provide better description and subject (or keyword) metadata.


** Example **
!DOCTYPE html PUBLIC -//W3C//DTD XHTML 1.0 Transitional//EN   
http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd;

html xmlns=http://www.w3.org/1999/xhtml; xml:lang=en lang=en
head
link rel=schema.dc href=http://purl.org/dc/elements/1.1/; /
link rel=schema.terms href=http://purl.org/dc/terms/; /
		link rel=DC.subject  
href=http://api.search.yahoo.com/WebSearchService/rss/webSearch.xml? 
appid=yahoosearchwebrssquery=link:http://jod.id.au/tutorial/naked- 
metadata.html /
		link rel=DC.subject  
href=http://del.icio.us/rss/url/e43f0f84e421ed5de166b285eca30468; /
		link rel=DC.description  
href=http://www.blogdigger.com/rssLinkSearch.jsp?link=http:// 
jod.id.au/tutorial/wild-metadata.html /

/head
body
/body
/html


** Background **

At the DC-ANZ 2005, David Hawking (Panoptic, CSIRO) convinced me that  
DC.Description and DC.Subject metadata aren't very useful finding aids  
when the full text of a Web page is indexed.  He showed a comparison of  
searches based on subject and description metadata versus searches  
based on anchor text alone, and the anchor text search was just as  
effective. [1, 2]


Aside from Web page authors, lots of people spend time indexing and  
categorising Web pages. They build links, write blog entries and tag  
pages in folksonomies. This metadata is wild - it is not crafted or  
controlled by the agency who created the page.  It hasn't been  
commissioned and it represents a variety of world views.  Individually,  
these pieces of metadata may not be very useful.  In numbers, however,  
the irregularities begin to smooth out and the information may be as  
good or better than metadata written by a Web page author.


The quality will not be as good as trained librarians applying metadata  
via a standardised system and controlled vocabularies.  It will,  
however, be as good or better than untrained people applying metadata  
to their own pages.  It will also be better than no metadata at all.


** Method **
A rough and ready method consists of finding pages that display anchor  
text, weblog summaries and folksonomy tags for a given page.   
Preference is given to pages that provide results in a well-formed XML  
format, as these assist the harvesting process.


* Anchor text *
Yahoo! provides a good listing of anchor text terms via their ability  
to find pages that link to a specified URL.

The syntax for Yahoo! is:
http://api.search.yahoo.com/WebSearchService/rss/webSearch.xml? 
appid=yahoosearchwebrssquery=link:http://jod.id.au/tutorial/naked- 
metadata.html


* Weblogs *
Weblog search engines like Blogdigger will show blog entries for a  
given URL. These can be used as descriptions of the page.

The format for Blogdigger is:
http://www.blogdigger.com/rssLinkSearch.jsp?link=http://jod.id.au/ 
tutorial/wild-metadata.html


* Folksonomies *
I could only find one folksonomy (del.icio.us) that had a syntax for  
searching by URL.  Unfortunately, I could not find a simple way to use  
this syntax.  Del.icio.us allocates a unique number to each URL.   
Therefore, before you can construct a URL, you need to discover what  
the URL is.


+   For example, I created the page:
http://jod.id.au/tutorial/wild-metadata.html
+   I then tagged it in Del.icio.us.
+   I then searched for it in Del.icio.us.
+	Del.icio.us told me that this URL could be referenced in RDF format  
at:

http://del.icio.us/rss/url/e43f0f84e421ed5de166b285eca30468


** Harvesting **
It is all well and good to put metadata into a document. You have to be  
able to get it out again for it to be any use.


Both Yahoo! and Del.icio.us provide their results in RSS or Atom  
format.  While this makes the results machine-readable (and machine  
harvestable), it doesn't make it easy for a mere mortal to read it.


I'm not sure if there are DC.metadata harvesters that can parse RSS or  
Atom feeds as metadata.  The possibility exists - I just can't point to  
an example.


** Advantages **
+	Wild metadata adds multiple voices to a metadata record.  For  
example, wild metadata might exist in different languages.
+	Wild metadata does not cost the Web page author anything, either in  
terms of time or money.


** Disadvantages **
+	New pages will not have any wild metadata.  Wild metadata builds over  
time.
+	Unpopular pages will not gather much wild metadata.  Wild metadata  
clusters to popular pages.


** References **
[1]	David 

Re: [WSG] Wild metadata

2005-11-14 Thread Steven C. Perkins
You might be interested in MKSearch, it searches for DC metadata in the 
head section of web pages.


http://www.mksearch.mkdoc.org/

Regards,

Steven C. Perkins

At 03:16 AM 11/14/2005, you wrote:

Hi DC-General and the Web Standards Group

Here's another half-baked idea that I am trying to straighten out.  I
would appreciate your feedback and suggestions.  This will be my last
one for a while, I promise.

** The problem **
On the Web, DC.description and DC.subject are not very effective
finding aids when the full text is indexed.

** The solution **
Wild metadata, such as anchor text, blog descriptions and folksonomies
may provide better description and subject (or keyword) metadata.

** Example **
!DOCTYPE html PUBLIC -//W3C//DTD XHTML 1.0 Transitional//EN
http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd;
html xmlns=http://www.w3.org/1999/xhtml; xml:lang=en lang=en
head
link rel=schema.dc 
href=http://purl.org/dc/elements/1.1/; /

link rel=schema.terms href=http://purl.org/dc/terms/; /
link rel=DC.subject
href=http://api.search.yahoo.com/WebSearchService/rss/webSearch.xml? 
appid=yahoosearchwebrssquery=link:http://jod.id.au/tutorial/naked- 
metadata.html /

link rel=DC.subject
href=http://del.icio.us/rss/url/e43f0f84e421ed5de166b285eca30468; /
link rel=DC.description
href=http://www.blogdigger.com/rssLinkSearch.jsp?link=http:// 
jod.id.au/tutorial/wild-metadata.html /

/head
body
/body
/html


** Background **

At the DC-ANZ 2005, David Hawking (Panoptic, CSIRO) convinced me that
DC.Description and DC.Subject metadata aren't very useful finding aids
when the full text of a Web page is indexed.  He showed a comparison of
searches based on subject and description metadata versus searches
based on anchor text alone, and the anchor text search was just as
effective. [1, 2]

Aside from Web page authors, lots of people spend time indexing and
categorising Web pages. They build links, write blog entries and tag
pages in folksonomies. This metadata is wild - it is not crafted or
controlled by the agency who created the page.  It hasn't been
commissioned and it represents a variety of world views.  Individually,
these pieces of metadata may not be very useful.  In numbers, however,
the irregularities begin to smooth out and the information may be as
good or better than metadata written by a Web page author.

The quality will not be as good as trained librarians applying metadata
via a standardised system and controlled vocabularies.  It will,
however, be as good or better than untrained people applying metadata
to their own pages.  It will also be better than no metadata at all.

** Method **
A rough and ready method consists of finding pages that display anchor
text, weblog summaries and folksonomy tags for a given page.
Preference is given to pages that provide results in a well-formed XML
format, as these assist the harvesting process.

* Anchor text *
Yahoo! provides a good listing of anchor text terms via their ability
to find pages that link to a specified URL.
The syntax for Yahoo! is:
http://api.search.yahoo.com/WebSearchService/rss/webSearch.xml? 
appid=yahoosearchwebrssquery=link:http://jod.id.au/tutorial/naked- 
metadata.html


* Weblogs *
Weblog search engines like Blogdigger will show blog entries for a
given URL. These can be used as descriptions of the page.
The format for Blogdigger is:
http://www.blogdigger.com/rssLinkSearch.jsp?link=http://jod.id.au/ 
tutorial/wild-metadata.html


* Folksonomies *
I could only find one folksonomy (del.icio.us) that had a syntax for
searching by URL.  Unfortunately, I could not find a simple way to use
this syntax.  Del.icio.us allocates a unique number to each URL.
Therefore, before you can construct a URL, you need to discover what
the URL is.

+   For example, I created the page:
http://jod.id.au/tutorial/wild-metadata.html
+   I then tagged it in Del.icio.us.
+   I then searched for it in Del.icio.us.
+   Del.icio.us told me that this URL could be referenced in RDF format
at:
http://del.icio.us/rss/url/e43f0f84e421ed5de166b285eca30468


** Harvesting **
It is all well and good to put metadata into a document. You have to be
able to get it out again for it to be any use.

Both Yahoo! and Del.icio.us provide their results in RSS or Atom
format.  While this makes the results machine-readable (and machine
harvestable), it doesn't make it easy for a mere mortal to read it.

I'm not sure if there are DC.metadata harvesters that can parse RSS or
Atom feeds as metadata.  The possibility exists - I just can't point to
an example.

** Advantages **
+   Wild metadata adds multiple voices to a metadata record.  For
example, wild metadata might exist in different languages.
+   Wild metadata does not cost the Web page author anything, either in
terms of time or money.

** Disadvantages **
+   New pages will not have any wild 

Re: [WSG] Wild metadata

2005-11-14 Thread Terrence Wood
Jonathan O'Donnell said:

 ** The solution **
 Wild metadata, such as anchor text, blog descriptions and folksonomies
 may provide better description and subject (or keyword) metadata.
 The quality will not be as good as trained librarians applying metadata
 via a standardised system and controlled vocabularies.  It will,
 however, be as good or better than untrained people applying metadata
 to their own pages.  It will also be better than no metadata at all.

You're such a pirate Jonathon =) I love it. The Web 2.0 fans should too.

Now all you need to do is package this up as a web service and you've got
yourself a web 2.0 company =)

kind regards
Terrence Wood.

**
The discussion list for  http://webstandardsgroup.org/

 See http://webstandardsgroup.org/mail/guidelines.cfm
 for some hints on posting to the list  getting help
**



Re: [WSG] Wild metadata

2005-11-14 Thread Andy Kirkwood, Motive
Hi Jonathan,

** The problem **
On the Web, DC.description and DC.subject are not very effective finding aids 
when the full text is indexed.

I'm unclear as to the purpose of your enquiry. My take on what you have 
outlined is that you're seeking a method of generating metadata records without 
requiring the author to be involved. If this were the approach taken by the 
White House, then George W Bush's biography would be assigned the metadata 
record 'miserable failure'.  http://news.bbc.co.uk/2/hi/americas/3298443.stm 

The benefit of classification by an authority (someone who knows their field) 
is that the classification differentiates content. The more specific the 
classification, the more useful that classification is to a knowledgeable 
searcher.

On the web, broad, common language classification systems are of most value 
when the subject is unknown. For example, as a new web designer I might search 
for 'web design'. As my knowledge increases, my search is likely to be become 
more sophisticated, for example 'CSS floats' or 'IE box-model hack'.

It would be helpful to define both 'effective' and 'finding aid'. As search is 
such a broad topic it would also be productive to establish context. For 
example, is this a public search or a site specific search? Would metadata 
records be displayed to the user or factored into the page ranking? Etc.

** The solution **
Wild metadata, such as anchor text, blog descriptions and folksonomies may 
provide better description and subject (or keyword) metadata.

If the author-generated metadata records are displayed as part of a search 
result records, then they provide a succinct description of the content. As to 
whether an individual finds metadata record support the locating of content, 
the method of display, relevance of the metadata records to the search 
conducted, personal preference, etc also come into play.

Link text (i.e. the text used to link to one webpage from another) is already 
factored in public search engine ranking algorithms, as does the number of 
incoming links.

Trackbacks  http://www.motive.co.nz/glossary/trackback.php  are an existing 
method of capitalising on blog comments, as the link text and blurb from 
referring webpage is embedded on the source webpage.

With regard to folksonomies, looking through Technorati's tags  
http://www.technorati.com/tags/ , content is often classified according to 
subjective qualities such as 'rant', 'rambling' and 'random'. It is more likely 
that folksonomies constitute a snapshot of the evolution of language. As a 
'fringe' term become socialised it emerges as part of a formal classification 
system. For example the term 'hack', as it pertains to CSS, has been socialised 
to the point where it has become meaningful search term.

I would go so far as to suggest that public search engines have already 
implemented a 'wild metadata' approach to generating search results. Perhaps 
the issue with the value of metadata records lies less with how they are 
generated and more with how people phrase search queries and use the web.

You might find it useful to browse our glossary as it provides further info on 
search engines, folksonomies, metadata, etc.  http://www.motive.co.nz/glossary 
.

Best regards,

-- 
Andy Kirkwood | Creative Director

Motive | web.design.integrity
http://www.motive.co.nz
ph: (04) 3 800 800  fx: (04) 970 9693
mob: 021 369 693
93 Rintoul St, Newtown
PO Box 7150, Wellington South, New Zealand
**
The discussion list for  http://webstandardsgroup.org/

 See http://webstandardsgroup.org/mail/guidelines.cfm
 for some hints on posting to the list  getting help
**