[WSG] Wild metadata
Hi DC-General and the Web Standards Group Here's another half-baked idea that I am trying to straighten out. I would appreciate your feedback and suggestions. This will be my last one for a while, I promise. ** The problem ** On the Web, DC.description and DC.subject are not very effective finding aids when the full text is indexed. ** The solution ** Wild metadata, such as anchor text, blog descriptions and folksonomies may provide better description and subject (or keyword) metadata. ** Example ** !DOCTYPE html PUBLIC -//W3C//DTD XHTML 1.0 Transitional//EN http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd; html xmlns=http://www.w3.org/1999/xhtml; xml:lang=en lang=en head link rel=schema.dc href=http://purl.org/dc/elements/1.1/; / link rel=schema.terms href=http://purl.org/dc/terms/; / link rel=DC.subject href=http://api.search.yahoo.com/WebSearchService/rss/webSearch.xml? appid=yahoosearchwebrssquery=link:http://jod.id.au/tutorial/naked- metadata.html / link rel=DC.subject href=http://del.icio.us/rss/url/e43f0f84e421ed5de166b285eca30468; / link rel=DC.description href=http://www.blogdigger.com/rssLinkSearch.jsp?link=http:// jod.id.au/tutorial/wild-metadata.html / /head body /body /html ** Background ** At the DC-ANZ 2005, David Hawking (Panoptic, CSIRO) convinced me that DC.Description and DC.Subject metadata aren't very useful finding aids when the full text of a Web page is indexed. He showed a comparison of searches based on subject and description metadata versus searches based on anchor text alone, and the anchor text search was just as effective. [1, 2] Aside from Web page authors, lots of people spend time indexing and categorising Web pages. They build links, write blog entries and tag pages in folksonomies. This metadata is wild - it is not crafted or controlled by the agency who created the page. It hasn't been commissioned and it represents a variety of world views. Individually, these pieces of metadata may not be very useful. In numbers, however, the irregularities begin to smooth out and the information may be as good or better than metadata written by a Web page author. The quality will not be as good as trained librarians applying metadata via a standardised system and controlled vocabularies. It will, however, be as good or better than untrained people applying metadata to their own pages. It will also be better than no metadata at all. ** Method ** A rough and ready method consists of finding pages that display anchor text, weblog summaries and folksonomy tags for a given page. Preference is given to pages that provide results in a well-formed XML format, as these assist the harvesting process. * Anchor text * Yahoo! provides a good listing of anchor text terms via their ability to find pages that link to a specified URL. The syntax for Yahoo! is: http://api.search.yahoo.com/WebSearchService/rss/webSearch.xml? appid=yahoosearchwebrssquery=link:http://jod.id.au/tutorial/naked- metadata.html * Weblogs * Weblog search engines like Blogdigger will show blog entries for a given URL. These can be used as descriptions of the page. The format for Blogdigger is: http://www.blogdigger.com/rssLinkSearch.jsp?link=http://jod.id.au/ tutorial/wild-metadata.html * Folksonomies * I could only find one folksonomy (del.icio.us) that had a syntax for searching by URL. Unfortunately, I could not find a simple way to use this syntax. Del.icio.us allocates a unique number to each URL. Therefore, before you can construct a URL, you need to discover what the URL is. + For example, I created the page: http://jod.id.au/tutorial/wild-metadata.html + I then tagged it in Del.icio.us. + I then searched for it in Del.icio.us. + Del.icio.us told me that this URL could be referenced in RDF format at: http://del.icio.us/rss/url/e43f0f84e421ed5de166b285eca30468 ** Harvesting ** It is all well and good to put metadata into a document. You have to be able to get it out again for it to be any use. Both Yahoo! and Del.icio.us provide their results in RSS or Atom format. While this makes the results machine-readable (and machine harvestable), it doesn't make it easy for a mere mortal to read it. I'm not sure if there are DC.metadata harvesters that can parse RSS or Atom feeds as metadata. The possibility exists - I just can't point to an example. ** Advantages ** + Wild metadata adds multiple voices to a metadata record. For example, wild metadata might exist in different languages. + Wild metadata does not cost the Web page author anything, either in terms of time or money. ** Disadvantages ** + New pages will not have any wild metadata. Wild metadata builds over time. + Unpopular pages will not gather much wild metadata. Wild metadata clusters to popular pages. ** References ** [1] David
Re: [WSG] Wild metadata
You might be interested in MKSearch, it searches for DC metadata in the head section of web pages. http://www.mksearch.mkdoc.org/ Regards, Steven C. Perkins At 03:16 AM 11/14/2005, you wrote: Hi DC-General and the Web Standards Group Here's another half-baked idea that I am trying to straighten out. I would appreciate your feedback and suggestions. This will be my last one for a while, I promise. ** The problem ** On the Web, DC.description and DC.subject are not very effective finding aids when the full text is indexed. ** The solution ** Wild metadata, such as anchor text, blog descriptions and folksonomies may provide better description and subject (or keyword) metadata. ** Example ** !DOCTYPE html PUBLIC -//W3C//DTD XHTML 1.0 Transitional//EN http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd; html xmlns=http://www.w3.org/1999/xhtml; xml:lang=en lang=en head link rel=schema.dc href=http://purl.org/dc/elements/1.1/; / link rel=schema.terms href=http://purl.org/dc/terms/; / link rel=DC.subject href=http://api.search.yahoo.com/WebSearchService/rss/webSearch.xml? appid=yahoosearchwebrssquery=link:http://jod.id.au/tutorial/naked- metadata.html / link rel=DC.subject href=http://del.icio.us/rss/url/e43f0f84e421ed5de166b285eca30468; / link rel=DC.description href=http://www.blogdigger.com/rssLinkSearch.jsp?link=http:// jod.id.au/tutorial/wild-metadata.html / /head body /body /html ** Background ** At the DC-ANZ 2005, David Hawking (Panoptic, CSIRO) convinced me that DC.Description and DC.Subject metadata aren't very useful finding aids when the full text of a Web page is indexed. He showed a comparison of searches based on subject and description metadata versus searches based on anchor text alone, and the anchor text search was just as effective. [1, 2] Aside from Web page authors, lots of people spend time indexing and categorising Web pages. They build links, write blog entries and tag pages in folksonomies. This metadata is wild - it is not crafted or controlled by the agency who created the page. It hasn't been commissioned and it represents a variety of world views. Individually, these pieces of metadata may not be very useful. In numbers, however, the irregularities begin to smooth out and the information may be as good or better than metadata written by a Web page author. The quality will not be as good as trained librarians applying metadata via a standardised system and controlled vocabularies. It will, however, be as good or better than untrained people applying metadata to their own pages. It will also be better than no metadata at all. ** Method ** A rough and ready method consists of finding pages that display anchor text, weblog summaries and folksonomy tags for a given page. Preference is given to pages that provide results in a well-formed XML format, as these assist the harvesting process. * Anchor text * Yahoo! provides a good listing of anchor text terms via their ability to find pages that link to a specified URL. The syntax for Yahoo! is: http://api.search.yahoo.com/WebSearchService/rss/webSearch.xml? appid=yahoosearchwebrssquery=link:http://jod.id.au/tutorial/naked- metadata.html * Weblogs * Weblog search engines like Blogdigger will show blog entries for a given URL. These can be used as descriptions of the page. The format for Blogdigger is: http://www.blogdigger.com/rssLinkSearch.jsp?link=http://jod.id.au/ tutorial/wild-metadata.html * Folksonomies * I could only find one folksonomy (del.icio.us) that had a syntax for searching by URL. Unfortunately, I could not find a simple way to use this syntax. Del.icio.us allocates a unique number to each URL. Therefore, before you can construct a URL, you need to discover what the URL is. + For example, I created the page: http://jod.id.au/tutorial/wild-metadata.html + I then tagged it in Del.icio.us. + I then searched for it in Del.icio.us. + Del.icio.us told me that this URL could be referenced in RDF format at: http://del.icio.us/rss/url/e43f0f84e421ed5de166b285eca30468 ** Harvesting ** It is all well and good to put metadata into a document. You have to be able to get it out again for it to be any use. Both Yahoo! and Del.icio.us provide their results in RSS or Atom format. While this makes the results machine-readable (and machine harvestable), it doesn't make it easy for a mere mortal to read it. I'm not sure if there are DC.metadata harvesters that can parse RSS or Atom feeds as metadata. The possibility exists - I just can't point to an example. ** Advantages ** + Wild metadata adds multiple voices to a metadata record. For example, wild metadata might exist in different languages. + Wild metadata does not cost the Web page author anything, either in terms of time or money. ** Disadvantages ** + New pages will not have any wild
Re: [WSG] Wild metadata
Jonathan O'Donnell said: ** The solution ** Wild metadata, such as anchor text, blog descriptions and folksonomies may provide better description and subject (or keyword) metadata. The quality will not be as good as trained librarians applying metadata via a standardised system and controlled vocabularies. It will, however, be as good or better than untrained people applying metadata to their own pages. It will also be better than no metadata at all. You're such a pirate Jonathon =) I love it. The Web 2.0 fans should too. Now all you need to do is package this up as a web service and you've got yourself a web 2.0 company =) kind regards Terrence Wood. ** The discussion list for http://webstandardsgroup.org/ See http://webstandardsgroup.org/mail/guidelines.cfm for some hints on posting to the list getting help **
Re: [WSG] Wild metadata
Hi Jonathan, ** The problem ** On the Web, DC.description and DC.subject are not very effective finding aids when the full text is indexed. I'm unclear as to the purpose of your enquiry. My take on what you have outlined is that you're seeking a method of generating metadata records without requiring the author to be involved. If this were the approach taken by the White House, then George W Bush's biography would be assigned the metadata record 'miserable failure'. http://news.bbc.co.uk/2/hi/americas/3298443.stm The benefit of classification by an authority (someone who knows their field) is that the classification differentiates content. The more specific the classification, the more useful that classification is to a knowledgeable searcher. On the web, broad, common language classification systems are of most value when the subject is unknown. For example, as a new web designer I might search for 'web design'. As my knowledge increases, my search is likely to be become more sophisticated, for example 'CSS floats' or 'IE box-model hack'. It would be helpful to define both 'effective' and 'finding aid'. As search is such a broad topic it would also be productive to establish context. For example, is this a public search or a site specific search? Would metadata records be displayed to the user or factored into the page ranking? Etc. ** The solution ** Wild metadata, such as anchor text, blog descriptions and folksonomies may provide better description and subject (or keyword) metadata. If the author-generated metadata records are displayed as part of a search result records, then they provide a succinct description of the content. As to whether an individual finds metadata record support the locating of content, the method of display, relevance of the metadata records to the search conducted, personal preference, etc also come into play. Link text (i.e. the text used to link to one webpage from another) is already factored in public search engine ranking algorithms, as does the number of incoming links. Trackbacks http://www.motive.co.nz/glossary/trackback.php are an existing method of capitalising on blog comments, as the link text and blurb from referring webpage is embedded on the source webpage. With regard to folksonomies, looking through Technorati's tags http://www.technorati.com/tags/ , content is often classified according to subjective qualities such as 'rant', 'rambling' and 'random'. It is more likely that folksonomies constitute a snapshot of the evolution of language. As a 'fringe' term become socialised it emerges as part of a formal classification system. For example the term 'hack', as it pertains to CSS, has been socialised to the point where it has become meaningful search term. I would go so far as to suggest that public search engines have already implemented a 'wild metadata' approach to generating search results. Perhaps the issue with the value of metadata records lies less with how they are generated and more with how people phrase search queries and use the web. You might find it useful to browse our glossary as it provides further info on search engines, folksonomies, metadata, etc. http://www.motive.co.nz/glossary . Best regards, -- Andy Kirkwood | Creative Director Motive | web.design.integrity http://www.motive.co.nz ph: (04) 3 800 800 fx: (04) 970 9693 mob: 021 369 693 93 Rintoul St, Newtown PO Box 7150, Wellington South, New Zealand ** The discussion list for http://webstandardsgroup.org/ See http://webstandardsgroup.org/mail/guidelines.cfm for some hints on posting to the list getting help **