On Mon, 2008-01-28 at 12:36 +0100, Alexander Larsson wrote:
> Attached is an update to the shared mime specification that adds two
> things:

Here is an updated version based on some feedback from david faure.
Apart from typo fixes and such the difference is:

* No magic prio 80 special rule, as this is no longer needed with the
new glob conflict resolution method.

* Renamed the glob priorities to "weight" to avoid the misunderstanding
that the magic and glob priorities are in the same global priority
scheme.

* Added description of new file formats

* Emphasizes that the icon and generic-icon data is availible in the
per-type xml file (as well as in the icons and generic-icons files)


--- shared-mime-info-spec.xml	1 Dec 2005 18:53:26 -0000	1.56
+++ shared-mime-info-spec.xml	29 Jan 2008 09:06:36 -0000
@@ -1,8 +1,8 @@
 <?xml version="1.0" standalone="no"?>
 <!DOCTYPE article PUBLIC "-//OASIS//DTD DocBook XML V4.1.2//EN"
-"/usr/share/sgml/docbook/dtd/xml/4.1.2/docbookx.dtd" [
-  <!ENTITY updated "1 December 2005">
-  <!ENTITY version "0.15">
+"http://www.oasis-open.org/docbook/xml/4.1.2/docbookx.dtd"; [
+  <!ENTITY updated "25 January 2008">
+  <!ENTITY version "0.16">
 ]>
 <article id="index">
 
@@ -159,7 +159,10 @@ changes take effect.
 The files created by <command>update-mime-database</command> are:
 			<itemizedlist>
 				<listitem><para>
-<filename>&lt;MIME&gt;/globs</filename> (contains a mapping from names to MIME types)
+<filename>&lt;MIME&gt;/globs</filename> (contains a mapping from names to MIME types) [deprecated for glob2]
+				</para></listitem>
+				<listitem><para>
+<filename>&lt;MIME&gt;/globs2</filename> (contains a mapping from names to MIME types and glob weight)
 				</para></listitem>
 				<listitem><para>
 <filename>&lt;MIME&gt;/magic</filename> (contains a mapping from file contents to MIME types)
@@ -171,17 +174,24 @@ The files created by <command>update-mim
 <filename>&lt;MIME&gt;/aliases</filename> (contains a mapping from aliases to MIME types)
 				</para></listitem>
 				<listitem><para>
+<filename>&lt;MIME&gt;/icons</filename> (contains a mapping from MIME types to icons)
+				</para></listitem>
+				<listitem><para>
+<filename>&lt;MIME&gt;/generic-icons</filename> (contains a mapping from MIME types to generic icons)
+				</para></listitem>
+				<listitem><para>
 <filename>&lt;MIME&gt;/XMLnamespaces</filename> (contains a mapping from XML
 (namespaceURI, localName) pairs to MIME types)
 				</para></listitem>
  				<listitem><para>
 <filename>&lt;MIME&gt;/MEDIA/SUBTYPE.xml</filename> (one file for each MIME
-type, giving details about the type)
+type, giving details about the type, including comment, icon and generic-icon)
 				</para></listitem>
 				<listitem><para>
-<filename>&lt;MIME&gt;/mime.cache</filename> (contains the same information as the <filename>globs</filename>,
-<filename>magic</filename>, <filename>subclasses</filename>, <filename>aliases</filename> and 
-<filename>XMLnamespaces</filename> files, in a binary, mmappable format)
+<filename>&lt;MIME&gt;/mime.cache</filename> (contains the same information as the <filename>globs2</filename>,
+<filename>magic</filename>, <filename>subclasses</filename>, <filename>aliases</filename>,
+<filename>icons</filename>, <filename>generic-icons</filename> and <filename>XMLnamespaces</filename> files,
+in a binary, mmappable format)
 				</para></listitem>
 			</itemizedlist>
 The format of these generated files and the source files in <filename>packages</filename>
@@ -213,7 +223,9 @@ and in any order:
 				<listitem><para>
 <userinput>glob</userinput> elements have a <userinput>pattern</userinput> attribute. Any file
 whose name matches this pattern will be given this MIME type (subject to conflicting rules in
-other files, of course).
+other files, of course). There is also an optional <userinput>weight</userinput> attribute which
+is used when resolving conflicts with other glob matches. The default weight value is 50, and
+the maximum is 100.
 		</para>
 		<para>
 KDE's glob system replaces GNOME's and ROX's ext/regex fields, since it
@@ -305,6 +317,24 @@ There may be many of these elements with
 to provide the text in multiple languages, although these should only be used if absolutely neccessary.
 				</para></listitem>
 				<listitem><para>
+<userinput>icon</userinput> elements specify the icon to be used for this particular mime-type, given
+by the <userinput>name</userinput> attribute. Generally the icon used for a mimetype is created
+based on the mime-type by mapping "/" characters to "-", but users can override this by using
+the <userinput>icon</userinput> element to customize the icon for a particular mimetype.
+This element is not used in the system database, but only used in the user overridden database.
+Only one <userinput>icon</userinput> element is allowed.
+				</para></listitem>
+				<listitem><para>
+<userinput>generic-icon</userinput> elements specify the icon to use as a generic icon for this
+particular mime-type, given by the <userinput>name</userinput> attribute. This is used if there
+is no specific icon (see <userinput>icon</userinput> for how these are found). These are
+used for categories of similar types (like spreadsheets or archives) that can use a common icon.
+The Icon Naming Specification lists a set of such icon names. If this element is not specified
+then the mimetype is used to generate the generic icon by using the top-level media type (e.g.
+"video" in "video/ogg") and appending "-x-generic" (i.e. "video-x-generic" in the previous example). 
+Only one <userinput>generic-icon</userinput> element is allowed.
+				</para></listitem>
+				<listitem><para>
 <userinput>root-XML</userinput> elements have <userinput>namespaceURI</userinput> 
 and <userinput>localName</userinput> attributes. If a file is identified as being an XML file,
 these rules allow a more specific MIME type to be chosen based on the namespace and localname
@@ -374,14 +404,29 @@ The example source file given above woul
 	<sect2>
 		<title>The glob files</title>
 		<para>
-This is a simple list of lines containing a MIME type and pattern, separated by a colon.
+The glob2 file is a simple list of lines containing weight, MIME type and pattern, separated by a colon.
+The lines are ordered by glob weigth.
+For example:
+			<programlisting><![CDATA[
+# This file was automatically generated by the
+# update-mime-database command. DO NOT EDIT!
+...
+55:text/x-diff:*.patch
+50:text/x-diff:*.diff
+...
+]]></programlisting>
+		</para>
+		<para>
+The glob file is a simple list of lines containing a MIME type and pattern, separated by a colon. It is
+deprecated in favour of the glob2 file which also lists the weight of the glob rule.		
+The lines are ordered by glob weigth.
 For example:
 			<programlisting><![CDATA[
 # This file was automatically generated by the
 # update-mime-database command. DO NOT EDIT!
 ...
-text/x-diff:*.diff
 text/x-diff:*.patch
+text/x-diff:*.diff
 ...
 ]]></programlisting>
 		</para>
@@ -512,30 +557,43 @@ For example:
 <screen>
 http://www.w3.org/1999/xhtml html application/xhtml+xml
 </screen>
-The lines are sorted (using strcmp) and there are no lines with the same namespaceURI and
+The lines are sorted (using strcmp in the C locale) and there are no lines with the same namespaceURI and
 localName in one file. If the localName was empty then there will be two spaces following
 the namespaceURI.
 		</para>
 	</sect2>
 	<sect2>
+		<title>The icon files</title>
+		<para>
+The <filename>icons</filename> and <filename>generic-icons</filename> files are list of lines in the form:
+<screen>MIME-Type ":" icon-name "\n"</screen>
+For example:
+<screen>
+application/msword:x-office-document
+</screen>
+		</para>
+	</sect2>
+	<sect2>
 		<title>The mime.cache files</title>
 		<para>
 The <filename>mime.cache</filename> files contain the same information as the 
-<filename>globs</filename>, <filename>magic</filename>, <filename>subclasses</filename>, 
+<filename>globs2</filename>, <filename>magic</filename>, <filename>subclasses</filename>, 
 <filename>aliases</filename> and <filename>XMLnamespaces</filename> files, in a binary, 
 mmappable format:
 </para>
 <programlisting>
 Header:
 2			CARD16		MAJOR_VERSION	1	
-2			CARD16		MINOR_VERSION	0	
+2			CARD16		MINOR_VERSION	1	
 4			CARD32		ALIAS_LIST_OFFSET
 4			CARD32		PARENT_LIST_OFFSET
 4			CARD32		LITERAL_LIST_OFFSET
-4			CARD32		SUFFIX_LIST_OFFSET
+4			CARD32		REVERSE_SUFFIX_TREE_OFFSET
 4			CARD32		GLOB_LIST_OFFSET
 4			CARD32		MAGIC_LIST_OFFSET
 4			CARD32		NAMESPACE_LIST_OFFSET
+4			CARD32		ICONS_LIST_OFFSET
+4			CARD32		GENERIC_ICONS_LIST_OFFSET
 
 AliasList:
 4			CARD32		N_ALIASES
@@ -564,6 +622,7 @@ LiteralList:
 LiteralEntry:
 4			CARD32		LITERAL_OFFSET
 4			CARD32		MIME_TYPE_OFFSET
+4			CARD32		WEIGHT
 
 GlobList:
 4			CARD32		N_GLOBS
@@ -572,16 +631,18 @@ GlobList:
 GlobEntry:
 4			CARD32		GLOB_OFFSET
 4			CARD32		MIME_TYPE_OFFSET
+4			CARD32		WEIGHT
 
-SuffixTree:
+ReverseSuffixTree:
 4			CARD32		N_ROOTS
 4	 		CARD32		FIRST_ROOT_OFFSET
 
-SuffixTreeNode:
+ReverseSuffixTreeNode:
 4			CARD32		CHARACTER
 4			CARD32		MIME_TYPE_OFFSET
 4			CARD32		N_CHILDREN			
 4			CARD32		FIRST_CHILD_OFFSET
+4			CARD32		WEIGHT
 
 MagicList:
 4			CARD32		N_MATCHES
@@ -612,12 +673,22 @@ NamespaceEntry:
 4			CARD32		NAMESPACE_URI_OFFSET
 4			CARD32		LOCAL_NAME_OFFSET
 4			CARD32		MIME_TYPE_OFFSET
+
+GenericIconsList:
+IconsList:
+4			CARD32		N_ICONS
+8*N_ICONS		IconListEntry
+
+IconListEntry:
+4			CARD32		MIME_TYPE_OFFSET
+4			CARD32		ICON_NAME_OFFSET
 </programlisting>
 <para>
 Lists in the file are sorted, to enable binary searching. The list of 
 aliases is sorted by alias, the list of literal globs is sorted by the 
 literal. The SuffixTreeNode siblings are sorted by character. 
-The list of namespaces is sorted by namespace uri.
+The list of namespaces is sorted by namespace uri. The list of icons
+is sorted by mimetype.
 </para>
 <para>
 Identical globs are stored in the suffix tree by appending suffix
@@ -698,31 +769,47 @@ If a MIME type is provided explicitly (e
 email attachment, an extended attribute or some other means) then that should
 be used instead of guessing.
 				</para></listitem>
+
 				<listitem><para>
-If no explicit type is present, magic rules with a priority of 80 or more
-should be tried next. These rules have a very low false-positive rate.
-				</para></listitem>
-				<listitem><para>
-If there is still no match, the glob rules should be applied to the name to
-get the type.
+Otherwise, start by doing a glob match of the filename. If one or more glob matches, and all the
+matching globs result in the same mimetype, use that mimetype as the result.
 				</para></listitem>
+				
 				<listitem><para>
-If no glob rules match, the remaining magic rules should be tried next.
+If the glob matching fails or results in multiple conflicting mimetypes, read the
+contents of the file and do magic sniffing on it. If no magic rule matches the data (or if
+the content is not available), use the default type of application/octet-stream for
+binary data, or text/plain for textual data. If there was no glob match the magic match
+as the result. 
+				</para><para>
+Note: Checking the first 32 bytes of the file for ASCII control characters is
+a good way to guess whether a file is binary or text, but note that files with high-bit-set
+characters should still be treated as text since these can appear in UTF-8 text,
+unlike control characters.
+				</para></listitem>
+				
+				<listitem><para>
+If any of the mimetypes resulting from a glob match is equal to or a subclass of
+the result from the magic sniffing, use this as the result. This allows us for example to
+distinguish text files called "foo.doc" from MS-Word files with the same name, as the
+magic match for the MS-Word file would be application/x-ole-storage which the MS-Word type
+inherits.
 				</para></listitem>
+				
 				<listitem><para>
-If nothing matches, the default type of application/octet-stream should be used
-for binary data, or text/plain for textual data. Checking the first 32
-bytes of the file for ASCII control characters is a good way to guess
-whether a file is binary or text, but note that files with high-bit-set
-characters should still be treated as text since these can appear in UTF-8
-text, unlike control characters.
+Otherwise use the result of the glob match that has the highest weight.
 				</para></listitem>
 			</itemizedlist>
 		</para>
 		<para>
-There are several reasons for checking most of the glob patterns before the magic.
-Some applications don't check the magic at all, and this makes it more likely
-that both will get the same type. Users can easily understand why calling their
+There are several reasons for checking the glob patterns before the magic.
+First of all doing magic sniffing is very expensive as reading the contents of the files
+causes a lot of seeks, which is very expensive. Secondly, some applications don't check
+the magic at all (sometimes the content is not available or too slow to read), and this
+makes it more likely that both will get the same type.
+		</para>
+		<para>
+Also, users can easily understand why calling their
 text file <filename>README.mp3</filename> makes the system think it's an MP3,
 whereas they have trouble understanding why their computer thinks
 <filename>README.txt</filename> is a PostScript file. If the system guesses wrongly,
_______________________________________________
xdg mailing list
[email protected]
http://lists.freedesktop.org/mailman/listinfo/xdg

Reply via email to