formats.apt

jukka Tue, 16 Sep 2008 16:20:55 -0700

Author: jukka
Date: Tue Sep 16 16:19:53 2008
New Revision: 696097

URL: http://svn.apache.org/viewvc?rev=696097&view=rev
Log:
TIKA-157: List all the document formats supported by Tika


Section on all the OLE 2 Compound Document formats.

Modified:
    incubator/tika/trunk/src/site/apt/formats.apt

Modified: incubator/tika/trunk/src/site/apt/formats.apt
URL: 
http://svn.apache.org/viewvc/incubator/tika/trunk/src/site/apt/formats.apt?rev=696097&r1=696096&r2=696097&view=diff
==============================================================================
--- incubator/tika/trunk/src/site/apt/formats.apt (original)
+++ incubator/tika/trunk/src/site/apt/formats.apt Tue Sep 16 16:19:53 2008
@@ -21,6 +21,115 @@
 
    This page lists all the document formats supported by Apache Tika.
 
+* Microsoft's OLE 2 Compound Document format
+
+   A number of Microsoft applications, most notably the Microsoft Office
+   suite, use the generic OLE 2 Compound Document format as the basis of
+   their document formats. Tika uses {{{http://poi.apache.org/}Apache POI}}
+   to support a number of these formats.
+
+   The OLE2 Compound Document format is designed for use with random access
+   files, and so the input stream passed to a Tika parser needs to be spooled
+   in memory or in a temporary file depending on the size of the document.
+   See {{{https://issues.apache.org/jira/browse/TIKA-153}TIKA-153}} for an
+   effort to avoid this extra temporary file if the input document already
+   comes from a file.
+
+   In addition to the shared base format there's also a shared sets of
+   metadata in typical OLE2 documents. Tika uses the
+   {{{http://poi.apache.org/hpsf/}HPSF library}} from POI to parse these
+   property sets and exposes them as the following document metadata:
+
+      * <<<TITLE>>> Title
+      * <<<SUBJECT>>> Subject
+      * <<<AUTHOR>>> Author
+      * <<<KEYWORDS>>> Keywords
+      * <<<COMMENTS>>> Comments
+      * <<<TEMPLATE>>> Template
+      * <<<LAST_SAVED>>> Last Saved By
+      * <<<REVISION_NUMBER>>> Revision Number
+      * <<<LAST_PRINTED>>> Last Printed
+      * <<<LAST_SAVED>>> Last Saved Time/Date
+      * <<<LAST_SAVED>>> Last Saved Time/Date
+      * <<<PAGE_COUNT>>> Number of Pages
+      * <<<WORD_COUNT>>> Number of Words
+      * <<<CHARACTER_COUNT>>> Number of Characters
+      * <<<APPLICATION_NAME>>> Name of Creating Application
+
+   Note that in practice the metadata in many documents is either missing,
+   incomplete or even incorrect, so a client application should not rely
+   too much on this information.
+
+   Support for the new Office Open XML format used by Microsoft Office
+   version 2007 is pending for a POI upgrade. Current status is recorded in
+   {{{https://issues.apache.org/jira/browse/TIKA-152}TIKA-152}}.
+
+   The generic OLE2 Compound Document format is automatically detected using
+   a magic number, and further parsing can automatically determine the more
+   specific document format. Tika also knows a number of common glob patterns
+   like <<<*.doc>>> and <<<*.ppt>>> for these formats.
+
+   The supported OLE 2 Compound Document formats are:
+
+   [Microsoft Excel (application/vnd.ms-excel)]
+    Excel spreadsheet support is available in all versions of Tika and is
+    based on the {{{http://poi.apache.org/hssf/}HSSF library}} from POI.
+
+    The Excel parser in Tika uses the
+    {{{http://poi.apache.org/hssf/how-to.html#event_api}HSSF event API}} and
+    is able to extract much of the document structure, including all
+    (non-empty) worksheets and their table structures. Formula results are
+    extracted as stored in the Excel file, and cell links are exposed as
+    XHTML links. These features were added in Tika version 0.2.
+
+    Cell comments and formatting are currently not supported. See
+    {{{https://issues.apache.org/jira/browse/TIKA-148}TIKA-148}} and
+    {{{https://issues.apache.org/jira/browse/TIKA-103}TIKA-103}} for the
+    respective issues.
+
+    See the 
{{{xref-test/org/apache/tika/parser/microsoft/ExcelParserTest.html}ExcelParserTest}}
+    test case for an example of parsing Microsoft Excel files.
+
+   [Microsoft Word (application/msword)]
+    Word document support is available in all versions of Tika and is based
+    on the {{{http://poi.apache.org/hwpf/}HWPF library}} from POI.
+
+    The Word parser uses the
+    
{{{http://poi.apache.org/apidocs/org/apache/poi/hwpf/extractor/WordExtractor.html}WordExtractor}}
+    class from HWPF to extract document content as a sequence of paragraphs.
+
+    See the 
{{{xref-test/org/apache/tika/parser/microsoft/WordParserTest.html}WordParserTest}}
+    test case for an example of parsing Microsoft Word files.
+
+   [Microsoft PowerPoint (application/vnd.ms-powerpoint)]
+    PowerPoint presentation support is available in all versions of Tika and
+    is based on the {{{http://poi.apache.org/hslf/}HSLF library}} from POI.
+
+    The PowerPoint parser uses the
+    
{{{http://poi.apache.org/apidocs/org/apache/poi/hslf/extractor/PowerPointExtractor.html}PowerPointExtractor}}
+    class from HSLF to extract spreadsheet content as a single paragraph.
+
+    See the 
{{{xref-test/org/apache/tika/parser/microsoft/PowerPointParserTest.html}PowerPointParserTest}}
+    test case for an example of parsing Microsoft PowerPoint files.
+
+   [Microsoft Visio (application/vnd.visio)]
+    Visio diagram support was added in Tika version 0.2 and is based on the
+    {{{http://poi.apache.org/hdgf/}HDGF library}} from POI.
+
+    The Visio parser uses the
+    
{{{http://poi.apache.org/apidocs/org/apache/poi/hdgf/extractor/VisioTextExtractor.html}VisioExtractor}}
+    class from HDGF to extract diagram content as a sequence of paragraphs.
+
+   [Microsoft Outlook (application/vnd.ms-outlook)]
+    Outlook message support was added in Tika version 0.2 and is based on the
+    {{{http://poi.apache.org/hsmf/}HSMF library}} from POI.
+
+    The Outlook parser extracts the subject of the message and the From,
+    To, Cc, and Bcc addresses (formatted for display) along with the body
+    text of text/plain messages. The <<<AUTHOR>>>, <<<TITLE>> and
+    <<<SUBJECT>>> metadata properties are set explicitly, overriding
+    potential generic document metadata retrieved from OLE2 property sets.
+
 * Compression formats
 
    General purpose compression formats are used to reduce the size of
@@ -44,6 +153,8 @@
    stage. Only the text content extracted by the second stage parser is
    returned to the client application.
 
+   The supported compression formats are:
+
    [gzip compression (application/x-gzip)]
     {{{http://en.wikipedia.org/wiki/Gzip}Gzip}} support was added in
     Tika version 0.2 and is based on the
@@ -81,123 +192,6 @@
    [Java jar archives]
     TODO
 
-   [Microsoft Word (application/msword)]
-    Tika uses the {{{http://poi.apache.org/hwpf/}HWPF}} API in
-    {{{http://poi.apache.org/}Apache POI}} to parse OLE2-based Microsoft
-    Word documents. Support for Microsoft Word was added in Tika 0.1.
-
-    The Word parser in Tika simply uses the POI
-    
{{{http://poi.apache.org/apidocs/org/apache/poi/hwpf/extractor/WordExtractor.html}WordExtractor}}
-    class to extract text paragraphs from Word documents. Support for more
-    complex content structures is not yet implemented; see
-    {{{https://issues.apache.org/jira/browse/TIKA-123}TIKA-123}} for this
-    issue.
-
-    Generic Microsoft Office document properties like title, author, and
-    keywords are returned as metadata properties.
-
-    Support for the new XML-based Word 2007 format is pending for a POI
-    upgrade. See {{{https://issues.apache.org/jira/browse/TIKA-152}TIKA-152}}
-    for the current status of this issue.
-
-    Microsoft Word documents are automatically detected based on a magic
-    header or a glob pattern.
-
-    For an example of parsing Microsoft Word files, see the
-    
{{{xref-test/org/apache/tika/parser/microsoft/WordParserTest.html}WordParserTest}}
-    test case.
-
-   [Microsoft Excel (application/vnd.ms-excel)]
-    Tika uses the {{{http://poi.apache.org/hssf/}HSSF}} API in
-    {{{http://poi.apache.org/}Apache POI}} to parse OLE2-based Microsoft
-    Excel spreadsheets. Support for Microsoft Excel was added in Tika 0.1.
-
-    The Excel parser in Tika uses the HSSF event model and is able to recreate
-    much of the document structure, including all (non-empty) worksheets and
-    their table structures. Formula results are extracted as stored in the
-    Excel file, and cell links are exposed as XHTML links. These features
-    were added in Tika 0.2.
-
-    Cell comments and formatting are currently not supported. See
-    {{{https://issues.apache.org/jira/browse/TIKA-148}TIKA-148}} and
-    {{{https://issues.apache.org/jira/browse/TIKA-103}TIKA-103}} for the
-    respective issues.
-
-    Generic Microsoft Office document properties like title, author, and
-    keywords are returned as metadata properties.
-
-    Support for the new XML-based Excel 2007 format is pending for a POI
-    upgrade. See {{{https://issues.apache.org/jira/browse/TIKA-152}TIKA-152}}
-    for the current status of this issue.
-
-    Microsoft Excel spreadsheets are automatically detected based on a magic
-    header or a glob pattern.
-
-    For an example of parsing Microsoft Excel files, see the
-    
{{{xref-test/org/apache/tika/parser/microsoft/ExcelParserTest.html}ExcelParserTest}}
-    test case.
-
-   [Microsoft PowerPoint (application/vnd.ms-powerpoint)]
-    Tika uses the {{{http://poi.apache.org/hslf/}HSLF}} API in
-    {{{http://poi.apache.org/}Apache POI}} to parse OLE2-based Microsoft
-    PowerPoint presentations. Support for Microsoft PowerPoint was added
-    in Tika 0.1.
-
-    The PowerPoint parser in Tika simply uses the POI
-    
{{{http://poi.apache.org/apidocs/org/apache/poi/hslf/extractor/PowerPointExtractor.html}PowerPointExtractor}}
-    class to extract all text as a single paragraph from a PowerPoint document.
-    Support for more complex content structures is not yet implemented; see
-    {{{https://issues.apache.org/jira/browse/TIKA-123}TIKA-123}} for this
-    issue.
-
-    Generic Microsoft Office document properties like title, author, and
-    keywords are returned as metadata properties.
-
-    Support for the new XML-based PowerPoint 2007 format is pending for a POI
-    upgrade. See {{{https://issues.apache.org/jira/browse/TIKA-152}TIKA-152}}
-    for the current status of this issue.
-
-    Microsoft PowerPoint presentations are automatically detected based on
-    a magic header or a glob pattern.
-
-    For an example of parsing Microsoft PowerPoint files, see the
-    
{{{xref-test/org/apache/tika/parser/microsoft/PowerPointParserTest.html}PowerPointParserTest}}
-    test case.
-
-   [Microsoft Visio (application/vnd.visio)]
-    Tika uses the {{{http://poi.apache.org/hdgf/}HDGF}} API in
-    {{{http://poi.apache.org/}Apache POI}} to parse OLE2-based Microsoft
-    Visio diagrams. Support for Microsoft Visio was added in Tika 0.2.
-
-    The Visio parser in Tika simply uses the POI
-    
{{{http://poi.apache.org/apidocs/org/apache/poi/hdgf/extractor/VisioTextExtractor.html}VisioExtractor}}
-    class to extract all text entries from Visio documents.
-    Support for more complex content structures is not yet implemented; see
-    {{{https://issues.apache.org/jira/browse/TIKA-123}TIKA-123}} for this
-    issue.
-
-    Generic Microsoft Office document properties like title, author, and
-    keywords are returned as metadata properties.
-
-    Microsoft Visio diagrams are automatically detected based on a magic
-    header or a glob pattern.
-
-   [Microsoft Outlook (application/vnd.ms-outlook)]
-    Tika uses the {{{http://poi.apache.org/hsmf/}HSMF}} API in
-    {{{http://poi.apache.org/}Apache POI}} to parse OLE2-based Microsoft
-    Outlook messages. Support for Microsoft Outlook was added in Tika 0.2.
-
-    The Outlook parser in Tika extracts the subject of the message and
-    the From, To, Cc, and Bcc addresses (formatted for display) along
-    with the body text of text/plain messages.
-
-    Microsoft Outlook messages are automatically detected based on a magic
-    header or a glob pattern.
-
-    For an example of parsing Microsoft Outlook files, see the
-    
{{{xref-test/org/apache/tika/parser/microsoft/OutlookParserTest.html}OutlookParserTest}}
-    test case.
-
    [MP3 Audio (audio/mp3)]
     TODO

svn commit: r696097 - /incubator/tika/trunk/src/site/apt/formats.apt

Reply via email to