Author: jukka
Date: Sun Sep 14 09:53:55 2008
New Revision: 695253
URL: http://svn.apache.org/viewvc?rev=695253&view=rev
Log:
TIKA-157: List all the document formats supported by Tika
First draft of the list of supported formats.
Added:
incubator/tika/trunk/src/site/apt/formats.apt
Modified:
incubator/tika/trunk/src/site/site.xml
Added: incubator/tika/trunk/src/site/apt/formats.apt
URL:
http://svn.apache.org/viewvc/incubator/tika/trunk/src/site/apt/formats.apt?rev=695253&view=auto
==============================================================================
--- incubator/tika/trunk/src/site/apt/formats.apt (added)
+++ incubator/tika/trunk/src/site/apt/formats.apt Sun Sep 14 09:53:55 2008
@@ -0,0 +1,192 @@
+ --------------------------
+ Supported Document Formats
+ --------------------------
+
+~~ Licensed to the Apache Software Foundation (ASF) under one or more
+~~ contributor license agreements. See the NOTICE file distributed with
+~~ this work for additional information regarding copyright ownership.
+~~ The ASF licenses this file to You under the Apache License, Version 2.0
+~~ (the "License"); you may not use this file except in compliance with
+~~ the License. You may obtain a copy of the License at
+~~
+~~ http://www.apache.org/licenses/LICENSE-2.0
+~~
+~~ Unless required by applicable law or agreed to in writing, software
+~~ distributed under the License is distributed on an "AS IS" BASIS,
+~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+~~ See the License for the specific language governing permissions and
+~~ limitations under the License.
+
+Supported Document Formats
+
+ This page lists all the document formats supported by Apache Tika.
+
+ [bzip2 compression (application/x-bzip)]
+ TODO
+
+ [Extensible Markup Language (application/xml)]
+ TODO
+
+ [gzip compression (application/x-gzip)]
+ TODO
+
+ [HyperText Markup Language (text/html)]
+ TODO
+
+ [Images (image/*)]
+ TODO
+
+ [Java class files]
+ TODO
+
+ [Java jar archives]
+ TODO
+
+ [Microsoft Word (application/msword)]
+ Tika uses the {{{http://poi.apache.org/hwpf/}HWPF}} API in
+ {{{http://poi.apache.org/}Apache POI}} to parse OLE2-based Microsoft
+ Word documents. Support for Microsoft Word was added in Tika 0.1.
+
+ The Word parser in Tika simply the POI
+
{{{http://poi.apache.org/apidocs/org/apache/poi/hwpf/extractor/WordExtractor.html}WordExtractor}}
+ class to extract text paragraphs from Word documents. Support for more
+ complex content structures is not yet implemented; see
+ {{{https://issues.apache.org/jira/browse/TIKA-123}TIKA-123}} for this
+ issue.
+
+ Generic Microsoft Office document properties like title, author, and
+ keywords are returned as metadata properties.
+
+ Support for the new XML-based Word 2007 format is pending for a POI
+ upgrade. See {{{https://issues.apache.org/jira/browse/TIKA-152}TIKA-152}}
+ for the current status of this issue.
+
+ For an example of parsing Microsoft Word files, see the
+
{{{xref-test/org/apache/tika/parser/microsoft/WordParserTest.html}WordParserTest}}
+ test case.
+
+ [Microsoft Excel (application/vnd.ms-excel)]
+ Tika uses the {{{http://poi.apache.org/hssf/}HSSF}} API in
+ {{{http://poi.apache.org/}Apache POI}} to parse OLE2-based Microsoft
+ Excel spreadsheets. Support for Microsoft Excel was added in Tika 0.1.
+
+ The Excel parser in Tika uses the HSSF event model and is able to recreate
+ much of the document structure, including all (non-empty) worksheets and
+ their table structures. Formula results are extracted as stored in the
+ Excel file, and cell links are exposed as XHTML links. These features
+ were added in Tika 0.2.
+
+ Cell comments and formatting are currently not supported. See
+ {{{https://issues.apache.org/jira/browse/TIKA-148}TIKA-148}} and
+ {{{https://issues.apache.org/jira/browse/TIKA-103}TIKA-103}} for the
+ respective issues.
+
+ Generic Microsoft Office document properties like title, author, and
+ keywords are returned as metadata properties.
+
+ Support for the new XML-based Excel 2007 format is pending for a POI
+ upgrade. See {{{https://issues.apache.org/jira/browse/TIKA-152}TIKA-152}}
+ for the current status of this issue.
+
+ For an example of parsing Microsoft Excel files, see the
+
{{{xref-test/org/apache/tika/parser/microsoft/ExcelParserTest.html}ExcelParserTest}}
+ test case.
+
+ [Microsoft PowerPoint (application/vnd.ms-powerpoint)]
+ Tika uses the {{{http://poi.apache.org/hslf/}HSLF}} API in
+ {{{http://poi.apache.org/}Apache POI}} to parse OLE2-based Microsoft
+ PowerPoint presentations. Support for Microsoft PowerPoint was added
+ in Tika 0.1.
+
+ The PowerPoint parser in Tika simply the POI
+
{{{http://poi.apache.org/apidocs/org/apache/poi/hslf/extractor/PowerPointExtractor.html}PowerPointExtractor}}
+ class to extract all text as a single paragraph from a PowerPoint document.
+ Support for more complex content structures is not yet implemented; see
+ {{{https://issues.apache.org/jira/browse/TIKA-123}TIKA-123}} for this
+ issue.
+
+ Generic Microsoft Office document properties like title, author, and
+ keywords are returned as metadata properties.
+
+ Support for the new XML-based PowerPoint 2007 format is pending for a POI
+ upgrade. See {{{https://issues.apache.org/jira/browse/TIKA-152}TIKA-152}}
+ for the current status of this issue.
+
+ For an example of parsing Microsoft PowerPoint files, see the
+
{{{xref-test/org/apache/tika/parser/microsoft/PowerPointParserTest.html}PowerPointParserTest}}
+ test case.
+
+ [Microsoft Visio (application/vnd.visio)]
+ Tika uses the {{{http://poi.apache.org/hdgf/}HDGF}} API in
+ {{{http://poi.apache.org/}Apache POI}} to parse OLE2-based Microsoft
+ Visio diagrams. Support for Microsoft Visio was added in Tika 0.2.
+
+ The Visio parser in Tika simply the POI
+
{{{http://poi.apache.org/apidocs/org/apache/poi/hdgf/extractor/VisioTextExtractor.html}VisioExtractor}}
+ class to extract all text entries from Visio documents.
+ Support for more complex content structures is not yet implemented; see
+ {{{https://issues.apache.org/jira/browse/TIKA-123}TIKA-123}} for this
+ issue.
+
+ Generic Microsoft Office document properties like title, author, and
+ keywords are returned as metadata properties.
+
+ [Microsoft Outlook (application/vnd.ms-outlook)]
+ Tika uses the {{{http://poi.apache.org/hsmf/}HSMF}} API in
+ {{{http://poi.apache.org/}Apache POI}} to parse OLE2-based Microsoft
+ Outlook messages. Support for Microsoft Outlook was added in Tika 0.2.
+
+ The Outlook parser in Tika extracts the subject of the message and
+ the From, To, Cc, and Bcc addresses (formatted for display) along
+ with the body text of text/plain messages.
+
+ For an example of parsing Microsoft Outlook files, see the
+
{{{xref-test/org/apache/tika/parser/microsoft/OutlookParserTest.html}OutlookParserTest}}
+ test case.
+
+ [MP3 Audio (audio/mp3)]
+ TODO
+
+ [OpenDocument (application/vnd.oasis.opendocument.*)]
+ TODO
+
+ [Plain text (text/plain)]
+ Tika uses the
+ {{{http://www.icu-project.org/}International Components for Unicode}}
+ Java library (ICU4J) to parse plain text. Support for plain text was added
+ in Tika 0.1.
+
+ Extracting text content from plain text files is actually a relatively
+ complex task due to the fact that the character encoding of the text
+ file is often unknown to the parser.
+
+ The text parser in Tika uses the ICU4J
+
{{{http://www.icu-project.org/apiref/icu4j/com/ibm/icu/text/CharsetDetector.html}CharsetDetector}}
+ class to automatically detect the character encoding of any text input.
+ As an added benefit, the ICU4J library is in some cases able to detect
+ also the language in which the text is written.
+
+ The character encoding and language of the plain text document are
+ returned as the <<<Metadata.CONTENT_ENCODING>>> and <<<Metadata.LANGUAGE>>>
+ metadata properties. If the (declared) content encoding of a text document
+ is already known to the client application, then it can be supplied as the
+ <<<Metadata.CONTENT_ENCODING>>> metadata property to the parser to
+ simplify encoding detection.
+
+ [Portable Document Format (application/pdf)]
+ TODO
+
+ [Rich Text Format (application/rtf)]
+ Tika uses Java's built-in Swing library to parse Rich Text Format (RTF)
+ documents. Support for RTF was added in Tika 0.1.
+
+ The RTF parser in Tika uses the Swing
+
{{{http://java.sun.com/j2se/1.5.0/docs/api/javax/swing/text/rtf/RTFEditorKit.html}RTFEditorKit}}
+ class to extract all text from an RTF document as a single paragraph.
+ Document metadata extraction is currently not supported.
+
+ [tar archive (application/x-tar)]
+ TODO
+
+ [ZIP archive (application/zip)]
+ TODO
Modified: incubator/tika/trunk/src/site/site.xml
URL:
http://svn.apache.org/viewvc/incubator/tika/trunk/src/site/site.xml?rev=695253&r1=695252&r2=695253&view=diff
==============================================================================
--- incubator/tika/trunk/src/site/site.xml (original)
+++ incubator/tika/trunk/src/site/site.xml Sun Sep 14 09:53:55 2008
@@ -39,7 +39,8 @@
<item name="Introduction" href="index.html"/>
<item name="Download" href="download.html"/>
<item name="Documentation" href="documentation.html"/>
+ <item name="Supported Formats" href="formats.html"/>
</menu>
<menu ref="reports"/>
</body>
-</project>
\ No newline at end of file
+</project>