Author: jukka
Date: Tue Dec 2 13:51:53 2008
New Revision: 722625
URL: http://svn.apache.org/viewvc?rev=722625&view=rev
Log:
TIKA-176: Getting Started guide
Reverted the parts that don't work with Tika 0.1.
Modified:
lucene/tika/trunk/src/site/apt/gettingstarted.apt
Modified: lucene/tika/trunk/src/site/apt/gettingstarted.apt
URL:
http://svn.apache.org/viewvc/lucene/tika/trunk/src/site/apt/gettingstarted.apt?rev=722625&r1=722624&r2=722625&view=diff
==============================================================================
--- lucene/tika/trunk/src/site/apt/gettingstarted.apt (original)
+++ lucene/tika/trunk/src/site/apt/gettingstarted.apt Tue Dec 2 13:51:53 2008
@@ -50,27 +50,10 @@
* tika-x.y.jar
- * tika-x.y-standalone.jar (available since 0.3)
-
- * tika-x.y-jdk14.jar (available since 0.3)
-
The main build artifact (tika-x.y.jar) contains the compiled Java
classes and interfaces in the <<<org.apache.tika>>> packages and
the default Tika configuration settings.
- The standalone jar (tika-x.y-standalone.jar, available since version 0.3)
- includes also the classes and resources from all Tika dependencies. You
- can just drop this jar file in your application to access the full
- functionality of all Tika parsers. This is a runnable jar that runs the
- Tika command line and graphical user interfaces without needing any other
- libraries (except of course the standard Java 5 class libraries) in the
- classpath.
-
- The final build artifact (tika-x.y-jdk14.jar, available since version 0.3)
- is a {{{http://retrotranslator.sourceforge.net/}retrotranslated}} version
- of the main Tika build artifact. Normally Tika only works with Java 5 or
- higher, but you can use this version of Tika also with Java 1.4.
-
Using Tika as a Maven dependency
Using Tika in a Maven project is very straightforward. Just select the
@@ -84,72 +67,48 @@
</dependency>
---
- The first version of the org.apache.tika:tika artifact available in the
- central Maven repository is 0.2. For the 0.1 version or for SNAPSHOT
- dependencies you need to build and install Tika locally.
-
- If your application uses Java 1.4, you need to use the retrotranslated
- version of Tika. This version is identified by the classifier "jdk14".
-
----
-<dependency>
- <groupId>org.apache.tika</groupId>
- <artifactId>tika</artifactId>
- <version>x.y</version>
- <classifier>jdk14</classifier>
-</dependency>
----
-
- The retrotranslated version will be available in the central Maven
- repository starting with Tika version 0.3.
+ Note that the incubating 0.1 release of Tika is not available in the
+ central Maven repository. You need to build and install Tika locally
+ to use it as a Maven dependency.
Note that adding the Tika dependency will introduce a number of
transitive dependencies to your project. You need to make sure that
these dependencies won't conflict with your existing project dependencies.
- The listing below shows all the compile-scope dependencies of the
- current Tika trunk (0.3-SNAPSHOT, November 2008). You can use the
- command "mvn dependency:tree" to check the latest tree of dependencies.
+ The listing below shows all the compile-scope dependencies of Tika 0.1.
+ You can use the command "mvn dependency:tree" to check the latest tree
+ of dependencies.
---
-org.apache.tika:tika:jar:0.3-SNAPSHOT
+org.apache.tika:tika:jar:0.1-incubating
+- commons-lang:commons-lang:jar:2.1:compile
+- commons-logging:commons-logging:jar:1.0.4:compile
+- commons-codec:commons-codec:jar:1.3:compile
-+- commons-io:commons-io:jar:1.4:compile
+- pdfbox:pdfbox:jar:0.7.3:compile
| +- org.fontbox:fontbox:jar:0.1.0:compile
| +- org.jempbox:jempbox:jar:0.2.0:compile
| +- bouncycastle:bcmail-jdk14:jar:136:compile
| \- bouncycastle:bcprov-jdk14:jar:136:compile
-+- org.apache.poi:poi:jar:3.1-FINAL:compile
-+- org.apache.poi:poi-scratchpad:jar:3.1-FINAL:compile
-+- net.sourceforge.nekohtml:nekohtml:jar:1.9.7:compile
-| \- xerces:xercesImpl:jar:2.8.1:compile
-| \- xml-apis:xml-apis:jar:1.3.03:compile
++- org.apache.poi:poi:jar:3.0-FINAL:compile
++- jdom:jdom:jar:1.0:compile
++- jaxen:jaxen:jar:1.1.1:compile
+| +- dom4j:dom4j:jar:1.6.1:compile
+| +- xml-apis:xml-apis:jar:1.3.02:compile
+| +- xerces:xercesImpl:jar:2.6.2:compile
+| \- xom:xom:jar:1.0:compile
+| +- xerces:xmlParserAPIs:jar:2.6.2:compile
+| \- xalan:xalan:jar:2.6.0:compile
++- nekohtml:nekohtml:jar:0.9.5:compile
+- com.ibm.icu:icu4j:jar:3.4.4:compile
-+- asm:asm:jar:3.1:compile
\- log4j:log4j:jar:1.2.14:compile
---
Using Tika in an Ant project
Unless you use a dependency manager tool like
- {{{http://ant.apache.org/ivy/}Apache Ivy}}, the easiest way to include
- Tika in your {{{http://ant.apache.org/}Ant}} build is to include the
- standalone jar in your classpath settings. The standalone jar contains
- everything you need, Tika and all the required dependencies, in a single
- package.
-
----
-<classpath>
- ... <!-- your other classpath entries -->
- <pathelement location="path/to/tika-x.y-standalone.jar"/>
-</classpath>
----
-
- If you want more control over which specific parser libraries you want
- to include in your application, you can include main Tika jar file and
- all the dependencies individually.
+ {{{http://ant.apache.org/ivy/}Apache Ivy}} you need to add both the
+ Tika jar and all dependency jars individually in your
+ {{{http://ant.apache.org/}Ant}} build. You can leave out some parser
+ libraries if you don't need support for certain file formats.
---
<classpath>
@@ -164,69 +123,22 @@
<pathelement location="path/to/jempbox-0.2.0.jar"/>
<pathelement location="path/to/bcmail-jdk14-136.jar"/>
<pathelement location="path/to/bcprov-jdk14-136.jar"/>
- <pathelement location="path/to/poi-3.1-FINAL.jar"/>
- <pathelement location="path/to/poi-scratchpad-3.1-FINAL.jar"/>
- <pathelement location="path/to/nekohtml-1.9.7.jar"/>
- <pathelement location="path/to/xercesImpl-2.8.1.jar"/>
- <pathelement location="path/to/xml-apis-1.3.03.jar"/>
+ <pathelement location="path/to/poi-3.0-FINAL.jar"/>
+ <pathelement location="path/to/jdom-1.0.jar"/>
+ <pathelement location="path/to/jaxen-1.1.1.jar"/>
+ <pathelement location="path/to/dom4j-1.6.1.jar"/>
+ <pathelement location="path/to/xml-apis-1.3.02.jar"/>
+ <pathelement location="path/to/xercesImpl-2.6.2.jar"/>
+ <pathelement location="path/to/xom-1.0.jar"/>
+ <pathelement location="path/to/xmlParserAPIs-2.6.2.jar"/>
+ <pathelement location="path/to/xalan-2.6.0.jar"/>
+ <pathelement location="path/to/nekohtml-0.9.5.jar"/>
<pathelement location="path/to/icu4j-3.4.4.jar"/>
- <pathelement location="path/to/asm-3.1.jar"/>
<pathelement location="path/to/log4j-1.2.14.jar"/>
</classpath>
---
- If you're using Java 1.4 as the base platform of your project,
- use the tika-x.y-jdk14.jar instead.
-
An easy way to gather all these libraries is to run
"mvn dependency:copy-dependencies" in the Tika source directory.
This will copy all Tika dependencies to the <<<target/dependencies>>>
directory.
-
-Using Tika as a command line utility
-
- The standalone jar (tika-x.y-standalone.jar) can be used as a command
- line utility for extracting text content and metadata from all sorts of
- files. The usage instructions are shown below.
-
----
-usage: java -jar tika-x.y-standalone.jar [option] file
-
-Options:
- -? or --help Print this usage message
- -v or --verbose Print debug level messages
- -g or --gui Start the Apache Tika GUI
- -x or --xml Output XHTML content (default)
- -h or --html Output HTML content
- -t or --text Output plain text content
- -m or --metadata Output only metadata
-
-Description:
- Apache Tika will parse the file(s) specified on the
- command line and output the extracted text content
- or metadata to standard output.
-
- Instead of a file name you can also specify the URL
- of a document to be parsed.
-
- Use "-" as the file name to parse the standard
- input stream.
-
- Use the "--gui" (or "-g") option to start
- the Apache Tika GUI. You can drag and drop files
- from a normal file explorer to the GUI window to
- extract text content and metadata from the files.
----
-
- The standalone jar is fully self-contained and should work wherever
- a Java 5 (or higher) runtime environment is available.
-
- You can also use the jar as a component in a Unix pipeline or
- as an external tool in many scripting languages.
-
----
-# Check if an Internet resource contains a specific keyword
-curl http://.../document.doc \
- | java -jar tika-x.y-standalone.jar --text \
- | grep -q keyword
----