Author: nick Date: Sat Jun 27 16:15:19 2015 New Revision: 1687946 URL: http://svn.apache.org/r1687946 Log: Tika javadocs are in /api/ not /apidocs/, correct links
Modified: tika/site/publish/0.10/parser.html tika/site/publish/1.10/examples.html tika/site/publish/1.3/parser.html tika/site/publish/1.4/parser.html tika/site/publish/1.5/parser.html tika/site/publish/1.6/parser.html tika/site/publish/1.7/examples.html tika/site/publish/1.7/parser.html tika/site/publish/1.8/examples.html tika/site/publish/1.8/parser.html tika/site/publish/1.9/examples.html tika/site/publish/1.9/parser.html Modified: tika/site/publish/0.10/parser.html URL: http://svn.apache.org/viewvc/tika/site/publish/0.10/parser.html?rev=1687946&r1=1687945&r2=1687946&view=diff ============================================================================== --- tika/site/publish/0.10/parser.html (original) +++ tika/site/publish/0.10/parser.html Sat Jun 27 16:15:19 2015 @@ -131,7 +131,7 @@ try { ... </body> </html></pre></div> -<p>Parser implementations typically use the <a href="./apidocs/org/apache/tika/sax/XHTMLContentHandler.html">XHTMLContentHandler</a> utility class to generate the XHTML output.</p> +<p>Parser implementations typically use the <a href="./api/org/apache/tika/sax/XHTMLContentHandler.html">XHTMLContentHandler</a> utility class to generate the XHTML output.</p> <p>Dealing with the raw SAX events can be a bit complex, so Apache Tika comes with a number of utility classes that can be used to process and convert the event stream to other representations.</p> <p>For example, the <a href="./api/org/apache/tika/sax/BodyContentHandler.html">BodyContentHandler</a> class can be used to extract just the body part of the XHTML output and feed it either as SAX events to another content handler or as characters to an output stream, a writer, or simply a string. The following code snippet parses a document from the standard input stream and outputs the extracted text content to standard output:</p> <div> @@ -173,7 +173,7 @@ try { <h3>Parser implementations<a name="Parser_implementations"></a></h3> <p>Apache Tika comes with a number of parser classes for parsing <a href="./formats.html">various document formats</a>. You can also extend Tika with your own parsers, and of course any contributions to Tika are warmly welcome.</p> <p>The goal of Tika is to reuse existing parser libraries like <a class="externalLink" href="http://pdfbox.apache.org/">PDFBox</a> or <a class="externalLink" href="http://poi.apache.org/">Apache POI</a> as much as possible, and so most of the parser classes in Tika are adapters to such external libraries.</p> -<p>Tika also contains some general purpose parser implementations that are not targeted at any specific document formats. The most notable of these is the <a href="./apidocs/org/apache/tika/parser/AutoDetectParser.html">AutoDetectParser</a> class that encapsulates all Tika functionality into a single parser that can handle any types of documents. This parser will automatically determine the type of the incoming document based on various heuristics and will then parse the document accordingly.</p></div></div> +<p>Tika also contains some general purpose parser implementations that are not targeted at any specific document formats. The most notable of these is the <a href="./api/org/apache/tika/parser/AutoDetectParser.html">AutoDetectParser</a> class that encapsulates all Tika functionality into a single parser that can handle any types of documents. This parser will automatically determine the type of the incoming document based on various heuristics and will then parse the document accordingly.</p></div></div> </div> <div id="sidebar"> <div id="navigation"> Modified: tika/site/publish/1.10/examples.html URL: http://svn.apache.org/viewvc/tika/site/publish/1.10/examples.html?rev=1687946&r1=1687945&r2=1687946&view=diff ============================================================================== --- tika/site/publish/1.10/examples.html (original) +++ tika/site/publish/1.10/examples.html Sat Jun 27 16:15:19 2015 @@ -113,23 +113,23 @@ <p>Tika provides a number of different ways to parse a file. These provide different levels of control, flexibility, and complexity.</p> <div class="section"> <h4><a name="Parsing_using_the_Tika_Facade">Parsing using the Tika Facade</a></h4> -<p>The <a href="./apidocs/org/apache/tika/Tika.html">Tika facade</a>, provides a number of very quick and easy ways to have your content parsed by Tika, and return the resulting plain text</p><style type="text/css"> +<p>The <a href="./api/org/apache/tika/Tika.html">Tika facade</a>, provides a number of very quick and easy ways to have your content parsed by Tika, and return the resulting plain text</p><style type="text/css"> @import url('attached-includes/css/shCoreDefault.css'); </style> <div id="highlighter_177280" class="syntaxhighlighter nogutter java"><table border="0" cellpadding="0" cellspacing="0"><tbody><tr><td class="code"><div class="container"><div class="line number49 index0 alt2"><code class="java keyword">public</code> <code class="java plain">String parseToStringExample() </code><code class="java keyword">throws</code> <code class="java plain">IOException, SAXException, TikaException {</code></div><div class="line number50 index1 alt1"><code class="java spaces"> </code><code class="java plain">InputStream stream = ParsingExample.</code><code class="java keyword">class</code><code class="java plain">.getResourceAsStream(</code><code class="java string">"test.doc"</code><code class="java plain">);</code></div><div class="line number51 index2 alt2"><code class="java spaces"> </code><code class="java plain">Tika tika = </code><code class="java keyword">new</code> <code class="java plain">Tika();</code></div>< div class="line number52 index3 alt1"><code class="java spaces"> </code><code class="java keyword">try</code> <code class="java plain">{</code></div><div class="line number53 index4 alt2"><code class="java spaces"> </code><code class="java keyword">return</code> <code class="java plain">tika.parseToString(stream);</code></div><div class="line number54 index5 alt1"><code class="java spaces"> </code><code class="java plain">} </code><code class="java keyword">finally</code> <code class="java plain">{</code></div><div class="line number55 index6 alt2"><code class="java spaces"> </code><code class="java plain">stream.close();</code></div><div class="line number56 index7 alt1"><code class="java spaces"> </code><code class="java plain">}</code></div><div class="line number57 index8 alt2"><code class="java plain">}</code></div></ div></td></tr></tbody></table></div></div> <div class="section"> <h4><a name="Parsing_using_the_Auto-Detect_Parser">Parsing using the Auto-Detect Parser</a></h4> -<p>For more control, you can call the <a href="./apidocs/org/apache/tika/parser/Parser.html">Tika Parsers</a> directly. Most likely, you'll want to start out using the <a href="./apidocs/org/apache/tika/parser/AutoDetectParser.html">Auto-Detect Parser</a>, which automatically figures out what kind of content you have, then calls the appropriate parser for you.</p><div id="highlighter_163376" class="syntaxhighlighter nogutter java"><table border="0" cellpadding="0" cellspacing="0"><tbody><tr><td class="code"><div class="container"><div class="line number83 index0 alt2"><code class="java keyword">public</code> <code class="java plain">String parseExample() </code><code class="java keyword">throws</code> <code class="java plain">IOException, SAXException, TikaException {</code></div><div class="line number84 index1 alt1"><code class="java spaces"> </code><code class="java plain">InputStream stream = ParsingExample.</code><code class="java keyword">class</code><c ode class="java plain">.getResourceAsStream(</code><code class="java string">"test.doc"</code><code class="java plain">);</code></div><div class="line number85 index2 alt2"><code class="java spaces"> </code><code class="java plain">AutoDetectParser parser = </code><code class="java keyword">new</code> <code class="java plain">AutoDetectParser();</code></div><div class="line number86 index3 alt1"><code class="java spaces"> </code><code class="java plain">BodyContentHandler handler = </code><code class="java keyword">new</code> <code class="java plain">BodyContentHandler();</code></div><div class="line number87 index4 alt2"><code class="java spaces"> </code><code class="java plain">Metadata metadata = </code><code class="java keyword">new</code> <code class="java plain">Metadata();</code></div><div class="line number88 index5 alt1"><code class="java spaces"> </code><code class="java keyword">tr y</code> <code class="java plain">{</code></div><div class="line number89 index6 alt2"><code class="java spaces"> </code><code class="java plain">parser.parse(stream, handler, metadata);</code></div><div class="line number90 index7 alt1"><code class="java spaces"> </code><code class="java keyword">return</code> <code class="java plain">handler.toString();</code></div><div class="line number91 index8 alt2"><code class="java spaces"> </code><code class="java plain">} </code><code class="java keyword">finally</code> <code class="java plain">{</code></div><div class="line number92 index9 alt1"><code class="java spaces"> </code><code class="java plain">stream.close();</code></div><div class="line number93 index10 alt2"><code class="java spaces"> </code><code class="java plain">}</code></div><div class="l ine number94 index11 alt1"><code class="java plain">}</code></div></div></td></tr></tbody></table></div></div></div> +<p>For more control, you can call the <a href="./api/org/apache/tika/parser/Parser.html">Tika Parsers</a> directly. Most likely, you'll want to start out using the <a href="./api/org/apache/tika/parser/AutoDetectParser.html">Auto-Detect Parser</a>, which automatically figures out what kind of content you have, then calls the appropriate parser for you.</p><div id="highlighter_163376" class="syntaxhighlighter nogutter java"><table border="0" cellpadding="0" cellspacing="0"><tbody><tr><td class="code"><div class="container"><div class="line number83 index0 alt2"><code class="java keyword">public</code> <code class="java plain">String parseExample() </code><code class="java keyword">throws</code> <code class="java plain">IOException, SAXException, TikaException {</code></div><div class="line number84 index1 alt1"><code class="java spaces"> </code><code class="java plain">InputStream stream = ParsingExample.</code><code class="java keyword">class</code><code clas s="java plain">.getResourceAsStream(</code><code class="java string">"test.doc"</code><code class="java plain">);</code></div><div class="line number85 index2 alt2"><code class="java spaces"> </code><code class="java plain">AutoDetectParser parser = </code><code class="java keyword">new</code> <code class="java plain">AutoDetectParser();</code></div><div class="line number86 index3 alt1"><code class="java spaces"> </code><code class="java plain">BodyContentHandler handler = </code><code class="java keyword">new</code> <code class="java plain">BodyContentHandler();</code></div><div class="line number87 index4 alt2"><code class="java spaces"> </code><code class="java plain">Metadata metadata = </code><code class="java keyword">new</code> <code class="java plain">Metadata();</code></div><div class="line number88 index5 alt1"><code class="java spaces"> </code><code class="java keyword">try</code> <code class="java plain">{</code></div><div class="line number89 index6 alt2"><code class="java spaces"> </code><code class="java plain">parser.parse(stream, handler, metadata);</code></div><div class="line number90 index7 alt1"><code class="java spaces"> </code><code class="java keyword">return</code> <code class="java plain">handler.toString();</code></div><div class="line number91 index8 alt2"><code class="java spaces"> </code><code class="java plain">} </code><code class="java keyword">finally</code> <code class="java plain">{</code></div><div class="line number92 index9 alt1"><code class="java spaces"> </code><code class="java plain">stream.close();</code></div><div class="line number93 index10 alt2"><code class="java spaces"> </code><code class="java plain">}</code></div><div class="line numb er94 index11 alt1"><code class="java plain">}</code></div></div></td></tr></tbody></table></div></div></div> <div class="section"> <h3><a name="Picking_different_output_formats">Picking different output formats</a></h3> <p>With Tika, you can get the textual content of your files returned in a number of different formats. These can be plain text, html, xhtml, xhtml of one part of the file etc. This is controlled based on the <a class="externalLink" href="http://docs.oracle.com/javase/7/docs/api/org/xml/sax/ContentHandler.html">ContentHandler</a> you supply to the Parser.</p> <div class="section"> <h4><a name="Parsing_to_Plain_Text">Parsing to Plain Text</a></h4> -<p>By using the <a href="./apidocs/org/apache/tika/sax/BodyContentHandler.html">BodyContentHandler</a>, you can request that Tika return only the content of the document's body as a plain-text string.</p><div id="highlighter_64041" class="syntaxhighlighter nogutter java"><table border="0" cellpadding="0" cellspacing="0"><tbody><tr><td class="code"><div class="container"><div class="line number46 index0 alt1"><code class="java keyword">public</code> <code class="java plain">String parseToPlainText() </code><code class="java keyword">throws</code> <code class="java plain">IOException, SAXException, TikaException {</code></div><div class="line number47 index1 alt2"><code class="java spaces"> </code><code class="java plain">BodyContentHandler handler = </code><code class="java keyword">new</code> <code class="java plain">BodyContentHandler();</code></div><div class="line number48 index2 alt1"><code class="java spaces"> </code> </div><d iv class="line number49 index3 alt2"><code class="java spaces"> </code><code class="java plain">InputStream stream = ContentHandlerExample.</code><code class="java keyword">class</code><code class="java plain">.getResourceAsStream(</code><code class="java string">"test.doc"</code><code class="java plain">);</code></div><div class="line number50 index4 alt1"><code class="java spaces"> </code><code class="java plain">AutoDetectParser parser = </code><code class="java keyword">new</code> <code class="java plain">AutoDetectParser();</code></div><div class="line number51 index5 alt2"><code class="java spaces"> </code><code class="java plain">Metadata metadata = </code><code class="java keyword">new</code> <code class="java plain">Metadata();</code></div><div class="line number52 index6 alt1"><code class="java spaces"> </code><code class="java keyword">try</code> <code class="java plain">{</code></ div><div class="line number53 index7 alt2"><code class="java spaces"> </code><code class="java plain">parser.parse(stream, handler, metadata);</code></div><div class="line number54 index8 alt1"><code class="java spaces"> </code><code class="java keyword">return</code> <code class="java plain">handler.toString();</code></div><div class="line number55 index9 alt2"><code class="java spaces"> </code><code class="java plain">} </code><code class="java keyword">finally</code> <code class="java plain">{</code></div><div class="line number56 index10 alt1"><code class="java spaces"> </code><code class="java plain">stream.close();</code></div><div class="line number57 index11 alt2"><code class="java spaces"> </code><code class="java plain">}</code></div><div class="line number58 index12 alt1"><code class="jav a plain">}</code></div></div></td></tr></tbody></table></div></div> +<p>By using the <a href="./api/org/apache/tika/sax/BodyContentHandler.html">BodyContentHandler</a>, you can request that Tika return only the content of the document's body as a plain-text string.</p><div id="highlighter_64041" class="syntaxhighlighter nogutter java"><table border="0" cellpadding="0" cellspacing="0"><tbody><tr><td class="code"><div class="container"><div class="line number46 index0 alt1"><code class="java keyword">public</code> <code class="java plain">String parseToPlainText() </code><code class="java keyword">throws</code> <code class="java plain">IOException, SAXException, TikaException {</code></div><div class="line number47 index1 alt2"><code class="java spaces"> </code><code class="java plain">BodyContentHandler handler = </code><code class="java keyword">new</code> <code class="java plain">BodyContentHandler();</code></div><div class="line number48 index2 alt1"><code class="java spaces"> </code> </div><div c lass="line number49 index3 alt2"><code class="java spaces"> </code><code class="java plain">InputStream stream = ContentHandlerExample.</code><code class="java keyword">class</code><code class="java plain">.getResourceAsStream(</code><code class="java string">"test.doc"</code><code class="java plain">);</code></div><div class="line number50 index4 alt1"><code class="java spaces"> </code><code class="java plain">AutoDetectParser parser = </code><code class="java keyword">new</code> <code class="java plain">AutoDetectParser();</code></div><div class="line number51 index5 alt2"><code class="java spaces"> </code><code class="java plain">Metadata metadata = </code><code class="java keyword">new</code> <code class="java plain">Metadata();</code></div><div class="line number52 index6 alt1"><code class="java spaces"> </code><code class="java keyword">try</code> <code class="java plain">{</code></div> <div class="line number53 index7 alt2"><code class="java spaces"> </code><code class="java plain">parser.parse(stream, handler, metadata);</code></div><div class="line number54 index8 alt1"><code class="java spaces"> </code><code class="java keyword">return</code> <code class="java plain">handler.toString();</code></div><div class="line number55 index9 alt2"><code class="java spaces"> </code><code class="java plain">} </code><code class="java keyword">finally</code> <code class="java plain">{</code></div><div class="line number56 index10 alt1"><code class="java spaces"> </code><code class="java plain">stream.close();</code></div><div class="line number57 index11 alt2"><code class="java spaces"> </code><code class="java plain">}</code></div><div class="line number58 index12 alt1"><code class="java pl ain">}</code></div></div></td></tr></tbody></table></div></div> <div class="section"> <h4><a name="Parsing_to_XHTML">Parsing to XHTML</a></h4> -<p>By using the <a href="./apidocs/org/apache/tika/sax/ToXMLContentHandler.html">ToXMLContentHandler</a>, you can get the XHTML content of the whole document as a string.</p><div id="highlighter_82511" class="syntaxhighlighter nogutter java"><table border="0" cellpadding="0" cellspacing="0"><tbody><tr><td class="code"><div class="container"><div class="line number63 index0 alt2"><code class="java keyword">public</code> <code class="java plain">String parseToHTML() </code><code class="java keyword">throws</code> <code class="java plain">IOException, SAXException, TikaException {</code></div><div class="line number64 index1 alt1"><code class="java spaces"> </code><code class="java plain">ContentHandler handler = </code><code class="java keyword">new</code> <code class="java plain">ToXMLContentHandler();</code></div><div class="line number65 index2 alt2"><code class="java spaces"> </code> </div><div class="line number66 index3 alt1">< code class="java spaces"> </code><code class="java plain">InputStream stream = ContentHandlerExample.</code><code class="java keyword">class</code><code class="java plain">.getResourceAsStream(</code><code class="java string">"test.doc"</code><code class="java plain">);</code></div><div class="line number67 index4 alt2"><code class="java spaces"> </code><code class="java plain">AutoDetectParser parser = </code><code class="java keyword">new</code> <code class="java plain">AutoDetectParser();</code></div><div class="line number68 index5 alt1"><code class="java spaces"> </code><code class="java plain">Metadata metadata = </code><code class="java keyword">new</code> <code class="java plain">Metadata();</code></div><div class="line number69 index6 alt2"><code class="java spaces"> </code><code class="java keyword">try</code> <code class="java plain">{</code></div><div class="line number70 index7 a lt1"><code class="java spaces"> </code><code class="java plain">parser.parse(stream, handler, metadata);</code></div><div class="line number71 index8 alt2"><code class="java spaces"> </code><code class="java keyword">return</code> <code class="java plain">handler.toString();</code></div><div class="line number72 index9 alt1"><code class="java spaces"> </code><code class="java plain">} </code><code class="java keyword">finally</code> <code class="java plain">{</code></div><div class="line number73 index10 alt2"><code class="java spaces"> </code><code class="java plain">stream.close();</code></div><div class="line number74 index11 alt1"><code class="java spaces"> </code><code class="java plain">}</code></div><div class="line number75 index12 alt2"><code class="java plain">}</code></div></div></td></tr ></tbody></table></div> -<p>If you just want the body of the xhtml document, without the header, you can chain together a <a href="./apidocs/org/apache/tika/sax/BodyContentHandler.html">BodyContentHandler</a> and a <a href="./apidocs/org/apache/tika/sax/ToXMLContentHandler.html">ToXMLContentHandler</a> as shown:</p><div id="highlighter_588168" class="syntaxhighlighter nogutter java"><table border="0" cellpadding="0" cellspacing="0"><tbody><tr><td class="code"><div class="container"><div class="line number81 index0 alt2"><code class="java keyword">public</code> <code class="java plain">String parseBodyToHTML() </code><code class="java keyword">throws</code> <code class="java plain">IOException, SAXException, TikaException {</code></div><div class="line number82 index1 alt1"><code class="java spaces"> </code><code class="java plain">ContentHandler handler = </code><code class="java keyword">new</code> <code class="java plain">BodyContentHandler(</code></div><div class="line number83 in dex2 alt2"><code class="java spaces"> </code><code class="java keyword">new</code> <code class="java plain">ToXMLContentHandler());</code></div><div class="line number84 index3 alt1"><code class="java spaces"> </code> </div><div class="line number85 index4 alt2"><code class="java spaces"> </code><code class="java plain">InputStream stream = ContentHandlerExample.</code><code class="java keyword">class</code><code class="java plain">.getResourceAsStream(</code><code class="java string">"test.doc"</code><code class="java plain">);</code></div><div class="line number86 index5 alt1"><code class="java spaces"> </code><code class="java plain">AutoDetectParser parser = </code><code class="java keyword">new</code> <code class="java plain">AutoDetectParser();</code></div><div class="line number87 index6 alt2"><code class="java spaces"> &n bsp; </code><code class="java plain">Metadata metadata = </code><code class="java keyword">new</code> <code class="java plain">Metadata();</code></div><div class="line number88 index7 alt1"><code class="java spaces"> </code><code class="java keyword">try</code> <code class="java plain">{</code></div><div class="line number89 index8 alt2"><code class="java spaces"> </code><code class="java plain">parser.parse(stream, handler, metadata);</code></div><div class="line number90 index9 alt1"><code class="java spaces"> </code><code class="java keyword">return</code> <code class="java plain">handler.toString();</code></div><div class="line number91 index10 alt2"><code class="java spaces"> </code><code class="java plain">} </code><code class="java keyword">finally</code> <code class="java plain">{</code></div><div class="line number92 index11 alt1"> <code class="java spaces"> </code><code class="java plain">stream.close();</code></div><div class="line number93 index12 alt2"><code class="java spaces"> </code><code class="java plain">}</code></div><div class="line number94 index13 alt1"><code class="java plain">}</code></div></div></td></tr></tbody></table></div></div> +<p>By using the <a href="./api/org/apache/tika/sax/ToXMLContentHandler.html">ToXMLContentHandler</a>, you can get the XHTML content of the whole document as a string.</p><div id="highlighter_82511" class="syntaxhighlighter nogutter java"><table border="0" cellpadding="0" cellspacing="0"><tbody><tr><td class="code"><div class="container"><div class="line number63 index0 alt2"><code class="java keyword">public</code> <code class="java plain">String parseToHTML() </code><code class="java keyword">throws</code> <code class="java plain">IOException, SAXException, TikaException {</code></div><div class="line number64 index1 alt1"><code class="java spaces"> </code><code class="java plain">ContentHandler handler = </code><code class="java keyword">new</code> <code class="java plain">ToXMLContentHandler();</code></div><div class="line number65 index2 alt2"><code class="java spaces"> </code> </div><div class="line number66 index3 alt1"><code class="java spaces"> </code><code class="java plain">InputStream stream = ContentHandlerExample.</code><code class="java keyword">class</code><code class="java plain">.getResourceAsStream(</code><code class="java string">"test.doc"</code><code class="java plain">);</code></div><div class="line number67 index4 alt2"><code class="java spaces"> </code><code class="java plain">AutoDetectParser parser = </code><code class="java keyword">new</code> <code class="java plain">AutoDetectParser();</code></div><div class="line number68 index5 alt1"><code class="java spaces"> </code><code class="java plain">Metadata metadata = </code><code class="java keyword">new</code> <code class="java plain">Metadata();</code></div><div class="line number69 index6 alt2"><code class="java spaces"> </code><code class="java keyword">try</code> <code class="java plain">{</code></div><div class="line number70 index7 alt1" ><code class="java >spaces"> </code><code >class="java plain">parser.parse(stream, handler, metadata);</code></div><div >class="line number71 index8 alt2"><code class="java >spaces"> </code><code >class="java keyword">return</code> <code class="java >plain">handler.toString();</code></div><div class="line number72 index9 >alt1"><code class="java spaces"> </code><code >class="java plain">} </code><code class="java keyword">finally</code> <code >class="java plain">{</code></div><div class="line number73 index10 >alt2"><code class="java >spaces"> </code><code >class="java plain">stream.close();</code></div><div class="line number74 >index11 alt1"><code class="java spaces"> </code><code >class="java plain">}</code></div><div class="line number75 index12 >alt2"><code class="java plain">}</code></div></div></td></tr></t body></table></div> +<p>If you just want the body of the xhtml document, without the header, you can chain together a <a href="./api/org/apache/tika/sax/BodyContentHandler.html">BodyContentHandler</a> and a <a href="./api/org/apache/tika/sax/ToXMLContentHandler.html">ToXMLContentHandler</a> as shown:</p><div id="highlighter_588168" class="syntaxhighlighter nogutter java"><table border="0" cellpadding="0" cellspacing="0"><tbody><tr><td class="code"><div class="container"><div class="line number81 index0 alt2"><code class="java keyword">public</code> <code class="java plain">String parseBodyToHTML() </code><code class="java keyword">throws</code> <code class="java plain">IOException, SAXException, TikaException {</code></div><div class="line number82 index1 alt1"><code class="java spaces"> </code><code class="java plain">ContentHandler handler = </code><code class="java keyword">new</code> <code class="java plain">BodyContentHandler(</code></div><div class="line number83 index2 alt 2"><code class="java spaces"> </code><code class="java keyword">new</code> <code class="java plain">ToXMLContentHandler());</code></div><div class="line number84 index3 alt1"><code class="java spaces"> </code> </div><div class="line number85 index4 alt2"><code class="java spaces"> </code><code class="java plain">InputStream stream = ContentHandlerExample.</code><code class="java keyword">class</code><code class="java plain">.getResourceAsStream(</code><code class="java string">"test.doc"</code><code class="java plain">);</code></div><div class="line number86 index5 alt1"><code class="java spaces"> </code><code class="java plain">AutoDetectParser parser = </code><code class="java keyword">new</code> <code class="java plain">AutoDetectParser();</code></div><div class="line number87 index6 alt2"><code class="java spaces"> &nbs p;</code><code class="java plain">Metadata metadata = </code><code class="java keyword">new</code> <code class="java plain">Metadata();</code></div><div class="line number88 index7 alt1"><code class="java spaces"> </code><code class="java keyword">try</code> <code class="java plain">{</code></div><div class="line number89 index8 alt2"><code class="java spaces"> </code><code class="java plain">parser.parse(stream, handler, metadata);</code></div><div class="line number90 index9 alt1"><code class="java spaces"> </code><code class="java keyword">return</code> <code class="java plain">handler.toString();</code></div><div class="line number91 index10 alt2"><code class="java spaces"> </code><code class="java plain">} </code><code class="java keyword">finally</code> <code class="java plain">{</code></div><div class="line number92 index11 alt1"><code cl ass="java spaces"> </code><code class="java plain">stream.close();</code></div><div class="line number93 index12 alt2"><code class="java spaces"> </code><code class="java plain">}</code></div><div class="line number94 index13 alt1"><code class="java plain">}</code></div></div></td></tr></tbody></table></div></div> <div class="section"> <h4><a name="Fetching_just_certain_bits_of_the_XHTML">Fetching just certain bits of the XHTML</a></h4> <p>It possible to execute XPath queries on the parse results, to fetch only certain bits of the XHTML. </p><div id="highlighter_928522" class="syntaxhighlighter nogutter java"><table border="0" cellpadding="0" cellspacing="0"><tbody><tr><td class="code"><div class="container"><div class="line number100 index0 alt1"><code class="java keyword">public</code> <code class="java plain">String parseOnePartToHTML() </code><code class="java keyword">throws</code> <code class="java plain">IOException, SAXException, TikaException {</code></div><div class="line number101 index1 alt2"><code class="java spaces"> </code><code class="java comments">// Only get things under html -> body -> div (class=header)</code></div><div class="line number102 index2 alt1"><code class="java spaces"> </code><code class="java plain">XPathParser xhtmlParser = </code><code class="java keyword">new</code> <code class="java plain">XPathParser(</code><code class="java strin g">"xhtml"</code><code class="java plain">, XHTMLContentHandler.XHTML);</code></div><div class="line number103 index3 alt2"><code class="java spaces"> </code><code class="java plain">Matcher divContentMatcher = xhtmlParser.parse(</code></div><div class="line number104 index4 alt1"><code class="java spaces"> </code><code class="java string">"/xhtml:html/xhtml:body/xhtml:div/descendant::node()"</code><code class="java plain">); </code></div><div class="line number105 index5 alt2"><code class="java spaces"> </code><code class="java plain">ContentHandler handler = </code><code class="java keyword">new</code> <code class="java plain">MatchingContentHandler(</code></div><div class="line number106 index6 alt1"><code class="java spaces"> </code><code class="java ke yword">new</code> <code class="java plain">ToXMLContentHandler(), divContentMatcher);</code></div><div class="line number107 index7 alt2"><code class="java spaces"> </code> </div><div class="line number108 index8 alt1"><code class="java spaces"> </code><code class="java plain">InputStream stream = ContentHandlerExample.</code><code class="java keyword">class</code><code class="java plain">.getResourceAsStream(</code><code class="java string">"test2.doc"</code><code class="java plain">);</code></div><div class="line number109 index9 alt2"><code class="java spaces"> </code><code class="java plain">AutoDetectParser parser = </code><code class="java keyword">new</code> <code class="java plain">AutoDetectParser();</code></div><div class="line number110 index10 alt1"><code class="java spaces"> </code><code class="java plain">Metadata metadata = </code><code class="java keyword">new</code> <cod e class="java plain">Metadata();</code></div><div class="line number111 index11 alt2"><code class="java spaces"> </code><code class="java keyword">try</code> <code class="java plain">{</code></div><div class="line number112 index12 alt1"><code class="java spaces"> </code><code class="java plain">parser.parse(stream, handler, metadata);</code></div><div class="line number113 index13 alt2"><code class="java spaces"> </code><code class="java keyword">return</code> <code class="java plain">handler.toString();</code></div><div class="line number114 index14 alt1"><code class="java spaces"> </code><code class="java plain">} </code><code class="java keyword">finally</code> <code class="java plain">{</code></div><div class="line number115 index15 alt2"><code class="java spaces"> </code><code class="java plai n">stream.close();</code></div><div class="line number116 index16 alt1"><code class="java spaces"> </code><code class="java plain">}</code></div><div class="line number117 index17 alt2"><code class="java plain">}</code></div></div></td></tr></tbody></table></div></div></div> @@ -138,7 +138,7 @@ <p>The textual output of parsing a file with Tika is returned via the SAX <a class="externalLink" href="http://docs.oracle.com/javase/7/docs/api/org/xml/sax/ContentHandler.html">ContentHandler</a> you pass to the parse method. It is possible to customise your parsing by supplying your own ContentHandler which does special things.</p> <div class="section"> <h4><a name="Extract_Phone_Numbers_from_Content_into_the_Metadata">Extract Phone Numbers from Content into the Metadata</a></h4> -<p>By using the <a href="./apidocs/org/apache/tika/sax/PhoneExtractingContentHandler.html">PhoneExtractingContentHandler</a>, you can have any phone numbers found in the textual content of the document extracted and placed into the Metadata object for you.</p><div id="highlighter_339689" class="syntaxhighlighter nogutter java"><table border="0" cellpadding="0" cellspacing="0"><tbody><tr><td class="code"><div class="container"><div class="line number69 index0 alt2"><code class="java keyword">public</code> <code class="java keyword">static</code> <code class="java keyword">void</code> <code class="java plain">process(File file) </code><code class="java keyword">throws</code> <code class="java plain">Exception {</code></div><div class="line number70 index1 alt1"><code class="java spaces"> </code><code class="java plain">Parser parser = </code><code class="java keyword">new</code> <code class="java plain">AutoDetectParser();</code></div><div class="line number71 index2 alt2"><code class="java spaces"> </code><code class="java plain">Metadata metadata = </code><code class="java keyword">new</code> <code class="java plain">Metadata();</code></div><div class="line number72 index3 alt1"><code class="java spaces"> </code><code class="java comments">// The PhoneExtractingContentHandler will examine any characters for phone numbers before passing them</code></div><div class="line number73 index4 alt2"><code class="java spaces"> </code><code class="java comments">// to the underlying Handler.</code></div><div class="line number74 index5 alt1"><code class="java spaces"> </code><code class="java plain">PhoneExtractingContentHandler handler = </code><code class="java keyword">new</code> <code class="java plain">PhoneExtractingContentHandler(</code><code class="java keyword">new</code> <code class="java plain">BodyContentHandler(), metadata);</code></div><div cl ass="line number75 index6 alt2"><code class="java spaces"> </code><code class="java plain">InputStream stream = </code><code class="java keyword">new</code> <code class="java plain">FileInputStream(file);</code></div><div class="line number76 index7 alt1"><code class="java spaces"> </code><code class="java keyword">try</code> <code class="java plain">{</code></div><div class="line number77 index8 alt2"><code class="java spaces"> </code><code class="java plain">parser.parse(stream, handler, metadata, </code><code class="java keyword">new</code> <code class="java plain">ParseContext());</code></div><div class="line number78 index9 alt1"><code class="java spaces"> </code><code class="java plain">}</code></div><div class="line number79 index10 alt2"><code class="java spaces"> </code><code class="java keyword">finally</code> <code class="java plain">{ </code></div><div class="line number80 index11 alt1"><code class="java spaces"> </code><code class="java plain">stream.close();</code></div><div class="line number81 index12 alt2"><code class="java spaces"> </code><code class="java plain">}</code></div><div class="line number82 index13 alt1"><code class="java spaces"> </code><code class="java plain">String[] numbers = metadata.getValues(</code><code class="java string">"phonenumbers"</code><code class="java plain">);</code></div><div class="line number83 index14 alt2"><code class="java spaces"> </code><code class="java keyword">for</code> <code class="java plain">(String number : numbers) {</code></div><div class="line number84 index15 alt1"><code class="java spaces"> </code><code class="java plain">phoneNumbers.add(number);</code></div><div class="line number85 index16 al t2"><code class="java spaces"> </code><code class="java plain">}</code></div><div class="line number86 index17 alt1"><code class="java plain">}</code></div></div></td></tr></tbody></table></div></div> +<p>By using the <a href="./api/org/apache/tika/sax/PhoneExtractingContentHandler.html">PhoneExtractingContentHandler</a>, you can have any phone numbers found in the textual content of the document extracted and placed into the Metadata object for you.</p><div id="highlighter_339689" class="syntaxhighlighter nogutter java"><table border="0" cellpadding="0" cellspacing="0"><tbody><tr><td class="code"><div class="container"><div class="line number69 index0 alt2"><code class="java keyword">public</code> <code class="java keyword">static</code> <code class="java keyword">void</code> <code class="java plain">process(File file) </code><code class="java keyword">throws</code> <code class="java plain">Exception {</code></div><div class="line number70 index1 alt1"><code class="java spaces"> </code><code class="java plain">Parser parser = </code><code class="java keyword">new</code> <code class="java plain">AutoDetectParser();</code></div><div class="line number71 inde x2 alt2"><code class="java spaces"> </code><code class="java plain">Metadata metadata = </code><code class="java keyword">new</code> <code class="java plain">Metadata();</code></div><div class="line number72 index3 alt1"><code class="java spaces"> </code><code class="java comments">// The PhoneExtractingContentHandler will examine any characters for phone numbers before passing them</code></div><div class="line number73 index4 alt2"><code class="java spaces"> </code><code class="java comments">// to the underlying Handler.</code></div><div class="line number74 index5 alt1"><code class="java spaces"> </code><code class="java plain">PhoneExtractingContentHandler handler = </code><code class="java keyword">new</code> <code class="java plain">PhoneExtractingContentHandler(</code><code class="java keyword">new</code> <code class="java plain">BodyContentHandler(), metadata);</code></div><div class= "line number75 index6 alt2"><code class="java spaces"> </code><code class="java plain">InputStream stream = </code><code class="java keyword">new</code> <code class="java plain">FileInputStream(file);</code></div><div class="line number76 index7 alt1"><code class="java spaces"> </code><code class="java keyword">try</code> <code class="java plain">{</code></div><div class="line number77 index8 alt2"><code class="java spaces"> </code><code class="java plain">parser.parse(stream, handler, metadata, </code><code class="java keyword">new</code> <code class="java plain">ParseContext());</code></div><div class="line number78 index9 alt1"><code class="java spaces"> </code><code class="java plain">}</code></div><div class="line number79 index10 alt2"><code class="java spaces"> </code><code class="java keyword">finally</code> <code class="java plain">{</co de></div><div class="line number80 index11 alt1"><code class="java spaces"> </code><code class="java plain">stream.close();</code></div><div class="line number81 index12 alt2"><code class="java spaces"> </code><code class="java plain">}</code></div><div class="line number82 index13 alt1"><code class="java spaces"> </code><code class="java plain">String[] numbers = metadata.getValues(</code><code class="java string">"phonenumbers"</code><code class="java plain">);</code></div><div class="line number83 index14 alt2"><code class="java spaces"> </code><code class="java keyword">for</code> <code class="java plain">(String number : numbers) {</code></div><div class="line number84 index15 alt1"><code class="java spaces"> </code><code class="java plain">phoneNumbers.add(number);</code></div><div class="line number85 index16 alt2"> <code class="java spaces"> </code><code class="java plain">}</code></div><div class="line number86 index17 alt1"><code class="java plain">}</code></div></div></td></tr></tbody></table></div></div> <div class="section"> <h4><a name="Streaming_the_plain_text_in_chunks">Streaming the plain text in chunks</a></h4> <p>Sometimes, you want to chunk the resulting text up, perhaps to output as you go minimising memory use, perhaps to output to HDFS files, or any other reason! With a small custom content handler, you can do that.</p><div id="highlighter_682391" class="syntaxhighlighter nogutter java"><table border="0" cellpadding="0" cellspacing="0"><tbody><tr><td class="code"><div class="container"><div class="line number124 index0 alt1"><code class="java keyword">public</code> <code class="java plain">List<String> parseToPlainTextChunks() </code><code class="java keyword">throws</code> <code class="java plain">IOException, SAXException, TikaException {</code></div><div class="line number125 index1 alt2"><code class="java spaces"> </code><code class="java keyword">final</code> <code class="java plain">List<String> chunks = </code><code class="java keyword">new</code> <code class="java plain">ArrayList<String>();</code></div><div class="line number126 index2 alt1">< code class="java spaces"> </code><code class="java plain">chunks.add(</code><code class="java string">""</code><code class="java plain">);</code></div><div class="line number127 index3 alt2"><code class="java spaces"> </code><code class="java plain">ContentHandlerDecorator handler = </code><code class="java keyword">new</code> <code class="java plain">ContentHandlerDecorator() {</code></div><div class="line number128 index4 alt1"><code class="java spaces"> </code><code class="java color1">@Override</code></div><div class="line number129 index5 alt2"><code class="java spaces"> </code><code class="java keyword">public</code> <code class="java keyword">void</code> <code class="java plain">characters(</code><code class="java keyword">char</code><code class="java plain">[] ch, </code><code class="java keyword">int</code> <code class="java plain">star t, </code><code class="java keyword">int</code> <code class="java plain">length) {</code></div><div class="line number130 index6 alt1"><code class="java spaces"> </code><code class="java plain">String lastChunk = chunks.get(chunks.size()-</code><code class="java value">1</code><code class="java plain">);</code></div><div class="line number131 index7 alt2"><code class="java spaces"> </code><code class="java plain">String thisStr = </code><code class="java keyword">new</code> <code class="java plain">String(ch, start, length);</code></div><div class="line number132 index8 alt1"><code class="java spaces"> </code> </div><div class="line number133 index9 alt2"><code class="java spaces"> </code><code clas s="java keyword">if</code> <code class="java plain">(lastChunk.length()+length > MAXIMUM_TEXT_CHUNK_SIZE) {</code></div><div class="line number134 index10 alt1"><code class="java spaces"> </code><code class="java plain">chunks.add(thisStr);</code></div><div class="line number135 index11 alt2"><code class="java spaces"> </code><code class="java plain">} </code><code class="java keyword">else</code> <code class="java plain">{</code></div><div class="line number136 index12 alt1"><code class="java spaces"> </code><code class="java plain">chunks.set(chunks.size()-</code><code class="java value">1</code><code class="java plain">, lastChunk+thisStr);</code></div><div class="line number137 index13 alt2"><code class="java spaces">  ; </code><code class="java plain">}</code></div><div class="line number138 index14 alt1"><code class="java spaces"> </code><code class="java plain">}</code></div><div class="line number139 index15 alt2"><code class="java spaces"> </code><code class="java plain">};</code></div><div class="line number140 index16 alt1"><code class="java spaces"> </code> </div><div class="line number141 index17 alt2"><code class="java spaces"> </code><code class="java plain">InputStream stream = ContentHandlerExample.</code><code class="java keyword">class</code><code class="java plain">.getResourceAsStream(</code><code class="java string">"test2.doc"</code><code class="java plain">);</code></div><div class="line number142 index18 alt1"><code class="java spaces"> </code><code class="java plain">Aut oDetectParser parser = </code><code class="java keyword">new</code> <code class="java plain">AutoDetectParser();</code></div><div class="line number143 index19 alt2"><code class="java spaces"> </code><code class="java plain">Metadata metadata = </code><code class="java keyword">new</code> <code class="java plain">Metadata();</code></div><div class="line number144 index20 alt1"><code class="java spaces"> </code><code class="java keyword">try</code> <code class="java plain">{</code></div><div class="line number145 index21 alt2"><code class="java spaces"> </code><code class="java plain">parser.parse(stream, handler, metadata);</code></div><div class="line number146 index22 alt1"><code class="java spaces"> </code><code class="java keyword">return</code> <code class="java plain">chunks;</code></div><div class="line number147 index23 alt2"><code class ="java spaces"> </code><code class="java plain">} </code><code class="java keyword">finally</code> <code class="java plain">{</code></div><div class="line number148 index24 alt1"><code class="java spaces"> </code><code class="java plain">stream.close();</code></div><div class="line number149 index25 alt2"><code class="java spaces"> </code><code class="java plain">}</code></div><div class="line number150 index26 alt1"><code class="java plain">}</code></div></div></td></tr></tbody></table></div></div></div> @@ -150,7 +150,7 @@ <p>In order to use the Microsoft Translation API, you need to sign up for a Microsoft account, get an API key, then pass the key to Tika before translating.</p><div id="highlighter_457659" class="syntaxhighlighter nogutter java"><table border="0" cellpadding="0" cellspacing="0"><tbody><tr><td class="code"><div class="container"><div class="line number23 index0 alt2"><code class="java keyword">public</code> <code class="java plain">String microsoftTranslateToFrench(String text) {</code></div><div class="line number24 index1 alt1"><code class="java spaces"> </code><code class="java plain">MicrosoftTranslator translator = </code><code class="java keyword">new</code> <code class="java plain">MicrosoftTranslator();</code></div><div class="line number25 index2 alt2"><code class="java spaces"> </code><code class="java comments">// Change the id and secret! See <a href="http://msdn.microsoft.com/en-us/library/hh454950.aspx.">http://msdn.microso ft.com/en-us/library/hh454950.aspx.</a></code></div><div class="line number26 index3 alt1"><code class="java spaces"> </code><code class="java plain">translator.setId(</code><code class="java string">"dummy-id"</code><code class="java plain">);</code></div><div class="line number27 index4 alt2"><code class="java spaces"> </code><code class="java plain">translator.setSecret(</code><code class="java string">"dummy-secret"</code><code class="java plain">);</code></div><div class="line number28 index5 alt1"><code class="java spaces"> </code><code class="java keyword">try</code> <code class="java plain">{</code></div><div class="line number29 index6 alt2"><code class="java spaces"> </code><code class="java keyword">return</code> <code class="java plain">translator.translate(text, </code><code class="java string">"fr"</code><code class="java plain">);</code></div><div class= "line number30 index7 alt1"><code class="java spaces"> </code><code class="java plain">} </code><code class="java keyword">catch</code> <code class="java plain">(Exception e) {</code></div><div class="line number31 index8 alt2"><code class="java spaces"> </code><code class="java keyword">return</code> <code class="java string">"Error while translating."</code><code class="java plain">;</code></div><div class="line number32 index9 alt1"><code class="java spaces"> </code><code class="java plain">}</code></div><div class="line number33 index10 alt2"><code class="java plain">}</code></div></div></td></tr></tbody></table></div></div></div> <div class="section"> <h3><a name="Language_Identification">Language Identification</a></h3> -<p>Tika provides support for identifying the language of text, through the <a href="./apidocs/org/apache/tika/language/LanguageIdentifier.html">LanguageIdentifier</a> class.</p><div id="highlighter_164393" class="syntaxhighlighter nogutter java"><table border="0" cellpadding="0" cellspacing="0"><tbody><tr><td class="code"><div class="container"><div class="line number23 index0 alt2"><code class="java keyword">public</code> <code class="java plain">String identifyLanguage(String text) {</code></div><div class="line number24 index1 alt1"><code class="java spaces"> </code><code class="java plain">LanguageIdentifier identifier = </code><code class="java keyword">new</code> <code class="java plain">LanguageIdentifier(text);</code></div><div class="line number25 index2 alt2"><code class="java spaces"> </code><code class="java keyword">return</code> <code class="java plain">identifier.getLanguage();</code></div><div class="line number26 index3 alt1"><code class="java plain">}</code></div></div></td></tr></tbody></table></div></div> +<p>Tika provides support for identifying the language of text, through the <a href="./api/org/apache/tika/language/LanguageIdentifier.html">LanguageIdentifier</a> class.</p><div id="highlighter_164393" class="syntaxhighlighter nogutter java"><table border="0" cellpadding="0" cellspacing="0"><tbody><tr><td class="code"><div class="container"><div class="line number23 index0 alt2"><code class="java keyword">public</code> <code class="java plain">String identifyLanguage(String text) {</code></div><div class="line number24 index1 alt1"><code class="java spaces"> </code><code class="java plain">LanguageIdentifier identifier = </code><code class="java keyword">new</code> <code class="java plain">LanguageIdentifier(text);</code></div><div class="line number25 index2 alt2"><code class="java spaces"> </code><code class="java keyword">return</code> <code class="java plain">identifier.getLanguage();</code></div><div class="line number26 index3 alt 1"><code class="java plain">}</code></div></div></td></tr></tbody></table></div></div> <div class="section"> <h3><a name="Additional_Examples">Additional Examples</a></h3> <p>A number of other examples are also available, including all of the examples from the <a class="externalLink" href="http://manning.com/mattmann/">Tika In Action book</a>. These can all be found in the <a class="externalLink" href="https://svn.apache.org/repos/asf/tika/trunk/tika-example">Tika Example module</a> in SVN.</p></div></div> Modified: tika/site/publish/1.3/parser.html URL: http://svn.apache.org/viewvc/tika/site/publish/1.3/parser.html?rev=1687946&r1=1687945&r2=1687946&view=diff ============================================================================== --- tika/site/publish/1.3/parser.html (original) +++ tika/site/publish/1.3/parser.html Sat Jun 27 16:15:19 2015 @@ -131,7 +131,7 @@ try { ... </body> </html></pre></div> -<p>Parser implementations typically use the <a href="./apidocs/org/apache/tika/sax/XHTMLContentHandler.html">XHTMLContentHandler</a> utility class to generate the XHTML output.</p> +<p>Parser implementations typically use the <a href="./api/org/apache/tika/sax/XHTMLContentHandler.html">XHTMLContentHandler</a> utility class to generate the XHTML output.</p> <p>Dealing with the raw SAX events can be a bit complex, so Apache Tika comes with a number of utility classes that can be used to process and convert the event stream to other representations.</p> <p>For example, the <a href="./api/org/apache/tika/sax/BodyContentHandler.html">BodyContentHandler</a> class can be used to extract just the body part of the XHTML output and feed it either as SAX events to another content handler or as characters to an output stream, a writer, or simply a string. The following code snippet parses a document from the standard input stream and outputs the extracted text content to standard output:</p> <div> @@ -173,7 +173,7 @@ try { <h3>Parser implementations<a name="Parser_implementations"></a></h3> <p>Apache Tika comes with a number of parser classes for parsing <a href="./formats.html">various document formats</a>. You can also extend Tika with your own parsers, and of course any contributions to Tika are warmly welcome.</p> <p>The goal of Tika is to reuse existing parser libraries like <a class="externalLink" href="http://pdfbox.apache.org/">PDFBox</a> or <a class="externalLink" href="http://poi.apache.org/">Apache POI</a> as much as possible, and so most of the parser classes in Tika are adapters to such external libraries.</p> -<p>Tika also contains some general purpose parser implementations that are not targeted at any specific document formats. The most notable of these is the <a href="./apidocs/org/apache/tika/parser/AutoDetectParser.html">AutoDetectParser</a> class that encapsulates all Tika functionality into a single parser that can handle any types of documents. This parser will automatically determine the type of the incoming document based on various heuristics and will then parse the document accordingly.</p></div></div> +<p>Tika also contains some general purpose parser implementations that are not targeted at any specific document formats. The most notable of these is the <a href="./api/org/apache/tika/parser/AutoDetectParser.html">AutoDetectParser</a> class that encapsulates all Tika functionality into a single parser that can handle any types of documents. This parser will automatically determine the type of the incoming document based on various heuristics and will then parse the document accordingly.</p></div></div> </div> <div id="sidebar"> <div id="navigation"> Modified: tika/site/publish/1.4/parser.html URL: http://svn.apache.org/viewvc/tika/site/publish/1.4/parser.html?rev=1687946&r1=1687945&r2=1687946&view=diff ============================================================================== --- tika/site/publish/1.4/parser.html (original) +++ tika/site/publish/1.4/parser.html Sat Jun 27 16:15:19 2015 @@ -131,7 +131,7 @@ try { ... </body> </html></pre></div> -<p>Parser implementations typically use the <a href="./apidocs/org/apache/tika/sax/XHTMLContentHandler.html">XHTMLContentHandler</a> utility class to generate the XHTML output.</p> +<p>Parser implementations typically use the <a href="./api/org/apache/tika/sax/XHTMLContentHandler.html">XHTMLContentHandler</a> utility class to generate the XHTML output.</p> <p>Dealing with the raw SAX events can be a bit complex, so Apache Tika comes with a number of utility classes that can be used to process and convert the event stream to other representations.</p> <p>For example, the <a href="./api/org/apache/tika/sax/BodyContentHandler.html">BodyContentHandler</a> class can be used to extract just the body part of the XHTML output and feed it either as SAX events to another content handler or as characters to an output stream, a writer, or simply a string. The following code snippet parses a document from the standard input stream and outputs the extracted text content to standard output:</p> <div> @@ -173,7 +173,7 @@ try { <h3>Parser implementations<a name="Parser_implementations"></a></h3> <p>Apache Tika comes with a number of parser classes for parsing <a href="./formats.html">various document formats</a>. You can also extend Tika with your own parsers, and of course any contributions to Tika are warmly welcome.</p> <p>The goal of Tika is to reuse existing parser libraries like <a class="externalLink" href="http://pdfbox.apache.org/">PDFBox</a> or <a class="externalLink" href="http://poi.apache.org/">Apache POI</a> as much as possible, and so most of the parser classes in Tika are adapters to such external libraries.</p> -<p>Tika also contains some general purpose parser implementations that are not targeted at any specific document formats. The most notable of these is the <a href="./apidocs/org/apache/tika/parser/AutoDetectParser.html">AutoDetectParser</a> class that encapsulates all Tika functionality into a single parser that can handle any types of documents. This parser will automatically determine the type of the incoming document based on various heuristics and will then parse the document accordingly.</p></div></div> +<p>Tika also contains some general purpose parser implementations that are not targeted at any specific document formats. The most notable of these is the <a href="./api/org/apache/tika/parser/AutoDetectParser.html">AutoDetectParser</a> class that encapsulates all Tika functionality into a single parser that can handle any types of documents. This parser will automatically determine the type of the incoming document based on various heuristics and will then parse the document accordingly.</p></div></div> </div> <div id="sidebar"> <div id="navigation"> Modified: tika/site/publish/1.5/parser.html URL: http://svn.apache.org/viewvc/tika/site/publish/1.5/parser.html?rev=1687946&r1=1687945&r2=1687946&view=diff ============================================================================== --- tika/site/publish/1.5/parser.html (original) +++ tika/site/publish/1.5/parser.html Sat Jun 27 16:15:19 2015 @@ -131,7 +131,7 @@ try { ... </body> </html></pre></div> -<p>Parser implementations typically use the <a href="./apidocs/org/apache/tika/sax/XHTMLContentHandler.html">XHTMLContentHandler</a> utility class to generate the XHTML output.</p> +<p>Parser implementations typically use the <a href="./api/org/apache/tika/sax/XHTMLContentHandler.html">XHTMLContentHandler</a> utility class to generate the XHTML output.</p> <p>Dealing with the raw SAX events can be a bit complex, so Apache Tika comes with a number of utility classes that can be used to process and convert the event stream to other representations.</p> <p>For example, the <a href="./api/org/apache/tika/sax/BodyContentHandler.html">BodyContentHandler</a> class can be used to extract just the body part of the XHTML output and feed it either as SAX events to another content handler or as characters to an output stream, a writer, or simply a string. The following code snippet parses a document from the standard input stream and outputs the extracted text content to standard output:</p> <div> @@ -173,7 +173,7 @@ try { <h3>Parser implementations<a name="Parser_implementations"></a></h3> <p>Apache Tika comes with a number of parser classes for parsing <a href="./formats.html">various document formats</a>. You can also extend Tika with your own parsers, and of course any contributions to Tika are warmly welcome.</p> <p>The goal of Tika is to reuse existing parser libraries like <a class="externalLink" href="http://pdfbox.apache.org/">PDFBox</a> or <a class="externalLink" href="http://poi.apache.org/">Apache POI</a> as much as possible, and so most of the parser classes in Tika are adapters to such external libraries.</p> -<p>Tika also contains some general purpose parser implementations that are not targeted at any specific document formats. The most notable of these is the <a href="./apidocs/org/apache/tika/parser/AutoDetectParser.html">AutoDetectParser</a> class that encapsulates all Tika functionality into a single parser that can handle any types of documents. This parser will automatically determine the type of the incoming document based on various heuristics and will then parse the document accordingly.</p></div></div> +<p>Tika also contains some general purpose parser implementations that are not targeted at any specific document formats. The most notable of these is the <a href="./api/org/apache/tika/parser/AutoDetectParser.html">AutoDetectParser</a> class that encapsulates all Tika functionality into a single parser that can handle any types of documents. This parser will automatically determine the type of the incoming document based on various heuristics and will then parse the document accordingly.</p></div></div> </div> <div id="sidebar"> <div id="navigation"> Modified: tika/site/publish/1.6/parser.html URL: http://svn.apache.org/viewvc/tika/site/publish/1.6/parser.html?rev=1687946&r1=1687945&r2=1687946&view=diff ============================================================================== --- tika/site/publish/1.6/parser.html (original) +++ tika/site/publish/1.6/parser.html Sat Jun 27 16:15:19 2015 @@ -131,7 +131,7 @@ try { ... </body> </html></pre></div> -<p>Parser implementations typically use the <a href="./apidocs/org/apache/tika/sax/XHTMLContentHandler.html">XHTMLContentHandler</a> utility class to generate the XHTML output.</p> +<p>Parser implementations typically use the <a href="./api/org/apache/tika/sax/XHTMLContentHandler.html">XHTMLContentHandler</a> utility class to generate the XHTML output.</p> <p>Dealing with the raw SAX events can be a bit complex, so Apache Tika comes with a number of utility classes that can be used to process and convert the event stream to other representations.</p> <p>For example, the <a href="./api/org/apache/tika/sax/BodyContentHandler.html">BodyContentHandler</a> class can be used to extract just the body part of the XHTML output and feed it either as SAX events to another content handler or as characters to an output stream, a writer, or simply a string. The following code snippet parses a document from the standard input stream and outputs the extracted text content to standard output:</p> <div> @@ -173,7 +173,7 @@ try { <h3>Parser implementations<a name="Parser_implementations"></a></h3> <p>Apache Tika comes with a number of parser classes for parsing <a href="./formats.html">various document formats</a>. You can also extend Tika with your own parsers, and of course any contributions to Tika are warmly welcome.</p> <p>The goal of Tika is to reuse existing parser libraries like <a class="externalLink" href="http://pdfbox.apache.org/">PDFBox</a> or <a class="externalLink" href="http://poi.apache.org/">Apache POI</a> as much as possible, and so most of the parser classes in Tika are adapters to such external libraries.</p> -<p>Tika also contains some general purpose parser implementations that are not targeted at any specific document formats. The most notable of these is the <a href="./apidocs/org/apache/tika/parser/AutoDetectParser.html">AutoDetectParser</a> class that encapsulates all Tika functionality into a single parser that can handle any types of documents. This parser will automatically determine the type of the incoming document based on various heuristics and will then parse the document accordingly.</p></div></div> +<p>Tika also contains some general purpose parser implementations that are not targeted at any specific document formats. The most notable of these is the <a href="./api/org/apache/tika/parser/AutoDetectParser.html">AutoDetectParser</a> class that encapsulates all Tika functionality into a single parser that can handle any types of documents. This parser will automatically determine the type of the incoming document based on various heuristics and will then parse the document accordingly.</p></div></div> </div> <div id="sidebar"> <div id="navigation">