svn commit: r68146 [2/3] - in /dev/tika: 2.9.1/ 2.9.2/

tallison Tue, 26 Mar 2024 07:23:58 -0700


Added: dev/tika/2.9.2/CHANGES-2.9.2.txt
==============================================================================
--- dev/tika/2.9.2/CHANGES-2.9.2.txt (added)
+++ dev/tika/2.9.2/CHANGES-2.9.2.txt Tue Mar 26 14:23:51 2024
@@ -0,0 +1,3205 @@
+Release 2.9.2 - 3/26/2024
+
+   * Dependency upgrades including temporary workarounds for regressions in 
commons-compress.
+
+   * Add detection for OpenSCAD, 3MF, AMF, STL file formats via Robin Schimpf 
(TIKA-4222, TIKA-4223,
+     TIKA-4224, TIKA-4225).
+
+Release 2.9.1 - 10/17/2023
+
+   * Dependency upgrades including commons-compress to fix CVE-2023-42503.
+
+   * Improve RFC822 detection (TIKA-4153).
+
+   * Enable configuration of "maxJsonStringFieldLength" in TikaConfig to allow 
users to
+     avoid DEFAULT_MAX_STRING_LEN exceptions from Jackson (TIKA-4154).
+
+   * Fix bug in DateUtils that stripped timezone information from
+     incoming Calendar objects (TIKA-4126).
+
+   * The InputStreamDigester now calculates stream length (TIKA-4016).
+
+Release 2.9.0 - 8/23/2023
+
+   * With user configuration, the PDFParser can now throw an 
EncryptedDocumentException
+     for Microsoft IRM PDF containers with encrypted payloads. Separately,
+     the PDFParser now throws an EncryptedDocumentException instead of an 
IOException
+     if the security handler cannot be found (TIKA-4082).
+
+   * Fix bug that led to duplicate extraction of macros from some OLE2 
containers (TIKA-4116).
+
+   * Parse iframe's srcdoc as an embedded file (TIKA-3109).
+
+   * Add detection of warc.gz as a specialization of gz and parse as if a 
standard WARC (TIKA-4048).
+
+   * Allow users to modify the attachment limit size in the /unpack resource 
(TIKA-4039)
+   
+   * Fixed write limit bug in RecursiveParserWrapper (TIKA-4055).
+
+   * Add mime detection for many files with thanks to Gregory Lepore 
(TIKA-3992).
+   
+   * Fixed iWork 13 keynote detection on files with wrong extension 
(TIKA-4111).
+
+Release 2.8.0 - 5/11/2023
+
+   * Enable counting and/or parsing of incremental updates in PDFs.  This
+     is an experimental feature and may change in later releases (TIKA-4017).
+
+   * Fixed bug that prevented the the loading of CompositeExternalParser in 
tika-app and
+     tika-server-standard. This parser will call exiftool and ffmpeg if those 
are installed, as was
+     the behavior in Tika 1.x. Exclude 
org.apache.tika.parser.external.CompositeExternalParser
+     if you do not want this behavior (TIKA-4022).
+
+   * Removed the shading of tika-parsers-standard-module (TIKA-4038).
+
+   * Enable optional extraction of file system metadata in FileSystemFetcher 
(TIKA-4035).
+
+   * Allow pretty printing in FileSystemEmitter (TIKA-4034).
+
+   * Add detection for and a new mime type for older postscript-based
+     Adobe Illustrator "application/illustrator+ps" files (TIKA-3971).
+
+   * Add magic detection for canon raw file types: crw, cr2 and cr3 
(TIKA-3991).
+
+   * Add detection for ONIX message files (TIKA-4011).
+
+   * Add detection and a parser for ActiveMime files (TIKA-3987).
+
+   * Add extraction of rendition layout value and version from Epub 
(TIKA-4013).
+
+   * Improve embedded file extraction from PDFs (TIKA-4012).
+
+   * Improve metadata extraction from WARCs (TIKA-4018).
+
+   * Update to PDFBox 2.0.28 (TIKA-4016).
+
+   * Users may now avoid the ZeroByteFileException via a
+     setting on the AutoDetectParserConfig (TIKA-3976).
+
+   * Fix bug in closing <a> elements in the presence of <b> elements
+     in RTF files (TIKA-3972).
+
+   * Improve extraction of embedded file names in .docx (TIKA-3968).
+
+   * Normalize author, title, subject and description to their Dublin Core
+     properties in the HTMLParser (TIKA-3963).
+
+
+Release 2.7.0 - 1/31/2023
+
+   * Add SVG detection for svg files that lack the xml header (TIKA-3308).
+
+   * Migrate to a live fork of Universal Charset Detector (TIKA-3213).
+
+   * Improve handling of text-based attachments inside .eml files (TIKA-3959).
+
+   * Add tika-parser-nlp-package to release artifacts (TIKA-3958).
+
+   * Remove need for <params/> element in classes that extend ConfigBase 
(TIKA-3946).
+
+   * Add X-TIKA:embedded_id_path to ensure unique embedded file paths 
(TIKA-3942).
+
+   * Fix bug that prevented digests when the fallback/EmptyParser
+     was called (TIKA-3939).
+
+   * Remove log4j 1.2.x (and slf4j-log4j12 which now redirects to 
slf4j-reload4j) from
+     all modules (TIKA-3935).
+
+   * Upgrade mime4j to 0.8.9 (TIKA-3950).
+
+   * Refactor date parsing for emails (TIKA-3957)
+
+   * Upgrade to Bouncy Castle 1.71 and jdk18on jars (TIKA-3933).
+
+   * Add a JDBCPipesReporter (TIKA-3931).
+
+   * Add multivalued field strategy option in jdbc-emitter (TIKA-3930).
+     Default is now 'concatenate' with ', ' as the delimiter.
+
+   * Downgrade logging in PipesClient for each parse from info to debug.
+
+Release 2.6.0 - 11/3/2022
+
+   * Add optional Siegfried detector (TIKA-3901).
+
+   * Move OverrideDetector's functionality to the CompositeDetector 
(TIKA-3904).
+
+   * The FileCommandDetector has been refactored to have the same
+     behavior as the Siegfried detector; see setUseMime in the javadoc 
(TIKA-3902).
+
+   * Fix bug in OpenSearch emitter that prevented upserts on
+     documents with embedded files (TIKA-3882).
+
+   * Extract PDF actions and triggers into the file's metadata (TIKA-3887).
+
+   * Add a tika-async-cli module (TIKA-3885).
+
+   * Fetch keys sent via headers to tika server are now URL decoded 
(TIKA-3864).
+
+
+Release 2.5.0 - 09/30/2022
+
+   * Improved extraction of PDF subset info for PDF/UA, PDF/VT, and PDF/X.
+     NOTE: we no longer append PDF/A information, e.g. 'version="A-1b"'
+     to the 'dc:format'. Users must now get that information from the
+     'pdfa:PDFVersion' key or from 'pdfaid:conformance'
+     and 'pdfaid:part' (TIKA-3844).
+
+   * Avoid infinite loop in bookmark extraction from PDFs (TIKA-3832).
+
+   * Upgraded to slf4j 2.0.1 (TIKA-3842).
+
+   * Added upsert option for the OpenSearch emitter (TIKA-3855).
+
+   * Extract PDF signature information at the document level
+     into the metadata (TIKA-3852).
+
+   * Enable configuration of digests via AutoDetectParserConfig (TIKA-3853).
+
+   * Use commons-io byte array streams via PJ Fanning (TIKA-3843).
+
+   * Upgrade to PDFBox 2.0.27 (TIKA-3866).
+
+   * Upgrade to JempBox 1.8.17 (TIKA-3856).
+
+   * Add extraction of ODF version from ODF files (TIKA-3840).
+
+   * tika-parser-html-commons (BoilerPipeHandler) is no longer a
+     a dependency of tika-parser-html-module. tika-app and tika-server-standard
+     have added a dependency on tika-parser-html-commons.  However,
+     users who are managing custom dependencies and who want the 
BoilerPipeHandler
+     will have to now include the tika-parser-html-commons dependency
+     (TIKA-1484).
+
+   * Add unrar as an optional parser (TIKA-3800).
+
+   * Refactor FuzzingCLI to use PipesParser (TIKA-3799).
+
+   * ServiceLoader's loadServiceProviders() now guarantees
+     unique classes (TIKA-3797).
+
+   * Fix bug that prevented setting of includeHeadersAndFooters
+     for xls, xlsx, doc and docx via tika-config (TIKA-3796).
+
+   * Fix bug that prevented specification of rendered image type
+     via http header in the PDFParser (TIKA-3794).
+
+   * Fix bug causing some Exif dates to be decoded wrongly on
+     timezones different than UTC (TIKA-3815).
+
+   * Numerous dependency upgrades (TIKA-3795).
+
+   * Add ALPHA-level initial releases of JDBCEmitter,
+     FileSystemStatusReporter and OpenSearchPipesReporter.
+     These may have breaking changes in subsequent releases.
+
+Release 2.4.1 - 06/14/2022
+
+   * Implement bulk upload in the OpenSearch emitter (TIKA-3791).
+
+   * Implement tika-server client via pipes mode (TIKA-3790).
+
+   * Custom embedded parsers and EmbeddedDocumentHandlers
+     can now add metadata to the container file's
+     metadata (TIKA-3789).
+
+   * Record embedded file exceptions in the container
+     file's metadata (TIKA-3788).
+
+   * Allow continuation of parsing after write limit has
+     been reached (TIKA-3787).
+
+   * Allow pass-through of 'Content-Length' header to metadata
+     in TikaResource (TIKA-3786).
+
+   * Add embedded depth to profiles tables in tika-eval (TIKA-3775).
+
+   * Add stop() method to TikaServerCli so that it can be run
+     with Apache Commons Daemon (TIKA-1570).
+
+   * Fixed bug in ordering of Parsers during service loading (TIKA-3750).
+
+   * Users can expand system properties from the forking
+     process into forked tika-server processes (TIKA-3748).
+
+   * Fix a few files being wrongly detected as EML (TIKA-3771).
+
+   * Fix ignoreCharsets param of Icu4jEncodingDetector (TIKA-3774).
+
+Release 2.4.0 - 04/23/2022
+
+   * NOTE: To save on resources, we no longer include the
+     deeplearning4j dependencies in the tika-dl jar. The dependencies for the
+     tika-dl package must be provided by users.  See:
+     
https://github.com/apache/tika/blob/main/tika-parsers/tika-parsers-ml/tika-dl/pom.xml
+     for the dependencies that must be provided at run-time (TIKA-3676).
+
+   * NOTE: Added prefix "dwg-custom:" to DWG custom metadata properties 
(TIKA-3731).
+
+   * Add initial, BETA-grade TLS encryption option for tika-server;
+     configuration may change in future releases (TIKA-3719).
+
+   * Allow specification of fetcherName and fetchKey via query parameters
+     in request URI in tika-server (TIKA-3714).
+
+   * Add basic parsers for WARC and WACZ in tika-parsers-standard (TIKA-3697).
+
+   * Add MetadataWriteFilter capability to improve memory profile in
+     Metadata objects (TIKA-3695).
+
+   * Allow configurability of the ContentHandlerDecorator used
+     by the AutoDetectParser (TIKA-3723).
+
+   * Allow configurability of the EmbeddedDocumentExtractor used
+     by the AutoDetectParser (TIKA-3711).
+
+   * Add detection for Frictionless Data packages and WACZ (TIKA-3696).
+
+   * Add detection for DGN files with gratitude and credit
+     to Steven Frew's tika-dgn-detector (TIKA-3721).
+
+   * Add parser for metadata from DGN 8 files via Dan Coldrick (TIKA-3721).
+
+   * Add a fetcher and emitter for Azure blob storage (TIKA-3707).
+
+   * Add detection for files encrypted by Microsoft's Rights Management Service
+     (TIKA-3666).
+
+   * Fixed regression in 2.3.0 that led to more embedded filenames
+     than appropriate being written to the content (TIKA-3711).
+
+   * tika-server now clones forking process' environment variables
+     into forked process (TIKA-3715).
+
+   * Add an optional /eval endpoint for tika-eval profile or compare
+     capabilities in tika-server (TIKA-3689).
+
+   * Add a Parsed-By-Full-Set metadata item to record all parsers that 
processed
+     a file (TIKA-3716).
+
+   * Add metadata filters for Optimaize and OpenNLP language detectors 
(TIKA-3717).
+
+   * Upgrade to PDFBox 2.0.26 (TIKA-3726).
+
+   * Upgrade deeplearning4j to 1.0.0-M2 (TIKA-3458 and PR#527).
+
+   * Various dependency upgrades, including POI, dl4j, gson, jackson,
+     twelvemonkeys, log4j2 and others (TIKA-3675 and many PRs from dependabot).
+
+   * Switch cipher from ECB to GCM in HttpClientFactory (TIKA-3724).
+
+Release 2.3.0 - 02/02/2022
+
+   * Upgrade to Apache POI 5.2.0. This is the first upgrade to POI
+     5.x and represents a major refactoring. Users may experience
+     significantly more logging (TIKA-3164).
+
+   * Upgrade to log4j2 2.17.1 (TIKA-3638).
+
+   * Improve consistency in reporting package-entry divs across
+     all parsers for embedded files (TIKA-3644). This leads
+     to some more text (embedded file names) in files with
+     many embedded attachments.
+
+   * Improve configuration of maps as params for parsers in
+     TikaConfig (TIKA-3645).
+
+   * Improve identification of iWorks 13 files and add parsing
+     for thumbnails, some metadata and attachments (TIKA-3634).
+     Skip handling of .iwa files, which are not yet supported.
+
+   * Limit the default in-memory processing (maxMainMemoryBytes) in
+     the PDFParser to 512MB as in the 1.x branch (TIKA-3642).
+
+   * Added IDML Parser from 1.x series to 2.x series (TIKA-3188).
+
+   * Extract annotation types and subtypes for PDFs into metadata (TIKA-3653).
+
+   * Add metadata value for PDFs that contain 3D annotations (TIKA-3653).
+
+   * Add parser for Translation Memory eXchange (TMX) files (TIKA-3660).
+
+   * Add Bill of Materials (Maven BOM) for centralized module version 
management (TIKA-3367).
+
+
+Release 2.2.1 - 12/19/2021
+
+   * Fix multithreading bug for ooxml files (TIKA-3627).
+
+   * Upgrade log4j to 2.17.0 (TIKA-3625).
+
+   * Upgrade to PDFBox 2.0.25 (TIKA-3622)
+
+   * Fix bug that prevented metadata keys in the UnpackerResource
+     in tika-server (TIKA-3624).
+
+   * Upgrade log4j to 2.16.0 (TIKA-3623)
+
+Release 2.2.0 - 12/13/2021
+
+   * Add support for OneNote files downloaded from O365 (TIKA-3446).
+
+   * Fix logic bug in PipesServer that prevented concatenation of
+     content from attachments (TIKA-3609).
+
+   * Improve extraction of embedded files from MSOffice files created
+     by non-Microsoft tools (TIKA-3526).
+
+   * Added back ability to ignore load errors in TikaConfig (TIKA-3575).
+
+   * Make SecureContentHandler and other parameters configurable in
+     AutoDetectParser programmatically and via tika-config.xml (TIKA-3594).
+
+   * Fix default logging in tika-app in batch mode (TIKA-3589).
+
+   * Fix bug that prevented specifying a config with the long
+     --config= option in tika-app in batch mode (TIKA-3589).
+
+   * Fix thread starvation after numerous restarts in
+     PipesClient (TIKA-3588).
+
+   * Fix race condition when starting multiple forked
+     servers on multiple ports (TIKA-3586).
+
+   * Add timeout per task to be configured via headers
+     for tika-server's legacy endpoints /tika and /rmeta.
+     Note that this timeout greater than taskTimeoutMillis (TIKA-3582).
+
+   * Add metadata item for whether or not a PDF has a collection/
+     is a Portfolio PDF (TIKA-3579).
+
+   * Add detection of ESRI Layer files (TIKA-3570).
+
+   * Add detection of JPEG XL, MARC, ICC profiles, NES-ROM file types
+     (TIKA-3562 and TIKA-3563)
+
+   * Remove duplicate "subject" metadata keys that were intended
+     for backwards compatibility within 1.x only (TIKA-3564).
+
+   * Fix Open Office mime types to be subclasses of application/zip
+     and no longer require OPCPackageDetector-last ordering of zip
+     detectors (TIKA-3556).
+
+   * Improve robustness and features of the httpfetcher (TIKA-3543)
+   
+   * Add optional fetch ranges to FetchEmitTuple to allow range fetching from,
+        e.g. http or s3 (TIKA-3542).
+
+   * Exclude dependencies on jsoup and ehcache in ucar grib/cdm (TIKA-3003).
+
+
+Release 2.1.0 - 08/18/2021
+
+   MAJOR CHANGES in 2.1.0:
+
+   * Improved packaging for tika-parsers-extended. Use the 
tika-parser-scientific-package and
+     tika-parser-sqlite3-package artifacts if you want fat jars with 
dependencies. (TIKA-3510)
+
+   * Tika app writes UTF-8 when an encoding is not specified; the legacy 
behavior
+     was UTF-8 on Mac OS, but System default on other OSs (TIKA-3515).
+
+   * Change the default rendering strategy for PDFs from NO_TEXT to ALL 
(TIKA-3520).
+
+   Other changes:
+
+   * Fixed bug that pointed to the wrong tessdata directory if the user 
specified
+     a tesseract path but not also a tessdata path (TIKA-3518).
+
+   * Fixed bug in Icu4j's encoding detector where it would return non-standard
+     names for charsets, e.g. IBM424_rtl is now returned as IBM424 (TIKA-3516).
+
+   * Add a simple UrlFetcher in tika-core as a basic alternative
+     to tika-fetcher-http (TIKA-3527).
+
+   * Add tika-pipes support for Google Cloud Storage (TIKA-3524).
+
+   * Fix markup ordering errors in xhtml output for ODT files (TIKA-2242).
+
+   * Fix serialization of embedded docs in OpenSearch emitter
+     and fix embedded documents not being indexed in some use
+     cases in the Solr emitter (TIKA-3490).
+
+   * Add pipesClientId system property to PipesServer so that each
+     forked process can log to its own logger (TIKA-3480).
+
+   * Add DateNormalizingMetadataFilter let users ensure that all dates
+     emitted to Solr/OpenSearch are in UTC. Users can configure which
+     timezone they'd like to use in cases where the file format does
+     not store a timezone (TIKA-3496).
+
+   * Breaking change in the Solr and OpenSearch emitters. To achieve
+     the SKIP or CONCATENATE attachment strategy, modify the
+     parseMode in the pipesiterators or in the FetchEmitTuple (TIKA-3494).
+
+Release 2.0.0 - 07/07/2021
+
+   * Cleanup of fetcher integration with tika-server.
+
+   * Update dependencies.
+
+Release 2.0.0-BETA - 05/19/2021
+   
+   * Refactor pipes module for resilience
+
+   * Add transcribe capability (TIKA-94).
+
+Release 2.0.0-ALPHA - 01/13/2021
+
+   BREAKING CHANGES in 2.0.0
+   * General
+     * OCR is now triggered automatically for PDFs if tesseract
+       is on the user's path see 
(https://cwiki.apache.org/confluence/display/TIKA/TikaOCR#TikaOCR-disable-ocr)
+       for how to disable OCR.
+     * We upgraded from log4j to log4j2 in tika-app, tika-server and anywhere 
else
+       we used to use log4j.
+     * By default, when rendering a page for OCR, the PDFParser does not 
render glyphs/text.
+     * Removed deprecated Metadata keys/properties (TIKA-1974).
+     * Removed deprecated PDFPreflightParser (TIKA-3437).
+     * Removed dangerous calls to read an inputstream or convert to bytes
+       without specifying a charset
+     * Parsers can be configured via tika-config.xml on instantiation.
+       We have moved away from configuration via .properties files because
+       of confusion among users.  This affects the PDFParser, 
TesseractOCRParser
+       and the StringsParser.
+     * Changed namespaces of translator implementations 
(o.a.t.language.translate.impl) to avoid
+       split-package with tika-core
+
+   * tika-parsers
+     * The parser modules have been broken into three main modules:
+        tika-parsers-standard, tika-parsers-extended and tika-parsers-ml.
+        Users may now need to add tika-parsers-extended's
+        tika-parser-scientific-module or tika-parser-sqlite3-module to 
tika-app and
+        tika-server to include parsers that used to be included by default
+        (for example: envi, gdal, grib, isatab, netcdf, sqlite3).
+     * PDFParser -- a) see above on OCR. b) This parser no longer warns if the 
jpeg2000
+       dependency is not included. Tika now relies on PDFBox to log an error 
if a jpeg2000
+       image should be processed but can't because the required external 
dependency is
+       not available.  See 
https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
+       for the non-ASF-2.0-compatible jpeg2000 library.
+     * CompressorParser -- users must add the com.github.luben:zstd-jni 
dependency to
+       the classpath to process zstd files.  This is an optional library that 
is no longer bundled
+       in tika-parsers-standard-package because it contains native libs.
+     * ChmParser was moved to org.apache.tika.parser.microsoft.chm
+     * RTFParser was moved to org.apache.tika.parser.microsoft.rtf
+     * We are now using non-shaded versions of xmpcore with namespaces 
com.adobe.internal.*
+       vs com.adobe.*.
+
+   * tika-app
+     * See above on default inclusion of only tika-parsers-standard.
+
+   * tika-server
+     * tika-server now by default forks a process to isolate the parsing
+       in the forked process (this was called the -spawnChild option
+       in tika-1.x).  Clients must now expect that tika-server
+       will restart on OOM, timeouts, crashes or after parsing a
+       large number of files.  When this happens tika-server will restand and 
not
+       receive connections for brief periods.  The less robust, legacy behavior
+       of not forking a process is available with "-noFork"=
+     * Most of tika-server's legacy configuration via the commandline has been 
moved
+       into configuration via a tika-config.xml file.
+     * tika-server's "enableFileUrl" has been removed in favor of a 
FileSystemFetcher.
+     * tika-server's /metadata endpoint requires tika-server-standard to write 
XMP/rdf output.
+       This output is not available in tika-server-core.
+     * In tika-server, for those parsers that can be configured per parse via 
a config object
+       passed in through the ParseContext, the config object will only update 
those fields
+       that the user has modified.  The config object will no longer
+       fully reset all settings to the default settings per parse.
+       This has a more intuitive "update the base/configured settings" with
+       what has been changed in the config object.
+
+  * tika-eval
+    * tika-eval's default profile and comparison reports no longer include tag 
reports.
+      Users can get the report configs that include tags (*-tags.xml):
+      
https://github.com/apache/tika/tree/main/tika-eval/tika-eval-app/src/main/resources
+
+Release 1.27 - 06/30/2021
+
+   * Migrate MP4 parsing to Drew Noakes' metadata-extractor (TIKA-3459).
+     To revert to legacy parser turn off NoakesMP4Parser and turn on MP4Parser
+     via tika-config.xml.
+
+   * Prevent rare infinite loop in tika-server's -spawnChild mode
+     when restart fails because of failure to bind to the port (TIKA-3441).
+
+   * Improve likelihood that tesseract will not be orphaned on
+     jvm restart in tika-server (TIKA-3441).
+
+   * Deprecate experimental PDFPreflightParser (TIKA-3437).
+
+   * Apply encoding detection to zip entry names via Ryan421 (TIKA-3374).
+
+   * Add json output for /tika endpoint in tika-server (TIKA-3352).
+
+   * Tika's PDFParser should use the underlying file if one is passed in
+     via a TikaInputStream (TIKA-3350)
+
+Release 1.26 - 03/24/2021
+
+   * Fix thread safety bug in OpenOffice parser (TIKA-3334).
+
+   * The "writeLimit" header now pertains to the combined characters
+     written per container document (and embedded documents) in the /rmeta
+     endpoint in tika-server (TIKA-3325); it no longer functions only
+     per container or embedded document.
+
+   * Extract more embedded files in PDFs by recursively processing the
+     embedded file tree (TIKA-3332).
+
+   * Allow for case insensitive headers for configuration of the PDFParser
+     and the TesseractOCRParser in tika-server via Subhajit Das (TIKA-3320).
+
+   * Improve detection and parsing of XPS files (TIKA-3316).
+
+   * General dependency upgrades (TIKA-3244).
+
+   * Great optimization in ForkParser (TIKA-3237).
+
+   * Fix parsing of emails attached to other emails in PST files (TIKA-3004).
+
+   * MP3 parser should output the xmpDM:duration metadata as seconds not
+     milliseconds, consistent with the other Audio and Video parsers 
(TIKA-3318).
+
+   * MP4 parser check if any of the Compatible Brands match when identifying
+     the subtype (TIKA-3310).
+
+Release 1.25 - 11/25/2020
+
+   * Fix inconsistent license in xmpcore (TIKA-3204).
+
+   * General upgrades including some dependencies with
+     recently found security vulnerabilities (TIKA-3119).
+
+   * Add detection and a parser for flat ODF files (TIKA-3159).
+
+   * Add extraction of macros from ODF files  (TIKA-3161).
+
+   * Add mime detection for hprof and hprof text files (TIKA-3144).
+
+   * Add TextSignature and TextProfileSignature to tika-eval (TIKA-3145 and 
TIKA-3146)
+
+   * Create a metadata filter to trigger tika-eval stats post parsing 
(TIKA-3140)
+
+   * Add a configurable metadata-filter for the RecursiveParserWrapper 
(TIKA-3137)
+
+   * Parameterize writeLimit and maxEmbeddedResources for 
RecursiveParserWrapper
+     in tika-server (TIKA-3133)
+
+   * Add status endpoint to tika-server (TIKA-3129).
+
+   * Remove whitelist/blacklist terminology (TIKA-3120)
+
+   * Add detection for parquet files (TIKA-3115).
+
+   * Add detection and parsing for bplist (TIKA-3104).
+
+   * Enable metadata value filtering for RecursiveParserWrapper (TIKA-3137)
+
+   * Add a basic parser for plist files based on com.googlecode.plist:dd-plist 
(TIKA-3104).
+
+   * Read hyperlinked images from ODT files (TIKA-3156).
+
+   * Updated GrobidRESTParser to use new API location (TIKA-3191).
+
+   * Add FileProfiler to tika-eval (TIKA-3216).
+
+   * Add status endpoint to tika-server (TIKA-3129).
+
+   * Improved handling of zip files with STORED entries with
+     data descriptor (TIKA-3196).
+
+   * Add parsers for XLZ, IDML and MIF (TIKA-2976, TIKA-3188 and TIKA-3189).
+
+   * Add the beginnings of a format-aware fuzzing module (TIKA-3083).
+
+   * Add wrapper for Linux 'file' command for mime detection (TIKA-3215).
+
+   * Added ability to skip parsing of embedded files in Tika Server 
(TIKA-3227).
+
+Release 1.24.1 - 4/17/2020
+
+   * Allow gzip compression of input and output streams for tika-server 
(TIKA-3073).
+
+Release 1.24 - 3/11/2019
+
+   * Add scripts to run tika-server as a service via Eric Pugh,
+    and add these scripts and jar as a new artifact in the release (TIKA-3010).
+
+   * Upgrade Drew Noakes' metadata-extractor (TIKA-2952).
+
+   * Enable optional extraction of structural tags in PDFs (alpha-grade) 
(TIKA-3026).
+
+   * Tika app's --extract mode now outputs to STDOUT (TIKA-3035).
+
+   * Add an optional Preflight parser for PDFs (TIKA-3055).
+
+   * Improve detection of some zip-based formats (TIKA-3057).
+
+   * Upgrade metadata-extractor to 2.13.0 (TIKA-2952).
+
+   * Upgrade to POI 4.1.2 (TIKA-3047).
+
+   * Extract XMP from PSD files (TIKA-3050).
+
+   * Added XMLProfiler as an optional parser to profile XFA and XMP
+     in PDFs (TIKA-3045).
+
+   * Extract inline images that rely on the DCT filter from PDFs (TIKA-3041).
+
+   * Upgrade to PDFBox 2.0.19 (TIKA-3033).
+
+   * Fix bug in ASM parser configuration (TIKA-2992).
+   
+   * Upgrade to java-libpst 0.9.3 (TIKA-2546).
+
+   * Fixed XLIFF12Parser failures with ToXMLHandler (TIKA-3014).
+
+Release 1.23 - 12/02/2019
+
+   * NOTE: The PDFParser now relies on OCRDPI to render page images when
+     users configure OCR on rendered page images. This will have the effect
+     of increasing rendered image size (TIKA-2624).
+
+   * NOTE: tika-server no longer returns 415 for file types for which there
+     is no parser.
+
+   * Fix bug in AUTO OCR strategy in the PDFParser (TIKA-3002).
+
+   * Fix incorrect height and width metadata extraction from JPEG images 
(TIKA-2630).
+
+   * Upgrade to POI 4.1.1 (TIKA-2851).
+
+   * Upgrade to PDFBox 2.0.17 (TIKA-2951).
+
+   * Ensure that the PDFParser respects custom configuration of Tesseract
+     from tika-config.xml via Eric Pugh (TIKA-2970).
+
+   * Add parser for XLIFF v1.2 files (TIKA-2975).
+
+   * Add mime type detection support for WebAssembly (TIKA-2894),
+     HEIF / HEIC images (TIKA-2942), Digilite FDF (TIKA-2988);
+     and xml-root detection for XFDF (TIKA-2990) and XDP (TIKA-2989).
+
+   * Add an XLZ Parser (TIKA-2976).
+   
+   * Fix deadlock with ForkParser when InputStream throws IOException 
(TIKA-2892).
+
+Release 1.22 - 07/29/2019
+
+   * NOTE: tika-server no longer hard-codes the HtmlParser to handle
+     XML files (TIKA-2910).  Users must now configure that behavior
+     via a tika-config.xml file.
+
+   * NOTE: Known regression: PDFBOX-4587 -- PDF passwords with codepoints
+     between 0xF000 and 0XF0000 will cause an exception.
+
+   * Add parser for HWP v5 files via SooMyung Lee (soomyung) and
+     JinSup Kim (ddoleye) (TIKA-2909).
+
+   * Fix order of closing streams to avoid "Failed to close temporary resource"
+     exception in TesseractOCRParser (TIKA-2908).
+
+   * Improve AutoDetectReader performance by caching encoding
+     detector (TIKA-1568).
+
+   * Prevent RTFParser from outputting illegal tag combinations (TIKA-2889).
+
+   * Fix RereadableInputStream to release all resources (TIKA-2903).
+
+   * Implement custom language identifier in the tika-eval module based on
+     OpenNLP's language detector; add 18 languages and add common words
+     lists for all 121 languages (TIKA-2790).
+
+   * Fix NPE in MimeTypesReader.releaseParser() via Eamonn Saunders 
(TIKA-2896).
+
+   * Fix RTFParser to extract more content (TIKA-2883).
+
+   * Add clientSubmitTime to the metadata extracted from PST files (TIKA-2898).
+
+   * Improve StreamingZipContainerDetector for xltx, xltm and
+     several other file formats (TIKA-2886).
+
+Release 1.21 - 05/14/2019
+
+   * Add optional AUTO mode to OCR'ing of PDFs.  If tesseract is installed
+     and on the path, and this option is selected programmatically
+     or via TikaConfig(), the PDFParser will use heuristics to decide
+     whether or not to run OCR per page on PDFs. (TIKA-2749)
+
+   * The ZipContainerDetector's default behavior was changed to run
+     streaming detection up to its markLimit.  Users can get the
+     legacy behavior (spool-to-file/rely-on-underlying-file-in-TikaInputStream)
+     by setting markLimit=-1. The POIFSContainerDetector requires an 
underlying file;
+     it will try to spool the file to disk; if the file's length is > 
markLimit,
+     it will not attempt detection; set markLimit to -1 for legacy behavior 
(TIKA-2849).
+
+   * Upgrade PDFBox to 2.0.14 (TIKA-2834).
+
+   * Add CSV detection and replace TXTParser with TextAndCSVParser;
+     users can turn off CSV detection by excluding the TextAndCSVParser
+     and adding back the TXTParser via tika-config (TIKA-2833).
+
+   * Add a CSVParser.  CSV detection is currently based solely on filename
+     and/or information conveyed via Metadata (TIKA-2826).
+
+   * General upgrades: asm, bouncycastle, commons-codec, commons-lang3, cxf,
+     guava, h2, httpcomponents, jackcess, junrar, Lucene, mime4j, opennlp, 
parso,
+     sqlite-jdbc (provided), zstd-jni (provided) (TIKA-2824)
+
+   * Bundle xerces2 with tika-parsers (TIKA-2802).
+
+   * Upgrade jaxb to 2.3.2 (TIKA-2819).
+
+   * Upgrade jackson to 2.9.8 (TIKA-2717).
+
+   * Update tika-eval's common tokens lists (TIKA-2822).
+
+   * Handle bad tags in tika-eval more robustly (TIKA-2810).
+
+   * Add reports for tags in tika-eval (TIKA-2809).
+
+   * Extract text from SDT element within textboxes in .docx files (TIKA-2807).
+
+   * Try to handle truncated OOXML files more robustly (TIKA-2765).
+
+Release 1.20 - 12/17/2018
+
+   * Upgrade to POI 4.0.1 (TIKA-2751).
+
+   * Integrate/parameterize new angles handling in
+     PDFBox (TIKA-2779).
+
+   * Upgrade to PDFBox 2.0.13 (TIKA-2788).
+
+   * Prevent content within <style/> and <script/> elements
+     to be written in the ToTextContentHandler (TIKA-2550).
+
+   * Switch child to parent communication to a shared memory-mapped
+     file in tika-server's -spawnChild mode.
+
+   * Fix bug in tika-server when run in legacy mode (not -spawnChild)
+     that caused it to return 503 on documents submitted after
+     it hit an OutOfMemoryError (TIKA-2776).
+
+   * Upgrade jaxb-runtime and javax.activation (TIKA-2778).
+
+   * tika-app in batch mode now requires an interrupt or
+     kill signal to the parent process to stop the parent
+     and the child processes (TIKA-2780).
+
+   * Bulk upgrade of dependencies (TIKA-2775).
+
+   * Improve language id efficiency in tika-eval (TIKA-2777).
+
+   * Upgrade sqlite "provided" dependency to 3.25.2 (TIKA-2773).
+
+   * Remove duplication of notes in PPT slides (TIKA-2735)
+
+   * Use -javaHome or $JAVA_HOME (if they exist) when
+     spawning child in tika-server's -spawnChild mode.
+
+   * Fixed closing of styles around Hyperlinks in Word Parser
+     Contributed by Ronan O'Sullivan (TIKA-2599).
+
+Release 1.19.1 - 10/4/2018
+
+   * Update PDFBox to 2.0.12, jempbox to 1.8.16
+     and jbig2 to 3.0.2 (TIKA-2745).
+
+   * Fix regression in parser for MP3 files (TIKA-2730).
+
+   * Updated Python Dependency Check for TesseractOCR (TIKA-2740).
+
+   * Improve SAXParser robustness (TIKA-2727).
+
+   * Remove dependency on slf4j-log4j12 by upgrading jmatio (TIKA-2742).
+
+   * Replace com.sun.xml.bind:jaxb-impl and jaxb-core with
+     org.glassfish.jaxb:jaxb-runtime and jaxb-core (TIKA-2743)
+
+Release 1.19 - 9/14/2018
+
+   * Require Java 8 (TIKA-2679).
+
+   * Enable building with Java 11 (TIKA-2668)
+
+   * Add an option to make tika-server robust against infinite loops,
+     OOMs, and memory leaks (TIKA-2725).
+
+   * Allow configuration of the Tesseract parser via the standard
+     tika-config.xml options (TIKA-2705).
+
+   * Improve handling of empty cells across table-based
+     formats (TIKA-2479).
+
+   * Add a Standards compliant HTML encoding detector
+     via Gerard Bouchar (TIKA-2673).
+
+   * Improved XML parsing -- limited default entity expansions to 20.
+     To raise this limit, add -Djdk.xml.entityExpansionLimit=XXX to
+     your commandline.
+
+   * Mime magic improvements for Olympus RAW (TIKA-2658), interpreted
+     server-side languages via HTTP (TIKA-2648), MHTML (TIKA-2723)
+
+   * Add absolute timeout to ForkParser rather than testing
+     for active (TIKA-2656).
+
+   * Make the RecursiveParserWrapper work with the ForkParser (TIKA-2655).
+
+   * Allow the ForkParser to specify a directory containing tika-app.jar
+     for use by the ForkServer.  This allows users to keep most of the
+     parser dependencies out of their code; and it allows for an easy
+     addition of optional jars for Parser dependencies,
+     such as the xerial sqlite jar (TIKA-2653).
+
+   * Use a pool for SAXParsers and DOMBuilders rather than creating
+     a new parser/builder for every parse.
+     For better performance, set XMLReaderUtils.setPoolSize() to the
+     number of threads you're using with Tika (TIKA-2645).
+
+   * Add the RecursiveParserWrapperHandler to improve the 
RecursiveParserWrapper
+     API slightly (TIKA-2644).
+
+   * Upgraded to Commons-Compress 1.18 (TIKA-2707).
+
+   * Upgraded to Apache POI 4.0.0 (TIKA-2552).
+
+   * Upgraded to Apache PDFBox 2.0.11 (TIKA-2681).
+
+   * Upgraded to deeplearning4j 1.0.0-beta2 (TIKA-2672).
+
+   * Upgraded jmatio to 1.4 (TIKA-2667)
+
+   * Upgraded Apache Lucene to 7.4.0 in tika-eval and tika-examples 
(TIKA-2695).
+
+   * Upgraded junrar to 1.0.1 (TIKA-2664).
+
+   * Numerous other upgrades (TIKA-2692).
+
+   * Excluded Spring as a transitive dependency (TIKA-2721).
+
+Release 1.18 - 4/20/2018
+
+   * Upgrade jackson to 2.9.5 (TIKA-2634).
+
+   * Add support for brotli (TIKA-2621).
+
+   * Upgrade PDFBox to 2.0.9 and include new jbig2-imageio
+     from org.apache.pdfbox (TIKA-2579 and TIKA-2607).
+
+   * Support for TIFF images in PDF files (TIKA-2338)
+   
+   * Detection of full encrypted 7z files (TIKA-2568)
+
+   * Various new mimes and typo fixes in tika-mimetypes.xml
+     via Andreas Meier (TIKA-2527).
+
+   * Revert to listenForAllRecords=false in ExcelExtractor
+     via Grigoriy Alekseev (TIKA-2590)
+
+   * Add workaround to identify TIFFs that might confuse
+     commons-compress's tar detection via Daniel Schmidt
+     (TIKA-2591)
+
+   * Ignore non-IANA supported charsets in HTML meta-headers
+     during charset detection in HTMLEncodingDetector
+     via Andreas Meier (TIKA-2592)
+
+   * Add detection and parsing of zstd (if user provides
+     com.github.luben:zstd-jni) via Andreas Meier (TIKA-2576)
+
+   * Allow for RFC822 detection for files starting with "dkim-"
+     and/or "x-" via Andreas Meier (TIKA-2578 and TIKA-2587)
+
+   * Extract xlsx files embedded in OLE objects within PPT and PPTX
+     via Brian McColgan (TIKA-2588).
+
+   * Extract files embedded in HTML and javascript inside HTML
+     that are stored in the Data URI scheme (TIKA-2563).
+
+   * Extract text from grouped text boxes in PPT (TIKA-2569).
+
+   * Extract language metadata item from PDF files via Matt Sheppard 
(TIKA-2559)
+
+   * RFC822 with multipart/mixed, first text element should be treated
+     as the main body of the email, not an attachment (TIKA-2547).
+
+   * Swap out com.tdunning:json for com.github.openjson:openjson to avoid
+     jar conflicts (TIKA-2556).
+
+   * No longer hardcode HtmlParser for XML files in tika-server (TIKA-2551).
+
+   * Require Java 8 (TIKA-2553).
+
+   * Add a parser for XPS (TIKA-2524).
+
+   * Mime magic for Dolby Digital AC3 and EAC3 files
+
+   * Fixed bug where TesseractOCRParser ignores configured ImageMagickPath,
+     and set rotation script to ignore Python warnings (TIKA-2509)
+
+   * Upgrade geo-apis to 3.0.1 (TIKA-2535)
+
+   * Mime definition and magic improvements for text-based programming
+     and config formats (TIKA-2554, TIKA-2567, TIKA-1141)
+
+   * Added local Docker image build using dockerfile-maven-plugin to allow
+     images to be built from source (TIKA-1518).
+
+   * Support for SAS7BDAT data files (TIKA-2462)
+
+   * Handle .epub files using .htm rather than .html extensions for the
+     embedded contents (TIKA-1288)
+
+   * Mime magic for ACES Images (TIKA-2628) and DPX Images (TIKA-2629)
+
+   * For sparse XLSX and XLSB files, always output missing cells to
+     the left of filled ones (matching XLS), and optionally output
+     missing rows on all 3 formats if requested via the
+     OfficeParserContext (TIKA-2479)
+
+Release 1.17 - 12/8/2017
+
+  ***NOTE: THIS IS THE LAST VERSION OF TIKA THAT WILL RUN
+     ON Java 7.  The next versions will require Java 8***
+
+  * Fix thread-safety in ChmExtractor (TIKA-2519).
+
+  * Upgrade cxf to 3.0.16 (TIKA-2516).
+
+  * Allow users to configure maxMainMemoryBytes for PDFs via shrike (PR-213).
+
+  * Extract underline and strikethrough in docx (TIKA-2347 and TIKA-2512).
+
+  * Cache TikaConfig in EmbeddedDocumentUtil for better performance
+    in documents with large number of attachments (TIKA-2511).
+
+  * Extract media files from ooxml (TIKA-2510).
+
+  * Standardize the way the Image and Video captioning 
+    dockers and extraction work (TIKA-2400, GitHub-208)
+
+  * Upgrade to xmpcore 5.1.3 (TIKA-2034).
+
+  * Upgrade to metadata-extractor 2.10.1 (TIKA-2486).
+
+  * Upgrade to OpenNLP 1.8.3 (TIKA-2502).
+
+  * Upgrade to Jackson 2.9.2 (TIKA-2501).
+
+  * Catch potential NPE in getting InputStream for attachments
+    in PST file (TIKA-2488).
+
+  * Upgrade to PDFBox 2.0.8 (TIKA-2489).
+
+  * Allow configuration of markLimit in EncodingDetectors
+    via tika-config.xml (TIKA-2485).
+
+  * RFC822Parser now selects the best alternative for
+    multipart/alternative body components.  This aligns with the
+    behavior of the OutlookParser (TIKA-2478).  Users can select
+    legacy behavior via the "extractAllAlternatives" parameter
+    in the RFC822 parser definition in tika-config.xml.
+
+  * Narrow mime detection for ms-owner files and add detection
+    for .nls files (TIKA-2469).
+
+  * Fix bug in CharsetDetector that led to different detected charsets
+    depending on whether user setText with a byte[] or an InputStream
+    via Sean Story (TIKA-2475).
+
+  * Remove JAXB for easier use with Java 9 via Robert Munteanu (TIKA-2466).
+
+  * Upgrade to POI 3.17 (TIKA-2429).
+
+  * Enabling extraction of standard references from text (TIKA-2449).
+
+  * Load external custom mimetypes XML from system property 
+    tika.custom-mimetypes (TIKA-2460). 
+
+  * Extract number of tiffs in a multi-page tiff (TIKA-2451).
+
+  * Fix detection of emails extracted from mbox (TIKA-2456).
+  
+  * Add OverrideDetector and allow PSTParser to specify body content type
+    as text or html -- to avoid incorrect auto-detection of
+    rfc/mbox, etc. (TIKA-2454)
+
+  * AutoDetectParser throws ZeroByteFileException for zero-byte files after
+    detection on the file extension (TIKA-2450).
+
+  * Extract phonetic runs in docx with experimental SAX parser (TIKA-2448).
+
+  * Extract phonetic runs from xls and allow users to turn off extraction
+    of phonetic runs in both xls and xlsx (TIKA-2440).
+
+  * OOXML locale should be set by POI's LocaleUtil not Locale.getDefault().
+    Fix unit tests to be robust against different locales in OOXML
+    and ExcelParser (TIKA-2438).
+
+  * Upgrade to PDFBox 2.0.7 (TIKA-2431).
+
+  * Tika now has support for automatic image captioning, that
+    combines Computer Vision and Natural Language Processing to
+    automatically generate a readable caption for an image 
+    (TIKA-2262, TIKA-2355, TIKA-2402, Gh-198, Gh-196, Gh-189).
+
+  * Add TestCorruptedFiles to allow devs to test parsers against
+    corrupted input files (TIKA-2430).
+
+  * Correct Mimetype definition for Windows batch files (CMD and BAT)
+    which are the same (TIKA-2445)
+
+  * PSDParser memory use improvements (TIKA-2447)
+
+  * Add underline extraction from Word documents (doc/docx) via Stuart Hendren
+    as well as strikethrough extraction in docx (TIKA-2347, GitHub-173)
+
+  * Corrected Tesseract OCR rotation.py script and made it a configurable
+    option via Peter Weiss (TIKA-2385) 
+ 
+Release 1.16 - 7/7/2017
+
+  * Exclude jj2000 from edu.ucar grip to avoid potential
+    license conflicts with ASL 2.0
+
+  * Add Age recognition using Ensemble model for Linear regression
+    and Apache OpenNLP Maximum Entropy. Tika can now detect age from
+    text (TIKA-1988).
+
+  * Add Tika Deep Learning support for the VGG16 model for
+    Very Deep Convolutional Networks for Large-Scale Image Recognition.
+    Now Tika supports both Inception v3/v4 and VGG16 based image 
+    recognition (TIKA-2298).
+
+  * Extract macros from PPT (TIKA-2089).
+
+  * Extract absolute path for last saved location when available
+    in .xlsx and .xlsb (TIKA-2335).
+
+  * Rename SentimentParser to SentimentAnalysisParser to
+    prevent conflict with dependency (TIKA-2368).
+
+  * tika-app now extracts inline images in PDFs by
+    default, and it includes a warning to users that this is not the
+    default behavior elsewhere in Tika (TIKA-2374).
+
+  * Allow configurability of warnings for problems during
+    parser initialization (TIKA-2389).
+
+  * Upgrade to Jackcess 2.1.8 (TIKA-2380).
+
+  * Upgrade to POI 3.17-beta1 (TIKA-2336).
+
+  * Remove non-ASL-2.0-compatible org.json (TIKA-1804).
+
+  * Allow extraction of <script> elements in HTML as embedded "MACRO".
+    Users must turn this on via TikaConfig (TIKA-2391).
+
+  * Allow users to turn off extraction of headers and footers
+    from .doc, .docx, .xls, .xlsx, .xlsb (TIKA-2362)
+
+  * Extract text from charts in .docx, .pptx, .xlsx and .xlsb
+    (TIKA-2254).
+
+  * Extract text from diagrams in .docx, .pptx, .xlsx and .xlsb
+    (TIKA-1945).
+
+  * Fix bug in tika-server that led to an attempt to close the
+    input stream twice (TIKA-2384).
+
+  * Enable base32 encoding of digests and enable BouncyCastle implementations
+    of digest algorithms (TIKA-2386).
+
+  * Add snap builds to codebase (TIKA-2401)
+
+  * Canonical Mimetype of WAVE audio changed to match RFC 2361 defined
+    version, audio/vnd.wave, older audio/x-wav remains as an alias
+
+  * Upgrade "provided" xerial to 3.19.3 (TIKA-2412).
+
+  * Upgrade Gson to 2.8.1 (TIKA-2414).
+
+  * Upgrade mime4j to 0.8.1 (TIKA-2413).
+
+  * Mime magic improvements for GraphViz (TIKA-2422), HTML files which
+    claim to be XML but aren't quite valid XML (TIKA-2419) and QuickTime
+    / MP4 (TIKA-2418)
+
+Release 1.15 - 05/23/2017
+
+  * Tika now has a module for Deep Learning powered by the 
+    DL4J toolkit. The initial included model is for InceptionV3
+    and so using this module, natively in Java, Tika can use 
+    Deep learning for metadata/text extraction from Images using
+    the power of the Inception model (Github-165).
+
+  * A new parser for sentiment analysis using a categorical 
+    (multi-class, anry, sad, neutral, like, love) and binary
+    (positive/negative) was added leveraging the USC data 
+    science work (TIKA-2016).
+
+  * Tika now has the ability to automatically detect objects in videos,
+    using OpenCV and Tensorflow (TIKA-2322).
+
+  * Change default behavior to parse embedded documents even if the user
+    forgets to specify a Parser.class in the ParseContext (TIKA-2096).
+    Users who wish to parse only the container document should set
+    an EmptyParser as the Parser.class in the ParseContext.
+
+  * Change default behavior of Office Parsers to _not_ extract
+    Macros.  User needs to setExtractMacros to "true" (TIKA-2302).
+
+  * Added tika-eval module (TIKA-1332).
+
+  * Unified logging across Tika: SLF4J as logging API, Apache Log4j as
+    implementation with JCL and JUL bridges in standalone tools like
+    tika-app, tika-batch and tika-server (TIKA-2245).
+
+  * Add parser for XLSB files (TIKA-1195).
+
+  * Add parsers for EMF/WMF files (TIKA-2246/TIKA-2247).
+
+  * Add parsers for WordPerfect and QuattroPro (.qpw) files.
+    Contributed by Pascal Essiembre (TIKA-1946 and TIKA-2228).
+
+  * Add experimental SAX parser for .pptx files. To select this parser,
+    set useSAXPptxExtractor(true) on OfficeParserConfig (TIKA-2210).
+
+  * Add experimental SAX parser for .docx files. To select this parser,
+    set useSAXDocxExtractor(true) on OfficeParserConfig (TIKA-1321, TIKA-2191).
+
+  * Add mime detection and parser for Word 2006ML format (TIKA-2179).
+
+  * Bug fix for WordPerfect via Pascal Essiembre (TIKA-2352).
+
+  * Added "text-main" equivalent option to tika-server via
+    /tika/main (TIKA-2343).
+
+  * Enabled configuration of the EncodingDetector used by
+    parsers that extend AbstractEncodingDetectorParser (TIKA-2273).
+
+  * Prevent easily preventable OOMs for both detection and parsing
+    of some compression formats (TIKA-2330).
+
+  * Extract images and thumbnails from ODT via Sam Bayer (TIKA-2295).
+
+  * Fix potential NPE in FeedParser via Julien Nioche (TIKA-2269).
+
+  * Official mime types for BMP, EMF and WMF have been registered with
+    IANA, so switch to these (image/bmp image/emf image/wmf) (TIKA-2250)
+
+  * Be more parsimonious with BufferedInputStreams via Josh Hight
+    (TIKA-2244).
+
+  * Enable handling of hyphenated language codes in TesseractOCRParser
+    via Graham Russell (TIKA-2231).
+
+  * Improve style tags in ODT (TIKA-2242).
+
+  * Add container detection for embedded MSEquation files (TIKA-2238).
+
+  * Add parsing of JBIG2 and extraction of JBIG2 from PDFs when
+    required dependencies are added to class path by user.
+    Contributed by Pascal Essiembre (TIKA-2232).
+
+  * Mime magic for the OneNote family (.one / .onetoc / .onepkg), no parser
+    (TIKA-2224).
+
+  * Add configurability of "preserve-interword-spacing" to
+    TesseractOCRParser (TIKA-2190).
+
+  * Upgrade to PDFBox 2.0.6 and JempBox 1.8.13 (TIKA-2209/TIKA-2236/TIKA-2361).
+
+  * Refactor MockParser to consolidate service loading
+    and mime types into tika-core/src/test (TIKA-2195).
+
+  * Enabled extraction of embedded objects from headers, footers,
+    footnotes, endnotes and comments in legacy .docx parser (TIKA-2192).
+
+  * Allow extraction of PDActions (including Javascript) from
+    PDFs (TIKA-2090).  This is turned off by default.  Users
+    must setExtractActions(true) on the PDFParserConfig.
+
+  * Change default behavior in experimental .docx parser to ignore
+    deleted text to align with .doc (TIKA-2187).
+
+  * Upgrade to POI 3.16 (TIKA-2116, TIKA-2181, TIKA-2329).
+
+  * Allow configuration of timeout for ForkParser (TIKA-2170).
+
+  * Add extraction of .jpx inline images from PDFs when required
+    dependencies are added by user to class path (TIKA-2175).
+
+  * Add .jpx, .jp2, .ppm to formats handled by Tesseract (TIKA-2174).
+
+  * Upgrade SQLite "provided" dependency to 3.16.1 (TIKA-2334).
+
+  * Update Apache CXF version to 3.0.12 (TIKA-2292).
+
+  * Add Lingo24 Language Detector (TIKA-2297).
+
+  * Further mime magic for WebVTT (TIKA-1772)
+
+  * Extend support for increased PSM options up to 13 for modern 
+    versions of Tesseract (TIKA-2357).
+
+  * Prevent potential resource leak by closing TrueTypeFont
+    via Cameron Rollheiser (TIKA-2370).
+
+Release 1.14 - 10/19/2016
+
+  * Extract all headers from MSG/RFC822 (TIKA-2122).
+
+  * Upgrade metadata-extractor to 2.9.1 (TIKA-2113).
+
+  * Extract PDF DocInfo metadata into separate keys to prevent
+    overwriting by XMP metadata (TIKA-2057).
+
+  * Re-enable fileUrl for tika-server (TIKA-2081).  If you choose,
+    to use this feature, beware of the security vulnerabilities!
+    See: https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2015-3271
+
+  * Add Tesseract's hOCR output format as an option, via Eric Pugh
+    (TIKA-2093)
+
+  * Extract macros from MSOffice files (TIKA-2069).
+
+  * Maintain passed-in mime in TXTParser (TIKA-2047).
+
+  * Upgrade to POI.3-15 (TIKA-2013).
+
+  * Upgrade to PDFBox 2.0.3 (TIKA-2051).
+
+  * Fix hyperlinks with formatting in DOC and DOCX (TIKA-1255
+    and TIKA-2078)
+
+  * Tika now is integrated with the Tensorflow library from Google 
+    and it can use its Inception v3 image classification model to 
+    identify objects in images (TIKA-1993).
+
+  * Parser configuration is now type-safe and parameters for parsers
+    can have assigned types (TIKA-1508, TIKA-1986).
+
+  * Prevent OOM/permanent hang on some corrupt CHM files (TIKA-2040).
+
+  * Upgrade ICU4J charset detection components to fix multithreading
+    bug (TIKA-2041).
+
+  * Upgrade to Jackcess 2.1.4 (TIKA-2039).
+
+  * Maintain more significant digits in cells of "General" format
+    in XLS and XLSX (TIKA-2025).
+
+  * Avoid mark/reset issues when extracting or detecting embedded resources
+    in RFC822 emails (TIKA-2037).
+
+  * Improving accuracy of Tesseract for better extraction of numeric 
+    and alphanumeric text from images (TIKA-2021, TIKA-2031).
+
+  * Improve extraction of embedded documents from PPT, PPTX and XLSX
+    (TIKA-2026).
+
+  * Add parser for applefile (AppleSingle) (TIKA-2022).
+
+  * Add mime types, mime magic and/or globs for:
+     * Endnote Import File (TIKA-2011)
+     * DJVU files (TIKA-2009)
+     * MS Owner File (TIKA-2008)
+     * Windows Media Metafile (TIKA-2004)
+     * iCal and vCalendar (TIKA-2006)
+     * MBOX (TIKA-2042)
+     * Stata DTA (TIKA-2064)
+
+  * Add configurable maximum threshold for number of events extracted
+    from the XMP Media Management Schema in JempboxExtractor (TIKA-1999).
+
+  * Integrate TesseractOCR with full page image rendering for PDFs (TIKA-1994).
+
+  * Add mime detection via Nick C and parser for DBF files (TIKA-1513).
+  
+  * Add mime detection and parsers for MSOffice 2003 XML Word
+    and Excel formats (TIKA-1958).
+
+  * Extract hyperlinks from PPT, PPTX, XSLX (TIKA-1454).
+
+  * Upgrade to Commons Compress 1.12 (supports progress on TIKA-1358)
+
+Release 1.13 - 05/08/2016
+
+  * Upgrade to PDFBox 2.0.1 (TIKA-1285/TIKA-1959).
+    MAJOR CHANGES in PDFParser:
+    * The classic sequential parser is no longer available.
+    * Tiff files are no longer extracted by default.  See
+      https://pdfbox.apache.org/2.0/dependencies.html#optional-components
+      for optional components to process Tiff files.
+    * Some truncated/corrupted files that had some content extracted
+      with 1.8.x may have no content extracted in 2.0.x (see TIKA-1912).
+
+  * The MIT-NLP Information Extraction (MITIE) Named Entity
+    Recognition (NER) system is now supported in Tika
+    (TIKA-1913, GitHub-108).
+
+  * Tika now supports the use of the Yandex translation 
+    service (TIKA-1943, GitHub-106).
+
+  * Tika now uses NER to extract scientific measurements
+    from text using either GROBID Quantities which uses 
+    conditional random fields and NLTK which uses regular 
+    expressesions (TIKA-1917, GitHub-104).
+
+  * Fixed JournalParser to handle null responses from 
+    GROBID and to log a message (TIKA-1925).
+
+  * Refactored Language Detector into tika-landetect module,
+    added default N-Gram implementation, Optimaize Lang
+    Detector and MIT Text.jl implementation 
+    (TIKA-1872, TIKA-1696, TIKA-1723).
+ 
+  * Extract metadata from MP4 videos whether or not the
+    PooledTimeSeries parser is available via Aditya Dhulipala
+    (TIKA-1844).
+
+  * Fix NPE when trying to get embedded image identifier in
+    WordParser (TIKA-1956).
+
+  * Improvements to MIME database for detection of Scientific
+    and other formats present in the TREC-DD-Polar dataset
+    (TIKA-1881, GitHub-85, TIKA-1883, TIKA-1884, TIKA-1886,
+     TIKA-1882).
+
+  * LinkContentHandler now extracts links from script tags
+    via Joseph Naegele (TIKA-1937).
+
+  * Handle per page IOExceptions more robustly in PDFParser (TIKA-1948).
+
+  * Upgrade commons-compress to 1.11 (TIKA-1949).
+
+  * Add detection for embedded MSChart.Graph files (TIKA-1033).
+
+  * Fix NPE in Sqlite parser from Nick C (TIKA-1927).
+
+  * Fix NPE in Open Document parser from Nick C (TIKA-1916).
+
+  * Upgrade mp4parser's isoparser to 1.1.7 (TIKA-1924 and TIKA-1931).
+
+  * Upgrade BouncyCastle to 1.54 (TIKA-1923).
+
+  * Upgrade Jackcess to 2.1.3 (TIKA-1922).
+
+  * Upgrade Drew Noakes' metadata-extractor to 2.8.1 (TIKA-1921).
+
+  * Upgrade Gson in tika-serialization to 2.6.2 (TIka-1920).
+
+  * Upgrade commons-cli in tika-batch to 1.3.1 (TIKA-1919).
+
+  * Add XMPMM support to PDFParser and JpegParser via Jempbox (TIKA-1894).
+
+  * Move serialization of TikaConfig to tika-core and enable dumping
+    of the config file via tika-app (TIKA-1657).
+
+  * Tika now incorporates the Natural Language Toolkit (NLTK) from the
+    Python community as an option for Named Entity Recognition (TIKA-1876).
+
+  * Add support for XFA extraction via Pascal Essiembre (TIKA-1857).
+
+  * Upgrade to sqlite-jdbc 3.8.11.2 (TIKA-1861).  NOTE: this dependency
+    is still <scope>provided</scope>.  You need to include this dependency
+    in order to parse sqlite files.
+
+  * Upgrade to POI 3.15-beta1 (TIKA-1895).
+
+  * Upgrade to Jackson 2.7.1 (TIKA-1869).
+
+  * Upgrade to Apache SIS 0.6 (TIKA-1878).
+
+  * RichTextContentHandler moved from the Server package to Core (TIKA-1870).
+
+  * Added ZeroSizeFileDetector to support application/x-zerovalue via
+    Adesh Gupta (TIKA-1885).  
+  
+  * Addition of types information to Grobid quantities parser via 
+    Can Menekse (TIKA-1965).
+
+Release 1.12 - 01/24/2016
+
+  * Support for iFrames and element link extraction is provided in
+    the link Content Handler (TIKA-1835).
+
+  * Slide notes are now linked to the slide XHTML in the PPT output
+    (TIKA-1840).
+
+  * JSON tests in Tika server were updated to remove impossible casts
+    (Github-73).
+
+  * Fix bug in GeoTopicParser where NER is reused instead of instantiated
+    with each request (TIKA-1834).
+
+  * Upgrade rome to 1.5.1 && Downgrade Rome dependency to 0.9 to avoid 
+    nasty NPE (TIKA-1820, TIKA-1516)
+
+  * The NamedEntityParser was enhanced to generate text content
+    in addition to metadata (TIKA-1815, TIKA-1816).
+
+  * A significant speed-up is made to the GeoTopicParser by
+    using the new REST server capabilities from Lucene Geo
+    Gazetteer (TIKA-1803).
+
+  * A parser to compute motion properties in Videos, e.g., 
+    Histogram of Oriented Gradients and Histogram of Optical Flows
+    using the Pooled Time Series algorithm, was added (TIKA-1798).
+
+  * Provide NamedEntityParser which exposes Named Entity Recognition
+    from OpenNLP and Stanford NER providers (TIKA-1787, GitHub-61,
+    GitHub-62).
+
+  * Allow XHTMLContentHandler to pass attributes of html element 
+    via Markus Jelsma (TIKA-1782).
+
+  * Fix regression with spacing in PPT via Andreas Beeker (TIKA-1777).
+
+  * Tika Facade parse methods for Path and File added which take a
+    Metadata object, to mirror the existing InputStream one (GitHub-60)
+
+  * GeoParser fix for loading the NER model from a jar file (TIKA-1791)
+
+
+Release 1.11 - 10/18/2015
+
+  * Java7 API support for allowing java.nio.file.Path as method arguments
+    was added to Tika and to ParsingReader, TikaFileTypeDetector, and to
+    Tika Config (TIKA-1745, TIKA-1746, TIKA-1751).
+
+  * MIME support was added for WebVTT: The Web Video Text Tracks Format
+    files (TIKA-1772).
+
+  * MIME magic improved to ensure emails detected as message/rfc822
+    (TIKA-1771).
+
+  * Upgrade to Jackcess Encrypt 2.1.1 to avoid binary incompatibility
+    with Bouncy Castle (TIKA-1736).
+  
+  * Make div and other markup more consistent between PPT and 
+    PPTX (TIKA-1755).
+
+  * Parse multiple authors from MSOffice's semi-colon delimited
+    author field (TIKA-1765).
+  
+  * Include CTAKESConfig.properties within tika-parsers resources 
+    by default (TIKA-1741).
+  
+  * Prevent infinite recursion when processing inline images
+    in PDF files by limiting extraction of duplicate images
+    within the same page (TIKA-1742).
+
+  * Upgrade to POI 3.13-final (via Andreas Beeker) (TIKA-1707).
+
+  * Upgraded tika-batch to use Path throughout (TIKA-1747 and
+    (TIKA-1754).
+
+  * Upgraded to Path in TikaInputStream (via Yaniv Kunda) (TIKA-1744).
+
+  * Changed default content handler type for "/rmeta" in tika-server
+    to "xml" to align with "-J" option in tika-app.  
+    Clients can now specify handler types via PathParam. (TIKA-1716).
+
+  * The fantastic GROBID (or Grobid) GeneRation Of BIbliographic Data
+    for machine learning from PDF files is now integrated as a 
+    Tika parser (TIKA-1699, TIKA-1712).
+
+  * The ability to specify the Tesseract Config Path was added
+    to the OCR Parser (TIKA-1703).
+
+  * Upgraded to ASM 5.0.4 (TIKA-1705).
+
+  * Corrected Tika Config XML detector definition explicit loading 
+    of MimeTypes (TIKA-1708)
+
+  * In Tika Parsers, Batch, Server, App and Examples, use Apache
+    Commons IO instead of inlined ex-Commons classes, and the Java 7
+    Standard Charset definitions (TIKA-1710)
+
+  * Upgraded to Commons Compress 1.10, which enables zlib compressed
+    archives support (TIKA-1718)
+
+
+Release 1.10 - 8/1/2015
+
+  * Tika Config XML can now be used to create composite detectors,
+    and exclude detectors that DefaultDetector would otherwise
+    have used. This brings support in-line with Parsers. (TIKA-1702)
+
+  * Reverted to legacy sort order of parsers that was 
+    mistakenly reversed in Tika 1.9 (TIKA-1689).
+
+  * Upgrade to POI 3.13-beta1 (TIKA-1667).
+
+  * Upgrade to PDFBox 1.8.10 (TIKA-1588).
+
+  * MimeTypes now tries to find a registered type with and 
+    without parameters (TIKA-1692).
+
+  * Added more robust error handling for encoding detection
+    of .MSG files (TIKA-1238).
+
+  * Fixed bug in Tika's use of the Jackcess parser that 
+    prevented reading of v97 Access files (TIKA-1681).
+
+  * Upgrade xerial.org's sqlite-jdbc to 3.8.10.1. NOTE: 
+    as of Tika 1.9, this jar is "provided." Make sure 
+    to upgrade your provided jar! (TIKA-1687).
+
+  * Add header/footer extraction to xls (via Aeham Abushwashi)
+    (TIKA-1400).
+
+  * Drop the source file name from the embedded file path in
+    RecursiveParserWrapper's "X-TIKA:embedded_resource_path" 
+    (TIKA-1673).
+
+  * Upgraded to Java 7 (TIKA-1536).
+
+  * Non-standards compliant emails are now correctly detected
+    as message/rfc822 (TIKA-1602).
+
+  * Added parser for MS Access files via Jackcess. Many thanks 
+    to Health Market Science, Brian O'Neill and James Ahlborn 
+    for relicensing Jackcess to Apache v2! (TIKA-1601)
+
+  * GDALParser now correctly sets "nitf" as a supported 
+    MediaType (TIKA-1664).
+
+  * Added DigestingParser to calculate digest hashes 
+    and record them in metadata. Integrated with
+    tika-app and tika-server (TIKA-1663).
+
+  * Fixed ZipContainerDetector to detect all IPA files
+    (TIKA-1659).
+
+
+Release 1.9 - 6/6/2015
+
+  * The ability to use the cTAKES clinical text
+    knowledge extraction system for biomedical data is 
+    now included as a Tika parser (TIKA-1645, TIKA-1642).
+
+  * Tika-server allows a user to specify the Tika config
+    from the command line (TIKA-1652, TIKA-1426).
+
+  * Matlab file detection has been improved (TIKA-1634).
+
+  * The EXIFTool was added as an External parser
+    (TIKA-1639).
+
+  * If FFMPEG is installed and on the PATH, it is a 
+    usable Parser in Tika now (TIKA-1510).
+
+  * Fixes have been applied to the ExternalParser to make
+    it functional (TIKA-1638).
+
+  * Tika service loading can now be more verbose with the 
+    org.apache.tika.service.error.warn system property (TIKA-1636).
+
+  * Tika Server now allows for metadata extraction from remote
+    URLs and in addition it outputs the detected language as a
+    metadata field (TIKA-1625).
+
+  * OUTPUT_FILE_TOKEN not being replaced in ExternalParser 
+    contributed by Pascal Essiembre (TIKA-1620).
+
+  * Tika REST server now supports language identification
+    (TIKA-1622).
+
+  * All of the example code from the Tika in Action book has 
+    been donated to Tika and added to tika-examples (TIKA-1562).
+
+  * Tika server now logs errors determining ContentDisposition
+    (TIKA-1621).
+
+  * An algorithm for using Byte Histogram frequencies to construct
+    a Neural Network and to perform MIME detection was added
+    (TIKA-1582).
+
+  * A Bayesian algorithm for MIME detection by probabilistic
+    means was added (TIKA-1517).
+
+  * Tika now incorporates the Apache Spatial Information
+    System capability of parsing Geographic ISO 19139 
+    files (TIKA-443). It can also detect those files as
+    well.
+
+  * Update the MimeTypes code to support inheritance
+    (TIKA-1535).
+
+  * Provide ability to parse and identify Global Change 
+    Master Directory Interchange Format (GCMD DIF) 
+    scientific data files (TIKA-1532).
+
+  * Improvements to detect CBOR files by extension (TIKA-1610).
+
+  * Change xerial.org's sqlite-jdbc jar to "provided" (TIKA-1511).
+    Users will now need to add sqlite-jdbc to their classpath for
+    the Sqlite3Parser to work.
+
+  * ExternalParser.check now catches (suppresses) SecurityException
+    and returns false, so it's OK to run Tika with a security policy
+    that does not allow execution of external processes (TIKA-1628).
+
+Release 1.8 - 4/13/2015
+
+  * Fix null pointer when processing ODT footer styles (TIKA-1600).
+
+  * Upgrade to com.drewnoakes' metadata-extractor to 2.0 and
+    add parser for webp metadata (TIKA-1594).
+
+  * Duration extracted from MP3s with no ID3 tags (TIKA-1589).
+
+  * Upgraded to PDFBox 1.8.9 (TIKA-1575).
+
+  * Tika now supports the IsaTab data standard for bioinformatics
+    both in terms of MIME identification and in terms of parsing
+    (TIKA-1580).
+
+  * Tika server can now enable CORS requests with the command line
+    "--cors" or "-C" option (TIKA-1586).
+
+  * Update jhighlight dependency to avoid using LGPL license. Thank
+    @kkrugler for his great contribution (TIKA-1581).
+  
+  * Updated HDF and NetCDF parsers to output file version in 
+    metadata (TIKA-1578 and TIKA-1579).
+
+  * Upgraded to POI 3.12-beta1 (TIKA-1531).
+
+  * Added tika-batch module for directory to directory batch
+    processing.  This is a new, experimental capability, and the API will 
+    likely change in future releases (TIKA-1330).
+
+  * Translator.translate() Exceptions are now restricted to
+    TikaException and IOException (TIKA-1416).
+
+  * Tika now supports MIME detection for Microsoft Extended 
+    Makefiles (EMF) (TIKA-1554).
+
+  * Tika has improved delineation in XML and HTML MIME detection
+    (TIKA-1365).
+
+  * Upgraded the Drew Noakes metadata-extractor to version 2.7.2
+    (TIKA-1576).
+
+  * Added basic style support for ODF documents, contributed by
+    Axel DÃ¶rfler (TIKA-1063).
+
+  * Move Tika server resources and writers to separate
+    org.apache.tika.server.resource and writer packages (TIKA-1564).
+
+  * Upgrade UCAR dependencies to 4.5.5 (TIKA-1571).
+  
+  * Fix Paths in Tika server welcome page (TIKA-1567).
+
+  * Fixed infinite recursion while parsing some PDFs (TIKA-1038).
+
+  * XHTMLContentHandler now properly passes along body attributes,
+    contributed by Markus Jelsma (TIKA-995).
+
+  * TikaCLI option --compare-file-magic to report mime types known to
+    the file(1) tool but not known / fully known to Tika.
+
+  * MediaTypeRegistry support for returning known child types.
+
+  * Support for excluding certain Parsers from being
+    used by DefaultParser via the Tika Config file, using the new
+    parser-exclude tag (TIKA-1558).
+
+  * Detect Global Change Master Directory (GCMD) Directory
+    Interchange Format (DIF) files (TIKA-1561).
+
+  * Tika's JAX-RS server can now return stacktraces for
+    parse exceptions (TIKA-1323).
+
+  * Added MockParser for testing handling of exceptions, errors
+    and hangs in code that uses parsers (TIKA-1553).
+
+  * The ForkParser service removed from Activator. Rollback of (TIKA-1354).
+
+  * Increased the speed of language identification by 
+    a factor of two -- contributed by Toke Eskildsen (TIKA-1549).
+
+  * Added parser for Sqlite3 db files. Some users will need to 
+    exclude the dependency on xerial.org's sqlite-jdbc because
+    it contains native libs (TIKA-1511).
+
+  * Use POST instead of PUT for tika-server form methods
+    (TIKA-1547).
+
+  * A basic wrapper around the UNIX file command was 
+    added to extract Strings. In addition a parse to 
+    handle Strings parsing from octet-streams using Latin1
+    charsets as added (TIKA-1541, TIKA-1483).
+
+  * Add test files and detection mechanism for Gridded
+    Binary (GRIB) files (TIKA-1539).
+
+  * The RAR parser was updated to handle Chinese characters 
+    using the functionality provided by allowing encoding to
+    be used within ZipArchiveInputStream (TIKA-936).
+
+  * Fix out of memory error in surefire plugin (TIKA-1537).
+
+  * Build a parser to extract data from GRIB formats (TIKA-1423).
+
+  * Upgrade to Commons Compress 1.9 (TIKA-1534).
+
+  * Include media duration in metadata parsed by MP4Parser (TIKA-1530).
+
+  * Support password protected 7zip files (using a PasswordProvider,
+    in keeping with the other password supporting formats) (TIKA-1521).
+
+  * Password protected Zip files should not trigger an exception (TIKA-1028).
+
+Release 1.7 - 1/9/2015
+
+  * Fixed resource leak in OutlookPSTParser that caused TikaException 
+    when invoked via AutoDetectParser on Windows (TIKA-1506).
+
+  * HTML tags are properly stripped from content by FeedParser
+    (TIKA-1500).
+
+  * Tika Server support for selecting a single metadata key;
+    wrapped MetadataEP into MetadataResource (TIKA-1499).
+
+  * Tika Server support for JSON and XMP views of metadata (TIKA-1497).
+
+  * Tika Parent uses dependency management to keep duplicate 
+    dependencies in different modules the same version (TIKA-1384).
+
+  * Upgraded slf4j to version 1.7.7 (TIKA-1496).
+
+  * Tika Server support for RecursiveParserWrapper's JSON output
+    (endpoint=rmeta) equivalent to (TIKA-1451's) -J option 
+    in tika-app (TIKA-1498).
+
+  * Tika Server support for providing the password for files on a 
+    per-request basis through the Password http header (TIKA-1494).
+
+  * Simple support for the BPG (Better Portable Graphics) image format
+    (TIKA-1491, TIKA-1495).
+
+  * Prevent exceptions from being thrown for some malformed
+    mp3 files (TIKA-1218).
+
+  * Reformat pom.xml files to use two spaces per indent (TIKA-1475).
+
+  * Fix warning of slf4j logger on Tika Server startup (TIKA-1472).
+
+  * Tika CLI and GUI now have option to view JSON rendering of output
+    of RecursiveParserWrapper (TIKA-1451).
+
+  * Tika now integrates the Geospatial Data Abstraction Library
+    (GDAL) for parsing hundreds of geospatial formats (TIKA-605,
+    TIKA-1503).
+
+  * ExternalParsers can now use Regexs to specify dynamic keys
+   (TIKA-1441).
+
+  * Thread safety issues in ImageMetadataExtractor were resolved
+    (TIKA-1369).
+ 
+  * The ForkParser service is now registered in Activator
+    (TIKA-1354).
+
+  * The Rome Library was upgraded to version 1.5 (TIKA-1435).
+
+  * Add markup for files embedded in PDFs (TIKA-1427).
+ 
+  * Extract files embedded in annotations in PDFS (TIKA-1433).
+
+  * Upgrade to PDFBox 1.8.8 (TIKA-1419, TIKA-1442).
+
+  * Add RecursiveParserWrapper (aka Jukka's and Nick's) 
+    RecursiveMetadataParser (TIKA-1329)
+
+  * Add example for how to dump TikaConfig to XML (TIKA-1418).
+
+  * Allow users to specify a tika config file for tika-app (TIKA-1426).
+
+  * PackageParser includes the last-modified date from the archive
+    in the metadata, when handling embedded entries (TIKA-1246)
+
+  * Created a new Tesseract OCR Parser to extract text from images.
+    Requires installation of Tesseract before use (TIKA-93).
+
+  * Basic parser for older Excel formats, such as Excel 4, 5 and 95,
+    which can get simple text, and metadata for Excel 5+95 (TIKA-1490)
+
+
+Release 1.6 - 08/31/2014
+
+  * Parse output should indicate which Parser was actually used
+    (TIKA-674).
+
+  * Use the forbidden-apis Maven plugin to check for unsafe Java
+    operations (TIKA-1387).
+
+  * Created an ExternalTranslator class to interface with command
+    line Translators (TIKA-1385).
+
+  * Created a MosesTranslator as a subclass of ExternalTranslator
+    that calls the Moses Decoder machine translation program (TIKA-1385).
+
+  * Created the tika-example module. It will have examples of how to
+    use the main Tika interfaces (TIKA-1390).
+
+  * Upgraded to Commons Compress 1.8.1 (TIKA-1275).
+
+  * Upgraded to POI 3.11-beta1 (TIKA-1380).
+
+  * Tika now extracts SDTCell content from tables in .docx files (TIKA-1317).
+
+  * Tika now supports detection of the Persian/Farsi language.
+    (TIKA-1337)
+  
+  * The Tika Detector interface is now exposed through the JAX-RS
+    server (TIKA-1336, TIKA-1336).
+
+  * Tika now has support for parsing binary Matlab files as part of 
+    our larger effort to increase the number of scientific data formats 
+    supported. (TIKA-1327)
+
+  * The Tika Server URLs for the unpacker resources have been changed,
+    to bring them under a common prefix (TIKA-1324). The mapping is
+    /unpacker/{id} -> /unpack/{id}
+    /all/{id}      -> /unpack/all/{id}
+
+  * Added module and core Tika interface for translating text between
+    languages and added a default implementation that call's Microsoft's
+    translate service (TIKA-1319)
+
+  * Added an Translator implementation that calls Lingo24's Premium
+    Machine Translation API (TIKA-1381)
+
+  * Made RTFParser's list handling slightly more robust against corrupt
+    list metadata (TIKA-1305)
+
+  * Fixed bug in CLI json output (TIKA-1291/TIKA-1310)
+
+  * Added ability to turn off image extraction from PDFs (TIKA-1294).
+    Users must now turn on this capability via the PDFParserConfig.
+
+  * Upgrade to PDFBox 1.8.6 (TIKA-1290, TIKA-1231, TIKA-1233, TIKA-1352)
+
+  * Zip Container Detection for DWFX and XPS formats, which are OPC
+    based (TIKA-1204, TIKA-1221)
+
+  * Added a user facing welcome page to the Tika Server, which
+    says what it is, and a very brief summary of what is available. 
+    (TIKA-1269)
+
+  * Added Tika Server endpoints to list the available mime types,
+    Parsers and Detectors, similar to the --list-<foo> methods on
+    the Tika CLI App (TIKA-1270)
+
+  * Improvements to NetCDF and HDF parsing to mimic the output of
+    ncdump and extract text dimensions and spatial and variable
+    information from scientific data files (TIKA-1265)
+
+  * Extract attachments from RTF files (TIKA-1010)
+
+  * Support Outlook Personal Folders File Format *.pst (TIKA-623)
+  
+  * Added mime entries for additional Ogg based formats (TIKA-1259)
+
+  * Updated the Ogg Vorbis plugin to v0.4, which adds detection for a wider
+    range of Ogg formats, and parsers for more Ogg Audio ones (TIKA-1113)
+
+  * PDF: Images in PDF documents can now be extracted as embedded resources.
+    (TIKA-1268)
+
+  * Fixed RuntimeException thrown for certain Word Documents (TIKA-1251).
+
+  * CLI: TikaCLI now has another option: --list-parser-details-apt, which 
outputs
+    the list of supported parsers in APT format. This is used to generate the 
list
+    on the formats page (TIKA-411).
+
+Release 1.5 - 02/04/2014
+
+  * Fixed bug in handling of embedded file processing in PDFs (TIKA-1228).
+
+  * Added SourceCodeParser to support java, Groovy, C++ files (TIKA-1224).
+  
+  * Updated Tika Server to support multipart/form-data payloads (TIKA-1198).
+
+  * Updated Tika Server to CXF 2.7.8 (TIKA-1197).
+
+  * Updated Tika Server to accept requests over wildcard addresses (TIKA-1196).
+
+  * Added option to use alternate NonSequentialPDFParser (TIKA-1201).
+
+  * Content from PDF AcroForms is now extracted (TIKA-973).
+
+  * Fixed invalid asterisks from master slide in PPT (TIKA-1171).
+
+  * Added test cases to confirm handling of auto-date in PPT and PPTX 
(TIKA-817).
+ 
+  * Text from tables in PPT files is once again extracted correctly 
(TIKA-1076).
+  
+  * Text is extracted from text boxes in XLSX (TIKA-1100).
+
+  * Tika no longer hangs when processing Excel files with custom fraction 
format (TIKA-1132).
+
+  * Disconcerting stacktrace from missing beans no longer printed for some 
DOCX files (TIKA-792).
+
+  * Upgraded POI to 3.10-beta2 (TIKA-1173).
+
+  * Upgraded PDFBox to 1.8.4 (TIKA-1230).
+
+  * Made HtmlEncodingDetector more flexible in finding meta 
+    header charset (TIKA-1001).
+
+  * Added sanitized test HTML file for local file test (TIKA-1139).
+
+  * Fixed bug that prevented attachments within a PDF from being processed
+    if the PDF itself was an attachment (TIKA-1124).
+
+  * Text from paragraph-level structured document tags in DOCX files is now 
extracted (TIKA-1130).
+
+  * RTF: Fixed ArrayIndexOutOfBoundsException when parsing list override 
(TIKA-1192).
+
+  * CLI: TikaCLI now escapes invalid filename characters as hex
+    characters (TIKA-1078).
+
+Release 1.4 - 06/15/2013
+
+  * Removed a test HTML file with a poorly chosen GPL text in it (TIKA-1129).
+
+  * Improvements to tika-server to allow it to produce text/html and
+    text/xml content (TIKA-1126, TIKA-1127).
+
+  * Improvements were made to the Compressor Parser to handle g'zipped files
+    that require the decompressConcatenated option set to true (TIKA-1096).
+
+  * Addressed a typographic error that was preventing from detection of 
+    awk files (TIKA-1081).
+
+  * Added a new end-point to Tika's JAX-RS REST server that only detects
+    the media-type based on a small portion of the document submitted
+   (TIKA-1047).
+
+  * RTF: Ordered and unordered lists are now extracted (TIKA-1062).
+
+  * MP3: Audio duration is now extracted (TIKA-991)
+
+  * Java .class files: upgraded from ASM 3.1 to ASM 4.1 for parsing
+    the Java bytecodes (TIKA-1053).
+
+  * Mime Types: Definitions extended to optionally include Link (URL) and
+    UTI, along with details for several common formats (TIKA-1012 / TIKA-1083)
+
+  * Exceptions when parsing OLE10 embedded documents, when parsing
+    summary information from Office documents, and when saving
+    embedded documennts in TikaCLI are now logged instead
+    of aborting extraction (TIKA-1074)
+
+  * MS Word: line tabular character is now replaced with newline
+    (TIKA-1128)
+
+  * XML: ElementMetadataHandlers can now optionally accept duplicate
+    and empty values (TIKA-1133)
+
+Release 1.3 - 01/19/2013
+
+  * Mimetype definitions added for more common programming languages,
+    including common extensions, but not magic patterns. (TIKA-1055)
+
+  * MS Word: When a Word (.doc) document contains embedded files or
+    links to external documents, Tika now places a <div
+    class="embedded" id="_XXX"/> placeholder into the XHTML so you can
+    see where in the main text the embedded document occurred
+    (TIKA-956, TIKA-1019).  Embedded Wordpad/RTF documents are now
+    recognized (TIKA-982).
+
+  * PDF: Text from pop-up annotations is now extracted (TIKA-981).
+    Text from bookmarks is now extracted (TIKA-1035).
+
+  * PKCS7: Detached signatures no longer through NullPointerException
+    (TIKA-986).
+
+  * iWork: The chart name for charts embedded in numbers documents is
+    now extracted (TIKA-918).
+
+  * CLI: TikaCLI -m now handles multi-valued metadata keys correctly
+    (previously it only printed the first value).  (TIKA-920)
+
+  * MS Word (.docx): When a Word (.docx) document contains embedded
+    files, Tika now places a <div class="embedded" id="XXX"/> into the
+    XHTML so you can see where in the main text the embedded document
+    occurred.  The id (rId) is included in the Metadata of each
+    embedded document as the new Metadata.EMBEDDED_RELATIONSHIP_ID
+    key, and TikaCLI prepends the rId (if present) onto the filename
+    it extracts (TIKA-989).  Fixed NullPointerException when style is
+    null (TIKA-1006).  Text inside text boxes is now extracted
+    (TIKA-1005).
+
+  * RTF: Page, word, character count and creation date metadata are
+    now extracted for RTF documents (TIKA-999).
+
+  * MS PowerPoint (.pptx): When a PowerPoint (.pptx) document contains
+    embedded files, Tika now places a <div class="embedded" id="XXX"/> into the
+    XHTML so you can see where in the main text the embedded document
+    occurred.  The id (rId) is included in the Metadata of each
+    embedded document as the new Metadata.EMBEDDED_RELATIONSHIP_ID
+    key, and TikaCLI prepends the rId (if present) onto the filename
+    it extracts (TIKA-997, TIKA-1032).
+
+  * MS PowerPoint (.ppt): When a PowerPoint (.ppt) document contains
+    embedded files, Tika now places a <div class="embedded" id="XXX"/> into the
+    XHTML so you can see where in the main text the embedded document
+    occurred (TIKA-1025).  Text from the master slide is now extracted
+    (TIKA-712).
+
+  * MHTML: fixed Null charset name exception when a mime part has an
+    unrecognized charset (TIKA-1011).
+
+  * MP3: if an ID3 tag was encoded in UTF-16 with only the BOM then on
+    certain JVMs this would incorrectly extract the BOM as the tag's
+    value (TIKA-1024).
+
+  * ZIP: placeholders (<div class="embedded" id="<entry name>"/>) are
+    now left in the XHTML so you can see where each archive member
+    appears (TIKA-1036). TikaCLI would hit FileNotFoundException when
+    extracting files that were under sub-directories from a ZIP
+    archive, because it failed to create the parent directories first
+    (TIKA-1031).
+
+  * XML: a space character is now added before each element
+    (TIKA-1048)
+
+Release 1.2 - 07/10/2012
+---------------------------------
+
+  * Tika's JAX-RS based Network server now is based on Apache CXF,
+    which is available in Maven Central and now allows the server
+    module to be packaged and included in our release
+    (TIKA-593, TIKA-901).
+
+  * Tika: parseToString now lets you specify the max string length
+    per-call, in addition to per-Tika-instance. (TIKA-870)
+
+  * Tika now has the ability to detect FITS (Flexible Image Transport System) 
+    files (TIKA-874).
+
+  * Images: Fixed file handle leak in ImageParser. (TIKA-875)
+
+  * iWork: Comments in Pages files are now extracted (TIKA-907).
+    Headers, footers and footnotes in Pages files are now extracted
+    (TIKA-906).  Don't throw NullPointerException on passsword
+    protected iWork files, even though we can't parse their contents
+    yet (TIKA-903).  Text extracted from Keynote text boxes and bullet
+    points no longer runs together (TIKA-910). Also extract text for
+    Pages documents created in layout mode (TIKA-904).  Table names
+    are now extracted in Numbers documents (TIKA-924).  Content added
+    to master slides is also extracted (TIKA-923).
+
+  * Archive and compression formats: The Commons Compress dependency was
+    upgraded from 1.3 to 1.4.1. With this change Tika can now parse also
+    Unix dump archives and documents compressed using the XZ and Pack200
+    compression formats. (TIKA-932)
+
+  * KML: Tika now has basic support for Keyhole Markup Language documents
+    (KML and KMZ) used by tools like Google Earth. See also
+    http://www.opengeospatial.org/standards/kml/. (TIKA-941)
+
+  * CLI: You can now use the TIKA_PASSWORD environment variable or the
+    --password=X command line option to specify the password that Tika CLI
+    should use for opening encrypted documents (TIKA-943).
+
+  * Character encodings: Tika's character encoding detection mechanism was
+    improved by adding integration to the juniversalchardet library that
+    implements Mozilla's universal charset detection algorithm. The slower
+    ICU4J algorithms are still used as a fallback thanks to their wider
+    coverage of custom character encodings. (TIKA-322, TIKA-471)
+
+  * Charset parameter: Related to the character encoding improvements
+    mentioned above, Tika now returns the detected character encoding as
+    a "charset" parameter of the content type metadata field for text/plain
+    and text/html documents. For example, instead of just "text/plain", the
+    returned content type will be something like "text/plain; charset=UTF-8"
+    for a UTF-8 encoded text document. Character encoding information is still
+    present also in the content encoding metadata field for backwards
+    compatibility, but that field should be considered deprecated. (TIKA-431)
+
+  * Extraction of embedded resources from OLE2 Office Documents, where
+    the resource isn't another office document, has been fixed (TIKA-948)
+
+Release 1.1 - 3/7/2012
+---------------------------------
+
+ * Link Extraction: The rel attribute is now extracted from 
+   links per the LinkConteHandler. (TIKA-824)
+
+ * MP3: Fixed handling of UTF-16 (two byte) ID3v2 tags (previously
+   the last character in a UTF-16 tag could be corrupted) (TIKA-793)
+
+ * Performance: Loading of the default media type registry is now
+   significantly faster. (TIKA-780)
+
+ * PDF: Allow controlling whether overlapping duplicated text should
+   be removed.  Disabling this (the default) can give big
+   speedups to text extraction and may workaround cases where
+   non-duplicated characters were incorrectly removed (TIKA-767).
+   Allow controlling whether text tokens should be sorted by their x/y
+   position before extracting text (TIKA-612); this is necessary for
+   certain PDFs.  Fixed cases where too many </p> tags appear in the
+   XHTML output, causing NPE when opening some PDFs with the GUI
+   (TIKA-778).
+
+ * RTF: Fixed case where a font change would result in processing
+   bytes in the wrong font's charset, producing bogus text output
+   (TIKA-777).  Don't output whitespace in ignored group states,
+   avoiding excessive whitespace output (TIKA-781).  Binary embedded
+   content (using \bin control word) is now skipped correctly;
+   previously it could cause the parser to incorrectly extract binary
+   content as text (TIKA-782).
+
+ * CLI: New TikaCLI option "--list-detectors", which displays the
+   mimetype detectors that are available, similar to the existing
+   "--list-parsers" option for parsers. (TIKA-785).
+
+ * Detectors: The order of detectors, as supplied via the service
+   registry loader, is now controlled. User supplied detectors are
+   prefered, then Tika detectors (such as the container aware ones),
+   and finally the core Tika MimeTypes is used as a backup. This
+   allows for specific, detailed detectors to take preference over
+   the default mime magic + filename detector. (TIKA-786)
+
+ * Microsoft Project (MPP): Filetype detection has been fixed,
+   and basic metadata (but no text) is now extracted. (TIKA-789)
+
+ * Outlook: fixed NullPointerException in TikaGUI when messages with
+   embedded RTF or HTML content were filtered (TIKA-801).
+
+ * Ogg Vorbis and FLAC: Parser added for Ogg Vorbis and FLAC audio
+   files, which extract audio metadata and tags (TIKA-747)
+
+ * MP4: Improved mime magic detection for MP4 based formats (including
+   QuickTime, MP4 Video and Audio, and 3GPP) (TIKA-851)
+
+ * MP4: Basic metadata extracting parser for MP4 files added, which includes
+   limited audio and video metadata, along with the iTunes media metadata
+   (such as Artist and Title) (TIKA-852)
+
+ * Document Passwords: A new ParseContext object, PasswordProvider,
+   has been added. This provides a way to supply the password for 
+   a document during processing. Currently, only password protected
+   PDFs and Microsoft OOXML Files are supported. (TIKA-850)
+
+Release 1.0 - 11/4/2011
+---------------------------------
+
+The most notable changes in Tika 1.0 over previous releases are:
+
+ * API: All methods, classes and interfaces that were marked as
+   deprecated in Tika 0.10 have been removed to clean up the API
+   (TIKA-703). You may need to adjust and recompile client code
+   accordingly. The declared OSGi package versions are now 1.0, and
+   will thus not resolve for client bundles that still refer to 0.x
+   versions (TIKA-565).
+
+ * Configuration: The context class loader of the current thread is
+   no longer used as the default for loading configured parser and
+   detector classes. You can still pass an explicit class loader
+   to the configuration mechanism to get the previous behaviour.
+   (TIKA-565)
+
+ * OSGi: The tika-core bundle will now automatically pick up and use
+   any available Parser and Detector services when deployed to an OSGi
+   environment. The tika-parsers bundle provides such services based on
+   for all the supported file formats for which the upstream parser library
+   is available. If you don't want to track all the parser libraries as
+   separate OSGi bundles, you can use the tika-bundle bundle that packages
+   tika-parsers together with all its upstream dependencies. (TIKA-565)
+
+ * RTF: Hyperlinks in RTF documents are now extracted as an <a
+   href=...>...</a> element (TIKA-632). The RTF parser is also now
+   more robust when encountering too many closing {'s vs. opening {'s
+   (TIKA-733).
+
+ * MS Word: From Word (.doc) documents we now extract optional hyphen
+   as Unicode zero-width space (U+200B), and non-breaking hyphen as
+   Unicode non-breaking hyphen (U+2011). (TIKA-711)
+
+ * Outlook: Tika can now process also attachments in Outlook messages.
+   (TIKA-396)
+
+ * MS Office: Performance of extracting embedded office docs was improved.
+   (TIKA-753)
+
+ * PDF: The PDF parser now extracts paragraphs within each page 
+   (TIKA-742) and  can now optionally extract text from PDF 
+   annotations (TIKA-738). There's also an option to enable (the 
+   default) or disable auto-space insertion (TIKA-724). 
+
+ * Language detection: Tika can now detect Belarusian, Catalan,
+   Esperanto, Galician, Lithuanian (TIKA-582), Romanian, Slovak,
+   Slovenian, and Ukrainian (TIKA-681).
+
+ * Java: Tika no longer ships retrotranslated Java 1.4 binaries along
+   with the normal ones that work with Java 5 and higher. (TIKA-744)
+
+ * OpenOffice documents: header/footer text is now extracted for text,
+   presentation and spreadsheet documents (TIKA-736)
+
+Tika 1.0 relies on the following set of major dependencies (generated using
+mvn dependency:tree from tika-parsers):
+
+   org.apache.tika:tika-parsers:bundle:1.0
+   +- org.apache.tika:tika-core:jar:1.0:compile
+   +- edu.ucar:netcdf:jar:4.2-min:compile
+   |  \- org.slf4j:slf4j-api:jar:1.5.6:compile
+   +- org.apache.james:apache-mime4j-core:jar:0.7:compile
+   +- org.apache.james:apache-mime4j-dom:jar:0.7:compile
+   +- org.apache.commons:commons-compress:jar:1.3:compile
+   +- commons-codec:commons-codec:jar:1.5:compile
+   +- org.apache.pdfbox:pdfbox:jar:1.6.0:compile
+   |  +- org.apache.pdfbox:fontbox:jar:1.6.0:compile
+   |  +- org.apache.pdfbox:jempbox:jar:1.6.0:compile
+   |  \- commons-logging:commons-logging:jar:1.1.1:compile
+   +- org.bouncycastle:bcmail-jdk15:jar:1.45:compile
+   +- org.bouncycastle:bcprov-jdk15:jar:1.45:compile
+   +- org.apache.poi:poi:jar:3.8-beta4:compile
+   +- org.apache.poi:poi-scratchpad:jar:3.8-beta4:compile
+   +- org.apache.poi:poi-ooxml:jar:3.8-beta4:compile
+   |  +- org.apache.poi:poi-ooxml-schemas:jar:3.8-beta4:compile
+   |  |  \- org.apache.xmlbeans:xmlbeans:jar:2.3.0:compile
+   |  \- dom4j:dom4j:jar:1.6.1:compile
+   +- org.apache.geronimo.specs:geronimo-stax-api_1.0_spec:jar:1.0.1:compile
+   +- org.ccil.cowan.tagsoup:tagsoup:jar:1.2.1:compile
+   +- asm:asm:jar:3.1:compile
+   +- com.drewnoakes:metadata-extractor:jar:2.4.0-beta-1:compile
+   +- de.l3s.boilerpipe:boilerpipe:jar:1.1.0:compile
+   +- rome:rome:jar:0.9:compile
+      \- jdom:jdom:jar:1.0:compile
+
+The following people have contributed to Tika 1.0 by submitting or commenting
+on the issues resolved in this release:
+
+Andrzej Bialecki
+Antoni Mylka
+Benson Margulies
+Chris A. Mattmann
+Cristian Vat
+Dave Meikle
+David Smiley
+Dennis Adler
+Erik Hetzner
+Ingo Renner
+Jeremias Maerki
+Jeremy Anderson
+Jeroen van Vianen
+John Bartak
+Jukka Zitting
+Julien Nioche
+Ken Krugler
+Mark Butler
+Maxim Valyanskiy
+Michael Bryant
+Michael McCandless 
+Nick Burch
+Pablo Queixalos
+Uwe Schindler
+Å½ygimantas Medelis
+
+
+See http://s.apache.org/Zk6 for more details on these contributions.
+
+
+Release 0.10 - 09/25/2011
+-------------------------
+
+The most notable changes in Tika 0.10 over previous releases are:
+
+ * A parser for CHM help files was added. (TIKA-245)
+
+ * TIKA-698: Invalid characters are now replaced with the Unicode
+   replacement character (U+FFFD), whereas before such characters were
+   replaced with spaces, so you may need to change your processing of
+   Tika's output to now handle U+FFFD.
+
+ * The RTF parser was rewritten to perform its own direct shallow


[... 879 lines stripped ...]

svn commit: r68146 [2/3] - in /dev/tika: 2.9.1/ 2.9.2/

Reply via email to