Allow disabling language detection in AutoDetectParser
------------------------------------------------------

                 Key: TIKA-320
                 URL: https://issues.apache.org/jira/browse/TIKA-320
             Project: Tika
          Issue Type: New Feature
          Components: parser
    Affects Versions: 0.5
            Reporter: Erik Hetzner


It should be possible to disable language detection in the AutoDetectParser.

Between 0.4 and the current trunk, the time Tika spent parsing my test data 
(100MB of compressed web crawl data, mixed HTML, images, etc.) increased 
considerably. After profiling, I determined that most of the time was spent in 
language detection. 

time results of indexing my test data with Lucene using AutoDetectParser:

real    15m21.020s
user    6m31.344s
sys     0m4.556s

time results on the same test data using the same code as AutoDetectParser, but 
with language detection disabled:

real    4m48.856s
user    2m9.416s
sys     0m3.484s

Obviously these numbers are worthless in their particulars but I think they 
demonstrate that one ought to be able to turn off language detection, as it can 
massively slow down parsing.



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to