[ 
https://issues.apache.org/jira/browse/TIKA-320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12778483#action_12778483
 ] 

Erik Hetzner commented on TIKA-320:
-----------------------------------

Wonderful, thanks!

> Allow disabling language detection in AutoDetectParser
> ------------------------------------------------------
>
>                 Key: TIKA-320
>                 URL: https://issues.apache.org/jira/browse/TIKA-320
>             Project: Tika
>          Issue Type: New Feature
>          Components: parser
>    Affects Versions: 0.5
>            Reporter: Erik Hetzner
>            Assignee: Jukka Zitting
>             Fix For: 0.5
>
>
> It should be possible to disable language detection in the AutoDetectParser.
> Between 0.4 and the current trunk, the time Tika spent parsing my test data 
> (100MB of compressed web crawl data, mixed HTML, images, etc.) increased 
> considerably. After profiling, I determined that most of the time was spent 
> in language detection. 
> time results of indexing my test data with Lucene using AutoDetectParser:
> real  15m21.020s
> user  6m31.344s
> sys   0m4.556s
> time results on the same test data using the same code as AutoDetectParser, 
> but with language detection disabled:
> real  4m48.856s
> user  2m9.416s
> sys   0m3.484s
> Obviously these numbers are worthless in their particulars but I think they 
> demonstrate that one ought to be able to turn off language detection, as it 
> can massively slow down parsing.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to