[ https://issues.apache.org/jira/browse/TIKA-320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12778483#action_12778483 ]
Erik Hetzner commented on TIKA-320: ----------------------------------- Wonderful, thanks! > Allow disabling language detection in AutoDetectParser > ------------------------------------------------------ > > Key: TIKA-320 > URL: https://issues.apache.org/jira/browse/TIKA-320 > Project: Tika > Issue Type: New Feature > Components: parser > Affects Versions: 0.5 > Reporter: Erik Hetzner > Assignee: Jukka Zitting > Fix For: 0.5 > > > It should be possible to disable language detection in the AutoDetectParser. > Between 0.4 and the current trunk, the time Tika spent parsing my test data > (100MB of compressed web crawl data, mixed HTML, images, etc.) increased > considerably. After profiling, I determined that most of the time was spent > in language detection. > time results of indexing my test data with Lucene using AutoDetectParser: > real 15m21.020s > user 6m31.344s > sys 0m4.556s > time results on the same test data using the same code as AutoDetectParser, > but with language detection disabled: > real 4m48.856s > user 2m9.416s > sys 0m3.484s > Obviously these numbers are worthless in their particulars but I think they > demonstrate that one ought to be able to turn off language detection, as it > can massively slow down parsing. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.