[jira] [Created] (TIKA-2607) Exchange levigo-jbig2-imageio with pdfbox-jbig2-imageio:3.0.0
Andreas Meier created TIKA-2607: --- Summary: Exchange levigo-jbig2-imageio with pdfbox-jbig2-imageio:3.0.0 Key: TIKA-2607 URL: https://issues.apache.org/jira/browse/TIKA-2607 Project: Tika Issue Type: Improvement Components: core, parser Reporter: Andreas Meier The jbig2-imageio (formerly levigo) is now ASL 2.0 compatible and Version 3.0.0 of it has been released as subproject of pdfbox. See https://pdfbox.apache.org/ Therefore the old implementation and restriction {code:xml} com.levigo.jbig2 levigo-jbig2-imageio 1.6.5 test {code} can be replaced with {code:xml} org.apache.pdfbox jbig2-imageio 3.0.0 {code} See also TIKA-2232 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2607) Exchange levigo-jbig2-imageio with pdfbox-jbig2-imageio:3.0.0
[ https://issues.apache.org/jira/browse/TIKA-2607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16398247#comment-16398247 ] Andreas Meier commented on TIKA-2607: - [~talli...@mitre.org] I hope you don't mind that I created the ticket, I guess it is already on your agenda. > Exchange levigo-jbig2-imageio with pdfbox-jbig2-imageio:3.0.0 > - > > Key: TIKA-2607 > URL: https://issues.apache.org/jira/browse/TIKA-2607 > Project: Tika > Issue Type: Improvement > Components: core, parser >Reporter: Andreas Meier >Priority: Major > > The jbig2-imageio (formerly levigo) is now ASL 2.0 compatible and Version > 3.0.0 of it has been released as subproject of pdfbox. See > https://pdfbox.apache.org/ > Therefore the old implementation and restriction > {code:xml} > > > com.levigo.jbig2 > levigo-jbig2-imageio > 1.6.5 > test > > {code} > can be replaced with > {code:xml} > > org.apache.pdfbox > jbig2-imageio > 3.0.0 > > {code} > See also TIKA-2232 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2607) Exchange levigo-jbig2-imageio with pdfbox-jbig2-imageio:3.0.0
[ https://issues.apache.org/jira/browse/TIKA-2607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16398438#comment-16398438 ] Tim Allison commented on TIKA-2607: --- [~AndreasMeier], do I mind a contribution to Apache Tika? LOL. Not in the least! Keep them coming! You're right, though, that migrating to PDFBox's jbig2 was going to be part of TIKA-2579, but, no, I don't mind in the least. Thank you! > Exchange levigo-jbig2-imageio with pdfbox-jbig2-imageio:3.0.0 > - > > Key: TIKA-2607 > URL: https://issues.apache.org/jira/browse/TIKA-2607 > Project: Tika > Issue Type: Sub-task > Components: core, parser >Reporter: Andreas Meier >Priority: Major > > The jbig2-imageio (formerly levigo) is now ASL 2.0 compatible and Version > 3.0.0 of it has been released as subproject of pdfbox. See > https://pdfbox.apache.org/ > Therefore the old implementation and restriction > {code:xml} > > > com.levigo.jbig2 > levigo-jbig2-imageio > 1.6.5 > test > > {code} > can be replaced with > {code:xml} > > org.apache.pdfbox > jbig2-imageio > 3.0.0 > > {code} > See also TIKA-2232 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (TIKA-2607) Exchange levigo-jbig2-imageio with pdfbox-jbig2-imageio:3.0.0
[ https://issues.apache.org/jira/browse/TIKA-2607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-2607: -- Issue Type: Sub-task (was: Improvement) Parent: TIKA-2579 > Exchange levigo-jbig2-imageio with pdfbox-jbig2-imageio:3.0.0 > - > > Key: TIKA-2607 > URL: https://issues.apache.org/jira/browse/TIKA-2607 > Project: Tika > Issue Type: Sub-task > Components: core, parser >Reporter: Andreas Meier >Priority: Major > > The jbig2-imageio (formerly levigo) is now ASL 2.0 compatible and Version > 3.0.0 of it has been released as subproject of pdfbox. See > https://pdfbox.apache.org/ > Therefore the old implementation and restriction > {code:xml} > > > com.levigo.jbig2 > levigo-jbig2-imageio > 1.6.5 > test > > {code} > can be replaced with > {code:xml} > > org.apache.pdfbox > jbig2-imageio > 3.0.0 > > {code} > See also TIKA-2232 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
TIKA-1509 (2.x breaking parser change) - ready for first review!
Hi All As promised, I've finally had a go to try and implement my ideas for TIKA-1509 / https://wiki.apache.org/tika/CompositeParserDiscussion / breaking 2.x parser change My work so far is in this github branch, and is ready for review! https://github.com/apache/tika/tree/multiple-parsers It seems to work fine for the Fallback case, and for the Supplemental case. You can set a policy that controls how clashing metadata is handled, currently "first one to set a key wins", "last one to set a key wins", "ignore previous parsers", and "keep old and new unique values" I've also done a proof of concept for "pick best" case, to try running the text parser with a specified set of different charsets, capture the text from each, "pick the best" (hard coded 1st...) then run for real with that one. Key TODOs - Support InputStreamFactory, properly work out what mimetypes to claim to support, Tika Config XML friendly helper for the metadata clash policy, review ContentHandlerFactory signature and tweak if needed. Proposed breaking 2.x change - add second parse method that takes ContentHandlerFactory instead of ContentHandler, with most parsers getting that just grabbing a single one and using that as before Before I do any more though... Thoughts? Comments? Ideas? Changes? Should I stop? Carry on? Modify it? Other? Nick