[jira] [Created] (TIKA-2607) Exchange levigo-jbig2-imageio with pdfbox-jbig2-imageio:3.0.0

2018-03-14 Thread Andreas Meier (JIRA)
Andreas Meier created TIKA-2607:
---

 Summary: Exchange levigo-jbig2-imageio with 
pdfbox-jbig2-imageio:3.0.0
 Key: TIKA-2607
 URL: https://issues.apache.org/jira/browse/TIKA-2607
 Project: Tika
  Issue Type: Improvement
  Components: core, parser
Reporter: Andreas Meier


The jbig2-imageio (formerly levigo) is now ASL 2.0 compatible and Version 3.0.0 
of it has been released as subproject of pdfbox. See https://pdfbox.apache.org/

Therefore the old implementation and restriction

{code:xml}


com.levigo.jbig2
levigo-jbig2-imageio
1.6.5
test

{code}

can be replaced with 

{code:xml}

org.apache.pdfbox
jbig2-imageio
3.0.0

{code}

See also TIKA-2232



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2607) Exchange levigo-jbig2-imageio with pdfbox-jbig2-imageio:3.0.0

2018-03-14 Thread Andreas Meier (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16398247#comment-16398247
 ] 

Andreas Meier commented on TIKA-2607:
-

[~talli...@mitre.org] I hope you don't mind that I created the ticket, I guess 
it is already on your agenda.

> Exchange levigo-jbig2-imageio with pdfbox-jbig2-imageio:3.0.0
> -
>
> Key: TIKA-2607
> URL: https://issues.apache.org/jira/browse/TIKA-2607
> Project: Tika
>  Issue Type: Improvement
>  Components: core, parser
>Reporter: Andreas Meier
>Priority: Major
>
> The jbig2-imageio (formerly levigo) is now ASL 2.0 compatible and Version 
> 3.0.0 of it has been released as subproject of pdfbox. See 
> https://pdfbox.apache.org/
> Therefore the old implementation and restriction
> {code:xml}
> 
> 
> com.levigo.jbig2
> levigo-jbig2-imageio
> 1.6.5
> test
> 
> {code}
> can be replaced with 
> {code:xml}
> 
> org.apache.pdfbox
> jbig2-imageio
> 3.0.0
> 
> {code}
> See also TIKA-2232



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2607) Exchange levigo-jbig2-imageio with pdfbox-jbig2-imageio:3.0.0

2018-03-14 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16398438#comment-16398438
 ] 

Tim Allison commented on TIKA-2607:
---

[~AndreasMeier], do I mind a contribution to Apache Tika? LOL.  Not in the 
least!  Keep them coming!  You're right, though, that migrating to PDFBox's 
jbig2 was going to be part of TIKA-2579, but, no, I don't mind in the least.  
Thank you!

> Exchange levigo-jbig2-imageio with pdfbox-jbig2-imageio:3.0.0
> -
>
> Key: TIKA-2607
> URL: https://issues.apache.org/jira/browse/TIKA-2607
> Project: Tika
>  Issue Type: Sub-task
>  Components: core, parser
>Reporter: Andreas Meier
>Priority: Major
>
> The jbig2-imageio (formerly levigo) is now ASL 2.0 compatible and Version 
> 3.0.0 of it has been released as subproject of pdfbox. See 
> https://pdfbox.apache.org/
> Therefore the old implementation and restriction
> {code:xml}
> 
> 
> com.levigo.jbig2
> levigo-jbig2-imageio
> 1.6.5
> test
> 
> {code}
> can be replaced with 
> {code:xml}
> 
> org.apache.pdfbox
> jbig2-imageio
> 3.0.0
> 
> {code}
> See also TIKA-2232



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (TIKA-2607) Exchange levigo-jbig2-imageio with pdfbox-jbig2-imageio:3.0.0

2018-03-14 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-2607:
--
Issue Type: Sub-task  (was: Improvement)
Parent: TIKA-2579

> Exchange levigo-jbig2-imageio with pdfbox-jbig2-imageio:3.0.0
> -
>
> Key: TIKA-2607
> URL: https://issues.apache.org/jira/browse/TIKA-2607
> Project: Tika
>  Issue Type: Sub-task
>  Components: core, parser
>Reporter: Andreas Meier
>Priority: Major
>
> The jbig2-imageio (formerly levigo) is now ASL 2.0 compatible and Version 
> 3.0.0 of it has been released as subproject of pdfbox. See 
> https://pdfbox.apache.org/
> Therefore the old implementation and restriction
> {code:xml}
> 
> 
> com.levigo.jbig2
> levigo-jbig2-imageio
> 1.6.5
> test
> 
> {code}
> can be replaced with 
> {code:xml}
> 
> org.apache.pdfbox
> jbig2-imageio
> 3.0.0
> 
> {code}
> See also TIKA-2232



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


TIKA-1509 (2.x breaking parser change) - ready for first review!

2018-03-14 Thread Nick Burch

Hi All

As promised, I've finally had a go to try and implement my ideas for 
TIKA-1509 / https://wiki.apache.org/tika/CompositeParserDiscussion / 
breaking 2.x parser change


My work so far is in this github branch, and is ready for review!
https://github.com/apache/tika/tree/multiple-parsers


It seems to work fine for the Fallback case, and for the Supplemental 
case. You can set a policy that controls how clashing metadata is handled, 
currently "first one to set a key wins", "last one to set a key wins", 
"ignore previous parsers", and "keep old and new unique values"


I've also done a proof of concept for "pick best" case, to try running the 
text parser with a specified set of different charsets, capture the text 
from each, "pick the best" (hard coded 1st...) then run for real with that 
one.



Key TODOs - Support InputStreamFactory, properly work out what mimetypes 
to claim to support, Tika Config XML friendly helper for the metadata 
clash policy, review ContentHandlerFactory signature and tweak if needed.


Proposed breaking 2.x change - add second parse method that takes 
ContentHandlerFactory instead of ContentHandler, with most parsers getting 
that just grabbing a single one and using that as before



Before I do any more though... Thoughts? Comments? Ideas? Changes? Should 
I stop? Carry on? Modify it? Other?


Nick