RE: Parsing PDF file - setting threshold of unmapped characters

2021-04-14 Thread Peter Kronenberg
, 2021 10:02 AM To: user@tika.apache.org Subject: RE: Parsing PDF file - setting threshold of unmapped characters On Wed, 14 Apr 2021, Peter Kronenberg wrote: > Anyone have any thoughts on this? I think both an absolute and a percentage would be good, but I don't have enough experience to comm

RE: Parsing PDF file - setting threshold of unmapped characters

2021-04-14 Thread Nick Burch
: Parsing PDF file - setting threshold of unmapped characters I’ve been thinking about this and I think it would be a good idea to change the comparison of unmapped characters to a percentage. For example, you suggested unmappedUnicodeCharsPerPage > 10 && percentUnmappedUnicode

RE: Parsing PDF file - setting threshold of unmapped characters

2021-04-14 Thread Peter Kronenberg
e.org; talli...@apache.org Subject: RE: Parsing PDF file - setting threshold of unmapped characters I’ve been thinking about this and I think it would be a good idea to change the comparison of unmapped characters to a percentage. For example, you suggested unmappedUnicodeChars

RE: Parsing PDF file - setting threshold of unmapped characters

2021-04-11 Thread Peter Kronenberg
I’ve been thinking about this and I think it would be a good idea to change the comparison of unmapped characters to a percentage. For example, you suggested unmappedUnicodeCharsPerPage > 10 && percentUnmappedUnicodeChars > 0.2 or something? The percentage could be configurable. Another