, 2021 10:02 AM
To: user@tika.apache.org
Subject: RE: Parsing PDF file - setting threshold of unmapped characters
On Wed, 14 Apr 2021, Peter Kronenberg wrote:
> Anyone have any thoughts on this?
I think both an absolute and a percentage would be good, but I don't have
enough experience to comm
: Parsing PDF file - setting threshold of unmapped characters
I’ve been thinking about this and I think it would be a good idea to change the
comparison of unmapped characters to a percentage. For example, you suggested
unmappedUnicodeCharsPerPage > 10 && percentUnmappedUnicode
e.org; talli...@apache.org
Subject: RE: Parsing PDF file - setting threshold of unmapped characters
I’ve been thinking about this and I think it would be a good idea to change the
comparison of unmapped characters to a percentage. For example, you suggested
unmappedUnicodeChars
I’ve been thinking about this and I think it would be a good idea to change the
comparison of unmapped characters to a percentage. For example, you suggested
unmappedUnicodeCharsPerPage > 10 && percentUnmappedUnicodeChars > 0.2 or
something?
The percentage could be configurable.
Another