Thanks. It happens on page 15, the square union symbol, near
"the space obtained from M1 ∪ M2 by gluing along φ"
where a squared glyph is used instead of "∪". So I found this file:
https://github.com/kohler/lcdf-typetools/blob/master/texglyphlist.txt
so this is an extension for TeX.
Also:
https://gist.github.com/RAnders00/09b69031fb0cdd429ba1e3e75cce2898
The license is LGPL so we can't use it, but we don't have to think about
it because Adobe also can't extract the text. Adobe also fails to
extract the "∪".
I was also wondering what this squared union symbol is about, and it
turned out that there are other such symbols and that there is no
universal meaning.
https://math.stackexchange.com/questions/1929439/what-does-square-subset-and-square-union-symbol-mean
https://math.stackexchange.com/questions/1569400/does-sqsubset-have-any-special-meaning
I'll keep page 15 for my own text extraction tests to detect related
code changes.
Tilman
On 03.08.2023 09:42, Brangs, Erik wrote:
Hi,
thank you.
Here is a link to a PDF that shows the unionsq warning:
https://d-nb.info/1267991550/34
-----Ursprüngliche Nachricht-----
Von: Tilman Hausherr [mailto:thaush...@t-online.de]
Gesendet: Mittwoch, 2. August 2023 20:18
An: users@pdfbox.apache.org
Betreff: Re: Supressing warnings for missing unicode mappings
Hi,
Yes, reducing logging is the way. I don't know if there are more.
I'd also be interested in the "unionsq" file, I wonder if this is a
false positive. This happens because "uniNNNN" is a valid glyph name.
There is unionsqdisplay and unionsqtext too, but not unionsq.
Tilman
On 02.08.2023 11:20, Brangs, Erik wrote:
Hi,
we're using PDFBox 3.0.0-beta1 to extract text from PDFs. This produces lots of
warnings about missing unicode mappings. Is there a programmatic way to suppress
those messages or would it be better to configure the logging to do that?
If it's better to configure logging, I would try to configure the logging level
for
PDSimpleFont, PDType0Font, PDFont and GlyphList. Are those all relevant loggers
or
are there any more?
For GlyphList, the most common warning is "Not a number in Unicode character
name: unionsq". I also saw a warning "Not a number in Unicode character name:
users" but only for one PDF.
Mit freundlichen Grüßen
Erik Brangs
*** Suchen. Finden. Entdecken. Deutsche Nationalbibliothek ***
---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org