[poppler] Recent changes in whitespace rendering with physical_layout

2021-05-03 Thread Jeroen Ooms
I maintain R bindings called pdftools, mostly used for extracting text from scientific documents. The bindings wrap the C++ API, in particular we convert pdf to text using poppler::page::text() with physical_layout. Recently users have started to report changes in behaviour with newer versions of

[poppler] How to normalize MathematicalPi text?

2019-03-13 Thread Jeroen Ooms
A researcher who is using the R bindings to analyze large numbers of scientific papers has asked me advice on the following: When extracting results from scientific pdf, sometimes math symbols cannot be extracted because symbols are encoded with a custom font called Mathematical-Pi [1]. An

Re: [poppler] Static build

2018-12-05 Thread Jeroen Ooms
On Wed, Dec 5, 2018 at 5:12 PM Ranjan Ghosh wrote: > > Hmm. I think it doesnt work that easily. Actually, I'm trying to build a > static pdf2svg which users poppler in turn. I tried to follow your > advice and installed libcairo-dev, libopenjp2-7-dev, libjpeg-dev, etc. > and then simply compiled

Re: [poppler] Static build

2018-12-04 Thread Jeroen Ooms
On Tue, Dec 4, 2018 at 4:44 PM Ranjan Ghosh wrote: > > Hi all, > > I'm desperately trying to create a fully static build without any > dependencies. I already got pretty far (IMHO) and build lots and lots of > other dependent libaries statically (cairo, freetype etc.) without > encountering any

Re: [poppler] c++ ustring encoding still completely broken

2018-12-03 Thread Jeroen Ooms
On Sun, Dec 2, 2018 at 12:51 PM Adam Reichold wrote: > > Hello, > > Am 02.12.18 um 00:06 schrieb Albert Astals Cid: > > El dissabte, 1 de desembre de 2018, a les 23:20:46 CET, Jeroen Ooms va > > escriure: > >> I maintain the poppler bindings for the R progra

[poppler] c++ ustring encoding still completely broken

2018-12-01 Thread Jeroen Ooms
I maintain the poppler bindings for the R programming language and get a lot of bug reports about corrupted text extracted with poppler. Below a minimal example that illustrates the problem: git clone https://github.com/jeroen/popplertest cd popplertest g++ -std=c++11 encoding.cpp -o

Re: [poppler] something like an "image_list" API for cpp frontend

2018-10-27 Thread Jeroen Ooms
> El dilluns, 2 d’abril de 2018, a les 10:22:51 CEST, suzuki toshiya va > escriure: > > if it is not the time to put "image_list" into cpp frontend > It is ok, actually i know someone else that wanted to do that. FWIW, if Suzuki is still interested, I would be very happy to extract images via

Re: [poppler] Example travis configuration file for poppler

2018-08-20 Thread Jeroen Ooms
On Sun, Aug 19, 2018 at 11:23 PM, Albert Astals Cid wrote: > El dilluns, 12 de febrer de 2018, a les 16:03:35 CEST, Jeroen Ooms va > escriure: >> We have been working on a .travis configuration file to automatically >> test poppler feature branches on various linux and

Re: [poppler] poppler::ustring encoding issue

2018-04-12 Thread Jeroen Ooms
FYI the encoding problems still exist in the master branch today. I am very interested in this patch by mpsuzuki, what can we do to move this forward? On Wed, Mar 28, 2018 at 2:26 PM, suzuki toshiya wrote: > Dear Adam, > > Adam Reichold wrote: >>> I see. where

Re: [poppler] poppler::ustring encoding issue

2018-03-26 Thread Jeroen Ooms
On Mon, Mar 26, 2018 at 10:06 PM, Albert Astals Cid wrote: > El diumenge, 25 de març de 2018, a les 5:39:18 CEST, suzuki toshiya va > escriure: >> Hi all, >> >> Finally I think I found the root of issue and I can propose a fix. >> pre-patch situation is like this: >>

Re: [poppler] poppler::ustring encoding issue

2018-03-26 Thread Jeroen Ooms
On Sun, Mar 25, 2018 at 5:39 AM, suzuki toshiya wrote: > My fix consists from 2 parts. > > part 1) > I replaced all detail::unicode_GooString_to_ustring() by ustring::from_utf8(), > this was suggested by Adam. > >

Re: [poppler] Requires.private field missing in poppler.pc

2018-03-22 Thread Jeroen Ooms
On Thu, Mar 22, 2018 at 8:53 AM, suzuki toshiya wrote: > Dear Jeroen, > > Please check https://github.com/mpsuzuki/poppler/tree/for-travis whether it > can serve for you. Yes this works! I now get: pkg-config --libs-only-l poppler-cpp -lpoppler-cpp pkg-config

Re: [poppler] poppler::ustring encoding issue

2018-03-21 Thread Jeroen Ooms
Thanks everyone for the work on this issue, really appreciate the input. Also excited about mpsuzuki's suggestion to include font data with the text_list, this will be super helpful. I have updated and cleaned my example code a little bit to make it easier to test these issues. The updated test

[poppler] Requires.private field missing in poppler.pc

2018-03-20 Thread Jeroen Ooms
Currently pkg-config does not correctly list the dependency libs for static linking when running with --static: pkg-config --libs --static poppler-cpp -lpoppler-cpp -lpoppler The output of --static should also include the recursive dependencies such as -lcairo -llcms2 -lopenjp2 -ltiff. For

Re: [poppler] poppler::ustring encoding issue

2018-03-06 Thread Jeroen Ooms
On Tue, Mar 6, 2018 at 10:31 AM, Adam Reichold wrote: > Hello mpsuzuki, > > from a glance at the code, it seems page::text uses ustring::from_utf8 > to convert Poppler's GooString into ustring which seems correct if > GlobalParams::textEncoding has its default value of

Re: [poppler] poppler::ustring encoding issue

2018-03-05 Thread Jeroen Ooms
wrong here? On Mon, Mar 5, 2018 at 3:10 PM, Jeroen Ooms <jer...@berkeley.edu> wrote: > I'm testing the new page::text_list() function but I run into an old > problem where the conversion of the ustring to UTF-8 doesn't do what I > expect: > > byte_array buf = x.to_utf8();

[poppler] poppler::ustring encoding issue

2018-03-05 Thread Jeroen Ooms
I'm testing the new page::text_list() function but I run into an old problem where the conversion of the ustring to UTF-8 doesn't do what I expect: byte_array buf = x.to_utf8(); std::string y(buf.begin(), buf.end()); const char * str = y.c_str(); The resulting char * is not UTF-8. It

Re: [poppler] gfile.cc fails to build on macos due to statbuf.st_mtim

2018-02-18 Thread Jeroen Ooms
On Mon, Feb 12, 2018 at 3:04 PM, Ihar Filipau <thephil...@gmail.com> wrote: > On 2/12/18, Jeroen Ooms <jer...@berkeley.edu> wrote: >> On Sun, Feb 11, 2018 at 12:11 PM, Albert Astals Cid <aa...@kde.org> wrote: >>> You're never assigning to tv_nsec in there

[poppler] Example travis configuration file for poppler

2018-02-12 Thread Jeroen Ooms
We have been working on a .travis configuration file to automatically test poppler feature branches on various linux and osx configurations. Perhaps this may be interesting to other poppler users as well. The example ".travis.yml" file can be copied from:

Re: [poppler] gfile.cc fails to build on macos due to statbuf.st_mtim

2018-02-12 Thread Jeroen Ooms
On Sun, Feb 11, 2018 at 12:11 PM, Albert Astals Cid wrote: > You're never assigning to tv_nsec in there but still use it in a comparison, > that needs fixing. You are right. I think we should compare modification time only by seconds. The standard definition of 'struct stat' only

[poppler] gfile.cc fails to build on macos due to statbuf.st_mtim

2018-02-09 Thread Jeroen Ooms
After recent changes (after 0.62) the master branch no longer builds on macos. The issue is that the statbuf struct does not have an "st_mtim" field on macos: [ 0%] Building CXX object CMakeFiles/poppler.dir/goo/gfile.cc.o /Users/jeroen/Desktop/popplergit/goo/gfile.cc:690:34: error: no member

[poppler] How to read textbox positions?

2017-12-27 Thread Jeroen Ooms
Is there a method in poppler-cpp to extract text from a pdf document, including the position of each text box? Currently we use page->text() with page::physical_layout which gives all text per page, but I need more detailed information about each text box per page.

Re: [poppler] New entry for "programs using poppler"

2017-11-13 Thread Jeroen Ooms
On Mon, Nov 13, 2017 at 12:10 AM, Albert Astals Cid <aa...@kde.org> wrote: > > El dimarts, 7 de novembre de 2017, a les 11:29:25 CET, Jeroen Ooms va > escriure: > > Would it be possible to mention the R package 'pdftools' [1] on the > > poppler website [2] under pr

[poppler] New entry for "programs using poppler"

2017-11-07 Thread Jeroen Ooms
Would it be possible to mention the R package 'pdftools' [1] on the poppler website [2] under programs using poppler? The R package quite popular among (data) scientists to extract text and data from pdf documents such scientific papers or public records. [1]

Re: [poppler] How to set custom 'share' directory

2017-11-04 Thread Jeroen Ooms
On Wed, Nov 1, 2017 at 9:31 PM, Jeroen Ooms <jer...@berkeley.edu> wrote: > Hmm that may be the function I am looking for but I don't understand > where I should pass a GlobalParams object when reading a pdf file. I > tried setting the 'globalParams' global when loadin

Re: [poppler] Fix building static libraries with cmake

2017-11-02 Thread Jeroen Ooms
On Thu, Nov 2, 2017 at 12:23 AM, Albert Astals Cid wrote: > Email here or use bugzillla. OK, attached is the patch. staticlib.patch Description: Binary data ___ poppler mailing list poppler@lists.freedesktop.org

Re: [poppler] How to set custom 'share' directory

2017-11-01 Thread Jeroen Ooms
On Wed, Nov 1, 2017 at 3:51 PM, Jason Crain wrote > > I don't know how you use poppler in your project, but you may also have > the option of passing in the path when you construct the GlobalParams > object. Hmm that may be the function I am looking for but I don't

Re: [poppler] How to set custom 'share' directory

2017-11-01 Thread Jeroen Ooms
On Wed, Nov 1, 2017 at 3:20 PM, Jason Crain wrote: > On Mac and Linux the path is hardcoded at compilation time. It's > generally in /usr/share/poppler on Linux. Not sure what the standard is > on Mac. On Windows it looks in \share\poppler relative to the >

[poppler] How to set custom 'share' directory

2017-11-01 Thread Jeroen Ooms
I maintain the poppler bindings for R which work on Windows, MacOS and Linux. However Chineese users on Windows/Mac have reported that poppler doesn't find the share data files: error: Missing language pack for 'Adobe-CNS1' mapping Where exactly does poppler look for the 'share' directory? Is

[poppler] Fix building static libraries with cmake

2017-11-01 Thread Jeroen Ooms
Several projects use static builds of poppler-cpp to ship standalone pdf applications, but since the switch to cmake it is no longer possible to build static libs. Setting -DBUILD_SHARED_LIBS=OFF in cmake only builds a static libpoppler.a, however libpoppler-cpp still gets built as a dynamic

Re: [poppler] Poppler 0.60.0 released

2017-10-04 Thread Jeroen Ooms
On Tue, Oct 3, 2017 at 11:40 PM, Albert Astals Cid wrote: > > You should probably open a bug since i don't think this is something that > is > going to get fixed fast, unless maybe you can just workaround it by > building > it twice? (i know it sucks build stuff twice) > OK I have

Re: [poppler] Poppler 0.60.0 released

2017-10-03 Thread Jeroen Ooms
On Tue, Oct 3, 2017 at 12:00 AM, Albert Astals Cid wrote: > > * cmake is now the default build system > * autotools based build system has been removed > After upgrading, homebrew no longer ships static libs :( Is there a way to make cmake produce both static and shared libs?

Re: [poppler] ZapfDingbats cannot be found on Windows

2017-09-07 Thread Jeroen Ooms
On Wed, Sep 6, 2017 at 7:38 PM, Albert Astals Cid wrote: >> The solution would be to ensure d05l.pfb is available, but whose >> responsibility should that be -- the poppler library, the client >> application that uses poppler, or the individual end user? > > Not poppler, we're

Re: [poppler] ZapfDingbats cannot be found on Windows

2017-09-06 Thread Jeroen Ooms
On Wed, Sep 6, 2017 at 9:10 AM, Jonathan Kew wrote: > On 05/09/2017 21:03, Albert Astals Cid wrote: >> >> ./GlobalParamsWin.cc:102:{"ZapfDingbats", "d05l.pfb", >> "wingding.ttf", gTrue}, >> >> This is the substitution table, i guess those files are not

[poppler] ZapfDingbats cannot be found on Windows

2017-09-05 Thread Jeroen Ooms
A user has reported an issue [1] with a pdf incorrectly rendering on windows. Poppler gives the following which is probably the reason that the dots do not render correctly: > Warning: error: Couldn't find a font for 'ZapfDingbats' I have tried building poppler --with-font-configuration=win32

[poppler] extract links form pdf using poppler-cpp api

2017-08-31 Thread Jeroen Ooms
Some users of the R bindings have requested a way to extract hyperlinks from a pdf file. However it seems that currently this functionality is only available in the qt api, but not in the cpp api? ___ poppler mailing list poppler@lists.freedesktop.org

Re: [poppler] c++ interface segfaults for invalid password

2017-06-12 Thread Jeroen Ooms
Some more debugging and a bug report regarding this issue: https://bugs.freedesktop.org/show_bug.cgi?id=101385 On Sun, Jun 11, 2017 at 1:51 PM, Jeroen Ooms <jer...@berkeley.edu> wrote: > If the user enters an incorrect password when reading a protected pdf via > document::load_f

[poppler] c++ interface segfaults for invalid password

2017-06-11 Thread Jeroen Ooms
If the user enters an incorrect password when reading a protected pdf via document::load_from_raw_data() an error is printed to stdout error: Incorrect password However the load_from_raw_data() does not raise an exception and returns a valid *document. However this document segfaults when we

[poppler] How to render greyscale pdf using c++ api

2017-06-07 Thread Jeroen Ooms
When rendering a black and white pdf file using poppler::page_renderer, the image always comes out as image::format_argb32 rather than image::format_mono. Therefore the png is 4x larger than expected. Is this expected or do I manually need to set the image format somewhere in the renderer? My

Re: [poppler] page.text() does not take page orientation into account?

2016-04-13 Thread Jeroen Ooms
On Tue, Mar 8, 2016 at 2:34 PM, Jeroen Ooms <jeroen.o...@stat.ucla.edu> wrote: > When extracting text from a landscape pdf file using the cpp > interface, text at the far right of the page does not get extracted .I > think the problem is that page.text() always assumes portrai

[poppler] page.text() does not take page orientation into account?

2016-03-08 Thread Jeroen Ooms
When extracting text from a landscape pdf file using the cpp interface, text at the far right of the page does not get extracted .I think the problem is that page.text() always assumes portrait orientation and hence underestimates the width of the page: p->text() p->text(p->page_rect()) Is

[poppler] poppler.pc missing Requires.private / Libs.private

2016-03-02 Thread Jeroen Ooms
The pkgconfig file for poppler does not contain the configured dependencies required for static linking: > pkg-config --libs --static poppler-cpp -L/usr/local/Cellar/poppler/0.41.0/lib -lpoppler-cpp -lpoppler This is certainly incomplete. Correct output (in my case) would be something

[poppler] Malformed/random output for raw_order_layout with c++ interface

2016-03-02 Thread Jeroen Ooms
I am trying to get the same (or similar) text output from the c++ interface as when using the 'pdftotext' utility without the -layout option. However raw_order_layout gives malformed output (no text at all for most pages): ustring str = p->text(p->page_rect(), page::raw_order_layout); An

[poppler] PDF tables and parsing errors

2016-02-26 Thread Jeroen Ooms
We are using poppler for parsing and indexing scientific articles. For this purpose I wrote some bindings to poppler-cpp for the R programming language. A few questions: - Many of our pdf files give parsing errors, such as "Failed to get object num from hint tables" or "Expected the optional