I maintain R bindings called pdftools, mostly used for extracting text
from scientific documents. The bindings wrap the C++ API, in
particular we convert pdf to text using poppler::page::text() with
physical_layout.
Recently users have started to report changes in behaviour with newer
versions of
A researcher who is using the R bindings to analyze large numbers of
scientific papers has asked me advice on the following:
When extracting results from scientific pdf, sometimes math symbols
cannot be extracted because symbols are encoded with a custom font
called Mathematical-Pi [1]. An
On Wed, Dec 5, 2018 at 5:12 PM Ranjan Ghosh wrote:
>
> Hmm. I think it doesnt work that easily. Actually, I'm trying to build a
> static pdf2svg which users poppler in turn. I tried to follow your
> advice and installed libcairo-dev, libopenjp2-7-dev, libjpeg-dev, etc.
> and then simply compiled
On Tue, Dec 4, 2018 at 4:44 PM Ranjan Ghosh wrote:
>
> Hi all,
>
> I'm desperately trying to create a fully static build without any
> dependencies. I already got pretty far (IMHO) and build lots and lots of
> other dependent libaries statically (cairo, freetype etc.) without
> encountering any
On Sun, Dec 2, 2018 at 12:51 PM Adam Reichold wrote:
>
> Hello,
>
> Am 02.12.18 um 00:06 schrieb Albert Astals Cid:
> > El dissabte, 1 de desembre de 2018, a les 23:20:46 CET, Jeroen Ooms va
> > escriure:
> >> I maintain the poppler bindings for the R progra
I maintain the poppler bindings for the R programming language and get
a lot of bug reports about corrupted text extracted with poppler.
Below a minimal example that illustrates the problem:
git clone https://github.com/jeroen/popplertest
cd popplertest
g++ -std=c++11 encoding.cpp -o
> El dilluns, 2 d’abril de 2018, a les 10:22:51 CEST, suzuki toshiya va
> escriure:
> > if it is not the time to put "image_list" into cpp frontend
> It is ok, actually i know someone else that wanted to do that.
FWIW, if Suzuki is still interested, I would be very happy to extract
images via
On Sun, Aug 19, 2018 at 11:23 PM, Albert Astals Cid wrote:
> El dilluns, 12 de febrer de 2018, a les 16:03:35 CEST, Jeroen Ooms va
> escriure:
>> We have been working on a .travis configuration file to automatically
>> test poppler feature branches on various linux and
FYI the encoding problems still exist in the master branch today. I am
very interested in this patch by mpsuzuki, what can we do to move this
forward?
On Wed, Mar 28, 2018 at 2:26 PM, suzuki toshiya
wrote:
> Dear Adam,
>
> Adam Reichold wrote:
>>> I see. where
On Mon, Mar 26, 2018 at 10:06 PM, Albert Astals Cid wrote:
> El diumenge, 25 de març de 2018, a les 5:39:18 CEST, suzuki toshiya va
> escriure:
>> Hi all,
>>
>> Finally I think I found the root of issue and I can propose a fix.
>> pre-patch situation is like this:
>>
On Sun, Mar 25, 2018 at 5:39 AM, suzuki toshiya
wrote:
> My fix consists from 2 parts.
>
> part 1)
> I replaced all detail::unicode_GooString_to_ustring() by ustring::from_utf8(),
> this was suggested by Adam.
>
>
On Thu, Mar 22, 2018 at 8:53 AM, suzuki toshiya
wrote:
> Dear Jeroen,
>
> Please check https://github.com/mpsuzuki/poppler/tree/for-travis whether it
> can serve for you.
Yes this works! I now get:
pkg-config --libs-only-l poppler-cpp
-lpoppler-cpp
pkg-config
Thanks everyone for the work on this issue, really appreciate the
input. Also excited about mpsuzuki's suggestion to include font data
with the text_list, this will be super helpful.
I have updated and cleaned my example code a little bit to make it
easier to test these issues. The updated test
Currently pkg-config does not correctly list the dependency libs for
static linking when running with --static:
pkg-config --libs --static poppler-cpp
-lpoppler-cpp -lpoppler
The output of --static should also include the recursive dependencies
such as -lcairo -llcms2 -lopenjp2 -ltiff. For
On Tue, Mar 6, 2018 at 10:31 AM, Adam Reichold
wrote:
> Hello mpsuzuki,
>
> from a glance at the code, it seems page::text uses ustring::from_utf8
> to convert Poppler's GooString into ustring which seems correct if
> GlobalParams::textEncoding has its default value of
wrong here?
On Mon, Mar 5, 2018 at 3:10 PM, Jeroen Ooms <jer...@berkeley.edu> wrote:
> I'm testing the new page::text_list() function but I run into an old
> problem where the conversion of the ustring to UTF-8 doesn't do what I
> expect:
>
> byte_array buf = x.to_utf8();
I'm testing the new page::text_list() function but I run into an old
problem where the conversion of the ustring to UTF-8 doesn't do what I
expect:
byte_array buf = x.to_utf8();
std::string y(buf.begin(), buf.end());
const char * str = y.c_str();
The resulting char * is not UTF-8. It
On Mon, Feb 12, 2018 at 3:04 PM, Ihar Filipau <thephil...@gmail.com> wrote:
> On 2/12/18, Jeroen Ooms <jer...@berkeley.edu> wrote:
>> On Sun, Feb 11, 2018 at 12:11 PM, Albert Astals Cid <aa...@kde.org> wrote:
>>> You're never assigning to tv_nsec in there
We have been working on a .travis configuration file to automatically
test poppler feature branches on various linux and osx configurations.
Perhaps this may be interesting to other poppler users as well.
The example ".travis.yml" file can be copied from:
On Sun, Feb 11, 2018 at 12:11 PM, Albert Astals Cid wrote:
> You're never assigning to tv_nsec in there but still use it in a comparison,
> that needs fixing.
You are right. I think we should compare modification time only by
seconds. The standard definition of 'struct stat' only
After recent changes (after 0.62) the master branch no longer builds
on macos. The issue is that the statbuf struct does not have an
"st_mtim" field on macos:
[ 0%] Building CXX object CMakeFiles/poppler.dir/goo/gfile.cc.o
/Users/jeroen/Desktop/popplergit/goo/gfile.cc:690:34: error: no member
Is there a method in poppler-cpp to extract text from a pdf document,
including the position of each text box? Currently we use page->text()
with page::physical_layout which gives all text per page, but I need
more detailed information about each text box per page.
On Mon, Nov 13, 2017 at 12:10 AM, Albert Astals Cid <aa...@kde.org> wrote:
>
> El dimarts, 7 de novembre de 2017, a les 11:29:25 CET, Jeroen Ooms va
> escriure:
> > Would it be possible to mention the R package 'pdftools' [1] on the
> > poppler website [2] under pr
Would it be possible to mention the R package 'pdftools' [1] on the
poppler website [2] under programs using poppler? The R package quite
popular among (data) scientists to extract text and data from pdf
documents such scientific papers or public records.
[1]
On Wed, Nov 1, 2017 at 9:31 PM, Jeroen Ooms <jer...@berkeley.edu> wrote:
> Hmm that may be the function I am looking for but I don't understand
> where I should pass a GlobalParams object when reading a pdf file. I
> tried setting the 'globalParams' global when loadin
On Thu, Nov 2, 2017 at 12:23 AM, Albert Astals Cid wrote:
> Email here or use bugzillla.
OK, attached is the patch.
staticlib.patch
Description: Binary data
___
poppler mailing list
poppler@lists.freedesktop.org
On Wed, Nov 1, 2017 at 3:51 PM, Jason Crain wrote
>
> I don't know how you use poppler in your project, but you may also have
> the option of passing in the path when you construct the GlobalParams
> object.
Hmm that may be the function I am looking for but I don't
On Wed, Nov 1, 2017 at 3:20 PM, Jason Crain wrote:
> On Mac and Linux the path is hardcoded at compilation time. It's
> generally in /usr/share/poppler on Linux. Not sure what the standard is
> on Mac. On Windows it looks in \share\poppler relative to the
>
I maintain the poppler bindings for R which work on Windows, MacOS and
Linux. However Chineese users on Windows/Mac have reported that
poppler doesn't find the share data files:
error: Missing language pack for 'Adobe-CNS1' mapping
Where exactly does poppler look for the 'share' directory? Is
Several projects use static builds of poppler-cpp to ship standalone
pdf applications, but since the switch to cmake it is no longer
possible to build static libs.
Setting -DBUILD_SHARED_LIBS=OFF in cmake only builds a static
libpoppler.a, however libpoppler-cpp still gets built as a dynamic
On Tue, Oct 3, 2017 at 11:40 PM, Albert Astals Cid wrote:
>
> You should probably open a bug since i don't think this is something that
> is
> going to get fixed fast, unless maybe you can just workaround it by
> building
> it twice? (i know it sucks build stuff twice)
>
OK I have
On Tue, Oct 3, 2017 at 12:00 AM, Albert Astals Cid wrote:
>
> * cmake is now the default build system
> * autotools based build system has been removed
>
After upgrading, homebrew no longer ships static libs :( Is there a way to
make cmake produce both static and shared libs?
On Wed, Sep 6, 2017 at 7:38 PM, Albert Astals Cid wrote:
>> The solution would be to ensure d05l.pfb is available, but whose
>> responsibility should that be -- the poppler library, the client
>> application that uses poppler, or the individual end user?
>
> Not poppler, we're
On Wed, Sep 6, 2017 at 9:10 AM, Jonathan Kew wrote:
> On 05/09/2017 21:03, Albert Astals Cid wrote:
>>
>> ./GlobalParamsWin.cc:102:{"ZapfDingbats", "d05l.pfb",
>> "wingding.ttf", gTrue},
>>
>> This is the substitution table, i guess those files are not
A user has reported an issue [1] with a pdf incorrectly rendering on
windows. Poppler gives the following which is probably the reason that
the dots do not render correctly:
> Warning: error: Couldn't find a font for 'ZapfDingbats'
I have tried building poppler --with-font-configuration=win32
Some users of the R bindings have requested a way to extract
hyperlinks from a pdf file. However it seems that currently this
functionality is only available in the qt api, but not in the cpp api?
___
poppler mailing list
poppler@lists.freedesktop.org
Some more debugging and a bug report regarding this issue:
https://bugs.freedesktop.org/show_bug.cgi?id=101385
On Sun, Jun 11, 2017 at 1:51 PM, Jeroen Ooms <jer...@berkeley.edu> wrote:
> If the user enters an incorrect password when reading a protected pdf via
> document::load_f
If the user enters an incorrect password when reading a protected pdf via
document::load_from_raw_data() an error is printed to stdout
error: Incorrect password
However the load_from_raw_data() does not raise an exception and returns a
valid *document. However this document segfaults when we
When rendering a black and white pdf file using
poppler::page_renderer, the image always comes out as
image::format_argb32 rather than image::format_mono.
Therefore the png is 4x larger than expected. Is this expected or do I
manually need to set the image format somewhere in the renderer?
My
On Tue, Mar 8, 2016 at 2:34 PM, Jeroen Ooms <jeroen.o...@stat.ucla.edu> wrote:
> When extracting text from a landscape pdf file using the cpp
> interface, text at the far right of the page does not get extracted .I
> think the problem is that page.text() always assumes portrai
When extracting text from a landscape pdf file using the cpp
interface, text at the far right of the page does not get extracted .I
think the problem is that page.text() always assumes portrait
orientation and hence underestimates the width of the page:
p->text()
p->text(p->page_rect())
Is
The pkgconfig file for poppler does not contain the configured dependencies
required for static linking:
> pkg-config --libs --static poppler-cpp
-L/usr/local/Cellar/poppler/0.41.0/lib -lpoppler-cpp -lpoppler
This is certainly incomplete. Correct output (in my case) would be
something
I am trying to get the same (or similar) text output from the c++ interface
as when using the 'pdftotext' utility without the -layout option.
However raw_order_layout gives malformed output (no text at all for most
pages):
ustring str = p->text(p->page_rect(), page::raw_order_layout);
An
We are using poppler for parsing and indexing scientific articles. For
this purpose I wrote some bindings to poppler-cpp for the R
programming language. A few questions:
- Many of our pdf files give parsing errors, such as "Failed to get
object num from hint tables" or "Expected the optional
44 matches
Mail list logo