Re: Simplest way to draw a circle, similar to addRect

2024-11-27 Thread Peter Murray-Rust
t; circle/arc? > > Chances for a one-liner or something similary simple? > > > > I use PDFBox 3.0.3. > > > > Thank you for your hints! > > Reg > > > > > > ----- > > To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org > > For additional commands, e-mail: users-h...@pdfbox.apache.org > > > > > -- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

Re: How can i judge a PDF is a Scanned PDF?

2024-11-23 Thread Peter Murray-Rust
out if you see a certain number of > pages with no text.) > > Brian > > - > To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org > For additional commands, e-mail: users-h...@pdfbox.apache.org > > -- Peter Murray-Rust Founder

Re: Text extraction from a certain PDF does not seem to terminate

2024-04-03 Thread Peter Murray-Rust
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org > > For additional commands, e-mail: users-h...@pdfbox.apache.org > > > > > ----- > To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org > For additional commands, e-mail: users-

Re: installing and running PDFBox within Python

2022-10-02 Thread Peter Murray-Rust
form; then linking into a global knowledge graph. P. On Sun, Oct 2, 2022 at 10:07 AM Tilman Hausherr wrote: > On 26.09.2022 10:53, Peter Murray-Rust wrote: > > * Does PDFBox3 have more functionality than PDFBox2 that would help? > > I don't think so, the main thing is the o

installing and running PDFBox within Python

2022-09-26 Thread Peter Murray-Rust
s! P. -- "I always retain copyright in my papers, and nothing in any contract I sign with any publisher will override that fact. You should do the same". Peter Murray-Rust Reader Emeritus in Molecular Informatics Yusuf Hamied Department of Chemistry University of Cambridge CB2 1EW, UK +44-1223-336432

Re: Table Extraction

2020-10-13 Thread Peter Murray-Rust
gt; same? > Any guidance would be much helpful. > > -- > Thanks & Regards > Kaushlendra Singh > Email: singh.kaushlendra...@gmail.com > Phone: +91 8377094564 > -- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

Re: Paragraph identification in apache pdf box

2020-08-12 Thread Peter Murray-Rust
t;>> Thanks & regards, > > >>> Aravind Swarna > > >>> > > >> > > >> - > > >> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org > > >> For additional commands, e-mail: users-h...@pdfb

Re: Detection of chess figure characters

2020-04-15 Thread Peter Murray-Rust
..@pdfbox.apache.org > > -- "I always retain copyright in my papers, and nothing in any contract I sign with any publisher will override that fact. You should do the same". Peter Murray-Rust Reader Emeritus in Molecular Informatics Unilever Centre, Dept. Of Chemistry University of Cambridge CB2 1EW, UK +44-1223-763069

PDFBox and COVID-19

2020-03-26 Thread Peter Murray-Rust
is an encouragement to all of us. P. -- "I always retain copyright in my papers, and nothing in any contract I sign with any publisher will override that fact. You should do the same". Peter Murray-Rust Reader Emeritus in Molecular Informatics Unilever Centre, Dept. Of Chemistr

Re: Extracting graphics primitives by subclassing PageDrawer

2020-01-01 Thread Peter Murray-Rust
ot;I always retain copyright in my papers, and nothing in any contract I sign with any publisher will override that fact. You should do the same". Peter Murray-Rust Reader Emeritus in Molecular Informatics Unilever Centre, Dept. Of Chemistry University of Cambridge CB2 1EW, UK +44-1223-763069

Re: Extracting graphics primitives by subclassing PageDrawer

2019-12-31 Thread Peter Murray-Rust
copyright in my papers, and nothing in any contract I sign with any publisher will override that fact. You should do the same". Peter Murray-Rust Reader Emeritus in Molecular Informatics Unilever Centre, Dept. Of Chemistry University of Cambridge CB2 1EW, UK +44-1223-763069

Extracting graphics primitives by subclassing PageDrawer

2019-12-30 Thread Peter Murray-Rust
ys retain copyright in my papers, and nothing in any contract I sign with any publisher will override that fact. You should do the same". Peter Murray-Rust Reader Emeritus in Molecular Informatics Unilever Centre, Dept. Of Chemistry University of Cambridge CB2 1EW, UK +44-1223-763069

Re: AW: Finding a Box containing text

2019-09-20 Thread Peter Murray-Rust
I do a lot of this and there is no generic way. The rect might be a rect or 4 lines or a polyline 3 or 4 (or 5 for overlaps). It migh be drawn twice for emplhasis . I have have some heuristics for creating probable rects. in http://github.com/petermr/ami3 If you are serious and doing a *lot* I c

Re: No Unicode mapping for xx (xx) in font null

2019-04-04 Thread Peter Murray-Rust
> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org > For additional commands, e-mail: users-h...@pdfbox.apache.org > > -- Peter Murray-Rust Reader Emeritus in Molecular Informatics Unilever Centre, Dept. Of Chemistry University of Cambridge CB2 1EW, UK +44-1223-763069

Re: No Unicode mapping for xx (xx) in font null

2019-04-01 Thread Peter Murray-Rust
in PDF should insist that a Unicode font is used. Better still avoid PDF. -- Peter Murray-Rust Reader Emeritus in Molecular Informatics Unilever Centre, Dept. Of Chemistry University of Cambridge CB2 1EW, UK +44-1223-763069

Re: Extract bold text from a PDF file

2019-03-18 Thread Peter Murray-Rust
> > > text.getFont().getFontDescriptor().getFontWeight() ); // returns 0.0. > > > > > 4. System.out.println( getGraphicsState().getLineWidth() ); // > > > > > returns > > > > > 1.0. > > > > > 5. System.out.println( > > > > > getGraphicsState().getTextState().getRenderingMode() ); // returns > > > > > FILL >

Re: Extract embedded SVG image from PDF file

2019-03-05 Thread Peter Murray-Rust
t 10:34 PM European Neuroscience Center < mnachev.nscenter...@gmail.com> wrote: > Hi, > > What is the way to extract an embedded image, which is in SVG format from > an PDF file using PDFBox? > > If there is no such option, how to determine from where the embedded SVG >

Re: Fwd: Apache in 2018 - By The Digits

2019-01-09 Thread Peter Murray-Rust
fic activities or programs. - > >> Targeted Platinum: DLA Piper, Microsoft, Oath, OSU Open Source Labs, > >> and Sonatype. - Targeted Gold: Atlassian, The CrytpoFund, Datadog, > >> PhoenixNAP, and Quenda. - Targeted Silver: Amazon Web Services, > >> Hot

Re: Extracting rotated text

2017-09-25 Thread Peter Murray-Rust
> > As > > se > > ss > > m > > en > > > > Thank you! > > > > > - > To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org > For additional commands, e-mail: users-h...@pdfbox.apache.org > > -- Peter Murray-Rust Reader Emeritus in Molecular Informatics Unilever Centre, Dept. Of Chemistry University of Cambridge CB2 1EW, UK +44-1223-763069

Re: drawing arrow in content stream

2017-08-17 Thread Peter Murray-Rust
--- > To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org > For additional commands, e-mail: users-h...@pdfbox.apache.org > > -- Peter Murray-Rust Reader Emeritus in Molecular Informatics Unilever Centre, Dept. Of Chemistry University of Cambridge CB2 1EW, UK +44-1223-763069

Re: Tabular Data Extracting

2017-05-14 Thread Peter Murray-Rust
y/ > > You can use PDFBox if you know the positions in advance, then search in > the source code examples for ExtractTextByArea. > > Tilman > > > --------- > To unsubscribe, e-mail: users-unsubscr...@pdfbox.apach

Re: What it feels like to be an open-source maintainer

2017-03-06 Thread Peter Murray-Rust
commands, e-mail: users-h...@pdfbox.apache.org > > -- Peter Murray-Rust Reader Emeritus in Molecular Informatics Unilever Centre, Dept. Of Chemistry University of Cambridge CB2 1EW, UK +44-1223-763069

Re: Extracting vector graphics from pdf

2017-02-28 Thread Peter Murray-Rust
t; > > > > > Thanks! > > > > Regards, > > > > Eli > > > > > > > > - > > To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org > > For additional commands, e-mail: users-h...@

What is a "bead" and how is it created/used?

2016-12-29 Thread Peter Murray-Rust
aware software? What is the abstract model of a bead in a reader? * are there other ways of transmitting "chunks" other than beads? TIA P. -- Peter Murray-Rust Reader Emeritus in Molecular Informatics Unilever Centre, Dept. Of Chemistry University of Cambridge CB2 1EW, UK +44-1223-763069

Re: Identify not visible characters - Overlapped characters

2016-12-29 Thread Peter Murray-Rust
oherent analysis when a table is larger than one > > page, for that reason Tabula is far from being a good tool for text > > extraction with correct positioning. > > We always welcome bug reports (and patches!) :) [1] > > Thanks! > > [1] https://github.com/tabulapdf/tabula-java/issues > > > — > Manuel Aristarán > http://jazzido.com > > > > > - > To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org > For additional commands, e-mail: users-h...@pdfbox.apache.org > > -- Peter Murray-Rust Reader Emeritus in Molecular Informatics Unilever Centre, Dept. Of Chemistry University of Cambridge CB2 1EW, UK +44-1223-763069

Re: Getting the PDF box to work

2016-03-07 Thread Peter Murray-Rust
: > > > > hello > > Can someone point me to a step by step guide to using this please? > > I have made it available under a project in Eclipse - but can't see any > > code. > > Regards > > Gopi > > > > > -- Peter Murray-Rust Reader in Molecular Informatics Unilever Centre, Dep. Of Chemistry University of Cambridge CB2 1EW, UK +44-1223-763069

Re: Strip Data out of PDF and save only skeleton.

2015-10-31 Thread Peter Murray-Rust
> Please share any custom solutions or ideas if any !! >>>> >>>> Thanks >>>> >>> >>> - >>> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org >>> For additional commands, e-mail: users-h...@pdfbox.apache.org >>> >>> >>> > > - > To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org > For additional commands, e-mail: users-h...@pdfbox.apache.org > > -- Peter Murray-Rust Reader in Molecular Informatics Unilever Centre, Dep. Of Chemistry University of Cambridge CB2 1EW, UK +44-1223-763069

Re: Can XML file convert into pdf using pdfbox?

2015-06-05 Thread Peter Murray-Rust
> Thanks > Mrunal > -- Peter Murray-Rust Reader in Molecular Informatics Unilever Centre, Dep. Of Chemistry University of Cambridge CB2 1EW, UK +44-1223-763069

Re: Problem reading PDF: encrypted document and unknown compression method

2015-04-19 Thread Peter Murray-Rust
Thank you On Sun, Apr 19, 2015 at 12:03 PM, Tilman Hausherr wrote: Am 19.04.2015 um 12:29 schrieb Peter Murray-Rust: > > Did you decrypt the file? Did you either load the file with loadNonSeq(), > or with load() and then call openProtection()? > I thought I had used loadNonSeq(

Problem reading PDF: encrypted document and unknown compression method

2015-04-19 Thread Peter Murray-Rust
yption, or a broken PDF that Adobe can somehow read or some other problem? Many thanks -- Peter Murray-Rust Reader in Molecular Informatics Unilever Centre, Dep. Of Chemistry University of Cambridge CB2 1EW, UK +44-1223-763069

Re: pdfbox warnings

2015-04-07 Thread Peter Murray-Rust
On Tue, Apr 7, 2015 at 7:49 AM, John Hewson wrote: > > > > On 6 Apr 2015, at 09:49, Peter Murray-Rust wrote: > > > ... > > > > PDFBox relies on the target OS distribution to include some of the 14 > > fonts. Since Windows doesn't have ZapfDingba

Re: pdfbox warnings

2015-04-06 Thread Peter Murray-Rust
ve to differentiate between "licence to use" and "licence to redistribute" -- Peter Murray-Rust Reader in Molecular Informatics Unilever Centre, Dep. Of Chemistry University of Cambridge CB2 1EW, UK +44-1223-763069

Re: pdfbox warnings

2015-04-06 Thread Peter Murray-Rust
created. > > > > I will test that ExternalFonts.addSubstitute when I get time, as another > > workaround. > > ----- > To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org > For additional commands, e-mail: users-h...@pdfbox.apache.org > > -- Peter Murray-Rust Reader in Molecular Informatics Unilever Centre, Dep. Of Chemistry University of Cambridge CB2 1EW, UK +44-1223-763069

Interpreting vector and pixel glyphs for characters

2015-03-24 Thread Peter Murray-Rust
occurrences. It's early days, but it people are interested in collaborating or have better solutions we'd be interested (we aren't able to help with casual problems). P. -- Peter Murray-Rust Reader in Molecular Informatics Unilever Centre, Dep. Of Chemistry University of Cambridge CB2 1EW, UK +44-1223-763069

Re: Re: Looking for some guidance on using PDFBox to analyze page content

2015-03-23 Thread Peter Murray-Rust
JECT: > Re: Looking for some guidance on using PDFBox to analyze > page content > > DATE: > 2015-03-20 10:08 > > FROM: > Peter Murray-Rust > > TO: > &qu

Re: ask for help

2015-03-23 Thread Peter Murray-Rust
e you can provide me some source codes extracting pdfs using PDFbox. Not > just stripper.getText(). > Thanks a billion!!! I hope you write to me soon!!! > sincerely, > > dock CHEN -- Peter Murray-Rust Reader in Molecular Informatics Unilever Centre, Dep. Of Chemistry University of Cambridge CB2 1EW, UK +44-1223-763069

Re: Looking for some guidance on using PDFBox to analyze page content

2015-03-20 Thread Peter Murray-Rust
at) > > Regards, > > WARREN GALLAGHER - CTO > > warren.gallag...@apxconsult.com > > M: 613-791-4987 W: 613-262-2601 Advance Property eXposure Canada Inc. > 1755 Woodward Drive, Suite 101, Ottawa, Ontario K2C 0P9 APXConsult.com > [1] > > Links: > -- > [1]

Re: PDF extraction

2015-02-02 Thread Peter Murray-Rust
3. $0.00$100 > >> Is there a way with PDFbox to extract a specific value(s) from the > table? > >> Example: Bank Of America and $0.00 > >> And also is there a way to cut the whole table and paste it into a > >> different PDF? > >> Please let m

Non-unicode characters

2015-01-27 Thread Peter Murray-Rust
d - but the task is finite if there is only one font. -- Peter Murray-Rust Reader in Molecular Informatics Unilever Centre, Dep. Of Chemistry University of Cambridge CB2 1EW, UK +44-1223-763069

Re: Character widths in fonts

2014-11-20 Thread Peter Murray-Rust
Acrobat Pro for > figuring out thorny issues, but I know that’s not an option for > everyone. > > Yes, I deliberately avoid it :-( -- Peter Murray-Rust Reader in Molecular Informatics Unilever Centre, Dep. Of Chemistry University of Cambridge CB2 1EW, UK +44-1223-763069

Character widths in fonts

2014-11-20 Thread Peter Murray-Rust
Is it allowed to have the same name for 2 different fonts (it would be very bad...) (c) How does PDFTextStripper calculate spaces? From the Font, or by some other heuristics? (d) is there a debugging tool on PDFBox I could be using for this sort of problem? Many thanks. P. -- Peter Murray-R

Re: Is sub-heading extraction possible?

2014-11-08 Thread Peter Murray-Rust
tioned I > can grab many templates such as IEEExplore, Spring etc. > > That's a very good point. If we can identify the authoring template we may be able to create the reverse engineering. > > -- Peter Murray-Rust Reader in Molecular Informatics Unilever Centre, Dep. Of Chemistry University of Cambridge CB2 1EW, UK +44-1223-763069

Re: Is sub-heading extraction possible?

2014-11-08 Thread Peter Murray-Rust
gt; Thanks. > > Best regards, > Mehmet > > > > -Original Message- > From: peter.murray.r...@googlemail.com [mailto: > peter.murray.r...@googlemail.com] On Behalf Of Peter Murray-Rust > Sent: Thursday 6 November 2014 7:38 PM > To: users@pdfbox.apache.org > Sub

Re: Is sub-heading extraction possible?

2014-11-06 Thread Peter Murray-Rust
> it possible to detect > Headings and sub-headings? More specifically, is it possible to extract > only introduction > Part or conclusion part? > > Thanks in advance. > > Best, > Mehmet > -- Peter Murray-Rust Reader in Molecular Informatics Unilever Centre, Dep. Of Chemistry University of Cambridge CB2 1EW, UK +44-1223-763069

Re: Extracting text into paragraphs

2014-10-31 Thread Peter Murray-Rust
a starting point for you in that > it > > looks for graphic boxes drawn around text to identify table headings. > > > > Frank > > > > On Thu, Oct 30, 2014 at 6:27 AM, Ken Bowen wrote: > > > > > You may want to get in contact with Peter Murray-Rust( &g

Re: Extracting text into paragraphs

2014-10-29 Thread Peter Murray-Rust
, not characters) will sometime be used. If you are going to do a lot of this, with a single source of documents it may be worth investing in creating some of these heuristics. But it will still be work, unfortunately. We are gradually building up this sort of approach in http://bitbucket.org/pet

Re: Regarding Table in PdfBox

2014-10-14 Thread Peter Murray-Rust
: > Hi , > How to identify table using PDFBOX . And extract text from it . > Please help me with the idea . > > Thanks > Borris > -- Peter Murray-Rust Reader in Molecular Informatics Unilever Centre, Dep. Of Chemistry University of Cambridge CB2 1EW, UK +44-1223-763069

Re: problem with pdf eof

2014-10-10 Thread Peter Murray-Rust
the hard work done by list members in writing PDFBox. Because the process is now legal in UK there is more incentive to develop and publish downstream analytic tools and that's what we are doing (Apache2-Open, of course). -- Peter Murray-Rust Reader in Molecular Informatics Unilever Centre, Dep

Re: problem with pdf eof

2014-10-10 Thread Peter Murray-Rust
m glow of having helped the human race. Same goes for tables and document structuring... BR P -- Peter Murray-Rust Reader in Molecular Informatics Unilever Centre, Dep. Of Chemistry University of Cambridge CB2 1EW, UK +44-1223-763069

Re: How to automatically evaluate the quality of the text extraction result by PDFBox?

2014-05-10 Thread Peter Murray-Rust
text from some PDF files, the question I want to ask is > that: is there a way to automatically evaluate the quality of text > extraction result? Or can PDFBox offer a confidence score about the > extracted text result? > > Regards, > -- Peter Murray-Rust Reader in Molecular Info

Re: How to define regions in PDFTextStripperByArea?

2014-05-04 Thread Peter Murray-Rust
rks fine for simple texts. It gets more complicated and > may lead to a false result if one of the following is used: > > - different text sizes in the same line > - different font sizes in the same line > - super/subscripts > - multicolumns > - > > BR > Andreas Lehmkühler > -- Peter Murray-Rust Reader in Molecular Informatics Unilever Centre, Dep. Of Chemistry University of Cambridge CB2 1EW, UK +44-1223-763069

Re: OCR and PDFBox/PDF2SVG

2014-04-22 Thread Peter Murray-Rust
x as part of a GSoC engagement. > > Maybe that’s what you are looking for? > > BR > Maruan > > Am 22.04.2014 um 15:39 schrieb Peter Murray-Rust : > > > We have a need to carry out limited OCR in the PDF extraction process and > > are thinking of adding it to PDF2S

OCR and PDFBox/PDF2SVG

2014-04-22 Thread Peter Murray-Rust
ther have a solution (which would save us going further) or to see if anyone is interested in using such a facility [Note that this is feasible mainly because the source is born-digital and binarized (0/1) and so does not suffer from scanning artefacts such as skewing, contrast, noise, etc.] P. -

Re: Eliminating super scripts while extracting text from pdf

2014-03-31 Thread Peter Murray-Rust
> Thanks. > > As suggested, I have gone through the links provided, but unfortunately > could not get to the heuristics to detect the subsuperscripts. > > If possible, please attach or provide a link that can publicly be accessed. > > Appreciate your help. > > > On Sat, Ma

Re: Eliminating super scripts while extracting text from pdf

2014-03-29 Thread Peter Murray-Rust
as normal text. > > > > A superscript to a word, which is the last word of a sentence, has been > > placed after the period(.) > > > > ex: Word: "test" with superscript "super" > > When it appeared at the end of a sentence, has been e

Re: 2 questions

2014-03-08 Thread Peter Murray-Rust
27;m making is a bit more advanced than the one embedded in > PDFBox as it creates a list of > couples (XY position of a word, contents of a word) and not just give the > list of words. > I do this in two stages - translate all chars to SVG (PDF2SVG) and in a separate project (SVG2XML)

Re: 2 questions

2014-03-08 Thread Peter Murray-Rust
gt; > > BR > > Maruan Sahyoun > > > > Am 07.03.2014 um 18:24 schrieb HQS : > > > >> Thank you all for those accurate answers. > >> I will give a try to the geometrical approach based on the (x, y) > coordinates of the characters. > >> > >

Re: 2 questions

2014-03-07 Thread Peter Murray-Rust
t; But this is not an issue, my problem is more the fact that this method may > not be 100% reliable. What do you think ? > We are committed to solving it for English-language science and European personal names. The worst case is probably slanted text in diagrams. > > As for the technical par

Re: 2 questions

2014-03-06 Thread Peter Murray-Rust
king for? > >>> > >>> BR > >>> Maruan Sahyoun > >>> > >>> Am 06.03.2014 um 18:39 schrieb HQS : > >>> > >>>> Hello all, > >>>> > >>>> 1. > >>>> Have you ever seen PD

Re: Query regarding pdfbox and android compatibility

2014-02-06 Thread Peter Murray-Rust
r. >> Thus, I wanted to know if this is still an issue or its solved by now? >> >> thanks and regards, >> >> > -- Peter Murray-Rust Reader in Molecular Informatics Unilever Centre, Dep. Of Chemistry University of Cambridge CB2 1EW, UK +44-1223-763069

Re: PDF to Text problems

2014-02-04 Thread Peter Murray-Rust
PDF (which uses PDFBox) - http://tabula.nerdpower.org/ - is among the most advanced open source projects. I do some of this myself in https://bitbucket.org/petermr/ami2. We hope to pool our software and experiences so we don't all have to reinvent algorithms and heuristics. It's mindbogglingly tedious to do this. > > /Johnny > > > -- Peter Murray-Rust Reader in Molecular Informatics Unilever Centre, Dep. Of Chemistry University of Cambridge CB2 1EW, UK +44-1223-763069

Re: Parsing a pdf file takes 3minutes

2013-12-23 Thread Peter Murray-Rust
ue requires 3minutes. > > How/why is this possible? How can I improve on this? > > Any help appreciated > Clemens > -- Peter Murray-Rust Reader in Molecular Informatics Unilever Centre, Dep. Of Chemistry University of Cambridge CB2 1EW, UK +44-1223-763069

Re: How to convert a text to curve?

2013-09-25 Thread Peter Murray-Rust
e.g. set a permission in PDF to > disallow text extraction > http://pdfbox.apache.org/apidocs/org/apache/pdfbox/pdmodel/encryption/AccessPermission.html > > > > BR > > > > Maruan Sahyoun > > > > Am 24.09.2013 um 17:10 schrieb daijun <16360...@qq.com>: >

Re: HTML to PDF

2013-09-24 Thread Peter Murray-Rust
x27;ve not used FOP directly either so can't comment further. > > > On 24 September 2013 16:02, Henry, Chad >wrote: > > > Is it possible to convert an HTML document to a PDF using pdfbox? Thanks > > > > Chad Henry > > System Design and Devlopment > > 71

Error somewhere in jbig2/PDPixelMap

2013-05-23 Thread Peter Murray-Rust
ter.java:190) at org.xmlcml.pdf2svg.PDFPage2SVGConverter.convertPageToSVG(PDFPage2SVGConverter.java:176) -- Peter Murray-Rust Reader in Molecular Informatics Unilever Centre, Dep. Of Chemistry University of Cambridge CB2 1EW, UK +44-1223-763069

Re: Determining whether character/font is bold

2013-05-13 Thread Peter Murray-Rust
On Mon, May 13, 2013 at 8:52 PM, Maruan Sahyoun wrote: > Hi Peter, > > Am 13.05.2013 um 19:44 schrieb Peter Murray-Rust : > > > Thanks for answering - any help is valuable. > > > > On Mon, May 13, 2013 at 6:02 PM, Maruan Sahyoun >wrote: > > > > > &

Re: Determining whether character/font is bold

2013-05-13 Thread Peter Murray-Rust
. The only practical answer seems to be crowdsourcing or to read the glyphs and find out what is going on. Which is why it is useful to know where the fontWeight is applied. P Cheers, E. -- Peter Murray-Rust Reader in Molecular Informatics Unilever Centre, Dep. Of Chemistry University of Cambridge CB2 1EW, UK +44-1223-763069

Re: Determining whether character/font is bold

2013-05-13 Thread Peter Murray-Rust
ho can give help. P. On Mon, May 13, 2013 at 11:55 AM, Peter Murray-Rust wrote: > > > > On Mon, May 13, 2013 at 11:14 AM, Maruan Sahyoun > wrote: > >> Hi Peter, >> >> which version of PDFBox are you using? If you are still on 1.7 I'd >> suggest

Re: Determining whether character/font is bold

2013-05-13 Thread Peter Murray-Rust
;ll upgrade and report. However these fonts are so non-conformant I am only semi-optimistic. Assuming it doesn't work, is there still a way of determining bold? P. > BR > Maruan > > Am 13.05.2013 um 10:55 schrieb Peter Murray-Rust : > > > I am dealing with a number of (hig

Determining whether character/font is bold

2013-05-13 Thread Peter Murray-Rust
often does not. At present I simply compile a list of fonts, so any help welcomed. -- Peter Murray-Rust Reader in Molecular Informatics Unilever Centre, Dep. Of Chemistry University of Cambridge CB2 1EW, UK +44-1223-763069

Re: Subscript/Superscripts

2013-05-10 Thread Peter Murray-Rust
life easier. > > I am new to the world of PDFBox and the details of fonts so feel free to > start at the beginning and drive slowly. :) > > Thanks, > Buzzy > -- Peter Murray-Rust Reader in Molecular Informatics Unilever Centre, Dep. Of Chemistry University of Cambridge CB2 1EW, UK +44-1223-763069

Re: Text Extraction with Formatting

2013-05-07 Thread Peter Murray-Rust
nput - it's a communal OpenSource project. Not ready for general use , especially why people don't understand there is an inevitable error rate (although small). We'd be delighted to hear from anyone needing this but at present you need to be able to understand running java

Re: know the approach to develop program

2013-04-19 Thread Peter Murray-Rust
rning it. If you are interested in PDFBox as a reader then our http://bitbucket.org/petermr/pdf2svg-dev may be a useful example. -- Peter Murray-Rust Reader in Molecular Informatics Unilever Centre, Dep. Of Chemistry University of Cambridge CB2 1EW, UK +44-1223-763069

Re: Fwd: Junk Characters while Extracting text from pdf file.

2013-02-06 Thread Peter Murray-Rust
and other way to go on this. > > Regards, Kulbhushan > -- Peter Murray-Rust Reader in Molecular Informatics Unilever Centre, Dep. Of Chemistry University of Cambridge CB2 1EW, UK +44-1223-763069

Re: Fwd: Junk Characters while Extracting text from pdf file.

2013-02-05 Thread Peter Murray-Rust
everal TeX fonts (CMM etc.) but haven't done a Ghostcript one and it would be useful But as Andreas says, ultimately these are probably non-conformant. A mixure of heuristics and glyph analysis (OCR and or heuristics) are required. Again PDF2SVG is addressing these - any community involvement i

Re: retrieving graphical coordinates

2013-01-30 Thread Peter Murray-Rust
http://bitbucket.org/petermr/pdf2svg and other sibling projects. We aim to extract high-level graphical objects from these paths. It's all F/OSS - you are welcome to use what we have so far. I am not aware of others doing the same at least in Open projects. P. > Florian -- Peter Mu

Re: Text for Ebook Readers

2013-01-27 Thread Peter Murray-Rust
at creating PDF from Word, LaTeX, etc. usually *destroys* information. I wish university libraries didn't do it. But that's out of scope here... -- Peter Murray-Rust Reader in Molecular Informatics Unilever Centre, Dep. Of Chemistry University of Cambridge CB2 1EW, UK +44-1223-763069

Re: How to ensure a PDF is valid

2013-01-22 Thread Peter Murray-Rust
ry valuable - and please feel free to fork and develop it. FWIW the next phase (SVGPlus) uses heuristics recreate paragraphs and other objects (super/subscripts, maths equations, tables, semantic graphs). The third phase turns these into semantic chemistry, biology, etc. - all from the PDF.

Re: Font properties

2012-12-28 Thread Peter Murray-Rust
ou can rely on keywords occurring at predictable places). If the corpus has varied sources and also covers a range of years this often introduces a lot of variation. > Does PDFBOX can get the font properties? There is another way to do it?? > > Thanks in advance > > Fernando Almeida &

Spaces (char 32) in output of PDFBox

2012-11-29 Thread Peter Murray-Rust
graph" consisting only of a space. I was unaware that PDF supported spaces - are these coming from the original document or are they generated in PDFBox from calculations of character spacing and width? TIA for help. P. -- Peter Murray-Rust Reader in Molecular Informatics Unilever Centr

Re: ANN: AMI2-PDF2SVG conversion of PDF to semantic characters and graphics

2012-11-17 Thread Peter Murray-Rust
Q5 & Mobile > b. http://technoracle.blogspot.com > t. @duanechaos > "Don't fear the Graph! Embrace Neo4J" > > > > > > -- Peter Murray-Rust Reader in Molecular Informatics Unilever Centre, Dep. Of Chemistry University of Cambridge CB2 1EW, UK +44-1223-763069

ANN: AMI2-PDF2SVG conversion of PDF to semantic characters and graphics

2012-11-16 Thread Peter Murray-Rust
am sure some of you will have faced the same problems and any (even partial) solutions will be useful. PDFD2SVG is beta; the others are being refactored to alpha. PDF2SVG may, of course, be of use in other disciplines - character processing is configurable through external files. Enjoy -- Peter M

Re: extracting text from image using pdfbox

2012-10-14 Thread Peter Murray-Rust
t then you have a chance. -- Peter Murray-Rust Reader in Molecular Informatics Unilever Centre, Dep. Of Chemistry University of Cambridge CB2 1EW, UK +44-1223-763069

Re: extracting text from image using pdfbox

2012-10-12 Thread Peter Murray-Rust
change frequently. -- Peter Murray-Rust Reader in Molecular Informatics Unilever Centre, Dep. Of Chemistry University of Cambridge CB2 1EW, UK +44-1223-763069

Re: Extracting vector graphics from PDF

2012-05-08 Thread Peter Murray-Rust
ordinate, but this varies slightly because of the different glyph origins. > ** > > May be I’ll download new version and look deeper… > > ** > I, for one, would be grateful if you did! I thought I was miscompiling/omitting some resource, etc. which caused different o

Re: Extracting vector graphics from PDF

2012-05-07 Thread Peter Murray-Rust
; > ** ** > > Andrey > > ** ** > > ** ** > > ** ** > > ** ** > > *Von:* peter.murray.r...@googlemail.com [mailto: > peter.murray.r...@googlemail.com] *Im Auftrag von *Peter Murray-Rust > *Gesendet:* Montag, 7. Mai 2012 15:24 > > *An:* Andrey Kuznetsov &

Re: Extracting vector graphics from PDF

2012-05-07 Thread Peter Murray-Rust
ds > > ** > I shall probably create a hack of some kind. I can find a san-serif and serif which are "fairly close" and substitute them. How would I get a system COSDictionary I could substitute? I am mainly interested in: * the identity of the characters * the font metr

Re: Extracting vector graphics from PDF

2012-05-07 Thread Peter Murray-Rust
Meanwhile I have rerun my code with pdfbox-1.7.0-SNAPSHOT and note that the final SVG contains ONLY paths and no text. many thanks.. > ** > > -- Peter Murray-Rust Reader in Molecular Informatics Unilever Centre, Dep. Of Chemistry University of Cambridge CB2 1EW, UK +44-1223-763069

Re: Extracting vector graphics from PDF

2012-05-07 Thread Peter Murray-Rust
l; } I notice now that this call is used in drawString so that might explain why there is no font information Is it worth changing to 1.7.0?? -- Peter Murray-Rust Reader in Molecular Informatics Unilever Centre, Dep. Of Chemistry University of Cambridge CB2 1EW, UK +44-1223-763069

Re: Extracting vector graphics from PDF

2012-05-07 Thread Peter Murray-Rust
ception) { exception.printStackTrace(); } } > ** > > ** ** > > -- Peter Murray-Rust Reader in Molecular Informatics Unilever Centre, Dep. Of Chemistry University of Cambridge CB2 1EW, UK +44-1223-763069

Re: Extracting vector graphics from PDF

2012-05-07 Thread Peter Murray-Rust
I am quite prepared to work with the glyphs as there are some documents where, I think, only glyph information is provided so I have to do some analysis there. Peter > -- > Peter Murray-Rust Reader in Molecular Informatics Unilever Centre, Dep. Of Chemistry University of Cambridge CB2 1EW, UK +44-1223-763069

Re: Extracting vector graphics from PDF

2012-04-26 Thread Peter Murray-Rust
On Mon, Apr 2, 2012 at 2:58 PM, Peter Murray-Rust wrote: > > > On Mon, Apr 2, 2012 at 2:51 PM, Andrey Kuznetsov wrote: > >> Peter, you have to pass your own Graphics2D object (with some overridden >> methods) to pdfbox. >> >> I am making good progress in ca

Re: Extracting vector graphics from PDF

2012-04-02 Thread Peter Murray-Rust
> > > -Ursprüngliche Nachricht- > Von: peter.murray.r...@googlemail.com > [mailto:peter.murray.r...@googlemail.com] Im Auftrag von Peter Murray-Rust > Gesendet: Montag, 2. April 2012 15:27 > An: users@pdfbox.apache.org > Betreff: Re: Extracting vector graphics

Re: Extracting vector graphics from PDF

2012-04-02 Thread Peter Murray-Rust
which I was able to interpret, but it was disjoint from the stream. Is it possible to examine the Graphics2D in the debugger. When you say "Graphics2D" do you mean Java 2D or is there a PDFBox graphics engine? If so what is it called :-) P. -- Peter Murray-Rust Reader in Molecula

Extracting vector graphics from PDF

2012-04-02 Thread Peter Murray-Rust
are there others in the PDF-hacking community who also want to extract graphics? My own interest is scientific and technical graphs, tables, diagrams, etc. -- Peter Murray-Rust Reader in Molecular Informatics Unilever Centre, Dep. Of Chemistry University of Cambridge CB2 1EW, UK +44-1223-763069