subject:"text extraction"

[jira] [Commented] (PDFBOX-3062) Text extraction and height different in 2.0

2015-11-09 Thread John Hewson (JIRA)

[ https://issues.apache.org/jira/browse/PDFBOX-3062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14997389#comment-14997389 ] John Hewson commented on PDFBOX-3062: - That's good news. > Text ex

[jira] [Commented] (PDFBOX-3044) Improve text extraction tests

2015-11-09 Thread ASF subversion and git services (JIRA)

3044: - Commit 1713474 from [~lehmi] in branch 'pdfbox/trunk' [ https://svn.apache.org/r1713474 ] PDFBOX-3044: add *txt files to rat config > Improve text extraction tests > - > > Key: PDFBOX-3044 > URL: https://issues.

[jira] [Updated] (PDFBOX-3044) Improve text extraction tests

2015-11-09 Thread Tilman Hausherr (JIRA)

[ https://issues.apache.org/jira/browse/PDFBOX-3044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated PDFBOX-3044: Component/s: Text extraction > Improve text extraction te

[jira] [Commented] (PDFBOX-3062) Text extraction and height different in 2.0

2015-11-09 Thread Tilman Hausherr (JIRA)

yway, probably due to the recent change that the fontBBox is taken from the dictionary and not from the font. I have not done a change to use CapHeigth. This is done in 1.8 only. > Text extraction and height different in 2.0 > --- > >

[jira] [Commented] (PDFBOX-3062) Text extraction and height different in 2.0

2015-11-09 Thread Maruan Sahyoun (JIRA)

e noticed yesterday that many files I had marked for further review are now good. {quote} they are with the change to use CapHeigth etc. or they are already good anyway? > Text extraction and height different in 2.0 > --- > >

[jira] [Commented] (PDFBOX-3062) Text extraction and height different in 2.0

2015-11-09 Thread Tilman Hausherr (JIRA)

the result is that text that is on different lines is extracted as being on the same line. Yes we could of course get a real BBox by getting through the glyphs like I recently did for type 3 fonts. But that would make text extraction slower. At this time I'm not saying that anything shoul

[jira] [Comment Edited] (PDFBOX-3062) Text extraction and height different in 2.0

2015-11-08 Thread John Hewson (JIRA)

o get the bounds of the GeneralPath if we want visual bounds. was (Author: jahewson): Those BBox values are pretty reasonable though, certainly not implausible. Neither CapHeight nor XHeight make sense as substitutes for BBox - we know those values will always be smaller. > Text extractio

[jira] [Commented] (PDFBOX-3062) Text extraction and height different in 2.0

2015-11-08 Thread John Hewson (JIRA)

able though, certainly not implausible. Neither CapHeight nor XHeight make sense as substitutes for BBox - we know those values will always be smaller. > Text extraction and height different in 2.0 > --- > > Key: PDFBOX-3062 >

[jira] [Updated] (PDFBOX-3062) Text extraction and height different in 2.0

2015-11-08 Thread Tilman Hausherr (JIRA)

values are more realistic than the FontBBox values, which too large in the font of this file. > Text extraction and height different in 2.0 > --- > > Key: PDFBOX-3062 > URL: https://issues.apache.org/jira/bro

[jira] [Commented] (PDFBOX-3044) Improve text extraction tests

2015-11-07 Thread ASF subversion and git services (JIRA)

3044: - Commit 1713127 from [~tilman] in branch 'pdfbox/branches/1.8' [ https://svn.apache.org/r1713127 ] PDFBOX-3044: add *txt files to rat config > Improve text extraction tests > - > > Key: PDFBOX-3044 > URL: https://

[jira] [Commented] (PDFBOX-3062) Text extraction and height different in 2.0

2015-11-07 Thread ASF subversion and git services (JIRA)

3062: - Commit 1713117 from [~tilman] in branch 'pdfbox/trunk' [ https://svn.apache.org/r1713117 ] PDFBOX-3062: add test files > Text extraction and height different in 2.0 > --- > > Key: PDFBOX-3062 > URL: h

[jira] [Commented] (PDFBOX-3062) Text extraction and height different in 2.0

2015-11-07 Thread Tilman Hausherr (JIRA)

ably PDFBOX-3078) text extraction of PDFBOX-3062-N2MOQ7YZICIYGTPLQJAWJ4HLN6CCEMHZ-reduced.pdf is now good: {quote} Fraternity Members 480 Male Undergraduates 3495 Sorority Members 484 Female Undergraduates 4880 {quote} > Text extraction and height different

[jira] [Commented] (PDFBOX-3044) Improve text extraction tests

2015-11-07 Thread ASF subversion and git services (JIRA)

3044: - Commit 1713114 from [~tilman] in branch 'pdfbox/branches/1.8' [ https://svn.apache.org/r1713114 ] PDFBOX-3044: change encoding to utf8, don't fail immediately > Improve text extraction tests > - > > Key: PDFBOX-30

[jira] [Commented] (PDFBOX-3044) Improve text extraction tests

2015-11-07 Thread ASF subversion and git services (JIRA)

3044: - Commit 1713113 from [~tilman] in branch 'pdfbox/branches/1.8' [ https://svn.apache.org/r1713113 ] PDFBOX-3044: change encoding to utf8 > Improve text extraction tests > - > > Key: PDFBOX-3044 > URL: https://issues.

[jira] [Commented] (PDFBOX-2508) Text extraction getting zero font height, bad widths, and ? for text in this PDF with Type 3 Fonts

2015-11-01 Thread Tilman Hausherr (JIRA)

your application as far as type 3 heights are concerned, because [ https://svn.apache.org/r1711758 ] will make all type 3 heights slightly smaller. > Text extraction getting zero font height, bad widths, and ? for text in this > PDF with Ty

[jira] [Resolved] (PDFBOX-2508) Text extraction getting zero font height, bad widths, and ? for text in this PDF with Type 3 Fonts

2015-11-01 Thread Tilman Hausherr (JIRA)

1.8.11 This resolves your issue. There might still be files with zero type 3 height, see PDFBOX-3076. And despite resolving, I'd like to hear how you got non-zero in your initial post. > Text extraction getting zero font height, bad widths, and ? for text in this > PDF

[jira] [Comment Edited] (PDFBOX-2508) Text extraction getting zero font height, bad widths, and ? for text in this PDF with Type 3 Fonts

2015-11-01 Thread Tilman Hausherr (JIRA)

5996 space=6.1737375 width=6.0899963]? String[140.72041,299.28 fs=58.0 xscale=58.0 height=5.1155996 space=6.1737375 width=3.1667938]? String[522.95984,293.28 fs=58.0 xscale=58.0 height=0.9744 space=2.6529562 width=1.4616089]? {code} > Text extraction getting zero font height, bad widths, and ? f

[jira] [Commented] (PDFBOX-2508) Text extraction getting zero font height, bad widths, and ? for text in this PDF with Type 3 Fonts

2015-11-01 Thread Tilman Hausherr (JIRA)

9562 width=1.4616089]? {code} > Text extraction getting zero font height, bad widths, and ? for text in this > PDF with Type 3 Fonts > -- > > Key: PDFBOX-2508 >

[jira] [Commented] (PDFBOX-2508) Text extraction getting zero font height, bad widths, and ? for text in this PDF with Type 3 Fonts

2015-11-01 Thread ASF subversion and git services (JIRA)

2508: - Commit 1711765 from [~tilman] in branch 'pdfbox/branches/1.8' [ https://svn.apache.org/r1711765 ] PDFBOX-2508: get font height from FontBBox item if existing method fails > Text extraction getting zero font height, bad widths, and ? for text in this > P

[jira] [Commented] (PDFBOX-2508) Text extraction getting zero font height, bad widths, and ? for text in this PDF with Type 3 Fonts

2015-11-01 Thread ASF subversion and git services (JIRA)

2508: - Commit 1711758 from [~tilman] in branch 'pdfbox/trunk' [ https://svn.apache.org/r1711758 ] PDFBOX-2508: fix bug in construction of font BoundingBox from PDRectangle > Text extraction getting zero font height, bad widths, and ? for text in this > P

[jira] [Updated] (PDFBOX-2508) Text extraction getting zero font height, bad widths, and ? for text in this PDF with Type 3 Fonts

2015-11-01 Thread Tilman Hausherr (JIRA)

[ https://issues.apache.org/jira/browse/PDFBOX-2508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated PDFBOX-2508: Affects Version/s: 2.0.0 > Text extraction getting zero font height, bad widths, and ?

[jira] [Commented] (PDFBOX-2508) Text extraction getting zero font height, bad widths, and ? for text in this PDF with Type 3 Fonts

2015-11-01 Thread Tilman Hausherr (JIRA)

height=0.9744 space=2.6529562 width=1.4616089]? String[129.84,347.04 fs=58.0 xscale=58.0 height=5.3592 space=5.0880046 width=2.6796112]? String[211.92,356.8801 fs=58.0 xscale=58.0 height=3.654 space=3.3609185 width=1.7052002]? {code} > Text extraction getting zero font height, bad widths,

[jira] [Commented] (PDFBOX-2508) Text extraction getting zero font height, bad widths, and ? for text in this PDF with Type 3 Fonts

2015-10-31 Thread ASF subversion and git services (JIRA)

2508: - Commit 1711714 from [~tilman] in branch 'pdfbox/branches/1.8' [ https://svn.apache.org/r1711714 ] PDFBOX-2508: correct calculation of glyphSpaceToTextSpaceFactor, remove misleading comment > Text extraction getting zero font height, bad widths, and ? for text in this > P

[jira] [Commented] (PDFBOX-2508) Text extraction getting zero font height, bad widths, and ? for text in this PDF with Type 3 Fonts

2015-10-31 Thread ASF subversion and git services (JIRA)

2508: - Commit 1711701 from [~tilman] in branch 'pdfbox/trunk' [ https://svn.apache.org/r1711701 ] PDFBOX-2508: correct calculation of glyphSpaceToTextSpaceFactor, remove misleading comment > Text extraction getting zero font height, bad widths, and ? for text in this > P

[jira] [Commented] (PDFBOX-3053) Text extraction fails with type 3 fonts

2015-10-29 Thread John Hewson (JIRA)

[ https://issues.apache.org/jira/browse/PDFBOX-3053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14980001#comment-14980001 ] John Hewson commented on PDFBOX-3053: - Ha! :) > Text extraction fails with

[jira] [Comment Edited] (PDFBOX-3062) Text extraction and height different in 2.0

2015-10-28 Thread John Hewson (JIRA)

at! See also my reply to [this thread|http://mail-archives.apache.org/mod_mbox/pdfbox-users/201510.mbox/%3ccaaphlv-0z+3ssvpxi8bwvbbqrf-vthkajigwxfedbb3vke_...@mail.gmail.com%3e] Enjoy! > Text extraction and height different in 2.0 > --- > >

[jira] [Commented] (PDFBOX-3062) Text extraction and height different in 2.0

2015-10-28 Thread John Hewson (JIRA)

I've deprecated PDFont#getHeight(). It wasn't used anyway. > Text extraction and height different in 2.0 > --- > > Key: PDFBOX-3062 > URL: https://issues.apache.org/jira/browse/PDFBOX-3062 >

[jira] [Commented] (PDFBOX-3062) Text extraction and height different in 2.0

2015-10-28 Thread ASF subversion and git services (JIRA)

3062: - Commit 1711181 from [~jahewson] in branch 'pdfbox/trunk' [ https://svn.apache.org/r1711181 ] PDFBOX-3062: Deprecate PDFont#getHeight() > Text extraction and height different in 2.0 > --- > > Key: PDFBOX-3062 &g

[jira] [Updated] (PDFBOX-3062) Text extraction and height different in 2.0

2015-10-28 Thread Tilman Hausherr (JIRA)

[ https://issues.apache.org/jira/browse/PDFBOX-3062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated PDFBOX-3062: Attachment: PDFBOX-3062-N2MOQ7YZICIYGTPLQJAWJ4HLN6CCEMHZ-reduced.pdf > Text extraction

[jira] [Commented] (PDFBOX-3062) Text extraction and height different in 2.0

2015-10-28 Thread Tilman Hausherr (JIRA)

3062-N2MOQ7YZICIYGTPLQJAWJ4HLN6CCEMHZ-reduced.pdf has the same problem, but with the opposite effect - due to a bad height stuff that is on separate lines is put together. (Which results in an incredible mess when using the sort option) > Text extraction and height different

[jira] [Commented] (PDFBOX-3042) Bad space calculation in text extraction

2015-10-28 Thread Tilman Hausherr (JIRA)

FBOX-3067 > Bad space calculation in text extraction > > > Key: PDFBOX-3042 > URL: https://issues.apache.org/jira/browse/PDFBOX-3042 > Project: PDFBox > Issue Type: Bug >

[jira] [Commented] (PDFBOX-3042) Bad space calculation in text extraction

2015-10-28 Thread ASF subversion and git services (JIRA)

3042: - Commit 1711070 from [~tilman] in branch 'pdfbox/trunk' [ https://svn.apache.org/r1711070 ] PDFBOX-3042: don't multiply with horizontalScalingText, as this has already been done before > Bad space calculation in text extraction > > &

[jira] [Comment Edited] (PDFBOX-3062) Text extraction and height different in 2.0

2015-10-27 Thread John Hewson (JIRA)

hlv-0z+3ssvpxi8bwvbbqrf-vthkajigwxfedbb3vke_...@mail.gmail.com%3e] Enjoy! > Text extraction and height different in 2.0 > --- > > Key: PDFBOX-3062 > URL: https://issues.apache.org/jira/browse/PDFBOX-3062 &g

[jira] [Comment Edited] (PDFBOX-3062) Text extraction and height different in 2.0

2015-10-27 Thread John Hewson (JIRA)

ent font size (with the TM + CTM taken into account). Note that the textRenderingMatrix (TRM) passed to onGlyph already has all of these calculations done for you... so use that! Enjoy! > Text extraction and height different in 2.0 > --- > >

[jira] [Commented] (PDFBOX-3062) Text extraction and height different in 2.0

2015-10-27 Thread John Hewson (JIRA)

(TRM) passed to onGlyph already has all of these calculations done for you... so use that! Enjoy! > Text extraction and height different in 2.0 > --- > > Key: PDFBOX-3062 > URL: https://issues.apache.org/jira/b

[jira] [Closed] (PDFBOX-2584) Text extraction reports zero character widths

2015-10-27 Thread Tilman Hausherr (JIRA)

needed. > Text extraction reports zero character widths > -- > > Key: PDFBOX-2584 > URL: https://issues.apache.org/jira/browse/PDFBOX-2584 > Project: PDFBox > Issue Type: Bug

[jira] [Updated] (PDFBOX-3062) Text extraction and height different in 2.0

2015-10-26 Thread Tilman Hausherr (JIRA)

[ https://issues.apache.org/jira/browse/PDFBOX-3062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated PDFBOX-3062: Attachment: 005021-reduced.pdf > Text extraction and height different in

[jira] [Created] (PDFBOX-3062) Text extraction and height different in 2.0

2015-10-26 Thread Tilman Hausherr (JIRA)

Tilman Hausherr created PDFBOX-3062: --- Summary: Text extraction and height different in 2.0 Key: PDFBOX-3062 URL: https://issues.apache.org/jira/browse/PDFBOX-3062 Project: PDFBox Issue

[jira] [Commented] (PDFBOX-3044) Improve text extraction tests

2015-10-25 Thread ASF subversion and git services (JIRA)

3044: - Commit 1710510 from [~tilman] in branch 'pdfbox/trunk' [ https://svn.apache.org/r1710510 ] PDFBOX-3044: delete possible leftover diff file > Improve text extraction tests > - > > Key: PDFBOX-3044 > URL: https://

[jira] [Updated] (PDFBOX-3053) Text extraction fails with type 3 fonts

2015-10-25 Thread Tilman Hausherr (JIRA)

[ https://issues.apache.org/jira/browse/PDFBOX-3053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated PDFBOX-3053: Issue Type: Sub-task (was: Bug) Parent: PDFBOX-3058 > Text extraction fails w

[jira] [Commented] (PDFBOX-3044) Improve text extraction tests

2015-10-25 Thread JIRA

here the cweb test came from, or what it was meant to test, it is probably from long ago. But I am trying to add only small, one page tests on any new PDFs I add. {quote} The cweb file is part of the test files from the beginning. AFAIK it's simply a test for text extraction in general an

[jira] [Resolved] (PDFBOX-3053) Text extraction fails with type 3 fonts

2015-10-24 Thread Tilman Hausherr (JIRA)

ext extraction fails with type 3 fonts > --- > > Key: PDFBOX-3053 > URL: https://issues.apache.org/jira/browse/PDFBOX-3053 > Project: PDFBox > Issue Type: Bug > Components: Text

[jira] [Commented] (PDFBOX-3053) Text extraction fails with type 3 fonts

2015-10-24 Thread ASF subversion and git services (JIRA)

3053: - Commit 1710376 from [~tilman] in branch 'pdfbox/trunk' [ https://svn.apache.org/r1710376 ] PDFBOX-3053: use Adobe glyph list, not Zapf glyph list > Text extraction fails with type 3 fonts > --- > > Key: PDFBOX-3053 &g

[jira] [Commented] (PDFBOX-3053) Text extraction fails with type 3 fonts

2015-10-24 Thread ASF subversion and git services (JIRA)

3053: - Commit 1710374 from [~tilman] in branch 'pdfbox/trunk' [ https://svn.apache.org/r1710374 ] PDFBOX-3053: set to text/plain > Text extraction fails with type 3 fonts > --- > > Key: PDFBOX-3053 > URL: https://

[jira] [Commented] (PDFBOX-3053) Text extraction fails with type 3 fonts

2015-10-24 Thread ASF subversion and git services (JIRA)

3053: - Commit 1710373 from [~tilman] in branch 'pdfbox/trunk' [ https://svn.apache.org/r1710373 ] PDFBOX-3053: add test files > Text extraction fails with type 3 fonts > --- > > Key: PDFBOX-3053 > URL: https://

[jira] [Commented] (PDFBOX-3053) Text extraction fails with type 3 fonts

2015-10-24 Thread Tilman Hausherr (JIRA)

ncoding() {code} glyphList = glyphList = GlyphList.getZapfDingbats(); {code} > Text extraction fails with type 3 fonts > --- > > Key: PDFBOX-3053 > URL: https://issues.apache.org/jira/browse/PDFBOX-3053 >

[jira] [Comment Edited] (PDFBOX-3053) Text extraction fails with type 3 fonts

2015-10-24 Thread Tilman Hausherr (JIRA)

4 PM: --- >From PDType3Font.readEncoding() {code} glyphList = GlyphList.getZapfDingbats(); {code} was (Author: tilman): >From PDType3Font.readEncoding() {code} glyphList = glyphList = GlyphList.getZapfDingbats(); {code} > Text extraction fails with

[jira] [Updated] (PDFBOX-3053) Text extraction fails with type 3 fonts

2015-10-24 Thread Tilman Hausherr (JIRA)

[ https://issues.apache.org/jira/browse/PDFBOX-3053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated PDFBOX-3053: Attachment: PDFBOX-3053-reduced.pdf > Text extraction fails with type 3 fo

[jira] [Updated] (PDFBOX-3053) Text extraction fails with type 3 fonts

2015-10-24 Thread Tilman Hausherr (JIRA)

[ https://issues.apache.org/jira/browse/PDFBOX-3053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated PDFBOX-3053: Attachment: (was: PDFBOX-3053-reduced.pdf) > Text extraction fails with type 3 fo

[jira] [Comment Edited] (PDFBOX-3053) Text extraction fails with type 3 fonts

2015-10-24 Thread Tilman Hausherr (JIRA)

on /w /b /j /v /comma /k /period] {code} 2959 file: {code} /Differences [32 /space 69 /E /F 72 /H /I 78 /N /O 82 /R 84 /T /U] {code} > Text extraction fails with type 3 fonts > --- > > Key: PDFBOX-3053 > URL: https

[jira] [Updated] (PDFBOX-3053) Text extraction fails with type 3 fonts

2015-10-24 Thread Tilman Hausherr (JIRA)

[ https://issues.apache.org/jira/browse/PDFBOX-3053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated PDFBOX-3053: Attachment: PDFBOX-3053-reduced.pdf > Text extraction fails with type 3 fo

[jira] [Commented] (PDFBOX-3053) Text extraction fails with type 3 fonts

2015-10-24 Thread Tilman Hausherr (JIRA)

/i /slash /f /colon /w /b /j /v /comma /k /period] {code} 2959 file: {code} /Differences [32 /space 69 /E /F 72 /H /I 78 /N /O 82 /R 84 /T /U] {code} > Text extraction fails with type 3 fonts > --- > > Key: PDFBOX-3053 &g

[jira] [Commented] (PDFBOX-3044) Improve text extraction tests

2015-10-24 Thread Tilman Hausherr (JIRA)

st came from, or what it was meant to test, it is probably from long ago. But I am trying to add only small, one page tests on any new PDFs I add. > Improve text extraction tests > - > > Key: PDFBOX-3044 > URL:

[jira] [Commented] (PDFBOX-3044) Improve text extraction tests

2015-10-24 Thread Ben McCann (JIRA)

ower the number that can fail in the future to prevent regressions * Make the files roughly equivalent in length. cweb.pdf is 28 pages long and all the rest are 1 page, so the test output is almost entirely dominated by whether we make this file better or worse > Improve text extractio

[jira] [Updated] (PDFBOX-3053) Text extraction fails with type 3 fonts

2015-10-24 Thread Tilman Hausherr (JIRA)

[ https://issues.apache.org/jira/browse/PDFBOX-3053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated PDFBOX-3053: Description: Text extraction fails with the attached file. It succeeds with Acrobat Reader

[jira] [Updated] (PDFBOX-3053) Text extraction fails with type 3 fonts

2015-10-24 Thread Tilman Hausherr (JIRA)

[ https://issues.apache.org/jira/browse/PDFBOX-3053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated PDFBOX-3053: Attachment: PDFBOX-2959-reduced.pdf > Text extraction fails with type 3 fo

[jira] [Updated] (PDFBOX-3053) Text extraction fails with type 3 fonts

2015-10-24 Thread Tilman Hausherr (JIRA)

[ https://issues.apache.org/jira/browse/PDFBOX-3053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated PDFBOX-3053: Attachment: PDFBOX-3053-3YQ2UXRQBBLX5TLKSLFCUZLWXWSI2Z2U.pdf > Text extraction fails w

[jira] [Created] (PDFBOX-3053) Text extraction fails with type 3 fonts

2015-10-24 Thread Tilman Hausherr (JIRA)

Tilman Hausherr created PDFBOX-3053: --- Summary: Text extraction fails with type 3 fonts Key: PDFBOX-3053 URL: https://issues.apache.org/jira/browse/PDFBOX-3053 Project: PDFBox Issue Type

[jira] [Commented] (PDFBOX-2584) Text extraction reports zero character widths

2015-10-23 Thread JIRA

e us some more details what exactly is the issue here? > Text extraction reports zero character widths > -- > > Key: PDFBOX-2584 > URL: https://issues.apache.org/jira/browse/PDFBOX-2584 >

[jira] [Comment Edited] (PDFBOX-2584) Text extraction reports zero character widths

2015-10-23 Thread JIRA

fixed in 1.8.11 as well? Pavel is complaining about 1.8.8. I'm going to check that later > Text extraction reports zero character widths > -- > > Key: PDFBOX-2584 > URL: https://is

[jira] [Comment Edited] (PDFBOX-2584) Text extraction reports zero character widths

2015-10-23 Thread JIRA

pace=2.2251122 width=5.7788696]N > Text extraction reports zero character widths > -- > > Key: PDFBOX-2584 > URL: https://issues.apache.org/jira/browse/PDFBOX-2584 > Project: P

[jira] [Commented] (PDFBOX-3044) Improve text extraction tests

2015-10-23 Thread Tilman Hausherr (JIRA)

u as desired? > Improve text extraction tests > - > > Key: PDFBOX-3044 > URL: https://issues.apache.org/jira/browse/PDFBOX-3044 > Project: PDFBox > Issue Type: Bug >Affects Ver

[jira] [Commented] (PDFBOX-3044) Improve text extraction tests

2015-10-23 Thread ASF subversion and git services (JIRA)

3044: - Commit 1710270 from [~tilman] in branch 'pdfbox/trunk' [ https://svn.apache.org/r1710270 ] PDFBOX-3044: set to text/plan > Improve text extraction tests > - > > Key: PDFBOX-3044 > URL: https://issues.apach

[jira] [Commented] (PDFBOX-3044) Improve text extraction tests

2015-10-23 Thread ASF subversion and git services (JIRA)

3044: - Commit 1710247 from [~tilman] in branch 'pdfbox/trunk' [ https://svn.apache.org/r1710247 ] PDFBOX-3044: change encoding to utf8, don't fail immediately; output diff output; use diff library; update test files to utf8 > Improve

[jira] [Commented] (PDFBOX-3044) Improve text extraction tests

2015-10-23 Thread ASF subversion and git services (JIRA)

3044: - Commit 1710250 from [~tilman] in branch 'pdfbox/trunk' [ https://svn.apache.org/r1710250 ] PDFBOX-3044: change encoding to utf8, don't fail immediately; output diff output; use diff library; update test files to utf8 > Improve

[jira] [Updated] (PDFBOX-3044) Improve text extraction tests

2015-10-23 Thread Tilman Hausherr (JIRA)

seem to be UTF16 encoded. I'm having a really difficult time using these files with the tools that I typically use (git, meld, etc.) Would it be possible to change the encoding to UTF8? By @Tilman Hausherr I'm expanding this as a long term issue to improve the testing of text extract

[jira] [Updated] (PDFBOX-3044) Improve text extraction tests

2015-10-23 Thread Tilman Hausherr (JIRA)

[ https://issues.apache.org/jira/browse/PDFBOX-3044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated PDFBOX-3044: Summary: Improve text extraction tests (was: Test files character encoding) > Impr

[jira] [Commented] (PDFBOX-3044) Improve text extraction tests

2015-10-23 Thread ASF subversion and git services (JIRA)

3044: - Commit 1710241 from [~tilman] in branch 'pdfbox/trunk' [ https://svn.apache.org/r1710241 ] PDFBOX-3044: add diffutils lib for test only > Improve text extraction tests > - > > Key: PDFBOX-3044 > URL: https://

[jira] [Updated] (PDFBOX-3044) Improve text extraction tests

2015-10-23 Thread Tilman Hausherr (JIRA)

> Improve text extraction tests > - > > Key: PDFBOX-3044 > URL: https://issues.apache.org/jira/browse/PDFBOX-3044 > Project: PDFBox > Issue Type: Bug >Affects Versions: 1.8.10, 1.8.11, 2.0.0 &g

[jira] [Commented] (PDFBOX-3042) Bad space calculation in text extraction

2015-10-22 Thread ASF subversion and git services (JIRA)

3042: - Commit 1710057 from [~tilman] in branch 'pdfbox/trunk' [ https://svn.apache.org/r1710057 ] PDFBOX-3042: remove dead code > Bad space calculation in text extraction > > > Key: PDFBOX-3042 > URL: h

[jira] [Comment Edited] (PDFBOX-2508) Text extraction getting zero font height, bad widths, and ? for text in this PDF with Type 3 Fonts

2015-10-22 Thread Tilman Hausherr (JIRA)

nt matrix is \[0.001 0 0 0.001 0 0]. {quote} With the file from PDFBOX-2794, 1 / 0.001 = 1000. And that is multiplied with the space width 277.832, so the base value is 277832! A Tf value of 8 means that the size is now 656. > Text extraction getting zero font height, bad widths, and ? for text

[jira] [Commented] (PDFBOX-2508) Text extraction getting zero font height, bad widths, and ? for text in this PDF with Type 3 Fonts

2015-10-22 Thread Tilman Hausherr (JIRA)

32! A Tf value of 8 means that the size is now 656. > Text extraction getting zero font height, bad widths, and ? for text in this > PDF with Type 3 Fonts > -- > >

[jira] [Updated] (PDFBOX-2508) Text extraction getting zero font height, bad widths, and ? for text in this PDF with Type 3 Fonts

2015-10-21 Thread Tilman Hausherr (JIRA)

[ https://issues.apache.org/jira/browse/PDFBOX-2508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated PDFBOX-2508: Labels: type3 (was: ) > Text extraction getting zero font height, bad widths, and ?

[jira] [Commented] (PDFBOX-3028) Text extraction broken for jbl example

2015-10-21 Thread Ben McCann (JIRA)

ible to depend less on the averageCharWidth in PDFTextStripper since it seemed like it was at least partially a workaround for that issue > Text extraction broken for jbl example > -- > > Key: PDFBOX-3028 >

[jira] [Comment Edited] (PDFBOX-3028) Text extraction broken for jbl example

2015-10-21 Thread Tilman Hausherr (JIRA)

ring[92.585,79.52399 fs=9.3624 xscale=9.268776 height=7.961241 space=5.5056534 width=5.561264]C {code} After PDFBOX-3042. > Text extraction broken for jbl example > -- > > Key: PDFBOX-3028 > URL: https://issues.apache.org

[jira] [Commented] (PDFBOX-3028) Text extraction broken for jbl example

2015-10-21 Thread Tilman Hausherr (JIRA)

9 fs=9.3624 xscale=9.268776 height=7.961241 space=5.5056534 width=5.561264]C {code} After PDFBOX-3042. > Text extraction broken for jbl example > -- > > Key: PDFBOX-3028 > URL: https://issues.apache.org/jira

[jira] [Comment Edited] (PDFBOX-3028) Text extraction broken for jbl example

2015-10-21 Thread Ben McCann (JIRA)

399 fs=9.3624 xscale=9.268776 height=5.302922 space=51.546127 width=5.561264] {code} > Text extraction broken for jbl example > -- > > Key: PDFBOX-3028 > URL: https://issues.apache.org/jira/browse/PDFBOX-3028 >

[jira] [Resolved] (PDFBOX-3042) Bad space calculation in text extraction

2015-10-21 Thread Tilman Hausherr (JIRA)

[ https://issues.apache.org/jira/browse/PDFBOX-3042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr resolved PDFBOX-3042. - Resolution: Fixed > Bad space calculation in text extract

[jira] [Commented] (PDFBOX-3042) Bad space calculation in text extraction

2015-10-21 Thread ASF subversion and git services (JIRA)

3042: - Commit 1709886 from [~tilman] in branch 'pdfbox/trunk' [ https://svn.apache.org/r1709886 ] PDFBOX-3042: add test files > Bad space calculation in text extraction > > > Key: PDFBOX-3042 > URL: https://

[jira] [Commented] (PDFBOX-3042) Bad space calculation in text extraction

2015-10-21 Thread ASF subversion and git services (JIRA)

3042: - Commit 1709883 from [~tilman] in branch 'pdfbox/trunk' [ https://svn.apache.org/r1709883 ] PDFBOX-3042: don't multiply with fontSize, as this has already been done before > Bad space calculation in text extraction > > >

[jira] [Updated] (PDFBOX-3042) Bad space calculation in text extraction

2015-10-21 Thread Tilman Hausherr (JIRA)

> Bad space calculation in text extraction > > > Key: PDFBOX-3042 > URL: https://issues.apache.org/jira/browse/PDFBOX-3042 > Project: PDFBox > Issue Type: Bug > Com

[jira] [Created] (PDFBOX-3042) Bad space calculation in text extraction

2015-10-21 Thread Tilman Hausherr (JIRA)

Tilman Hausherr created PDFBOX-3042: --- Summary: Bad space calculation in text extraction Key: PDFBOX-3042 URL: https://issues.apache.org/jira/browse/PDFBOX-3042 Project: PDFBox Issue Type

[jira] [Commented] (PDFBOX-2584) Text extraction reports zero character widths

2015-10-20 Thread JIRA

to check the 1.8.x branch yesterday. Maybe it's fixed in 1.8.11 as well? Pavel is complaining about 1.8.8. I'm going to check that later > Text extraction reports zero character widths > -- > > Key: PDFBOX-25

[jira] [Commented] (PDFBOX-2584) Text extraction reports zero character widths

2015-10-20 Thread Tilman Hausherr (JIRA)

04 height=5.326662 space=2.2251122 width=5.7788696]N String[624.0,213.18 fs=1.0 xscale=8.004 height=5.326662 space=2.2251122 width=5.7788696]N {code} So what is the problem here? > Text extraction reports zero character widths > -- > >

[jira] [Updated] (PDFBOX-2584) Text extraction reports zero character widths

2015-10-20 Thread JIRA

[ https://issues.apache.org/jira/browse/PDFBOX-2584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andreas Lehmkühler updated PDFBOX-2584: --- Affects Version/s: (was: 2.0.0) > Text extraction reports zero character wid

[jira] [Commented] (PDFBOX-2584) Text extraction reports zero character widths

2015-10-20 Thread JIRA

22 width=4.4501953]8 String[624.0,213.18 fs=1.0 xscale=8.004 height=5.324927 space=2.2251122 width=5.7788696]N > Text extraction reports zero character widths > -- > > Key: PDFBOX-2584 > URL: https://issue

[jira] [Resolved] (PDFBOX-3038) Text extraction shows glyphs with zero height

2015-10-20 Thread Tilman Hausherr (JIRA)

[ https://issues.apache.org/jira/browse/PDFBOX-3038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr resolved PDFBOX-3038. - Resolution: Fixed Assignee: Tilman Hausherr Setting to resolved; - Text extraction

[jira] [Commented] (PDFBOX-3038) Text extraction shows glyphs with zero height

2015-10-20 Thread Tilman Hausherr (JIRA)

.913 space=20.25 width=2.25] {code} > Text extraction shows glyphs with zero height > - > > Key: PDFBOX-3038 > URL: https://issues.apache.org/jira/browse/PDFBOX-3038 > Project: PDFBox >

[jira] [Commented] (PDFBOX-3038) Text extraction shows glyphs with zero height

2015-10-20 Thread Tilman Hausherr (JIRA)

ight (speech by FTC head). > Text extraction shows glyphs with zero height > - > > Key: PDFBOX-3038 > URL: https://issues.apache.org/jira/browse/PDFBOX-3038 > Project: PDFBox >

[jira] [Commented] (PDFBOX-3038) Text extraction shows glyphs with zero height

2015-10-20 Thread ASF subversion and git services (JIRA)

3038: - Commit 1709647 from [~tilman] in branch 'pdfbox/trunk' [ https://svn.apache.org/r1709647 ] PDFBOX-3038: return BBox from font descriptor if font BBox empty > Text extraction shows glyphs with zero height > - > >

[jira] [Commented] (PDFBOX-3038) Text extraction shows glyphs with zero height

2015-10-20 Thread ASF subversion and git services (JIRA)

3038: - Commit 1709646 from [~tilman] in branch 'pdfbox/trunk' [ https://svn.apache.org/r1709646 ] PDFBOX-3038: add test files > Text extraction shows glyphs with zero height > - > > Key: PDFBOX-3038 > U

[jira] [Updated] (PDFBOX-3038) Text extraction shows glyphs with zero height

2015-10-20 Thread Tilman Hausherr (JIRA)

[ https://issues.apache.org/jira/browse/PDFBOX-3038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated PDFBOX-3038: Attachment: PDFBOX-3038-001033-p2.pdf > Text extraction shows glyphs with zero hei

[jira] [Resolved] (PDFBOX-3037) Text extraction decodes image files

2015-10-20 Thread Tilman Hausherr (JIRA)

[ https://issues.apache.org/jira/browse/PDFBOX-3037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr resolved PDFBOX-3037. - Resolution: Fixed > Text extraction decodes image fi

[jira] [Commented] (PDFBOX-3037) Text extraction decodes image files

2015-10-20 Thread ASF subversion and git services (JIRA)

3037: - Commit 1709640 from [~tilman] in branch 'pdfbox/trunk' [ https://svn.apache.org/r1709640 ] PDFBOX-3037: check for image to avoid decoding them when doing text extraction > Text extraction decodes image files > --- > >

[jira] [Commented] (PDFBOX-3037) Text extraction decodes image files

2015-10-20 Thread ASF subversion and git services (JIRA)

3037: - Commit 1709639 from [~tilman] in branch 'pdfbox/trunk' [ https://svn.apache.org/r1709639 ] PDFBOX-3037: add DrawObject method for content extractor engine > Text extraction decodes image files > --- > > Key: PDFBOX-3037 &g

[jira] [Updated] (PDFBOX-3038) Text extraction shows glyphs with zero height

2015-10-20 Thread Tilman Hausherr (JIRA)

[ https://issues.apache.org/jira/browse/PDFBOX-3038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated PDFBOX-3038: Labels: regression (was: ) > Text extraction shows glyphs with zero hei

[jira] [Created] (PDFBOX-3038) Text extraction shows glyphs with zero height

2015-10-20 Thread Tilman Hausherr (JIRA)

Tilman Hausherr created PDFBOX-3038: --- Summary: Text extraction shows glyphs with zero height Key: PDFBOX-3038 URL: https://issues.apache.org/jira/browse/PDFBOX-3038 Project: PDFBox Issue

[jira] [Created] (PDFBOX-3037) Text extraction decodes image files

2015-10-20 Thread Tilman Hausherr (JIRA)

Tilman Hausherr created PDFBOX-3037: --- Summary: Text extraction decodes image files Key: PDFBOX-3037 URL: https://issues.apache.org/jira/browse/PDFBOX-3037 Project: PDFBox Issue Type: Bug

[jira] [Updated] (PDFBOX-3037) Text extraction decodes image files

2015-10-20 Thread Tilman Hausherr (JIRA)

[ https://issues.apache.org/jira/browse/PDFBOX-3037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated PDFBOX-3037: Attachment: 001131.pdf > Text extraction decodes image fi

[jira] [Commented] (PDFBOX-586) Text Extraction on Android

2015-10-19 Thread Tilman Hausherr (JIRA)

rted, 1.8 supports RC4, and 2.0 supports also AES. > Text Extraction on Android > -- > > Key: PDFBOX-586 > URL: https://issues.apache.org/jira/browse/PDFBOX-586 > Project: PDFBox > Issue Type: Improvem

< 1 2 3 4 5 6 7 8 9 10 >

401 - 500 of 1060 matches

Mail list logo