[jira] [Updated] (PDFBOX-3970) x,y co-ordinates of the text inside the cell are not getting correctly.
[ https://issues.apache.org/jira/browse/PDFBOX-3970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated PDFBOX-3970: Attachment: wrong_space_parsed_sample.pdf > x,y co-ordinates of the text inside the cell are not getting correctly. > --- > > Key: PDFBOX-3970 > URL: https://issues.apache.org/jira/browse/PDFBOX-3970 > Project: PDFBox > Issue Type: Bug > Components: Text extraction >Affects Versions: 2.0.7 > Environment: Operating system: Windows 7 (64 bit). >Reporter: Navnath Kumbhar >Priority: Major > Labels: how-to > Attachments: LegacyPDFStreamEngine.java, LegacyPDFStreamEngine.java, > formula-marked-34.png, paragraphNextToTable-marked-1.png, > paragraphNextToTable.pdf, simpleAnnotation.pdf, wrong_space_parsed_sample.pdf > > > Hello Support Team, > I am working on a project which parses a whole PDF document and stores the > extracted text in some .txt file which can be read by other product. > My issue is regarding extracting the text inside the cell of a table: > *x,y co-ordinates of the text inside the cell are not getting correctly.* > Y value of the last text line in the cell is getting larger than cell's max-Y > value. > I have attached the test file with this bug. > As you can see in the test document, there is one cell along-with text in it > and a text paragraph next to that cell. > x-y coordinates that I get from pdfbox for all the paths (two vertical and > two horizontal lines) of the cell are: > (in x1,y1,x2,y2 format) > Horizontal line 1: [100,88,220,88] > Horizontal line 2: [100,120,220,120] > Vertical line 1 : [100,88,100,120] > Vertical line 2: [220,88,220,120] > (Y values of the above paths are final values by subtracting the actual value > given by pdfbox from height of the page as I see that for paths, y-values are > processed from bottom to up) > And bounding box of the last line in that cell is : [102,114,59,7] and hence > max-Y of that line becomes 121 (min-Y + height) > > So, if we consider max-Y value of that cell (i.e. 120) and that of last line > in that cell (i.e. 121), clearly, that line goes out of that cell. > What can be the possible reason for this? > Thank you in advance! > Regards, > Navnath Kumbhar -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Updated] (PDFBOX-3970) x,y co-ordinates of the text inside the cell are not getting correctly.
[ https://issues.apache.org/jira/browse/PDFBOX-3970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated PDFBOX-3970: Attachment: LegacyPDFStreamEngine.java 2nd version, cleaned up a bit. > x,y co-ordinates of the text inside the cell are not getting correctly. > --- > > Key: PDFBOX-3970 > URL: https://issues.apache.org/jira/browse/PDFBOX-3970 > Project: PDFBox > Issue Type: Bug > Components: Text extraction >Affects Versions: 2.0.7 > Environment: Operating system: Windows 7 (64 bit). >Reporter: Navnath Kumbhar > Labels: how-to > Attachments: LegacyPDFStreamEngine.java, LegacyPDFStreamEngine.java, > formula-marked-34.png, paragraphNextToTable-marked-1.png, > paragraphNextToTable.pdf, simpleAnnotation.pdf > > > Hello Support Team, > I am working on a project which parses a whole PDF document and stores the > extracted text in some .txt file which can be read by other product. > My issue is regarding extracting the text inside the cell of a table: > *x,y co-ordinates of the text inside the cell are not getting correctly.* > Y value of the last text line in the cell is getting larger than cell's max-Y > value. > I have attached the test file with this bug. > As you can see in the test document, there is one cell along-with text in it > and a text paragraph next to that cell. > x-y coordinates that I get from pdfbox for all the paths (two vertical and > two horizontal lines) of the cell are: > (in x1,y1,x2,y2 format) > Horizontal line 1: [100,88,220,88] > Horizontal line 2: [100,120,220,120] > Vertical line 1 : [100,88,100,120] > Vertical line 2: [220,88,220,120] > (Y values of the above paths are final values by subtracting the actual value > given by pdfbox from height of the page as I see that for paths, y-values are > processed from bottom to up) > And bounding box of the last line in that cell is : [102,114,59,7] and hence > max-Y of that line becomes 121 (min-Y + height) > > So, if we consider max-Y value of that cell (i.e. 120) and that of last line > in that cell (i.e. 121), clearly, that line goes out of that cell. > What can be the possible reason for this? > Thank you in advance! > Regards, > Navnath Kumbhar -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Updated] (PDFBOX-3970) x,y co-ordinates of the text inside the cell are not getting correctly.
[ https://issues.apache.org/jira/browse/PDFBOX-3970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated PDFBOX-3970: Attachment: LegacyPDFStreamEngine.java It looked good at first but when I tried to run it on the file from PDFBOX-4000 I got some weird effects... ("N" and "o" had almost the same visual bounds) one problem in your code is that you're using the height, instead of using the upper Y coordinate. I then removed the translation because we need the shapes from the baseline and don't care about the rendering position on the page (this is done later in the code). I am attaching my current code. Make a diff to yours to see what it is about. It is a bit messy. I see you changed the type3 code at the bottom, I didn't investigate that, I'm using the existing type3 code. I then ran the tests and got many regressions. Some can be explained (e.g. PDFBOX-3062-002207-p1.pdf), but at least one is weird, the sorted result of PDFBOX-2984-page180°.pdf. > x,y co-ordinates of the text inside the cell are not getting correctly. > --- > > Key: PDFBOX-3970 > URL: https://issues.apache.org/jira/browse/PDFBOX-3970 > Project: PDFBox > Issue Type: Bug > Components: Text extraction >Affects Versions: 2.0.7 > Environment: Operating system: Windows 7 (64 bit). >Reporter: Navnath Kumbhar > Labels: how-to > Attachments: LegacyPDFStreamEngine.java, formula-marked-34.png, > paragraphNextToTable-marked-1.png, paragraphNextToTable.pdf, > simpleAnnotation.pdf > > > Hello Support Team, > I am working on a project which parses a whole PDF document and stores the > extracted text in some .txt file which can be read by other product. > My issue is regarding extracting the text inside the cell of a table: > *x,y co-ordinates of the text inside the cell are not getting correctly.* > Y value of the last text line in the cell is getting larger than cell's max-Y > value. > I have attached the test file with this bug. > As you can see in the test document, there is one cell along-with text in it > and a text paragraph next to that cell. > x-y coordinates that I get from pdfbox for all the paths (two vertical and > two horizontal lines) of the cell are: > (in x1,y1,x2,y2 format) > Horizontal line 1: [100,88,220,88] > Horizontal line 2: [100,120,220,120] > Vertical line 1 : [100,88,100,120] > Vertical line 2: [220,88,220,120] > (Y values of the above paths are final values by subtracting the actual value > given by pdfbox from height of the page as I see that for paths, y-values are > processed from bottom to up) > And bounding box of the last line in that cell is : [102,114,59,7] and hence > max-Y of that line becomes 121 (min-Y + height) > > So, if we consider max-Y value of that cell (i.e. 120) and that of last line > in that cell (i.e. 121), clearly, that line goes out of that cell. > What can be the possible reason for this? > Thank you in advance! > Regards, > Navnath Kumbhar -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Updated] (PDFBOX-3970) x,y co-ordinates of the text inside the cell are not getting correctly.
[ https://issues.apache.org/jira/browse/PDFBOX-3970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Navnath Kumbhar updated PDFBOX-3970: Attachment: simpleAnnotation.pdf Hello Tilman, Thank you for pointing out the right code snippet. I have done some changes in the LegacyPDFStreamEngine.java Below is my code change: {code:java} @Override protected void showGlyph(Matrix textRenderingMatrix, PDFont font, int code, String unicode, Vector displacement) throws IOException { // // legacy calculations which were previously in PDFStreamEngine // // DO NOT USE THIS CODE UNLESS YOU ARE WORKING WITH PDFTextStripper. // THIS CODE IS DELIBERATELY INCORRECT // PDGraphicsState state = getGraphicsState(); Matrix ctm = state.getCurrentTransformationMatrix(); float fontSize = state.getTextState().getFontSize(); float horizontalScaling = state.getTextState().getHorizontalScaling() / 100f; Matrix textMatrix = getTextMatrix(); Shape glyphShape = getActualGlyphBoundingBox(textRenderingMatrix, font, code); BoundingBox bbox = new BoundingBox((float)glyphShape.getBounds2D().getMinX(), (float)glyphShape.getBounds2D().getMinY(), (float)glyphShape.getBounds2D().getMaxX(), (float)glyphShape.getBounds2D().getMaxY()); if (bbox.getLowerLeftY() < Short.MIN_VALUE) { // PDFBOX-2158 and PDFBOX-3130 // files by Salmat eSolutions / ClibPDF Library bbox.setLowerLeftY(- (bbox.getLowerLeftY() + 65536)); } // 1/2 the bbox is used as the height todo: why? float glyphHeight = bbox.getHeight()/2; /*PDFontDescriptor fontDescriptor = font.getFontDescriptor(); if (fontDescriptor != null) { float capHeight = fontDescriptor.getCapHeight(); if (capHeight != 0 && (capHeight < glyphHeight || glyphHeight == 0)) { glyphHeight = capHeight; } }*/ // transformPoint from glyph space -> text space float height; if (font instanceof PDType3Font) { height = font.getFontMatrix().transformPoint(0, glyphHeight).y; } else { height = glyphHeight / 1000; } . . . } {code} And here is *getActualGlyphBoundingBox()* method. {code:java} private Shape getActualGlyphBoundingBox(Matrix textRenderingMatrix, PDFont font, int code) throws IOException { GeneralPath path = null; AffineTransform at = textRenderingMatrix.createAffineTransform(); at.concatenate(font.getFontMatrix().createAffineTransform()); if (font instanceof PDType3Font) { PDType3Font t3Font = (PDType3Font) font; PDType3CharProc charProc = t3Font.getCharProc(code); if (charProc != null) { PDRectangle glyphBBox = charProc.getGlyphBBox(); if (glyphBBox != null) { path = glyphBBox.toGeneralPath(); } } } else if (font instanceof PDVectorFont) { PDVectorFont vectorFont = (PDVectorFont) font; path = vectorFont.getPath(code); if (font instanceof PDTrueTypeFont) { PDTrueTypeFont ttFont = (PDTrueTypeFont) font; int unitsPerEm = ttFont.getTrueTypeFont().getHeader().getUnitsPerEm(); at.scale(1000d / unitsPerEm, 1000d / unitsPerEm); } if (font instanceof PDType0Font) { PDType0Font t0font = (PDType0Font) font; if (t0font.getDescendantFont() instanceof PDCIDFontType2) { int unitsPerEm = ((PDCIDFontType2) t0font.getDescendantFont()).getTrueTypeFont().getHeader().getUnitsPerEm(); at.scale(1000d / unitsPerEm, 1000d / unitsPerEm); } } } else if (font instanceof PDSimpleFont) { PDSimpleFont simpleFont = (PDSimpleFont) font; // these two lines do not always work, e.g. for the TT fonts in file 032431.pdf // which is why PDVectorFont is tried first. String name = simpleFont.getEncoding().getName(code); path = simpleFont.getPath(name); } else { // shouldn't happen, please open issue in JIRA System.out.println("Unknown font class: " + font.getClass()); } if (path == null) { return null; } //return at.createTransformedShape(path.getBounds2D()); return path.getBounds2D(); } {code} I am getting satisfactory results for text
[jira] [Updated] (PDFBOX-3970) x,y co-ordinates of the text inside the cell are not getting correctly.
[ https://issues.apache.org/jira/browse/PDFBOX-3970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated PDFBOX-3970: Labels: how-to (was: ) > x,y co-ordinates of the text inside the cell are not getting correctly. > --- > > Key: PDFBOX-3970 > URL: https://issues.apache.org/jira/browse/PDFBOX-3970 > Project: PDFBox > Issue Type: Bug > Components: Text extraction >Affects Versions: 2.0.7 > Environment: Operating system: Windows 7 (64 bit). >Reporter: Navnath Kumbhar > Labels: how-to > Attachments: formula-marked-34.png, > paragraphNextToTable-marked-1.png, paragraphNextToTable.pdf > > > Hello Support Team, > I am working on a project which parses a whole PDF document and stores the > extracted text in some .txt file which can be read by other product. > My issue is regarding extracting the text inside the cell of a table: > *x,y co-ordinates of the text inside the cell are not getting correctly.* > Y value of the last text line in the cell is getting larger than cell's max-Y > value. > I have attached the test file with this bug. > As you can see in the test document, there is one cell along-with text in it > and a text paragraph next to that cell. > x-y coordinates that I get from pdfbox for all the paths (two vertical and > two horizontal lines) of the cell are: > (in x1,y1,x2,y2 format) > Horizontal line 1: [100,88,220,88] > Horizontal line 2: [100,120,220,120] > Vertical line 1 : [100,88,100,120] > Vertical line 2: [220,88,220,120] > (Y values of the above paths are final values by subtracting the actual value > given by pdfbox from height of the page as I see that for paths, y-values are > processed from bottom to up) > And bounding box of the last line in that cell is : [102,114,59,7] and hence > max-Y of that line becomes 121 (min-Y + height) > > So, if we consider max-Y value of that cell (i.e. 120) and that of last line > in that cell (i.e. 121), clearly, that line goes out of that cell. > What can be the possible reason for this? > Thank you in advance! > Regards, > Navnath Kumbhar -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Updated] (PDFBOX-3970) x,y co-ordinates of the text inside the cell are not getting correctly.
[ https://issues.apache.org/jira/browse/PDFBOX-3970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Navnath Kumbhar updated PDFBOX-3970: Attachment: formula-marked-34.png Hello Tillman, Thank you for your helpful feedback. I tried the suggestion that you have given except the font bounding box part. It worked well. But It has limitation on some PDF document pages where I am trying to extract mathematical formulas. I am attaching, herewith, the result of the *DrawPrintTextLocations* for that particular page. As you will see in the attachment, *big parenthesis* and *Summation *symbols are mixing up with its previous lines. [ As far as red rectangles are considered which has coordinate values computed by Java as you mentioned in your last comment]. Generally, I see the red box is inside the font bounding box. But it is not the case in attached example. Can we just use glyph bounds to extract text as it looks the perfect bound for the textposition? If so, can we do it with TextPosition class? If No, what other heuristic we can use in such cases? Why do we need that red rectangle [which is not real but only a heuristic to extract the text]? Thank you again for your help! > x,y co-ordinates of the text inside the cell are not getting correctly. > --- > > Key: PDFBOX-3970 > URL: https://issues.apache.org/jira/browse/PDFBOX-3970 > Project: PDFBox > Issue Type: Bug > Components: Text extraction >Affects Versions: 2.0.7 > Environment: Operating system: Windows 7 (64 bit). >Reporter: Navnath Kumbhar > Attachments: formula-marked-34.png, > paragraphNextToTable-marked-1.png, paragraphNextToTable.pdf > > > Hello Support Team, > I am working on a project which parses a whole PDF document and stores the > extracted text in some .txt file which can be read by other product. > My issue is regarding extracting the text inside the cell of a table: > *x,y co-ordinates of the text inside the cell are not getting correctly.* > Y value of the last text line in the cell is getting larger than cell's max-Y > value. > I have attached the test file with this bug. > As you can see in the test document, there is one cell along-with text in it > and a text paragraph next to that cell. > x-y coordinates that I get from pdfbox for all the paths (two vertical and > two horizontal lines) of the cell are: > (in x1,y1,x2,y2 format) > Horizontal line 1: [100,88,220,88] > Horizontal line 2: [100,120,220,120] > Vertical line 1 : [100,88,100,120] > Vertical line 2: [220,88,220,120] > (Y values of the above paths are final values by subtracting the actual value > given by pdfbox from height of the page as I see that for paths, y-values are > processed from bottom to up) > And bounding box of the last line in that cell is : [102,114,59,7] and hence > max-Y of that line becomes 121 (min-Y + height) > > So, if we consider max-Y value of that cell (i.e. 120) and that of last line > in that cell (i.e. 121), clearly, that line goes out of that cell. > What can be the possible reason for this? > Thank you in advance! > Regards, > Navnath Kumbhar -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Updated] (PDFBOX-3970) x,y co-ordinates of the text inside the cell are not getting correctly.
[ https://issues.apache.org/jira/browse/PDFBOX-3970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated PDFBOX-3970: Attachment: paragraphNextToTable-marked-1.png You didn't attach any code so I don't know how you got your values. I have attached the result file of the DrawPrintTextLocations example. > x,y co-ordinates of the text inside the cell are not getting correctly. > --- > > Key: PDFBOX-3970 > URL: https://issues.apache.org/jira/browse/PDFBOX-3970 > Project: PDFBox > Issue Type: Bug > Components: Text extraction >Affects Versions: 2.0.7 > Environment: Operating system: Windows 7 (64 bit). >Reporter: Navnath Kumbhar > Attachments: paragraphNextToTable-marked-1.png, > paragraphNextToTable.pdf > > > Hello Support Team, > I am working on a project which parses a whole PDF document and stores the > extracted text in some .txt file which can be read by other product. > My issue is regarding extracting the text inside the cell of a table: > *x,y co-ordinates of the text inside the cell are not getting correctly.* > Y value of the last text line in the cell is getting larger than cell's max-Y > value. > I have attached the test file with this bug. > As you can see in the test document, there is one cell along-with text in it > and a text paragraph next to that cell. > x-y coordinates that I get from pdfbox for all the paths (two vertical and > two horizontal lines) of the cell are: > (in x1,y1,x2,y2 format) > Horizontal line 1: [100,88,220,88] > Horizontal line 2: [100,120,220,120] > Vertical line 1 : [100,88,100,120] > Vertical line 2: [220,88,220,120] > (Y values of the above paths are final values by subtracting the actual value > given by pdfbox from height of the page as I see that for paths, y-values are > processed from bottom to up) > And bounding box of the last line in that cell is : [102,114,59,7] and hence > max-Y of that line becomes 121 (min-Y + height) > > So, if we consider max-Y value of that cell (i.e. 120) and that of last line > in that cell (i.e. 121), clearly, that line goes out of that cell. > What can be the possible reason for this? > Thank you in advance! > Regards, > Navnath Kumbhar -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org