[jira] [Comment Edited] (PDFBOX-4189) Enable rendering of Indian languages, by reading and utilizing the GSUB table

2018-04-21 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-4189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16446711#comment-16446711
 ] 

Tilman Hausherr edited comment on PDFBOX-4189 at 4/21/18 9:15 AM:
--

that's just a warm-up and to get rid of (some) binaries in the patch.


was (Author: tilman):
that's just a warm-up and to get rid of binaries in the patch.

> Enable rendering of Indian languages, by reading and utilizing the GSUB table
> -
>
> Key: PDFBOX-4189
> URL: https://issues.apache.org/jira/browse/PDFBOX-4189
> Project: PDFBox
>  Issue Type: New Feature
>  Components: FontBox, PDModel
>Reporter: Palash Ray
>Priority: Major
> Attachments: Bengali-text-after.pdf, Bengali-text-before.pdf
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Implemented proper rendering of Indian languages, which need extensive Glyph 
> substitution. The GSUB table has been read and used effectively to replace 
> some compound words with their respective Glyphs. All tests are passing. I 
> have tested this for the Bengali font. Please review these changes and let me 
> know if it makes sense to incorporate these.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Comment Edited] (PDFBOX-4189) Enable rendering of Indian languages, by reading and utilizing the GSUB table

2018-04-21 Thread Palash Ray (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-4189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16446685#comment-16446685
 ] 

Palash Ray edited comment on PDFBOX-4189 at 4/21/18 8:28 AM:
-

I know. If you ask me, its a real shame. The reason we have abstractions and 
specifications, we are supposed to be able to figure out pretty much, all the 
rules, without having to write language specific handlers. But I think even the 
font developers are to blame. They should push these big companies who build 
these specifications to do a better job. Anyway, sorry for the rant :)


was (Author: paawak):
I know. If you ask me, its a real shame. The reason we have abstractions and 
specifications, we are supposed to be able to figure out pretty much the rules, 
without having to write language specific handlers. But I think even the font 
developers are to blame. They should push these big companies who build these 
specifications to do a better job. Anyway, sorry for the rant :)

> Enable rendering of Indian languages, by reading and utilizing the GSUB table
> -
>
> Key: PDFBOX-4189
> URL: https://issues.apache.org/jira/browse/PDFBOX-4189
> Project: PDFBox
>  Issue Type: New Feature
>  Components: FontBox, PDModel
>Reporter: Palash Ray
>Priority: Major
> Attachments: Bengali-text-after.pdf, Bengali-text-before.pdf
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Implemented proper rendering of Indian languages, which need extensive Glyph 
> substitution. The GSUB table has been read and used effectively to replace 
> some compound words with their respective Glyphs. All tests are passing. I 
> have tested this for the Bengali font. Please review these changes and let me 
> know if it makes sense to incorporate these.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Comment Edited] (PDFBOX-4189) Enable rendering of Indian languages, by reading and utilizing the GSUB table

2018-04-15 Thread John Hewson (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-4189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16438914#comment-16438914
 ] 

John Hewson edited comment on PDFBOX-4189 at 4/16/18 1:45 AM:
--

{quote}
For correct text positioning using mixed language information from the 
following tables might be useful:
- GPOS: to adjust the glyph position
- BASE: baseline offsets on a script-by-script basis.
- JSTF: justification information, including whitespace and Kashida adjustments.
- BIDI Mirroring: 
https://www.unicode.org/Public/10.0.0/ucd/BidiMirroring.txt{quote}

It's probably worth noting that BASE, JSTF and BiDi are concerned with 
_paragraph-level_ layout, which happens at a higher level than the proposed 
layout() - which would be concerned with only a single script in a single 
direction (i.e. only OpenType _shaping_). BASE and BiDi are related to changes 
between different scripts, while JSTF is to aid in making good line break 
choices. So all of that functionality will happen somewhere else (this fits 
very closely with the layout code we have for forms, for example). So in layout 
we're really only going to be concerned with GPOS and GSUB features. That way 
the only options that one might want to pass to layout would be the list of 
which [feature 
flags|https://docs.microsoft.com/en-us/typography/opentype/spec/featurelist] to 
apply.

Maybe layout() should be called shapeText() to emphasize this distinction?


was (Author: jahewson):
{quote}
For correct text positioning using mixed language information from the 
following tables might be useful:
- GPOS: to adjust the glyph position
- BASE: baseline offsets on a script-by-script basis.
- JSTF: justification information, including whitespace and Kashida adjustments.
- BIDI Mirroring: 
https://www.unicode.org/Public/10.0.0/ucd/BidiMirroring.txt{quote}

It's probably worth noting that BASE, JSTF and BiDi are concerned with 
_paragraph-level_ layout, which happens at a higher level than the proposed 
layout() - which would be concerned with only a single script in a single 
direction (i.e. only OpenType _shaping_). BASE and BiDi are related to changes 
between different scripts, while JSTF is to aid in making good line break 
choices. So all of that functionality will happen somewhere else (this fits 
very closely with the layout code we have for forms, for example). So in layout 
we're really only going to be concerned with GPOS and GSUB features. That way 
the only options that one might want to pass to layout would be this list of 
which [feature 
flags|https://docs.microsoft.com/en-us/typography/opentype/spec/featurelist] to 
apply.

Maybe layout() should be called shapeText() to emphasize this distinction?

> Enable rendering of Indian languages, by reading and utilizing the GSUB table
> -
>
> Key: PDFBOX-4189
> URL: https://issues.apache.org/jira/browse/PDFBOX-4189
> Project: PDFBox
>  Issue Type: New Feature
>  Components: FontBox, PDModel
>Reporter: Palash Ray
>Priority: Major
> Attachments: Bengali-text-after.pdf, Bengali-text-before.pdf
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Implemented proper rendering of Indian languages, which need extensive Glyph 
> substitution. The GSUB table has been read and used effectively to replace 
> some compound words with their respective Glyphs. All tests are passing. I 
> have tested this for the Bengali font. Please review these changes and let me 
> know if it makes sense to incorporate these.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Comment Edited] (PDFBOX-4189) Enable rendering of Indian languages, by reading and utilizing the GSUB table

2018-04-15 Thread John Hewson (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-4189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16438914#comment-16438914
 ] 

John Hewson edited comment on PDFBOX-4189 at 4/16/18 1:44 AM:
--

{quote}
For correct text positioning using mixed language information from the 
following tables might be useful:
- GPOS: to adjust the glyph position
- BASE: baseline offsets on a script-by-script basis.
- JSTF: justification information, including whitespace and Kashida adjustments.
- BIDI Mirroring: 
https://www.unicode.org/Public/10.0.0/ucd/BidiMirroring.txt{quote}

It's probably worth noting that BASE, JSTF and BiDi are concerned with 
_paragraph-level_ layout, which happens at a higher level than the proposed 
layout() - which would be concerned with only a single script in a single 
direction (i.e. only OpenType _shaping_). BASE and BiDi are related to changes 
between different scripts, while JSTF is to aid in making good line break 
choices. So all of that functionality will happen somewhere else (this fits 
very closely with the layout code we have for forms, for example). So in layout 
we're really only going to be concerned with GPOS and GSUB features. That way 
the only options that one might want to pass to layout would be this list of 
which [feature 
flags|https://docs.microsoft.com/en-us/typography/opentype/spec/featurelist] to 
apply.

Maybe layout() should be called shapeText() to emphasize this distinction?


was (Author: jahewson):
{quote}
For correct text positioning using mixed language information from the 
following tables might be useful:
- GPOS: to adjust the glyph position
- BASE: baseline offsets on a script-by-script basis.
- JSTF: justification information, including whitespace and Kashida adjustments.
- BIDI Mirroring: 
https://www.unicode.org/Public/10.0.0/ucd/BidiMirroring.txt{quote}

It's probably worth noting that BASE, JSTF and BiDi are concerned with 
_paragraph-level_ layout, which happens at a higher level than the proposed 
layout() - which would be concerned with only a single script in a single 
direction (i.e. only OpenType _shaping_). BASE and BiDi are related to changes 
between different scripts, while JSTF is to aid in making good line break 
choices. So all of that functionality will happen somewhere else (this fits 
very closely with the layout code form forms, for example). So in layout we're 
really only going to be concerned with GPOS and GSUB features. That way the 
only options that one might want to pass to layout would be this list of which 
[feature 
flags|https://docs.microsoft.com/en-us/typography/opentype/spec/featurelist] to 
apply.

Maybe layout() should be called shapeText() to emphasize this distinction?

> Enable rendering of Indian languages, by reading and utilizing the GSUB table
> -
>
> Key: PDFBOX-4189
> URL: https://issues.apache.org/jira/browse/PDFBOX-4189
> Project: PDFBox
>  Issue Type: New Feature
>  Components: FontBox, PDModel
>Reporter: Palash Ray
>Priority: Major
> Attachments: Bengali-text-after.pdf, Bengali-text-before.pdf
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Implemented proper rendering of Indian languages, which need extensive Glyph 
> substitution. The GSUB table has been read and used effectively to replace 
> some compound words with their respective Glyphs. All tests are passing. I 
> have tested this for the Bengali font. Please review these changes and let me 
> know if it makes sense to incorporate these.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Comment Edited] (PDFBOX-4189) Enable rendering of Indian languages, by reading and utilizing the GSUB table

2018-04-15 Thread John Hewson (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-4189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16438914#comment-16438914
 ] 

John Hewson edited comment on PDFBOX-4189 at 4/16/18 1:41 AM:
--

{quote}
For correct text positioning using mixed language information from the 
following tables might be useful:
- GPOS: to adjust the glyph position
- BASE: baseline offsets on a script-by-script basis.
- JSTF: justification information, including whitespace and Kashida adjustments.
- BIDI Mirroring: 
https://www.unicode.org/Public/10.0.0/ucd/BidiMirroring.txt{quote}

It's probably worth noting that BASE, JSTF and BiDi are concerned with 
_paragraph-level_ layout, which happens at a higher level than the proposed 
layout() - which would be concerned with only a single script in a single 
direction (i.e. only OpenType _shaping_). BASE and BiDi are related to changes 
between different scripts, while JSTF is to aid in making good line break 
choices. So all of that functionality will happen somewhere else (this fits 
very closely with the layout code form forms, for example). So in layout we're 
really only going to be concerned with GPOS and GSUB features. That way the 
only options that one might want to pass to layout would be this list of which 
[feature 
flags|https://docs.microsoft.com/en-us/typography/opentype/spec/featurelist] to 
apply.

Maybe layout() should be called shapeText() to emphasize this distinction?


was (Author: jahewson):
{quote}
For correct text positioning using mixed language information from the 
following tables might be useful:
- GPOS: to adjust the glyph position
- BASE: baseline offsets on a script-by-script basis.
- JSTF: justification information, including whitespace and Kashida adjustments.
- BIDI Mirroring: 
https://www.unicode.org/Public/10.0.0/ucd/BidiMirroring.txt{quote}

It's probably worth noting that BASE, JSTF and BiDi are concerned with 
_paragraph-level_ layout, which happens at a higher level than the proposed 
layout() - which would be concerned with only a single script in a single 
direction (i.e. only OpenType _shaping_). BASE and BiDi are related to changes 
between different scripts, while JSTF is to aid in making good line break 
choices. So all of that functionality will happen somewhere else (this fits 
very closely with the layout code form forms, for example). So in layout we're 
really only going to be concerned with GPOS and GSUB features. That way the 
only options that one might want to pass to layout would be this list of which 
[feature 
flags|https://docs.microsoft.com/en-us/typography/opentype/spec/featurelist] to 
apply.

> Enable rendering of Indian languages, by reading and utilizing the GSUB table
> -
>
> Key: PDFBOX-4189
> URL: https://issues.apache.org/jira/browse/PDFBOX-4189
> Project: PDFBox
>  Issue Type: New Feature
>  Components: FontBox, PDModel
>Reporter: Palash Ray
>Priority: Major
> Attachments: Bengali-text-after.pdf, Bengali-text-before.pdf
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Implemented proper rendering of Indian languages, which need extensive Glyph 
> substitution. The GSUB table has been read and used effectively to replace 
> some compound words with their respective Glyphs. All tests are passing. I 
> have tested this for the Bengali font. Please review these changes and let me 
> know if it makes sense to incorporate these.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Comment Edited] (PDFBOX-4189) Enable rendering of Indian languages, by reading and utilizing the GSUB table

2018-04-15 Thread John Hewson (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-4189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16438914#comment-16438914
 ] 

John Hewson edited comment on PDFBOX-4189 at 4/16/18 1:40 AM:
--

{quote}
For correct text positioning using mixed language information from the 
following tables might be useful:
- GPOS: to adjust the glyph position
- BASE: baseline offsets on a script-by-script basis.
- JSTF: justification information, including whitespace and Kashida adjustments.
- BIDI Mirroring: 
https://www.unicode.org/Public/10.0.0/ucd/BidiMirroring.txt{quote}

It's probably worth noting that BASE, JSTF and BiDi are concerned with 
_paragraph-level_ layout, which happens at a higher level than the proposed 
layout() - which would be concerned with only a single script in a single 
direction (i.e. only OpenType _shaping_). BASE and BiDi are related to changes 
between different scripts, while JSTF is to aid in making good line break 
choices. So all of that functionality will happen somewhere else (this fits 
very closely with the layout code form forms, for example). So in layout we're 
really only going to be concerned with GPOS and GSUB features. That way the 
only options that one might want to pass to layout would be this list of which 
[feature 
flags|https://docs.microsoft.com/en-us/typography/opentype/spec/featurelist] to 
apply.


was (Author: jahewson):
For correct text positioning using mixed language information from the 
following tables might be useful:
- GPOS: to adjust the glyph position
- BASE: baseline offsets on a script-by-script basis.
- JSTF: justification information, including whitespace and Kashida adjustments.
- BIDI Mirroring: https://www.unicode.org/Public/10.0.0/ucd/BidiMirroring.txt

bq. here

BASE, JSTF and BiDi are concerned with _paragraph-level_ layout, which happens 
at a higher level than the proposed layout() - which would be concerned with 
only a single script in a single direction (i.e. only OpenType _shaping_). BASE 
and BiDi are related to changes between different scripts, while JSTF is to aid 
in making good line break choices. So all of that functionality will happen 
somewhere else (this fits very closely with the layout code form forms, for 
example). So in layout we're really only going to be concerned with GPOS and 
GSUB features. That way the only options that one might want to pass to layout 
would be this list of which [feature 
flags|https://docs.microsoft.com/en-us/typography/opentype/spec/featurelist] to 
apply.

> Enable rendering of Indian languages, by reading and utilizing the GSUB table
> -
>
> Key: PDFBOX-4189
> URL: https://issues.apache.org/jira/browse/PDFBOX-4189
> Project: PDFBox
>  Issue Type: New Feature
>  Components: FontBox, PDModel
>Reporter: Palash Ray
>Priority: Major
> Attachments: Bengali-text-after.pdf, Bengali-text-before.pdf
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Implemented proper rendering of Indian languages, which need extensive Glyph 
> substitution. The GSUB table has been read and used effectively to replace 
> some compound words with their respective Glyphs. All tests are passing. I 
> have tested this for the Bengali font. Please review these changes and let me 
> know if it makes sense to incorporate these.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Comment Edited] (PDFBOX-4189) Enable rendering of Indian languages, by reading and utilizing the GSUB table

2018-04-15 Thread Maruan Sahyoun (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-4189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16438609#comment-16438609
 ] 

Maruan Sahyoun edited comment on PDFBOX-4189 at 4/15/18 8:34 AM:
-

The patch is a great and - given several questions we had in the past - 
important addition to PDFBox.

On the longer run I'd see some additions we might conceptually already think 
about and/or start introducing in the public API. As I haven't reviewed the 
patch the below list is meant to be a hint for possible addition. They may 
already be included

For correct text positioning using mixed language information from the 
following tables might be useful:
- GPOS: to adjust the glyph position
- BASE: baseline offsets on a script-by-script basis.
- JSTF: justification information, including whitespace and Kashida adjustments.
- BIDI Mirroring: https://www.unicode.org/Public/10.0.0/ucd/BidiMirroring.txt

To allow the user to override the language system identified by the script 
being used we might want to add {{setLanguage/getLanguage}} so that can be 
called prior to {{showText}} if an override needs to be done.

Putting that into an internal {{layout}} method as John suggested would also 
allow us to put it behind a feature flag where one could enable/disable the 
processing. We might also mark that feature as **experimental** and specify 
which languages it has been tested with (to some extend).

This is mainly meant to understand which capabilities belong where as I'm 
looking to add the processing to layout of interactive form field values.


was (Author: msahyoun):
The patch is a great and - given several questions we had in the past - 
important addition to PDFBox.

On the longer run I'd see some additions we might conceptually already think 
about and/or start introducing in the public API. As I haven't reviewed the 
patch the below list is meant to be a hint for possible addition. They may 
already be included

For correct text positioning using mixed language information from the 
following tables might be useful:
- GPOS: to adjust the glyph position
- BASE: baseline offsets on a script-by-script basis.
- JSTF: justification information, including whitespace and Kashida adjustments.
- BIDI Mirroring: https://www.unicode.org/Public/10.0.0/ucd/BidiMirroring.txt

To allow the user to override the language system identified by the script 
being used we might want to add {{setLanguage/getLanguage}} so that can be 
called prior to {{showText}} if an override needs to be done.

Putting that into an internal {{layout}} method as John suggested would also 
allow us to put it behind a feature flag where one could enable/disable the 
processing. We might also mark that feature as **experimental** and specify 
which languages it has been tested with (to some extend).

> Enable rendering of Indian languages, by reading and utilizing the GSUB table
> -
>
> Key: PDFBOX-4189
> URL: https://issues.apache.org/jira/browse/PDFBOX-4189
> Project: PDFBox
>  Issue Type: New Feature
>  Components: FontBox, PDModel
>Reporter: Palash Ray
>Priority: Major
> Attachments: Bengali-text-after.pdf, Bengali-text-before.pdf
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Implemented proper rendering of Indian languages, which need extensive Glyph 
> substitution. The GSUB table has been read and used effectively to replace 
> some compound words with their respective Glyphs. All tests are passing. I 
> have tested this for the Bengali font. Please review these changes and let me 
> know if it makes sense to incorporate these.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Comment Edited] (PDFBOX-4189) Enable rendering of Indian languages, by reading and utilizing the GSUB table

2018-04-14 Thread John Hewson (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-4189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16438540#comment-16438540
 ] 

John Hewson edited comment on PDFBOX-4189 at 4/15/18 1:04 AM:
--

Hi guys, this is a really welcome contribution, thank you. With regards to 
PDFont#encode(String text) being non-final I can add some insight as I was the 
original designer of our current PDFont#encode mechanism.

Basically, the PDFont classes are designed to represent fonts identically to 
how they are represented when embedded in PDF files. So there's no support for 
OpenType, by design. A Type0 font knows nothing about OpenType (but we can 
relax this a bit, as I explain below).

So how can we use OpenType in PDFBox? The answer is that we do it one layer of 
abstraction up, during text _layout_ instead of text _encoding_*_._* So you 
want to put your glyph substitution code inside PDPageContentStream#showText, 
actually you want 
[PDPageContentStream#showTextInternal|https://github.com/apache/pdfbox/blob/7e721643c0b1fca9fdc349f78431f36e68abc097/pdfbox/src/main/java/org/apache/pdfbox/contentstream/PDAbstractContentStream.java#L256].

That way PDFont#encode(String text) can stay non-final :)

*OpenType*: In general, OpenType layouts consist of glyph _substitutions_ (via 
GSUB) and _positionings_ (via GPOS). Obviously it's not possible to handle 
positionings in PDFont#encode(), so that helps explain why showText() is the 
right place for OpenType, as showText performs both positioning and encoding. 
We also need to keep track of glyphs for subsetting, which is not possible in 
encode().

*Subsetting*: We currently track which glyphs need to be included in a subset 
by using their Unicode code point, but with GSUB enabled we will have to keep 
track of some substituted glyphs via their glyph id (GID), because the glyphs 
which result from a substitution don't necessarily have their own code points 
(no entry in the camp table). This should be easy to add to TTFSubsetter as it 
already tracks glyph ids internally, we just need the ability to pass them in 
too, e.g. addGlyphId(integer). Then PDPageContentStream#showText will be 
responsible for passing the glyph ids. But now we need showText to know about 
those glyph ids, which leads me to

*Glyph IDs:* The JDK represents text which has been through OpenType layout as 
a 
[GlyphVector|https://docs.oracle.com/javase/7/docs/api/java/awt/font/GlyphVector.html]
 which encapsulates substitutions via GID and positioning via a transform 
associated with each glyph. PDFBox might want to do something similar, I think 
it would even be ok to add this to PDType0Font (because I'm suggesting a 
specific OpenType API so it doesn't interfere with our PDType0Font's 
non-OpenType assumption) in the form of a method such as: {{final 
PDFGlyphVector layout(String text)}} which is called from 
PDPageContentStream#showText instead of encode(text). I also think it would be 
fine to use instanceof to detect this case, because only PDType0Font need have 
this capability. I'm assuming PDFGlyphVector is our own very simple version of 
the JDK's GlyphVector, which is effectively just a vector of (gid, dx, dy) 
tuples. Then all that PDPageContentStream#showText needs to know how to do is 
to draw a PDFGlyphVector on the page, by converting it into the equivalent text 
drawing operations (Tj and the like). Because this patch is just for GSUB, all 
of those positioning values can just be zero, and we need not implemented any 
actual glyph positioning in showText() yet :). Thus GlyphVector will serve 
simply as an array of GIDs.

Phew! That was a lot of information. Just to be clear, the current patch is not 
compatible with subsetting without making some changes. P.S. Make sure any new 
APIs are {{final}}. All of the suggestions above consist of adding only 
non-breaking APIs, which is nice.

Thanks!


was (Author: jahewson):
Hi guys, this is a really welcome contribution, thank you. With regards to 
PDFont#encode(String text) being non-final I can add some insight as I was the 
original designer of our current PDFont#encode mechanism.

Basically, the PDFont classes are designed to represent fonts identically to 
how they are represented when embedded in PDF files. So there's no support for 
OpenType, by design. A Type0 font knows nothing about OpenType (but we can 
relax this a bit, as I explain below).

So how can we use OpenType in PDFBox? The answer is that we do it one layer of 
abstraction up, during text _layout_ instead of text _encoding_*_._* So you 
want to put your glyph substitution code inside PDPageContentStream#showText, 
actually you want 
[PDPageContentStream#showTextInternal|https://github.com/apache/pdfbox/blob/7e721643c0b1fca9fdc349f78431f36e68abc097/pdfbox/src/main/java/org/apache/pdfbox/contentstream/PDAbstractContentStream.java#L256].

That way PDFont#encode(String text) can 

[jira] [Comment Edited] (PDFBOX-4189) Enable rendering of Indian languages, by reading and utilizing the GSUB table

2018-04-14 Thread John Hewson (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-4189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16438540#comment-16438540
 ] 

John Hewson edited comment on PDFBOX-4189 at 4/15/18 12:57 AM:
---

Hi guys, this is a really welcome contribution, thank you. With regards to 
PDFont#encode(String text) being non-final I can add some insight as I was the 
original designer of our current PDFont#encode mechanism.

Basically, the PDFont classes are designed to represent fonts identically to 
how they are represented when embedded in PDF files. So there's no support for 
OpenType, by design. A Type0 font knows nothing about OpenType (but we can 
relax this a bit, as I explain below).

So how can we use OpenType in PDFBox? The answer is that we do it one layer of 
abstraction up, during text _layout_ instead of text _encoding_*_._* So you 
want to put your glyph substitution code inside PDPageContentStream#showText, 
actually you want 
[PDPageContentStream#showTextInternal|https://github.com/apache/pdfbox/blob/7e721643c0b1fca9fdc349f78431f36e68abc097/pdfbox/src/main/java/org/apache/pdfbox/contentstream/PDAbstractContentStream.java#L256].

That way PDFont#encode(String text) can stay non-final :)

*OpenType*: In general, OpenType layouts consist of glyph _substitutions_ (via 
GSUB) and _positionings_ (via GPOS). Obviously it's not possible to handle 
positionings in PDFont#encode(), so that helps explain why showText() is the 
right place for OpenType, as showText performs both positioning and encoding. 
We also need to keep track of glyphs for subsetting, which is not possible in 
encode().

*Subsetting*: We currently track which glyphs need to be included in a subset 
by using their Unicode code point, but with GSUB enabled we will have to keep 
track of some substituted glyphs via their glyph id (GID), because the glyphs 
which result from a substitution don't necessarily have their own code points 
(no entry in the camp table). This should be easy to add to TTFSubsetter as it 
already tracks glyph ids internally, we just need the ability to pass them in 
too, e.g. addGlyphId(integer). Then PDPageContentStream#showText will be 
responsible for passing the glyph ids. But now we need showText to know about 
those glyph ids, which leads me to

*Glyph IDs:* The JDK represents text which has been through OpenType layout as 
a 
[GlyphVector|https://docs.oracle.com/javase/7/docs/api/java/awt/font/GlyphVector.html]
 which encapsulates substitutions via GID and positioning via a transform 
associated with each glyph. PDFBox might want to do something similar, I think 
it would even be ok to add this to PDType0Font (because I'm suggesting a 
specific OpenType API so it doesn't interfere with our PDType0Font's 
non-OpenType assumption) in the form of a method such as: {{final 
PDFGlyphVector layout(String text)}} which is called from 
PDPageContentStream#showText instead of encode(text). I also think it would be 
fine to use instanceof to detect this case, because only PDType0Font need have 
this capability. I'm assuming PDFGlyphVector is our own very simple version of 
the JDK's GlyphVector, which is effectively just a vector of (gid, dx, dy) 
tuples. Then all that PDPageContentStream#showText needs to know how to do is 
to draw a PDFGlyphVector on the page, by converting it into the equivalent text 
drawing operations (Tj and the like). Because this patch is just for GSUB, all 
of those positioning values can just be zero, and we need not implemented any 
actual glyph positioning in showText() yet :). Thus GlyphVector will serve 
simply as an array of GIDs.

Phew! That was a lot of information. Just to be clear, the current patch is not 
compatible with subsetting without making some changes. P.S. Make sure any new 
APIs are {{final}}.


was (Author: jahewson):
Hi guys, this is a really welcome contribution, thank you. With regards to 
PDFont#encode(String text) being non-final I can add some insight as I was the 
original designer of our current PDFont#encode mechanism.

Basically, the PDFont classes are designed to represent fonts identically to 
how they are represented when embedded in PDF files. So there's no support for 
OpenType, by design. A Type0 font knows nothing about OpenType (but we can 
relax this a bit, as I explain below).

So how can we use OpenType in PDFBox? The answer is that we do it one layer of 
abstraction up, during text _layout_ instead of text _encoding_*_._* So you 
want to put your glyph substitution code inside PDPageContentStream#showText, 
actually you want 
[PDPageContentStream#showTextInternal|https://github.com/apache/pdfbox/blob/7e721643c0b1fca9fdc349f78431f36e68abc097/pdfbox/src/main/java/org/apache/pdfbox/contentstream/PDAbstractContentStream.java#L256].

That way PDFont#encode(String text) can stay non-final :)

*OpenType*: In general, OpenType layouts consist of glyph _substitutions_ (via 

[jira] [Comment Edited] (PDFBOX-4189) Enable rendering of Indian languages, by reading and utilizing the GSUB table

2018-04-14 Thread John Hewson (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-4189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16438540#comment-16438540
 ] 

John Hewson edited comment on PDFBOX-4189 at 4/15/18 12:52 AM:
---

Hi guys, this is a really welcome contribution, thank you. With regards to 
PDFont#encode(String text) being non-final I can add some insight as I was the 
original designer of our current PDFont#encode mechanism.

Basically, the PDFont classes are designed to represent fonts identically to 
how they are represented when embedded in PDF files. So there's no support for 
OpenType, by design. A Type0 font knows nothing about OpenType (but we can 
relax this a bit, as I explain below).

So how can we use OpenType in PDFBox? The answer is that we do it one layer of 
abstraction up, during text _layout_ instead of text _encoding_*_._* So you 
want to put your glyph substitution code inside PDPageContentStream#showText, 
actually you want 
[PDPageContentStream#showTextInternal|https://github.com/apache/pdfbox/blob/7e721643c0b1fca9fdc349f78431f36e68abc097/pdfbox/src/main/java/org/apache/pdfbox/contentstream/PDAbstractContentStream.java#L256].

That way PDFont#encode(String text) can stay non-final :)

*OpenType*: In general, OpenType layouts consist of glyph _substitutions_ (via 
GSUB) and _positionings_ (via GPOS). Obviously it's not possible to handle 
positionings in PDFont#encode(), so that helps explain why showText() is the 
right place for OpenType, as showText performs both positioning and encoding. 
We also need to keep track of glyphs for subsetting, which is not possible in 
encode().

*Subsetting*: We currently track which glyphs need to be included in a subset 
by using their Unicode code point, but with GSUB enabled we will have to keep 
track of some substituted glyphs via their glyph id (GID), because the glyphs 
which result from a substitution don't necessarily have their own code points 
(no entry in the camp table). This should be easy to add to TTFSubsetter as it 
already tracks glyph ids internally, we just need the ability to pass them in 
too, e.g. addGlyphId(integer). Then PDPageContentStream#showText will be 
responsible for passing the glyph ids. But now we need showText to know about 
those glyph ids, which leads me to

*Glyph IDs:* The JDK represents text which has been through OpenType layout as 
a 
[GlyphVector|https://docs.oracle.com/javase/7/docs/api/java/awt/font/GlyphVector.html]
 which encapsulates substitutions via GID and positioning via a transform 
associated with each glyph. PDFBox might want to do something similar, I think 
it would even be ok to add this to PDType0Font (because I'm suggesting a 
specific OpenType API so it doesn't interfere with our PDType0Font's 
non-OpenType assumption) in the form of a method such as: {{PDFGlyphVector 
layout(String text)}} which is called from PDPageContentStream#showText instead 
of encode(text). I also think it would be fine to use instanceof to detect this 
case, because only PDType0Font need have this capability. I'm assuming 
PDFGlyphVector is our own very simple version of the JDK's GlyphVector, which 
is effectively just a vector of (gid, dx, dy) tuples. Then all that 
PDPageContentStream#showText needs to know how to do is to draw a 
PDFGlyphVector on the page, by converting it into the equivalent text drawing 
operations (Tj and the like).

Phew! That was a lot of information. Just to be clear, the current patch is not 
compatible with subsetting without making some changes.


was (Author: jahewson):
Hi guys, this is a really welcome contribution, thank you. With regards to 
PDFont#encode(String text) being non-final I can add some insight as I was the 
original designer of our current PDFont#encode mechanism.

Basically, the PDFont classes are designed to represent fonts identically to 
how they are represented when embedded in PDF files. So there's no support for 
OpenType, by design. A Type0 font knows nothing about OpenType (but we can 
relax this a bit, as I explain below).

So how can we use OpenType in PDFBox? The answer is that we do it one layer of 
abstraction up, during text _layout_ instead of text _encoding_*_._* So you 
want to put your glyph substitution code inside PDPageContentStream#showText, 
actually you want 
[PDPageContentStream#showTextInternal|https://github.com/apache/pdfbox/blob/7e721643c0b1fca9fdc349f78431f36e68abc097/pdfbox/src/main/java/org/apache/pdfbox/contentstream/PDAbstractContentStream.java#L256].

That way PDFont#encode(String text) can stay non-final :)

*OpenType*: In general, OpenType layouts consist of glyph _substitutions_ (via 
GSUB) and _positionings_ (via GPOS). Obviously it's not possible to handle 
positionings in PDFont#encode(), so that helps explain why showText() is the 
right place for OpenType, as showText performs both positioning and encoding. 
We also need to keep track of glyphs for 

[jira] [Comment Edited] (PDFBOX-4189) Enable rendering of Indian languages, by reading and utilizing the GSUB table

2018-04-14 Thread John Hewson (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-4189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16438540#comment-16438540
 ] 

John Hewson edited comment on PDFBOX-4189 at 4/15/18 12:51 AM:
---

Hi guys, this is a really welcome contribution, thank you. With regards to 
PDFont#encode(String text) being non-final I can add some insight as I was the 
original designer of our current PDFont#encode mechanism.

Basically, the PDFont classes are designed to represent fonts identically to 
how they are represented when embedded in PDF files. So there's no support for 
OpenType, by design. A Type0 font knows nothing about OpenType (but we can 
relax this a bit, as I explain below).

So how can we use OpenType in PDFBox? The answer is that we do it one layer of 
abstraction up, during text _layout_ instead of text _encoding_*_._* So you 
want to put your glyph substitution code inside PDPageContentStream#showText, 
actually you want 
[PDPageContentStream#showTextInternal|https://github.com/apache/pdfbox/blob/7e721643c0b1fca9fdc349f78431f36e68abc097/pdfbox/src/main/java/org/apache/pdfbox/contentstream/PDAbstractContentStream.java#L256].

That way PDFont#encode(String text) can stay non-final :)

*OpenType*: In general, OpenType layouts consist of glyph _substitutions_ (via 
GSUB) and _positionings_ (via GPOS). Obviously it's not possible to handle 
positionings in PDFont#encode(), so that helps explain why showText() is the 
right place for OpenType, as showText performs both positioning and encoding. 
We also need to keep track of glyphs for subsetting, which is not possible in 
encode().

*Subsetting*: We currently track which glyphs need to be included in a subset 
by using their Unicode code point, but with GSUB enabled we will have to keep 
track of some substituted glyphs via their glyph id (GID), because the glyphs 
which result from a substitution don't necessarily have their own code points 
(no entry in the camp table). This should be easy to add to TTFSubsetter as it 
already tracks glyph ids internally, we just need the ability to pass them in 
too, e.g. addGlyphId(integer). Then PDPageContentStream#showText will be 
responsible for passing the glyph ids. But now we need showText to know about 
those glyph ids, which leads me to

*Glyph IDs:* The JDK represents text which has been through OpenType layout as 
a 
[GlyphVector|https://docs.oracle.com/javase/7/docs/api/java/awt/font/GlyphVector.html]
 which encapsulates substitutions via GID and positioning via a transform 
associated with each glyph. PDFBox might want to do something similar, I think 
it would even be ok to add this to PDType0Font (because I'm suggesting a 
specific OpenType API so it doesn't interfere with our PDType0Font's 
non-OpenType assumption) in the form of a method such as: {{PDFGlyphVector 
layout(String text)}} which is called from PDPageContentStream#showText. I also 
think it would be fine to use instanceof to detect this case, because only 
PDType0Font need have this capability. I'm assuming PDFGlyphVector is our own 
very simple version of the JDK's GlyphVector, which is effectively just a 
vector of (gid, dx, dy) tuples. Then all that PDPageContentStream#showText 
needs to know how to do is to draw a PDFGlyphVector on the page, by converting 
it into the equivalent text drawing operations (Tj and the like).

Phew! That was a lot of information. Just to be clear, the current patch is not 
compatible with subsetting without making some changes.


was (Author: jahewson):
Hi guys, this is a really welcome contribution, thank you. With regards to 
PDFont#encode(String text) being non-final I can add some insight as I was the 
original designer of our current PDFont#encode mechanism.

Basically, the PDFont classes are designed to represent fonts identically to 
how they are represented when embedded in PDF files. So there's no support for 
OpenType, by design. A Type0 font knows nothing about OpenType (but we can 
relax this a bit, as I explain below).

So how can we use OpenType in PDFBox? The answer is that we do it one layer of 
abstraction up, during text _layout_ instead of text _encoding_*_._* So you 
want to put your glyph substitution code inside PDPageContentStream#showText, 
actually you want 
[PDPageContentStream#showTextInternal|https://github.com/apache/pdfbox/blob/7e721643c0b1fca9fdc349f78431f36e68abc097/pdfbox/src/main/java/org/apache/pdfbox/contentstream/PDAbstractContentStream.java#L256].

That way PDFont#encode(String text) can stay non-final :)

*OpenType*: In general, OpenType layouts consist of glyph _substitutions_ (via 
GSUB) and _positionings_ (via GPOS). Obviously it's not possible to handle 
positionings in PDFont#encode(), so that helps explain why showText() is the 
right place for OpenType, as showText performs both positioning and encoding. 
We also need to keep track of glyphs for subsetting, which is not 

[jira] [Comment Edited] (PDFBOX-4189) Enable rendering of Indian languages, by reading and utilizing the GSUB table

2018-04-14 Thread John Hewson (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-4189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16438540#comment-16438540
 ] 

John Hewson edited comment on PDFBOX-4189 at 4/15/18 12:49 AM:
---

Hi guys, this is a really welcome contribution, thank you. With regards to 
PDFont#encode(String text) being non-final I can add some insight as I was the 
original designer of our current PDFont#encode mechanism.

Basically, the PDFont classes are designed to represent fonts identically to 
how they are represented when embedded in PDF files. So there's no support for 
OpenType, by design. A Type0 font knows nothing about OpenType (but we can 
relax this a bit, as I explain below).

So how can we use OpenType in PDFBox? The answer is that we do it one layer of 
abstraction up, during text _layout_ instead of text _encoding_*_._* So you 
want to put your glyph substitution code inside PDPageContentStream#showText, 
actually you want 
[PDPageContentStream#showTextInternal|https://github.com/apache/pdfbox/blob/7e721643c0b1fca9fdc349f78431f36e68abc097/pdfbox/src/main/java/org/apache/pdfbox/contentstream/PDAbstractContentStream.java#L256].

That way PDFont#encode(String text) can stay non-final :)

*OpenType*: In general, OpenType layouts consist of glyph _substitutions_ (via 
GSUB) and _positionings_ (via GPOS). Obviously it's not possible to handle 
positionings in PDFont#encode(), so that helps explain why showText() is the 
right place for OpenType, as showText performs both positioning and encoding. 
We also need to keep track of glyphs for subsetting, which is not possible in 
encode().

*Subsetting*: We currently track which glyphs need to be included in a subset 
by using their Unicode code point, but with GSUB enabled we will have to keep 
track of some substituted glyphs via their glyph id (GID), because the glyphs 
which result from a substitution don't necessarily have their own code points 
(no entry in the camp table). This should be easy to add to TTFSubsetter as it 
already tracks glyph ids internally, we just need the ability to pass them in 
too. Then PDPageContentStream#showText will be responsible for passing the 
glyph ids. But now we need showText to know about those glyph ids, which leads 
me to

*Glyph IDs:* The JDK represents text which has been through OpenType layout as 
a 
[GlyphVector|https://docs.oracle.com/javase/7/docs/api/java/awt/font/GlyphVector.html]
 which encapsulates substitutions via GID and positioning via a transform 
associated with each glyph. PDFBox might want to do something similar, I think 
it would even be ok to add this to PDType0Font (because I'm suggesting a 
specific OpenType API so it doesn't interfere with our PDType0Font's 
non-OpenType assumption) in the form of a method such as: {{PDFGlyphVector 
layout(String text)}} which is called from PDPageContentStream#showText. I also 
think it would be fine to use instanceof to detect this case, because only 
PDType0Font need have this capability. I'm assuming PDFGlyphVector is our own 
very simple version of the JDK's GlyphVector, which is effectively just a 
vector of (gid, dx, dy) tuples. Then all that PDPageContentStream#showText 
needs to know how to do is to draw a PDFGlyphVector on the page, by converting 
it into the equivalent text drawing operations (Tj and the like).

Phew! That was a lot of information. Just to be clear, the current patch is not 
compatible with subsetting without making some changes.


was (Author: jahewson):
Hi guys, this is a really welcome contribution, thank you. With regards to 
PDFont#encode(String text) being non-final I can add some insight as I was the 
original designer of our current PDFont#encode mechanism.

Basically, the PDFont classes are designed to represent fonts identically to 
how they are represented when embedded in PDF files. So there's no support for 
OpenType, by design. A Type0 font knows nothing about OpenType (but we can 
relax this a bit, as I explain below).

So how can we use OpenType in PDFBox? The answer is that we do it one layer of 
abstraction up, during text _layout_ instead of text _encoding_*_._* So you 
want to put your glyph substitution code inside PDPageContentStream#showText, 
actually you want 
[PDPageContentStream#showTextInternal|https://github.com/apache/pdfbox/blob/7e721643c0b1fca9fdc349f78431f36e68abc097/pdfbox/src/main/java/org/apache/pdfbox/contentstream/PDAbstractContentStream.java#L256].

That way PDFont#encode(String text) can stay non-final :)

*OpenType*: In general, OpenType layouts consist of glyph _substitutions_ (via 
GSUB) and _positionings_ (via GPOS). Obviously it's not possible to handle 
positionings in PDFont#encode(), so that helps explain why showText() is the 
right place for OpenType, as showText performs both positioning and encoding. 
We also need to keep track of glyphs for subsetting, which is not possible in 
encode().


[jira] [Comment Edited] (PDFBOX-4189) Enable rendering of Indian languages, by reading and utilizing the GSUB table

2018-04-14 Thread John Hewson (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-4189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16438540#comment-16438540
 ] 

John Hewson edited comment on PDFBOX-4189 at 4/15/18 12:48 AM:
---

Hi guys, this is a really welcome contribution, thank you. With regards to 
PDFont#encode(String text) being non-final I can add some insight as I was the 
original designer of our current PDFont#encode mechanism.

Basically, the PDFont classes are designed to represent fonts identically to 
how they are represented when embedded in PDF files. So there's no support for 
OpenType, by design. A Type0 font knows nothing about OpenType (but we can 
relax this a bit, as I explain below).

So how can we use OpenType in PDFBox? The answer is that we do it one layer of 
abstraction up, during text _layout_ instead of text _encoding_*_._* So you 
want to put your glyph substitution code inside PDPageContentStream#showText, 
actually you want 
[PDPageContentStream#showTextInternal|https://github.com/apache/pdfbox/blob/7e721643c0b1fca9fdc349f78431f36e68abc097/pdfbox/src/main/java/org/apache/pdfbox/contentstream/PDAbstractContentStream.java#L256].

That way PDFont#encode(String text) can stay non-final :)

*OpenType*: In general, OpenType layouts consist of glyph _substitutions_ (via 
GSUB) and _positionings_ (via GPOS). Obviously it's not possible to handle 
positionings in PDFont#encode(), so that helps explain why showText() is the 
right place for OpenType, as showText performs both positioning and encoding. 
We also need to keep track of glyphs for subsetting, which is not possible in 
encode().

*Subsetting*: We currently track which glyphs need to be included in a subset 
by using their Unicode code point, but with GSUB enabled we will have to keep 
track of some substituted glyphs via their glyph id (GID), because the glyphs 
which result from a substitution don't necessarily have their own code points 
(and so have no entry in the camp table). This should be easy to add to 
TTFSubsetter as it already tracks glyph ids internally, we just need the 
ability to pass them in too. Then PDPageContentStream#showText will be 
responsible for passing the glyph ids. But now we need showText to know about 
those glyph ids, which leads me to

*Glyph IDs:* The JDK represents text which has been through OpenType layout as 
a 
[GlyphVector|https://docs.oracle.com/javase/7/docs/api/java/awt/font/GlyphVector.html]
 which encapsulates substitutions via GID and positioning via a transform 
associated with each glyph. PDFBox might want to do something similar, I think 
it would even be ok to add this to PDType0Font (because I'm suggesting a 
specific OpenType API so it doesn't interfere with our PDType0Font's 
non-OpenType assumption) in the form of a method such as: {{PDFGlyphVector 
layout(String text)}} which is called from PDPageContentStream#showText. I also 
think it would be fine to use instanceof to detect this case, because only 
PDType0Font need have this capability. I'm assuming PDFGlyphVector is our own 
very simple version of the JDK's GlyphVector, which is effectively just a 
vector of (gid, dx, dy) tuples. Then all that PDPageContentStream#showText 
needs to know how to do is to draw a PDFGlyphVector on the page, by converting 
it into the equivalent text drawing operations (Tj and the like).

Phew! That was a lot of information. Just to be clear, the current patch is not 
compatible with subsetting without making some changes.


was (Author: jahewson):
Hi guys, this is a really welcome contribution, thank you. With regards to 
PDFont#encode(String text) being non-final I can add some insight as I was the 
original designer of our current PDFont#encode mechanism.

Basically, the PDFont classes are designed to represent fonts identically to 
how they are represented when embedded in PDF files. So there's no support for 
OpenType, by design. A Type0 font knows nothing about OpenType.

So how can we use OpenType in PDFBox? The answer is that we do it one layer of 
abstraction up, during text _layout_ instead of text _encoding_*_._* So you 
want to put your glyph substitution code inside PDPageContentStream#showText, 
actually you want 
[PDPageContentStream#showTextInternal|https://github.com/apache/pdfbox/blob/7e721643c0b1fca9fdc349f78431f36e68abc097/pdfbox/src/main/java/org/apache/pdfbox/contentstream/PDAbstractContentStream.java#L256].

That way PDFont#encode(String text) can stay non-final :)

*OpenType*: In general, OpenType layouts consist of glyph _substitutions_ (via 
GSUB) and _positionings_ (via GPOS). Obviously it's not possible to handle 
positionings in PDFont#encode(), so that helps explain why showText() is the 
right place for OpenType, as showText performs both positioning and encoding. 
We also need to keep track of glyphs for subsetting, which is not possible in 
encode().

*Subsetting*: We currently track which 

[jira] [Comment Edited] (PDFBOX-4189) Enable rendering of Indian languages, by reading and utilizing the GSUB table

2018-04-14 Thread John Hewson (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-4189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16438540#comment-16438540
 ] 

John Hewson edited comment on PDFBOX-4189 at 4/15/18 12:47 AM:
---

Hi guys, this is a really welcome contribution, thank you. With regards to 
PDFont#encode(String text) being non-final I can add some insight as I was the 
original designer of our current PDFont#encode mechanism.

Basically, the PDFont classes are designed to represent fonts identically to 
how they are represented when embedded in PDF files. So there's no support for 
OpenType, by design. A Type0 font knows nothing about OpenType.

So how can we use OpenType in PDFBox? The answer is that we do it one layer of 
abstraction up, during text _layout_ instead of text _encoding_*_._* So you 
want to put your glyph substitution code inside PDPageContentStream#showText, 
actually you want 
[PDPageContentStream#showTextInternal|https://github.com/apache/pdfbox/blob/7e721643c0b1fca9fdc349f78431f36e68abc097/pdfbox/src/main/java/org/apache/pdfbox/contentstream/PDAbstractContentStream.java#L256].

That way PDFont#encode(String text) can stay non-final :)

*OpenType*: In general, OpenType layouts consist of glyph _substitutions_ (via 
GSUB) and _positionings_ (via GPOS). Obviously it's not possible to handle 
positionings in PDFont#encode(), so that helps explain why showText() is the 
right place for OpenType, as showText performs both positioning and encoding. 
We also need to keep track of glyphs for subsetting, which is not possible in 
encode().

*Subsetting*: We currently track which glyphs need to be included in a subset 
by using their Unicode code point, but with GSUB enabled we will have to keep 
track of some substituted glyphs via their glyph id (GID), because the glyphs 
which result from a substitution don't necessarily have their own code points 
(and so have no entry in the camp table). This should be easy to add to 
TTFSubsetter as it already tracks glyph ids internally, we just need the 
ability to pass them in too. Then PDPageContentStream#showText will be 
responsible for passing the glyph ids. But now we need showText to know about 
those glyph ids, which leads me to

*Glyph IDs:* The JDK represents text which has been through OpenType layout as 
a 
[GlyphVector|https://docs.oracle.com/javase/7/docs/api/java/awt/font/GlyphVector.html]
 which encapsulates substitutions via GID and positioning via a transform 
associated with each glyph. PDFBox might want to do something similar, I think 
it would even be ok to add this to PDType0Font (because I'm suggesting a 
specific OpenType API so it doesn't interfere with our PDType0Font's 
non-OpenType assumption) in the form of a method such as: {{PDFGlyphVector 
layout(String text)}} which is called from PDPageContentStream#showText. I also 
think it would be fine to use instanceof to detect this case, because only 
PDType0Font need have this capability. I'm assuming PDFGlyphVector is our own 
very simple version of the JDK's GlyphVector, which is effectively just a 
vector of (gid, dx, dy) tuples. Then all that PDPageContentStream#showText 
needs to know how to do is to draw a PDFGlyphVector on the page, by converting 
it into the equivalent text drawing operations (Tj and the like).

Phew! That was a lot of information. Just to be clear, the current patch is not 
compatible with subsetting without making some changes.


was (Author: jahewson):
Hi guys, this is a really welcome contribution, thank you. With regards to 
PDFont#encode(String text) being non-final I can add some insight as I was the 
original designer of our current PDFont#encode mechanism.

Basically, the PDFont classes are designed to represent fonts identically to 
how they are represented when embedded in PDF files. So there's no support for 
OpenType, by design. A Type0 font knows nothing about OpenType.

So how can we use OpenType in PDFBox? The answer is that we do it one layer of 
abstraction up, during text _layout_ instead of text _encoding_*_._* So you 
want to put your glyph substitution code inside PDPageContentStream#showText, 
actually you want 
[PDPageContentStream#showTextInternal|https://github.com/apache/pdfbox/blob/7e721643c0b1fca9fdc349f78431f36e68abc097/pdfbox/src/main/java/org/apache/pdfbox/contentstream/PDAbstractContentStream.java#L256].

That way PDFont#encode(String text) can stay non-final :)

*Technical Background*: In general, OpenType layouts consist of glyph 
_substitutions_ (via GSUB) and _positionings_ (via GPOS). Obviously it's not 
possible to handle positionings in PDFont#encode(), so that helps explain why 
showText() is the right place for OpenType, as showText performs both 
positioning and encoding.

> Enable rendering of Indian languages, by reading and utilizing the GSUB table
> -
>
>  

[jira] [Comment Edited] (PDFBOX-4189) Enable rendering of Indian languages, by reading and utilizing the GSUB table

2018-04-14 Thread John Hewson (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-4189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16438540#comment-16438540
 ] 

John Hewson edited comment on PDFBOX-4189 at 4/15/18 12:17 AM:
---

Hi guys, this is a really welcome contribution, thank you. With regards to 
PDFont#encode(String text) being non-final I can add some insight as I was the 
original designer of our current PDFont#encode mechanism.

Basically, the PDFont classes are designed to represent fonts identically to 
how they are represented when embedded in PDF files. So there's no support for 
OpenType, by design. A Type0 font knows nothing about OpenType.

So how can we use OpenType in PDFBox? The answer is that we do it one layer of 
abstraction up, during text _layout_ instead of text _encoding_*_._* So you 
want to put your glyph substitution code inside PDPageContentStream#showText, 
actually you want 
[PDPageContentStream#showTextInternal|https://github.com/apache/pdfbox/blob/7e721643c0b1fca9fdc349f78431f36e68abc097/pdfbox/src/main/java/org/apache/pdfbox/contentstream/PDAbstractContentStream.java#L256].

That way PDFont#encode(String text) can stay non-final :)

*Technical Background*: In general, OpenType layouts consist of 
glyph_substitutions_ (via GSUB) and _positionings_ (via GPOS). Obviously it's 
not possible to handle positionings in PDFont#encode(), so that helps explain 
why showText() is the right place for OpenType, as showText performs both 
positioning and encoding.


was (Author: jahewson):
Hi guys, this is a really welcome contribution, thank you. With regards to 
PDFont#encode(String text) being non-final I can add some insight as I was the 
original designer of our current PDFont#encode mechanism.

Basically, the PDFont classes are designed to represent fonts identically to 
how they are represented when embedded in PDF files. So there's no support for 
OpenType, by design. A Type0 font knows nothing about OpenType.

So how can we use OpenType in PDFBox? The answer is that we do it one layer of 
abstraction up, during text _layout_ instead of text _encoding_*_._* So you 
want to put your glyph substitution code inside PDPageContentStream#showText, 
actually you want 
[PDPageContentStream#showTextInternal|https://github.com/apache/pdfbox/blob/7e721643c0b1fca9fdc349f78431f36e68abc097/pdfbox/src/main/java/org/apache/pdfbox/contentstream/PDAbstractContentStream.java#L256].

That way PDFont#encode(String text) can stay non-final :)

In general, OpenType layouts consist of glyph_substitutions_ (via GSUB) and 
_positionings_ (via GPOS). Obviously it's not possible to handle positionings 
in PDFont#encode(), so that helps explain why showText() is the right place for 
OpenType, as showText performs both positioning and encoding.

> Enable rendering of Indian languages, by reading and utilizing the GSUB table
> -
>
> Key: PDFBOX-4189
> URL: https://issues.apache.org/jira/browse/PDFBOX-4189
> Project: PDFBox
>  Issue Type: New Feature
>  Components: FontBox, PDModel
>Reporter: Palash Ray
>Priority: Major
> Attachments: Bengali-text-after.pdf, Bengali-text-before.pdf
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Implemented proper rendering of Indian languages, which need extensive Glyph 
> substitution. The GSUB table has been read and used effectively to replace 
> some compound words with their respective Glyphs. All tests are passing. I 
> have tested this for the Bengali font. Please review these changes and let me 
> know if it makes sense to incorporate these.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Comment Edited] (PDFBOX-4189) Enable rendering of Indian languages, by reading and utilizing the GSUB table

2018-04-14 Thread John Hewson (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-4189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16438540#comment-16438540
 ] 

John Hewson edited comment on PDFBOX-4189 at 4/15/18 12:17 AM:
---

Hi guys, this is a really welcome contribution, thank you. With regards to 
PDFont#encode(String text) being non-final I can add some insight as I was the 
original designer of our current PDFont#encode mechanism.

Basically, the PDFont classes are designed to represent fonts identically to 
how they are represented when embedded in PDF files. So there's no support for 
OpenType, by design. A Type0 font knows nothing about OpenType.

So how can we use OpenType in PDFBox? The answer is that we do it one layer of 
abstraction up, during text _layout_ instead of text _encoding_*_._* So you 
want to put your glyph substitution code inside PDPageContentStream#showText, 
actually you want 
[PDPageContentStream#showTextInternal|https://github.com/apache/pdfbox/blob/7e721643c0b1fca9fdc349f78431f36e68abc097/pdfbox/src/main/java/org/apache/pdfbox/contentstream/PDAbstractContentStream.java#L256].

That way PDFont#encode(String text) can stay non-final :)

*Technical Background*: In general, OpenType layouts consist of glyph 
_substitutions_ (via GSUB) and _positionings_ (via GPOS). Obviously it's not 
possible to handle positionings in PDFont#encode(), so that helps explain why 
showText() is the right place for OpenType, as showText performs both 
positioning and encoding.


was (Author: jahewson):
Hi guys, this is a really welcome contribution, thank you. With regards to 
PDFont#encode(String text) being non-final I can add some insight as I was the 
original designer of our current PDFont#encode mechanism.

Basically, the PDFont classes are designed to represent fonts identically to 
how they are represented when embedded in PDF files. So there's no support for 
OpenType, by design. A Type0 font knows nothing about OpenType.

So how can we use OpenType in PDFBox? The answer is that we do it one layer of 
abstraction up, during text _layout_ instead of text _encoding_*_._* So you 
want to put your glyph substitution code inside PDPageContentStream#showText, 
actually you want 
[PDPageContentStream#showTextInternal|https://github.com/apache/pdfbox/blob/7e721643c0b1fca9fdc349f78431f36e68abc097/pdfbox/src/main/java/org/apache/pdfbox/contentstream/PDAbstractContentStream.java#L256].

That way PDFont#encode(String text) can stay non-final :)

*Technical Background*: In general, OpenType layouts consist of 
glyph_substitutions_ (via GSUB) and _positionings_ (via GPOS). Obviously it's 
not possible to handle positionings in PDFont#encode(), so that helps explain 
why showText() is the right place for OpenType, as showText performs both 
positioning and encoding.

> Enable rendering of Indian languages, by reading and utilizing the GSUB table
> -
>
> Key: PDFBOX-4189
> URL: https://issues.apache.org/jira/browse/PDFBOX-4189
> Project: PDFBox
>  Issue Type: New Feature
>  Components: FontBox, PDModel
>Reporter: Palash Ray
>Priority: Major
> Attachments: Bengali-text-after.pdf, Bengali-text-before.pdf
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Implemented proper rendering of Indian languages, which need extensive Glyph 
> substitution. The GSUB table has been read and used effectively to replace 
> some compound words with their respective Glyphs. All tests are passing. I 
> have tested this for the Bengali font. Please review these changes and let me 
> know if it makes sense to incorporate these.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Comment Edited] (PDFBOX-4189) Enable rendering of Indian languages, by reading and utilizing the GSUB table

2018-04-14 Thread John Hewson (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-4189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16438540#comment-16438540
 ] 

John Hewson edited comment on PDFBOX-4189 at 4/15/18 12:16 AM:
---

Hi guys, this is a really welcome contribution, thank you. With regards to 
PDFont#encode(String text) being non-final I can add some insight as I was the 
original designer of our current PDFont#encode mechanism.

Basically, the PDFont classes are designed to represent fonts identically to 
how they are represented when embedded in PDF files. So there's no support for 
OpenType, by design. A Type0 font knows nothing about OpenType.

So how can we use OpenType in PDFBox? The answer is that we do it one layer of 
abstraction up, during text _layout_ instead of text _encoding_*_._* So you 
want to put your glyph substitution code inside PDPageContentStream#showText, 
actually you want 
[PDPageContentStream#showTextInternal|https://github.com/apache/pdfbox/blob/7e721643c0b1fca9fdc349f78431f36e68abc097/pdfbox/src/main/java/org/apache/pdfbox/contentstream/PDAbstractContentStream.java#L256].

That way PDFont#encode(String text) can stay non-final :)

In general, OpenType layouts consist of glyph_substitutions_ (via GSUB) and 
_positionings_ (via GPOS). Obviously it's not possible to handle positionings 
in PDFont#encode(), so that helps explain why showText() is the right place for 
OpenType, as showText performs both positioning and encoding.


was (Author: jahewson):
Hi guys, this is a really welcome contribution, thank you. With regards to 
PDFont#encode(String text) being non-final I can add some insight as I was the 
original designer of our current PDFont#encode mechanism.

Basically, the PDFont classes are designed to represent fonts identically to 
how they are represented when embedded in PDF files. So there's no support for 
OpenType, by design. A Type0 font knows nothing about OpenType.

So how can we use OpenType in PDFBox? The answer is that we do it one layer of 
abstraction up, during text _layout_ instead of text _encoding_*_._* So you 
want to put your glyph substitution code inside PDPageContentStream#showText, 
actually you want 
[PDPageContentStream#showTextInternal|https://github.com/apache/pdfbox/blob/7e721643c0b1fca9fdc349f78431f36e68abc097/pdfbox/src/main/java/org/apache/pdfbox/contentstream/PDAbstractContentStream.java#L256].

That way PDFont#encode(String text) can stay non-final :)

> Enable rendering of Indian languages, by reading and utilizing the GSUB table
> -
>
> Key: PDFBOX-4189
> URL: https://issues.apache.org/jira/browse/PDFBOX-4189
> Project: PDFBox
>  Issue Type: New Feature
>  Components: FontBox, PDModel
>Reporter: Palash Ray
>Priority: Major
> Attachments: Bengali-text-after.pdf, Bengali-text-before.pdf
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Implemented proper rendering of Indian languages, which need extensive Glyph 
> substitution. The GSUB table has been read and used effectively to replace 
> some compound words with their respective Glyphs. All tests are passing. I 
> have tested this for the Bengali font. Please review these changes and let me 
> know if it makes sense to incorporate these.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Comment Edited] (PDFBOX-4189) Enable rendering of Indian languages, by reading and utilizing the GSUB table

2018-04-14 Thread John Hewson (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-4189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16438540#comment-16438540
 ] 

John Hewson edited comment on PDFBOX-4189 at 4/15/18 12:11 AM:
---

Hi guys, this is a really welcome contribution, thank you. With regards to 
PDFont#encode(String text) being non-final I can add some insight as I was the 
original designer of our current PDFont#encode mechanism.

Basically, the PDFont classes are designed to represent fonts identically to 
how they are represented when embedded in PDF files. So there's no support for 
OpenType, by design. A Type0 font knows nothing about OpenType.

So how can we use OpenType in PDFBox? The answer is that we do it one layer of 
abstraction up, during text _layout_ instead of text _encoding_*_._* So you 
want to put your glyph substitution code inside PDPageContentStream#showText, 
actually you want 
[PDPageContentStream#showTextInternal|https://github.com/apache/pdfbox/blob/7e721643c0b1fca9fdc349f78431f36e68abc097/pdfbox/src/main/java/org/apache/pdfbox/contentstream/PDAbstractContentStream.java#L256].

That way PDFont#encode(String text) can stay non-final :)


was (Author: jahewson):
Hi guys, this is a really welcome contribution, thank you. With regards to 
PDFont#encode(String text) being non-final I can add some insight as I was the 
original designer of our current PDFont#encode mechanism.

Basically, the PDFont classes are designed to represent fonts identically to 
how they are represented when embedded in PDF files. So there's no support for 
OpenType, by design. A Type0 font knows nothing about OpenType (by design).

So how can we use OpenType in PDFBox? The answer is that we do it one layer of 
abstraction up, during text _layout_ instead of text _encoding_*_._* So you 
want to put your glyph substitution code inside PDPageContentStream#showText, 
actually you want 
[PDPageContentStream#showTextInternal|https://github.com/apache/pdfbox/blob/7e721643c0b1fca9fdc349f78431f36e68abc097/pdfbox/src/main/java/org/apache/pdfbox/contentstream/PDAbstractContentStream.java#L256].

That way PDFont#encode(String text) can stay non-final :)

> Enable rendering of Indian languages, by reading and utilizing the GSUB table
> -
>
> Key: PDFBOX-4189
> URL: https://issues.apache.org/jira/browse/PDFBOX-4189
> Project: PDFBox
>  Issue Type: New Feature
>  Components: FontBox, PDModel
>Reporter: Palash Ray
>Priority: Major
> Attachments: Bengali-text-after.pdf, Bengali-text-before.pdf
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Implemented proper rendering of Indian languages, which need extensive Glyph 
> substitution. The GSUB table has been read and used effectively to replace 
> some compound words with their respective Glyphs. All tests are passing. I 
> have tested this for the Bengali font. Please review these changes and let me 
> know if it makes sense to incorporate these.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org