Re: [HarfBuzz] Why harfbuzz isn't/couldn't/shouldn't provide separate [optional] API for glyph/positioning?

2018-03-07 Thread Behdad Esfahbod
On Sun, Feb 25, 2018 at 10:46 PM, Nikolay Sivov 
wrote:

> On 2/26/2018 5:28 AM, Behdad Esfahbod wrote:
> >
> > Two things stand out:
> >
> >   - There's a lot of duplicate info going into both calls,
> >
> >   - There's also a lot data coming out of the first call just to go
> > directly into the second; namely pCharPropsand pGlyphProps.
> >
> > Those two very strongly suggest that the two calls are part of the same
> > larger operation and rather forcefully separated.
>
> One example of such larger operation is ScriptStringAnalyse(), except
> that it's pre-*OpenType() and thus does not have feature ranges support.
>
> If not to justify but to understand better this separation, does it make
> sense if the idea was to have an ability to change font size? Or toggle
> GPOS features without re-running all deal of reprocessing input text
> buffer, because resulting glyph array won't change anyway at this point.
>

Changing font size initially sounds compelling. I have had that in mind for
HarfBuzz too. But in reality, no system is going to use that. It's hard
enough to keep track of input and shaped glyphstrings already. Many systems
throw that away and reshape as needed.  It's just not worth it.


> DirectWrite call is cleaner in that sense, because of separate size
> argument GetGlyphPlacements() takes, as opposed to just current font in
> HDC (or cache).
>
> ...
>
> >
> > Separating the calls also means that some things, like which OpenType
> > feature applies to what range, needs to be recalculated. Guess that's
> > not a huge deal. The biggest problem with separating the calls in a way
> > that is useful for Wine implementing the Uniscribe API on top is that we
> > have to expose the buffer-internal bit allocations. And we don't want to
> > do that, because that is an implementation detail and changes over time.
>
> Actually I have looked again last year at using hb_buffer for
> DirectWrite in Wine, and after I didn't find any way to fill buffer with
> resulting glyphs as opposed to text, I realized that it won't be easy if
> possible at all.
>

It definitely *is* possible to split hb_shape() call into two. There's some
minor complexities, those can be resolved. But channeling the entirety of
hb_glyph_info_t through the Uniscribe / DirectWrite GlyphProps API might be
harder.  I haven't fully checked the DirectWrite API. If I split hb_shape()
and write ScriptShapeOpenType / ScriptPlaceOpenType around them, would that
be enough to get you going? Might be harder with ScriptShape / ScriptPlace
which have less slots to carry info, but then again they don't have
OpenType features, so less data needs to be channeled through as well.  It
might be doable after all.


> P.S. Behdad, how do you test things? Do you have large set of texts +
> fonts you run against, more than what's in /test of hb tree I mean.
> Since hb-shape can also use Uniscribe or DirectWrite, that would be
> helpful to have such data to test Wine on.
>

Check out my writeup and talk:
https://goo.gl/9eWCLy
https://www.youtube.com/watch?v=sMkO4gF4-3U

The input data is at:
https://github.com/harfbuzz/harfbuzz-testing-wikipedia

I have a few local scripts that run this and diff against pre-recorded
output of Uniscribe, for a set of fonts. Mine is just default MS font for
each Indic scripts. That's what the numbers we put in the commits are about:

BENGALI: 353725 out of 354188 tests passed. 463 failed (0.130722%)
DEVANAGARI: 707307 out of 707394 tests passed. 87 failed (0.0122987%)
GUJARATI: 366355 out of 366457 tests passed. 102 failed (0.0278341%)
GURMUKHI: 60729 out of 60747 tests passed. 18 failed (0.0296311%)
KANNADA: 951300 out of 951913 tests passed. 613 failed (0.0643966%)
KHMER: 299071 out of 299124 tests passed. 53 failed (0.0177184%)
MALAYALAM: 1048136 out of 1048334 tests passed. 198 failed (0.0188871%)
ORIYA: 42320 out of 42329 tests passed. 9 failed (0.021262%)
SINHALA: 271662 out of 271847 tests passed. 185 failed (0.068053%)
TAMIL: 1091754 out of 1091754 tests passed. 0 failed (0%)
TELUGU: 970555 out of 970573 tests passed. 18 failed (0.00185457%)

I should make it possible for others to reproduce these.

Jonathan Kew also has had built a portal running on Amazon AWS, comparing
Uniscribe and HarfBuzz outputs on the fly and generating browsable
dashboard of the diffs. It wasn't fully productionized. It's worth picking
up again.

The main problem is that the output generated from these test suites is
massive. Just storing it is takes a lot of resources. So it's most feasible
to run the two backends side-by-side and only print out the diffs.

-- 
behdad
http://behdad.org/
___
HarfBuzz mailing list
HarfBuzz@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/harfbuzz


Re: [HarfBuzz] Why harfbuzz isn't/couldn't/shouldn't provide separate [optional] API for glyph/positioning?

2018-02-25 Thread Nikolay Sivov
On 2/26/2018 5:28 AM, Behdad Esfahbod wrote:
> 
> Two things stand out:
> 
>   - There's a lot of duplicate info going into both calls,
> 
>   - There's also a lot data coming out of the first call just to go
> directly into the second; namely pCharPropsand pGlyphProps.
> 
> Those two very strongly suggest that the two calls are part of the same
> larger operation and rather forcefully separated.

One example of such larger operation is ScriptStringAnalyse(), except
that it's pre-*OpenType() and thus does not have feature ranges support.

If not to justify but to understand better this separation, does it make
sense if the idea was to have an ability to change font size? Or toggle
GPOS features without re-running all deal of reprocessing input text
buffer, because resulting glyph array won't change anyway at this point.

DirectWrite call is cleaner in that sense, because of separate size
argument GetGlyphPlacements() takes, as opposed to just current font in
HDC (or cache).

...

> 
> Separating the calls also means that some things, like which OpenType
> feature applies to what range, needs to be recalculated. Guess that's
> not a huge deal. The biggest problem with separating the calls in a way
> that is useful for Wine implementing the Uniscribe API on top is that we
> have to expose the buffer-internal bit allocations. And we don't want to
> do that, because that is an implementation detail and changes over time.

Actually I have looked again last year at using hb_buffer for
DirectWrite in Wine, and after I didn't find any way to fill buffer with
resulting glyphs as opposed to text, I realized that it won't be easy if
possible at all.

P.S. Behdad, how do you test things? Do you have large set of texts +
fonts you run against, more than what's in /test of hb tree I mean.
Since hb-shape can also use Uniscribe or DirectWrite, that would be
helpful to have such data to test Wine on.
___
HarfBuzz mailing list
HarfBuzz@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/harfbuzz


Re: [HarfBuzz] Why harfbuzz isn't/couldn't/shouldn't provide separate [optional] API for glyph/positioning?

2018-02-25 Thread Behdad Esfahbod
Hi Ebrahim,

On Sat, Feb 24, 2018 at 11:33 AM, Ebrahim Byagowi 
wrote:

> About why "isn't", I guess harfbuzz has developed before DirectWrite,
>

That's not the reason. Uniscribe API also had the separation. Initially I
had wanted to allow it, but eventually didn't. Read on.



> but I like to know if a separate API for substitution and positioning a
> possibility? Or, is accepting glyphs instead on input [and later as
> an optimization, hb_shape without positioning] a possibility? Have a look
> at hb-directwrite's GetGlyphs
> 
>  and GetGlyphPlacements
> 
> .
>

If you look at the Uniscribe APIs for Shape & Place:

HRESULT ScriptShapeOpenType(
  _In_opt_   HDC  hdc,
  _Inout_SCRIPT_CACHE *psc,
  _Inout_SCRIPT_ANALYSIS  *psa,
  _In_   OPENTYPE_TAG tagScript,
  _In_   OPENTYPE_TAG tagLangSys,
  _In_opt_   int  *rcRangeChars,
  _In_opt_   TEXTRANGE_PROPERTIES **rpRangeProperties,
  _In_   int  cRanges,
  _In_ const WCHAR*pwcChars,
  _In_   int  cChars,
  _In_   int  cMaxGlyphs,
  _Out_  WORD *pwLogClust,
  _Out_  SCRIPT_CHARPROP  *pCharProps,
  _Out_  WORD *pwOutGlyphs,
  _Out_  SCRIPT_GLYPHPROP *pOutGlyphProps,
  _Out_  int  *pcGlyphs
);


HRESULT ScriptPlaceOpenType(
  _In_opt_HDC  hdc,
  _Inout_ SCRIPT_CACHE *psc,
  _Inout_ SCRIPT_ANALYSIS  *psa,
  _In_OPENTYPE_TAG tagScript,
  _In_OPENTYPE_TAG tagLangSys,
  _In_opt_int  *rcRangeChars,
  _In_opt_TEXTRANGE_PROPERTIES **rpRangeProperties,
  _In_int  cRanges,
  _In_  const WCHAR*pwcChars,
  _In_WORD *pwLogClust,
  _In_SCRIPT_CHARPROP  *pCharProps,
  _In_int  cChars,
  _In_  const WORD *pwGlyphs,
  _In_  const SCRIPT_GLYPHPROP *pGlyphProps,
  _In_int  cGlyphs,
  _Out_   int  *piAdvance,
  _Out_   GOFFSET  *pGoffset,
  _Out_opt_   ABC  *pABC
);

Two things stand out:

  - There's a lot of duplicate info going into both calls,

  - There's also a lot data coming out of the first call just to go
directly into the second; namely pCharProps and pGlyphProps.

Those two very strongly suggest that the two calls are part of the
same larger operation and rather forcefully separated.

We can do the same separation in HarfBuzz. We also have lots of data
that should come out of the first call and go into the second call to
make that possible.  Some of that even matches the data Uniscribe is
passing.  In our case, to reconstruct the buffer in the second call we
need the following buffer-internal info:

/* buffer var allocations, used during the entire shaping process */
#define unicode_props()↦↦   var2.u16[0]

/* buffer var allocations, used during the GSUB/GPOS processing */
#define glyph_props()↦  ↦   var1.u16[0] /* GDEF glyph properties */
#define lig_props()↦↦   var1.u8[2] /* GSUB/GPOS ligature tracking */
#define syllable()↦ ↦   var1.u8[3] /* GSUB/GPOS shaping boundaries */


The syllable() is only used during shaping; so that's not needed for
positioning.  The lig_props is needed to correctly attach marks to
their ligature components. Uniscribe should be hiding that info
somewhere in those Reserved bits it passes.  Looks like we need 40
bits per glyph to be passed between the two calls to make this
possible without significant restructuring.

I mean, sure, I can split hb_ot_shape() into two calls as long as you
take the buffer from first and pass it straight to the second. But to
funnel that buffer through the Uniscribe API boundary, we need to pass
those 40 bits somewhere in the Uniscribe structs:

typedef struct script_charprop {
  WORD fCanGlyphAlone  :1;
  WORD reserved  :15;
} SCRIPT_CHARPROP;

This one is per character, while we work mostly per glyph. So might be
useful or not. Interesting how the fCanGlyphAlone is similar to our
unsafe_to_break, but modeled differently.

typedef struct script_glyphprop {
  SCRIPT_VISATTR sva;
  WORD   reserved;
} SCRIPT_GLYPHPROP;
typedef struct tag_SCRIPT_VISATTR {
  WORD uJustification  :4;
  WORD fClusterStart  :1;
  WORD fDiacritic  :1;
  WORD fZeroWidth  :1;
  WORD fReserved  :1;
  WORD fShapeReserved  :8;
} SCRIPT_VISATTR;

The SCRIPT_GLYPHPROP is unique to the OpenType() flavor of the
Uniscribe calls. The