Re: [whatwg] Google Feedback on the HTML5 media a11y specifications

Philip Jägenstedt Tue, 15 Feb 2011 02:09:29 -0800

On Tue, 15 Feb 2011 04:28:36 +0100, Silvia Pfeiffer<[email protected]> wrote:

Hi Philip,
On Tue, Feb 15, 2011 at 3:27 AM, Philip Jägenstedt <[email protected]>wrote:
On Wed, 09 Feb 2011 03:57:37 +0100, Silvia Pfeiffer
<[email protected]> wrote:
A. Feedback on the WebVTT format
1. Introduce file-wide metadata

WebVTT requires a structure to add header-style metadata. We are here
talking about lists of name-value pairs as typically in use forheaderinformation. The metadata can be optional, but we need a definedmeans
of adding them.
Required attributes in WebVTT files should be the main language inuse
and the kind of data found in the WebVTT file - information that is
currently provided in the <track> element by the @srclang and @kind
attributes. These are necessary to allow the files to be interpreted
correctly by non-browser applications, for transcoding or todetermine
if a file was created as a caption file or something else, in
particular the @kind=metadata. @srclang also sets the base
directionality for BiDi calculations.
Are there non-browsers that use the language for font-selection orbidi?
Is
auto-detection not likely to give a better user experience? Are thereanyother use cases for knowing the language of the captions *after*they've
been opened?
I can't see a different way to let non-browser applications know what
font to choose, even how to provide the user with a menu of available
caption tracks for a video, or to set the base directionality for
BiDi. Also, language auto-detection is a huge burden to put onto
non-browser applications. Having a readable language tag at the
beginning of the file is useful to quickly figure it all out.

The language set in <track> would certainly overrule what is in the
file. Also, the last language attribute in the header would probably
win.

I guess it would also be ok to have language and kind optional -
different applications may then default to interpreting WebVTT files
differently, such as by default English and Captions - or English and
Descriptions, but that's probably acceptable from context.
Given that most existing subtitle formats don't have any languagemetadata,I'm a bit skeptical. However, if implementors of non-browser playerswant toimplement WebVTT and ask for this I won't stand in the way (not that Icouldif I wanted to). For simplicity, I'd prefer the language metadata fromthefile to not have any effect on browsers though, even if no language isgiven
on <track>.
There is also the Content-Language response header of HTTP, which
could have an influence on the browser, too. I'm not sure about the
best way to deal with all this overlapping information, but I'm sure
it can be sorted out.

My preference is ignoring everything except what is given in <track>. Inparticular language can't be given in the resource or its headers, becausethen one has to fetch all the tracks in order to provide a track selectionmenu with language information or to automatically activate the suitabletracks.

Why do non-browser players need to know the kind? All kinds areprocessed
in
the same way except metadata, and there's no reason to use metadata
tracks
with external players.


Maybe I have a different view of what applications will make use of
WebVTT files than most. My thinking is that there will also be uses
for metadata tracks in external applications. Aside from this, there
will be authoring applications and players, yes, but there will also
be automated processing tools. So, to know what type of content is
inside a file without having to look at more than the file's headers
is really important.

For both of these cases, putting some magic strings inside commentsthat areignored by browsers sounds like it would be sufficient. Name-valuemetadata

that is ignored by browsers would be fine as well.


I'm for the second option: name-value metadata that is ignored by the
browser. I think in fact the browser should in general ignore all
name-value metadata with the exception of file-wide cue settings.

I agree, browsers should ignore in-file metadata. (That's one reason Ithink using comments for it is quite fine most of the time.)

Further metadata fields that are typically used by authors to keep
specific authoring information or usage hints are necessary, too. As
examples of current use see the format of MPlayer mpsub’s header
metadata [2], EBU STL’s General Subtitle Information block [3], and
even CEA-608’s Extended Data Service with its StartDate, Station,

Program, Category and TVRating information [4]. Rather thanspecifying

a specific subset of potential fields we recommend to just have the
means to provide name-value pairs and leave it to the negotiation
between the author and the publisher which fields they expect of each
other.

This approach has worked very well with Vorbis Comments, probablymostly

because all interesting fields have been pre-defined in
http://www.xiph.org/vorbis/doc/v-comment.html

For a web format though, wouldn't some kind of wiki registry be goodtoavoid total mayhem, especially if there are some predefined fields?(Not

having file-wide metadata would also avoid such mayhem.)


It might be good to define a base set - the Vorbis Comments one or the
ID3 ones could be appropriate. Even the old Dublin Core set (the first
ones, not the current chaos) could be good. I could also analyse the
sets used in current typical caption formats and propose a superset of
those.

While I think you're right with suggesting a predefined set of fields,
I am mostly keen right now to agree on the general format of the
fields and how we need to parse them rather than what they actually
are.

So, I would suggest we allow lines of "name=value" under the WEBVTT
magic string. A blank line defines the end of the header section and
the beginning of the cues. Would be simple enough to parse, right?


Sure, it's already handled by the current parsing spec, since it ignores
everything up to the first blank line.


That's not quite how I'm reading the spec.

http://www.whatwg.org/specs/web-apps/current-work/multipage/video.html#webvtt-0
allows
"Optionally, either a U+0020 SPACE character or a U+0009 CHARACTER
TABULATION (tab) character followed by any number of characters that
are not U+000A LINE FEED (LF) or U+000D CARRIAGE RETURN (CR)
characters."
after the "WEBVTT FILE" magic.
To me that reads like all of the extra stuff has to be on the same line.
I'd prefer if this read "any character except for two WebVTT line
terminators", then it would all be ready for such header-style
metadata.

See steps 12-17 of<http://www.whatwg.org/specs/web-apps/current-work/multipage/video.html#parsing-0>,it just skips all lines up to the first blank line. Syntax and parsing aredifferent :)

4. Cue formatting requirements
In analysing the available cue formatting functionality, we havefound
that some features are missing. Most of these features can be added
through using CSS on cues that have received a <b>, <i>, <c> or <v>
marker. The following features are core to traditional TV and existin
EBU STL and CEA-608/708 captions. Support of these will be a core
requirement for browsers as well as non-browser applications and it
makes sense to add these to WebVTT rather than relying on externalCSS
which cannot be used for non-browser captions:
The unstated requirement here seems to be that WebVTT needs to workas aninterchange format for various TV captioning formats even in useragents
without any support for CSS (or JavaScript). I'm trying to not make a
straw
man argument, but if want an interchange format, we should pick TTML,
which
is explicitly designed to be just that and doesn't depend on CSS.
Is it not enough that a lossy conversion can be made from variousformats
into WebVTT+CSS(+JavaScript)? If not, the "Web" in "WebVTT" is highly
misleading...
We're trying to avoid the need for multiple transcodings and are
trying to achieve something like the following pipeline:
broadcast captions -> transcode to WebVTT -> show in browser ->
transcode to broadcast devices -> show

If we have to plug TTML into this pipeline, too, it will be much
slower and we would need to additionally define a mapping from TTML to
WebVTT and back.

I'm sure with SMPTE-TT around we will end up seeing things like
broadcast->TTML->WebVTT->browser, but even then we don't want WebVTT
to be a lossy format.
I can only disagree. Trying to make WebVTT into an interchange formatwillinevitably turn it into a highly presentational format with lots oflegacybaggage. I can certainly see the use cases for an interchange format,but Idon't think it's worth the added complexity. I'd prefer an approachwhereany format quirks that can't be mapped to WebVTT are expressed using<c.foo>
and if it turns out lots of people want the feature, we can add it to a
future revision.
I wouldn't go as far as to say it needs to become an interchange
format. But I can see us specifying what the browser parses, while
given options such as the header-metadata and span classes that allow
with some extra information to fully recover the broadcast
functionality. I actually think that is almost possible already.

After this thread has run for a while, it'd be nice to hear where youthink <c.foo> isn't enough and new markup is needed, if anything.

* underline: EBU STL, CEA-608 and CEA-708 support underlining of
characters. The underline character is also particularly importantfor
some Asian languages. Please make it possible to provide text
underlines without the use of CSS in WebVTT.
Which Asian languages? If it's just the Chinese
<http://en.wikipedia.org/wiki/Proper_name_mark>, then I don't thinkthat
needs <u> or similar. In my experience, use of the Chinese proper name
mark
is in fact extremely rare in Chinese captions, at least in movies andTV
series from the mainland and Taiwan. It would be best to use e.g.
我來自<c.pnm>中國</c> to make it easy to change the style between
single/double/wavy/no underline.


OK. So if we need underlined text, it will need to be
<c.underline>..</c> and CSS underline? I guess in a Web context
underline text is usually a hyperlink so it makes sense to discourage
<u> for the Web. But is that also an argument for
captions/subtitles/descriptions? What is the argument against using
<u> in captions?


I don't really have an argument against it, I just questioned that it is
important for Asian languages in particular. Adding <u> would be really
simple, it's just a question of why. I've seldom seen underlining in
captions, so it's not clear to me how it's usually used.


I'm told <u> is fairly common in traditional captions. We don't do
<c.italics> either for such common stuff.
But if we really don't want this, I guess <c.u> would work, too and is
not that much longer.

I can't see any underlining when scanning through the samples at<http://wiki.whatwg.org/wiki/Use_cases_for_timed_tracks_rendered_over_video_by_the_UA>.If it is in fact common in some contexts, it'd be great to have samplesadded to the wiki, I'm sure we could learn something from it. If <u> isactually useful for something, then we should just add it.

With "-" you are referring to replacing "-->" with "-" to arrive atthings
like:
15.000-17.950
At the left we can see...

as compared to:
15.000+2.950
At the left we can see...


Yes, that's what I meant.

I actually think they read fairly given that people are used to the
double meaning of "-": to mean both "from ... to" and "minus".
But we could use a different character for "absolute time" if you
prefer, e.g. "/".
15.000/17.950
At the left we can see...

I find this fairly readable, too.


Either would work for me. As I mentioned, the room for improvement here

isn't only the syntax of the timing line, but also to make it obviousthat

cue timestamps like <00:01.000> are relative. Using + for relative
timestamps is potentially confusing too, as one might think that many

consecutive <+00:01.000> are cumulative, rather than all being 1 secondfrom

the start time of the cue.


That's true and in fact the way in which I have authored my examples,
now that I look back at them. It makes the timings smaller and I think
it's a bit more logical. But really we just have to decide on one
meaning:

5-10
This <+1>is <+1>a <+1>simple <+1>example.

I find I actually prefer this over

5-10
This <+1>is <+2>a <+3>simple <+4>example.

Right, we just have to pick something. I'd like to get the basic structuredown soon, though, as changing the timestamp parsing will be verydifficult once there are implementations.

7. Comments

we recommend the introduction of comments.


I agree and think it needs to happen before WebVTT starts to get
implemented
and used on the web. In other words: now.


Agreed. I'm happy for the previously suggested "//" at the line start
to be comments, or, for that matter, "#" or ";" or any other special
character. I would prefer not to use "/*" since it implies a "*/" is
required to end the comment. Similarly we should avoid "<!--" and
"-->" or anything else that requires a special comment end mark and
more than one or two characters.


I'd quite like to have block comments, so I think the best options are:

1. // and /* */ like JavaScript
2. <!-- --> like HTML/XML


If the main use case for the comments is to comment out a line,
something at the line start alone would be sufficient. If we have to
have both, I would prefer the shorter first option.

I think that the main difficulty is actually not picking a syntax, but
deciding how it works in the parser. Unlike HTML, I don't think we wantthecomments to show up in the "DOM", since that would only work forintra-cue
comments. Ideally it would be preprocessor-ish, but yet the magic bytes
("WEBVTT FILE") should be checked first as otherwise identifying WebVTT
would require implementing its preprocessor steps :/


As I would not want the comments not to be handed into the DOM or to
JavaScript, it doesn't matter if they are not like HTML. I would
regard them more as pre-processor style comments.

For simplicity, perhaps it would be better to have line-comments only. Onmy wishlist I have a less convoluted parser definition which operates onlines instead of sprinkling CR/LF all over, and it'd be easy to addline-comments to such a parser. Wish-list item requested at<http://www.w3.org/Bugs/Public/show_bug.cgi?id=12076>.

8. Line wrapping

CEA-708 captions support automatic line wrapping in a more
sophisticated way than WebVTT -- see
http://en.wikipedia.org/wiki/CEA-708#Word_wrap.

In our experience with YouTube we have found that in certain
situations this type of automatic line wrapping is very useful.
Captions that were authored for display in a full-screen video may
contain too many words to be displayed fully within the actual video
presentation (note that mobile / desktop / internet TV devices may
each have a different amount of space available, and embedded videos
may be of arbitrary sizes). Furthermore, user-selected fonts or font
sizes may be larger than expected, especially for viewers who need
larger print.

WebVTT as currently specified wraps text at the edge of their
containing blocks, regardless of the value of the 'white-space'

property, even if doing so requires splitting a word where there isno

line breaking opportunity. This will tend to create poor quality
captions.  For languages where it makes sense, line wrapping should
only be possible at carriage return, space, or hyphen characters, but
not on &nbsp; characters.  (Note that CEA-708 also contains
non-breaking space and non-breaking transparent space characters to
help control wrapping.)However, this algorithm will not necessarily
work for all languages.

We therefore suggest that a better solution for line wrapping wouldbe

to use the existing line wrapping algorithms of browsers, which are
presumably already language-sensitive.

[Note: the YouTube line wrapping algorithm goes even further by
splitting single caption cues into multiple cues if there is too much

text to reasonably fit within the area. YouTube then adjusts thetimesof these caption cues so they appear sequentially. Perhaps thiscould

be mentioned as another option for server-side tools.]


Yeah, with SRT people are manually line-wrapping when authoring the
captions
and often enough the end result is that you get something rendered:

- Who could have guessed that not all fonts are the same
size?
- That's news to me, so I get four lines of text where I
wanted two!

I'm inclined to say that we should normalize all whitespace during
parsing
and not have explicit line breaks at all. If people really want two
lines,

they should use two cues. In practice, I don't know how well thatwould

fare, though. What other solutions are there?


I don't think I would go that far. The concern has mostly been with
the line wrapping of lines that are too long and the possibility of
splitting words that way. The particular concern was with this
paragraph:

"Text runs must be wrapped at the edge of their containing blocks,
regardless of the value of the 'white-space' property, even if doing
so requires splitting a word where there is no line breaking
opportunity."
see
http://www.whatwg.org/specs/web-apps/current-work/multipage/rendering.html#timed-text-tracks-0

So we want to avoid splitting mid-word and we suggest introducing the
ability to have non-breaking spaces.

I think splitting in the middle of words would only happen for wordsthat

are longer than the whole line.


Ah ok - I guess you can interpret the sentence above in this way as
in"splitting a word ONLY where there is no line breaking opportunity".
Then it's probably ok. It would still make sense to accept
non-breaking spaces.


Perhaps Hixie would like to clarify in the spec precisely what is meant?

There's already a non-breaking space in Unicode: NO-BREAK SPACE (U+00A0)

There's still plenty of room for improvements in line wrapping, though.Itseems to me that the main reason that people line wrap captionsmanually is
to avoid getting two lines of very different length, as that looks quite
unbalanced. There's no way to make that happen with CSS, and AFAIK it'snot
done by the WebVTT rendering spec either.


People split manually when they want quality captions and can visually
test what it will look like.

This endeavor has one big problem: when you change the video size,
e.g. go to full screen, your optimisation for the previous size is
likely to not be optimal for the new size any more. There, an
automatic line balancing that makes use of commas and "and"s for
choosing likely good line break positions would be nice.

A completely different situation appears when the captions are not
manually created, as is the case in YouTube. Even when you submit a
perfect transcript and time-align it through speech recognition, you
will only do the line breaks as you have to render cues. To achieve a
better quality there, a better line-break algorithm would help
massively.

So, I agree with you about improving the line wrapping. I also think
it is likely something that we have to leave to the browsers - at
least for now.

Right, some experimentation here would be great, as I haven't seen anyfeature like this in any media players. In the hope of inspiring someone,perhaps myself, here's how I tentatively would like things to work:


1. Authors are encouraged to not manually line-break

2. UAs render the text at whatever with the <video> container allows, withmargins and all

3. The text will have been rendered on n lines.

4. Decrease the width on the container as much as possible while having nlines.5. Use that line-breaking and then do whatever left/center/right-alignmentrelative to the original width.

I really should get around to reading the rendering section for WebVTT tosee what it actually does, perhaps it's already clever...

4. Addressing individual cues through CSS

As far as we understand, you can currently address all cues through
::cue and you can address a cue part through ::cue-part(<voice> ||
<part> || <position> || <future-compatibility>). However, if we
understand correctly, it doesn’t seem to be possible to address an
individual cue through CSS, even though cues have individual
identifiers. This is either an oversight or a misunderstanding on our
parts. Can you please clarify how it is possible to address an
individual cue through CSS?

Since I've been arguing against the id's in WebVTT, I'm curious aboutthe

use case here. Isn't using a unique class good enough?


This links in with the discussion above on CSS styling and classes.
Rather than define classes of cue settings and reference them from the
cues, this allows them to be applied to individual cues in style
sheets. I thought the whole reason of cue identifiers was to have this
addressing functionality, so this would just close the loop.

For example:

Style sheet of the Web page:
<style>
video track#t1 ::cue(cue10) {
 text-decoration: blink;
}
</style>

The Web page (extract):
<video src="video.webm" controls>
 <track id="t1" label="captions" kind="captions" srclang="en-US"
src="cap1.vtt"/>
</video>

The caption file cap1.vtt:
WEBVTT
Language=en-US
Kind=Captions

cue1
0.000-5.000
blab blah

cue10
40.000-60.000
ALERT: Your basement is flooding - evacuate!


Cue10 is addressed through CSS and turned into a blinking text without
a need to change the markup at all.


My point was that you could just as well do this:

0.000-5.000
<c.cue1>blab blah</c>

In my view of things, id's in HTML are primarily for addressing via

#fragments and as hooks for scripts, for styling class is quitesufficient,

so I'm thinking it would be for WebVTT as well.


I quite like the idea of using the identifiers for named media
fragment URIs: e.g. http://example.org/video.webm#cue10 . We need
identifiers for this. Also, I find them less intrusive in the text
than <c.cue1> which defines a class that is only every used on this
single cue.

Hmm, isn't that what we have chapters for? Or do you want to use id's forsome kind of inline chapters?

5. Ability to move captions out of the way

Our experience with automated caption creation and positioning on
YouTube indicates that it is almost impossible to always place the
captions out of the way of where a user may be interested to look at.
We therefore allow users to dynamically move the caption rendering

area to a different viewport position to reveal what is underneath.We

recommend such drag-and-drop functionality also be made available for
TimedTrack captions on the Web, especially when no specific
positioning information is provided.


This would indeed be rather nice, but wouldn't it interfere with text

selection? Detaching the captions into a floating, draggable windowvia

the
context menu would be a theoretically possible solution, but that's
getting
rather far ahead of ourselves before we have basic captioning support.


On YouTube you can only move them within the video viewport. You
should try it - it's really awesome actually.

When you say "interfere with text selection" are you suggesting that
the text of captions/subtitles should be able to be cut and pasted? I
wonder what copyright holders think about that.

Being able to select the captions just like any other text is a greatthingthat I wouldn't want to disable. It's very useful if you want to pauseandlook up the definition of a word or to report a typo in the captionswithout

having to retype the whole text.


I guess you can have all of that as you can have it on Web pages, too.
If you click and hold, it will be grabbing for moving. If you double
click it is text selection for cut and paste. So, I don't think there
would be a problem.

That would work, but I have to admit I've never seen a web page/browsercombination that does what you suggest. Just single clicking and draggingis certainly the most discoverable form of text selection.

Premium Captions can be protected using the same tricks that are used to
prevent Premium DOM Text Nodes from being copied.


Agreed.


--
Philip Jägenstedt
Core Developer
Opera Software

Re: [whatwg] Google Feedback on the HTML5 media a11y specifications

Reply via email to