On Sat, 07 Aug 2010 09:57:39 +0200, Silvia Pfeiffer <[email protected]> wrote:

Hi Philip,

On Sat, Aug 7, 2010 at 1:50 AM, Philip Jägenstedt <[email protected]> wrote:

* there is a possibility to provide script that just affects the
time-synchronized text resource


I agree that some metadata would be useful, more on that below. I'm not
sure why we would want to run scripts inside the text document, though, when
that can be accomplished by using the TimedTrack API from the containing
page.



Scripts inside a timed text document would only be useful for applications
that use the track not in conjunction with a Web page.

Do you mean that media players could include a JavaScript engine just for supporting scripts in WebSRT? Not to say that it can't happen, but it seems a bit unlikely.

2. There is a natural mapping of WebSRT into in-band text tracks.
Each cue naturally maps into a encoding page (just like a WMML cue does,
too). But in WebSRT, because the setup information is not brought in a
hierarchical element surrounding all cues, it is easier to just chuck
anything that comes before the first cue into an encoding header page. For
WMML, this problem can be solved, but it is less natural.


I really like the idea of letting everything before the first timestamp in
WebSRT be interpreted as the header. I'd want to use it like this:

# author: Fan Subber
# voices: <1> Boy
#         <2> Girl

01:23:45.678 --> 01:23:46.789
<1> Hello

01:23:48.910 --> 01:23:49.101
<2> Hello

It's not critical that the format of the header be machine-readable, but we
could of course make up a key-value syntax, use JSON, or something else.



I disagree. I think it's absolutely necessary that the format of the header be machine-readable. Just like EXIF in images is machine readable or ID3 in
MP3 is machine-readable. It would be counter-productive not to have it
machine-readable, in particular useless to archiving and media management
solutions.

OK, so maybe key-values?

Author: Fan Subber
Voice: <1> Boy
Voice: <2> Girl

01:23:45.678 --> 01:23:46.789
<1> Hello

This looks a bit like HTTP headers. (I'm not sure I'd actually want to allow multiple occurrences of the same key, in practice that seems to result in inconsistencies in how people mark up multiple authors.)

I'm not sure of the best solution. I'd quite like the ability to use
arbitrary voices, e.g. to use the names/initials of the speaker rather than
a number, or to use e.g. <shouting> in combination with CSS :before {
content 'Shouting: ' } or similar to adapt the display for different
audiences (accessibility, basically).



I agree. I think we can go back to using<span> and @class and @id and that
would solve it all.

I guess this is in support of Henri's proposal of parsing the cue using the HTML fragment parser (same as innerHTML)? That would be easy to implement, but how do we then mark up speakers? Using <span class="narrator"></span> around each cue is very verbose. HTML isn't very good for marking up dialog, which is quite a limitation when dealing with subtitles...

* there is no language specification for a WebSRT resource; while this
will
not be a problem when used in conjunction with a <track> element, it still is a problem when the resource is used just by itself, in particular as a
hint for font selection and speech synthesis.


The language inside the WebSRT file wouldn't end up being used for anything by a browser, as it needs to know the language before downloading it to know whether or not to download it at all. Still, I'd like a header section in
WebSRT. I think the parser is already defined so that it would ignore
garbage before the first cue, so this is more a matter of making it legal
syntax.


Not quite. Some metadata in the header can make sense to also expose to the
Web page.

I agree that we need a structured header section in WebSRT.

Fair enough, we should revisit this when deciding on how to expose metadata in media resources in general.

* there is no means to identify which parser is required in the cues (is
it
"plain text", "minimal markup", or "anything"?) and therefore it is not
possible for an application to know how it should parse the cues.


All the types that are actually for visual rendering are parsed in the same
way, aren't they? Of course there's no way for non-browsers to know that
metadata tracks aren't interesting to look at as subtitles, but I think
showing the user the garbage is a quicker to communicate that the file isn't
for direct viewing than hiding the text or similar.



The spec says that files of kind "descriptions" and "metadata" are not
displayed. It seems though that the parsing section will try two interfaces: HTML and plain. I think there is a disconnect there. If we already know that
it's not parsable in HTML, why even try?

I was confused. The parsing algorithm does the same thing regardless of what kind of text track it is dealing with. I guess what you're saying is that non-browser applications also need to know that something is e.g. chapter markers, so that it can display it appropriately?

I don't have a strong opinion, but repeating the same information both in the containing document and in the subtitle file means that one of them will be ignored by browsers. People will copy-paste the ignored one and it will end up being wrong a lot of the time.

* there is no version number on the format, thus it will be difficult to
introduce future changes.


I think we shouldn't have a version number, for the same reason that CSS
and HTML don't really have versions. If we evolve the WebSRT spec, it should
be in a backwards-compatible way.


CSS and HTML are structured formats where you ignore things that you cannot interpret. But the parsing is fixed and extensions play within this parsing
framework. I have my doubts that is possible with WebSRT. Already one
extension that we are discussion here will break parsing: the introduction
of structured headers. Because there is no structured way of extending
WebSRT, I believe the best way to communicate whether it is backwards
compatible is through a version number. We can change the minor versions if the compatibility is not broken - it communicates though what features are being used - and we can change the major version of compatibility is broken.

Similarly, I think that the WebSRT parser should be designed to ignore things that it doesn't recognize, in particular unknown voices (if we keep those). Requiring parsers to fail when the version number is increased makes it harder to introduce changes to the format, because you'll have to either break all existing implementations or provide one subtitle file for each version. (Having a version number but letting parsers ignore it is just weird, quite like in HTML.)

I filed a bug suggesting that voice is allowed to be an arbitrary string: <http://www.w3.org/Bugs/Public/show_bug.cgi?id=10320> (From the point of view of the parser, it still wouldn't be valid syntax.)

 2. Break the SRT link.


* the mime type of WebSRT resources should be a different mime type to SRT
files, since they are so fundamentally different; e.g. text/websrt

* the file extension of WebSRT resources should be different from SRT
files,
e.g. wsrt


I'm not sure if either of these would make a difference.


Really? How do you propose that a media player identifies that it cannot
parse a WebSRT file that has random metadata in it when it is called .srt
and provided under the same mime type as SRT files? Or a transcoding
pipeline that relies on srt files just being plain old simple SRT. It breaks
expectations with users, with developers and with software.

I think it's unlikely that people will offer download links to SRT files that aren't useful outside of the page, so random metadata isn't likely to reach end users or applications by accident. Also, most media frameworks rely mainly on sniffing, so even a file that uses lots of WebSRT-only features is quite likely going to be detected as SRT anyway. At least in GStreamer, the file extension is given quite little weight in guessing the type and MIME isn't used at all (because the sniffing code doesn't know anything about HTTP). Finally, seeing random metadata displayed on screen is about as good an indication that the file is "broken" as the application failing to recognize the file completely.

On the other hand, keeping the same extension and (unregistered) MIME type as SRT has plenty of benefits, such as immediately being able to use existing SRT files in browsers without changing their file extension or MIME type.

 4. Make full use of CSS

In the current form, WebSRT only makes limited use of existing CSS. I see
particularly the following limitations:

* no use of the positioning functionality is made and instead a new means
of
positioning is introduced; it would be nicer to just have this reuse CSS
functionality. It would also avoid having to repeat the positioning
information on every single cue.


I agree, the positioning syntax isn't something I'm happy about with
WebSRT. I think treating everything that follows the timestamp to be CSS
that applies to the whole cue would be better.


Or taking the positioning stuff out of WebSRT and moving it to an external
CSS file as is done with formatting would make it much simpler.

Ah, that would be great. It's quite likely that there will only be 1 or 2 different positions in the whole file, which you don't want to repeat on each and every cue.

 * there is no definition of the "canvas" dimensions that the cues are
prepared for (width/height) and expected to work with other than saying it is the video dimensions - but these can change and the proportions should
be
changed with that


I'm not sure what you're saying here. Should the subtitle file be
hard-coded to a particular size? In the quite peculiar case where the same subtitles really don't work at two different resolutions, couldn't we just
have two files? In what cases would this be needed?


Most subtitles will be created with a specific width and height in mind. For example, the width in characters relies on the video canvas having at least that size and the number of lines used usually refers to a lower third of a
video - where that is too small, it might cover the whole video. So, my
proposal is not the hard-code the subtitles to a particular size, but to put
the minimum width and height that are being used for the creation of the
subtitles into the file. Then, the file can be scaled below or above this
size to adjust to the actual available space.

In practice, does this mean scaling font-size by width_actual/width_intended or similar? Personally, I prefer subtitles to be something like 20 screen pixels regardless of video size, as that is readable. Making them bigger hides more of the video, while making them smaller makes them hard to read. But I guess we could let the CSS media query min-width and similar be evaluated against the size of the containing video element, to make it possible anyway.

--
Philip Jägenstedt
Core Developer
Opera Software

Reply via email to