Re: Intent to implement and ship: UTF-8 autodetection for HTML and plain text loaded from file: URLs

2018-12-12 Thread Martin Thomson
On Thu, Dec 13, 2018 at 1:14 AM Henri Sivonen  wrote:
> I changed the limit to 4 MB.

SGTM.
___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform


Re: Intent to implement and ship: UTF-8 autodetection for HTML and plain text loaded from file: URLs

2018-12-12 Thread Henri Sivonen
On Tue, Dec 11, 2018 at 10:08 AM Henri Sivonen  wrote:
> How about I change it to 5 MB on the assumption that that's still very
> large relative to pre-UTF-8-era HTML and text file sizes?

I changed the limit to 4 MB.

-- 
Henri Sivonen
hsivo...@mozilla.com
___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform


Re: Intent to implement and ship: UTF-8 autodetection for HTML and plain text loaded from file: URLs

2018-12-11 Thread Henri Sivonen
On Tue, Dec 11, 2018 at 2:24 AM Martin Thomson  wrote:
> This seems reasonable, but 50M is a pretty large number.  Given the
> odds of UTF-8 detection failing, I would have thought that this could
> be much lower.

Consider the case of a document of ASCII text with a copyright sign in
the footer. I'd rather not make anyone puzzle over why the behavior of
the footer depends on how much text comes before the footer.

50 MB is intentionally extremely large relative to "normal" HTML and
text files so that the limit is reached approximately "never" unless
you open *huge* log files.

The HTML spec is about 11 MB these days, so that's existence proof
that a non-log-file HTML document can exceed 10 MB. Of course, the
limit doesn't need to be larger than present-day UTF-8 files but
larger than "normal"-sized *legacy* non-UTF-8 files.

It is quite possible that 50 MB is *too* large considering 32-bit
systems and what *other* allocations are proportional to the buffer
size, and I'm open to changing the limit to something smaller than 50
MB as long as it's still larger than "normal" non-UTF-8 HTML and text
files.

How about I change it to 5 MB on the assumption that that's still very
large relative to pre-UTF-8-era HTML and text file sizes?

> What is the number in Chrome?

It depends. It's unclear to me what exactly it depends on. Based on
https://github.com/whatwg/encoding/issues/68#issuecomment-272993181 ,
I expect it to depend on some combination of file system, OS kernel
and Chromium IO library internals.

On Ubuntu 18.04 with ext4 on an SSD, the number is 64 KB. On Windows
10 1803 with NTFS on an SSD, it's something smaller.

I think making the limit depend on the internals of file IO buffering
instead of a constant in the HTML parser is a really bad idea. Also 64
KB or something less than 64 KB seem way too small for the purpose of
making it so that the user approximately never needs to puzzle over
why things are different based on the length of the ASCII prefix of a
file with non-ASCII later in the file.

> I assume that other local sources like chrome: are expected to be
> annotated properly.

>From source inspection, it seems that chrome: URLs already get
hard-coded to UTF-8 on the channel level:
https://searchfox.org/mozilla-central/source/chrome/nsChromeProtocolHandler.cpp#187

As part of developing the patch, I saw only resource: URLs showing up
as file: URLs to the HTML parser, so only resource: URLs got a special
check that fast-tracks them to UTF-8 instead of buffering for
detection like normal file: URLs.

> On Mon, Dec 10, 2018 at 11:28 PM Henri Sivonen  wrote:
> >
> > (Note: This isn't really a Web-exposed feature, but this is a Web
> > developer-exposed feature.)
> >
> > # Summary
> >
> > Autodetect UTF-8 when loading HTML or plain text from file: URLs (only!).
> >
> > Some Web developers like to develop locally from file: URLs (as
> > opposed to local HTTP server) and then deploy using a Web server that
> > declares charset=UTF-8. To get the same convenience as when developing
> > with Chrome, they want the files loaded from file: URLs be treated as
> > UTF-8 even though the HTTP header isn't there.
> >
> > Non-developer users save files from the Web verbatim without the HTTP
> > headers and open the files from file: URLs. These days, those files
> > are most often in UTF-8 and lack the BOM, and sometimes they lack
> > , and plain text files can't even use  > charset=utf-8>. These users, too, would like a Chrome-like convenience
> > when opening these files from file: URLs in Firefox.
> >
> > # Details
> >
> > If a HTML or plain text file loaded from a file: URL does not contain
> > a UTF-8 error in the first 50 MB, assume it is UTF-8. (It is extremely
> > improbable for text intended to be in a non-UTF-8 encoding to look
> > like valid UTF-8 on the byte level.) Otherwise, behave like at
> > present: assume the fallback legacy encoding, whose default depends on
> > the Firefox UI locale.
> >
> > The 50 MB limit exists to avoid buffering everything when loading a
> > log file whose size is on the order of a gigabyte. 50 MB is an
> > arbitrary size that is significantly larger than "normal" HTML or text
> > files, so that "normal"-sized files are examined with 100% confidence
> > (i.e. the whole file is examined) but can be assumed to fit in RAM
> > even on computers that only have a couple of gigabytes of RAM.
> >
> > The limit, despite being arbitrary, is checked exactly to avoid
> > visible behavior changes depending on how Necko chooses buffer
> > boundaries.
> >
> > The limit is a number of bytes instead of a timeout in order to avoid
> > reintroducing timing dependencies (carefully removed in Firefox 4) to
> > HTML parsing--even for file: URLs.
> >
> > Unless a  declaring the encoding (or a BOM) is found within the
> > first 1024 bytes, up to 50 MB of input is buffered before starting
> > tokenizing. That is, the feature assumes that local files don't need
> > incremental HTML parsing, that 

Re: Intent to implement and ship: UTF-8 autodetection for HTML and plain text loaded from file: URLs

2018-12-10 Thread Martin Thomson
This seems reasonable, but 50M is a pretty large number.  Given the
odds of UTF-8 detection failing, I would have thought that this could
be much lower.  What is the number in Chrome?

I assume that other local sources like chrome: are expected to be
annotated properly.
On Mon, Dec 10, 2018 at 11:28 PM Henri Sivonen  wrote:
>
> (Note: This isn't really a Web-exposed feature, but this is a Web
> developer-exposed feature.)
>
> # Summary
>
> Autodetect UTF-8 when loading HTML or plain text from file: URLs (only!).
>
> Some Web developers like to develop locally from file: URLs (as
> opposed to local HTTP server) and then deploy using a Web server that
> declares charset=UTF-8. To get the same convenience as when developing
> with Chrome, they want the files loaded from file: URLs be treated as
> UTF-8 even though the HTTP header isn't there.
>
> Non-developer users save files from the Web verbatim without the HTTP
> headers and open the files from file: URLs. These days, those files
> are most often in UTF-8 and lack the BOM, and sometimes they lack
> , and plain text files can't even use  charset=utf-8>. These users, too, would like a Chrome-like convenience
> when opening these files from file: URLs in Firefox.
>
> # Details
>
> If a HTML or plain text file loaded from a file: URL does not contain
> a UTF-8 error in the first 50 MB, assume it is UTF-8. (It is extremely
> improbable for text intended to be in a non-UTF-8 encoding to look
> like valid UTF-8 on the byte level.) Otherwise, behave like at
> present: assume the fallback legacy encoding, whose default depends on
> the Firefox UI locale.
>
> The 50 MB limit exists to avoid buffering everything when loading a
> log file whose size is on the order of a gigabyte. 50 MB is an
> arbitrary size that is significantly larger than "normal" HTML or text
> files, so that "normal"-sized files are examined with 100% confidence
> (i.e. the whole file is examined) but can be assumed to fit in RAM
> even on computers that only have a couple of gigabytes of RAM.
>
> The limit, despite being arbitrary, is checked exactly to avoid
> visible behavior changes depending on how Necko chooses buffer
> boundaries.
>
> The limit is a number of bytes instead of a timeout in order to avoid
> reintroducing timing dependencies (carefully removed in Firefox 4) to
> HTML parsing--even for file: URLs.
>
> Unless a  declaring the encoding (or a BOM) is found within the
> first 1024 bytes, up to 50 MB of input is buffered before starting
> tokenizing. That is, the feature assumes that local files don't need
> incremental HTML parsing, that local file streams don't stall as part
> of their intended operation, and that the content of local files is
> available in its entirety (approximately) immediately.
>
> There are counter examples like Unix FIFOs (can be infinite and can
> stall for an arbitrary amount of time) or file server shares mounted
> as if they were local disks (data available somewhat less
> immediately). It is assumed that it's OK to require people who have
> built workflows around Unix FIFOs to use  and that it's
> OK to potentially start rendering a little later when file: URLs
> actually cause network access.
>
> UTF-8 autodetection is given lower precedence that all other signals
> that are presently considered for file: URLs. In particular, if a
> file:-URL HTML document frames another file: URL HTML document (i.e.
> they count as same-origin), the child inherits the encoding from the
> parent instead of UTF-8 autodetection getting applied in the child
> frame.
>
> # Why file: URLs only
>
> The reason why the feature does not apply to http: or https: resources
> is that in those cases, it really isn't OK to assume that all bytes
> arrive so quickly as to not benefit from incremental rendering and it
> isn't OK to assume that the stream doesn't intentionally stall.
>
> Applying detection to http: or https: resources would mean at least on
> of the following compromises:
>
> * Making the detection unreliable by making it depend on non-ASCII
> appearing in the first 1024 bytes (the number of bytes currently
> buffered for scanning ). If the  was always near the
> start of the file and the natural language used a non-Latin script to
> make non-ASCII in the  a certainty, this solution would be
> reliable. However, this solution would be particularly bad for
> Latin-script languages with infrequent non-ASCII, such as Finnish or
> German, which can legitimately have all-ASCII titles despite the
> language as a whole including non-ASCII. That is, if a developer
> tested a site with a title that has some non-ASCII, things would
> appear to work, but then the site would break when an all-ASCII title
> occurs.
>
> * Making results depend on timing. (Having a detection timeout would
> make the results depend on network performance relative to wall-clock
> time.)
>
> * Making the detection unreliable by examining only the first buffer
> passed by the networking subsystem to 

Intent to implement and ship: UTF-8 autodetection for HTML and plain text loaded from file: URLs

2018-12-10 Thread Henri Sivonen
(Note: This isn't really a Web-exposed feature, but this is a Web
developer-exposed feature.)

# Summary

Autodetect UTF-8 when loading HTML or plain text from file: URLs (only!).

Some Web developers like to develop locally from file: URLs (as
opposed to local HTTP server) and then deploy using a Web server that
declares charset=UTF-8. To get the same convenience as when developing
with Chrome, they want the files loaded from file: URLs be treated as
UTF-8 even though the HTTP header isn't there.

Non-developer users save files from the Web verbatim without the HTTP
headers and open the files from file: URLs. These days, those files
are most often in UTF-8 and lack the BOM, and sometimes they lack
, and plain text files can't even use . These users, too, would like a Chrome-like convenience
when opening these files from file: URLs in Firefox.

# Details

If a HTML or plain text file loaded from a file: URL does not contain
a UTF-8 error in the first 50 MB, assume it is UTF-8. (It is extremely
improbable for text intended to be in a non-UTF-8 encoding to look
like valid UTF-8 on the byte level.) Otherwise, behave like at
present: assume the fallback legacy encoding, whose default depends on
the Firefox UI locale.

The 50 MB limit exists to avoid buffering everything when loading a
log file whose size is on the order of a gigabyte. 50 MB is an
arbitrary size that is significantly larger than "normal" HTML or text
files, so that "normal"-sized files are examined with 100% confidence
(i.e. the whole file is examined) but can be assumed to fit in RAM
even on computers that only have a couple of gigabytes of RAM.

The limit, despite being arbitrary, is checked exactly to avoid
visible behavior changes depending on how Necko chooses buffer
boundaries.

The limit is a number of bytes instead of a timeout in order to avoid
reintroducing timing dependencies (carefully removed in Firefox 4) to
HTML parsing--even for file: URLs.

Unless a  declaring the encoding (or a BOM) is found within the
first 1024 bytes, up to 50 MB of input is buffered before starting
tokenizing. That is, the feature assumes that local files don't need
incremental HTML parsing, that local file streams don't stall as part
of their intended operation, and that the content of local files is
available in its entirety (approximately) immediately.

There are counter examples like Unix FIFOs (can be infinite and can
stall for an arbitrary amount of time) or file server shares mounted
as if they were local disks (data available somewhat less
immediately). It is assumed that it's OK to require people who have
built workflows around Unix FIFOs to use  and that it's
OK to potentially start rendering a little later when file: URLs
actually cause network access.

UTF-8 autodetection is given lower precedence that all other signals
that are presently considered for file: URLs. In particular, if a
file:-URL HTML document frames another file: URL HTML document (i.e.
they count as same-origin), the child inherits the encoding from the
parent instead of UTF-8 autodetection getting applied in the child
frame.

# Why file: URLs only

The reason why the feature does not apply to http: or https: resources
is that in those cases, it really isn't OK to assume that all bytes
arrive so quickly as to not benefit from incremental rendering and it
isn't OK to assume that the stream doesn't intentionally stall.

Applying detection to http: or https: resources would mean at least on
of the following compromises:

* Making the detection unreliable by making it depend on non-ASCII
appearing in the first 1024 bytes (the number of bytes currently
buffered for scanning ). If the  was always near the
start of the file and the natural language used a non-Latin script to
make non-ASCII in the  a certainty, this solution would be
reliable. However, this solution would be particularly bad for
Latin-script languages with infrequent non-ASCII, such as Finnish or
German, which can legitimately have all-ASCII titles despite the
language as a whole including non-ASCII. That is, if a developer
tested a site with a title that has some non-ASCII, things would
appear to work, but then the site would break when an all-ASCII title
occurs.

* Making results depend on timing. (Having a detection timeout would
make the results depend on network performance relative to wall-clock
time.)

* Making the detection unreliable by examining only the first buffer
passed by the networking subsystem to the HTML parser. This makes the
result dependent on network buffer boundaries (*and* potentially
timing to the extent timing affects the boundaries), which is
unreliable. Prior to Firefox 4, HTML parsing in Firefox depended on
network buffer boundaries, which was bad and was remedied in Firefox
4. According to
https://github.com/whatwg/encoding/issues/68#issuecomment-272993181 ,
Chrome chooses this mode of badness.

* Breaking incremental rendering. (Not acceptable for remote content
for user-perceived