Re: Intent to implement and ship: UTF-8 autodetection for HTML and plain text loaded from file: URLs
On Thu, Dec 13, 2018 at 1:14 AM Henri Sivonen wrote: > I changed the limit to 4 MB. SGTM. ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform
Re: Intent to implement and ship: UTF-8 autodetection for HTML and plain text loaded from file: URLs
On Tue, Dec 11, 2018 at 10:08 AM Henri Sivonen wrote: > How about I change it to 5 MB on the assumption that that's still very > large relative to pre-UTF-8-era HTML and text file sizes? I changed the limit to 4 MB. -- Henri Sivonen hsivo...@mozilla.com ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform
Re: Intent to implement and ship: UTF-8 autodetection for HTML and plain text loaded from file: URLs
On Tue, Dec 11, 2018 at 2:24 AM Martin Thomson wrote: > This seems reasonable, but 50M is a pretty large number. Given the > odds of UTF-8 detection failing, I would have thought that this could > be much lower. Consider the case of a document of ASCII text with a copyright sign in the footer. I'd rather not make anyone puzzle over why the behavior of the footer depends on how much text comes before the footer. 50 MB is intentionally extremely large relative to "normal" HTML and text files so that the limit is reached approximately "never" unless you open *huge* log files. The HTML spec is about 11 MB these days, so that's existence proof that a non-log-file HTML document can exceed 10 MB. Of course, the limit doesn't need to be larger than present-day UTF-8 files but larger than "normal"-sized *legacy* non-UTF-8 files. It is quite possible that 50 MB is *too* large considering 32-bit systems and what *other* allocations are proportional to the buffer size, and I'm open to changing the limit to something smaller than 50 MB as long as it's still larger than "normal" non-UTF-8 HTML and text files. How about I change it to 5 MB on the assumption that that's still very large relative to pre-UTF-8-era HTML and text file sizes? > What is the number in Chrome? It depends. It's unclear to me what exactly it depends on. Based on https://github.com/whatwg/encoding/issues/68#issuecomment-272993181 , I expect it to depend on some combination of file system, OS kernel and Chromium IO library internals. On Ubuntu 18.04 with ext4 on an SSD, the number is 64 KB. On Windows 10 1803 with NTFS on an SSD, it's something smaller. I think making the limit depend on the internals of file IO buffering instead of a constant in the HTML parser is a really bad idea. Also 64 KB or something less than 64 KB seem way too small for the purpose of making it so that the user approximately never needs to puzzle over why things are different based on the length of the ASCII prefix of a file with non-ASCII later in the file. > I assume that other local sources like chrome: are expected to be > annotated properly. >From source inspection, it seems that chrome: URLs already get hard-coded to UTF-8 on the channel level: https://searchfox.org/mozilla-central/source/chrome/nsChromeProtocolHandler.cpp#187 As part of developing the patch, I saw only resource: URLs showing up as file: URLs to the HTML parser, so only resource: URLs got a special check that fast-tracks them to UTF-8 instead of buffering for detection like normal file: URLs. > On Mon, Dec 10, 2018 at 11:28 PM Henri Sivonen wrote: > > > > (Note: This isn't really a Web-exposed feature, but this is a Web > > developer-exposed feature.) > > > > # Summary > > > > Autodetect UTF-8 when loading HTML or plain text from file: URLs (only!). > > > > Some Web developers like to develop locally from file: URLs (as > > opposed to local HTTP server) and then deploy using a Web server that > > declares charset=UTF-8. To get the same convenience as when developing > > with Chrome, they want the files loaded from file: URLs be treated as > > UTF-8 even though the HTTP header isn't there. > > > > Non-developer users save files from the Web verbatim without the HTTP > > headers and open the files from file: URLs. These days, those files > > are most often in UTF-8 and lack the BOM, and sometimes they lack > > , and plain text files can't even use > charset=utf-8>. These users, too, would like a Chrome-like convenience > > when opening these files from file: URLs in Firefox. > > > > # Details > > > > If a HTML or plain text file loaded from a file: URL does not contain > > a UTF-8 error in the first 50 MB, assume it is UTF-8. (It is extremely > > improbable for text intended to be in a non-UTF-8 encoding to look > > like valid UTF-8 on the byte level.) Otherwise, behave like at > > present: assume the fallback legacy encoding, whose default depends on > > the Firefox UI locale. > > > > The 50 MB limit exists to avoid buffering everything when loading a > > log file whose size is on the order of a gigabyte. 50 MB is an > > arbitrary size that is significantly larger than "normal" HTML or text > > files, so that "normal"-sized files are examined with 100% confidence > > (i.e. the whole file is examined) but can be assumed to fit in RAM > > even on computers that only have a couple of gigabytes of RAM. > > > > The limit, despite being arbitrary, is checked exactly to avoid > > visible behavior changes depending on how Necko chooses buffer > > boundaries. > > > > The limit is a number of bytes instead of a timeout in order to avoid > > reintroducing timing dependencies (carefully removed in Firefox 4) to > > HTML parsing--even for file: URLs. > > > > Unless a declaring the encoding (or a BOM) is found within the > > first 1024 bytes, up to 50 MB of input is buffered before starting > > tokenizing. That is, the feature assumes that local files don't need > > incremental HTML parsing, that
Re: Intent to implement and ship: UTF-8 autodetection for HTML and plain text loaded from file: URLs
This seems reasonable, but 50M is a pretty large number. Given the odds of UTF-8 detection failing, I would have thought that this could be much lower. What is the number in Chrome? I assume that other local sources like chrome: are expected to be annotated properly. On Mon, Dec 10, 2018 at 11:28 PM Henri Sivonen wrote: > > (Note: This isn't really a Web-exposed feature, but this is a Web > developer-exposed feature.) > > # Summary > > Autodetect UTF-8 when loading HTML or plain text from file: URLs (only!). > > Some Web developers like to develop locally from file: URLs (as > opposed to local HTTP server) and then deploy using a Web server that > declares charset=UTF-8. To get the same convenience as when developing > with Chrome, they want the files loaded from file: URLs be treated as > UTF-8 even though the HTTP header isn't there. > > Non-developer users save files from the Web verbatim without the HTTP > headers and open the files from file: URLs. These days, those files > are most often in UTF-8 and lack the BOM, and sometimes they lack > , and plain text files can't even use charset=utf-8>. These users, too, would like a Chrome-like convenience > when opening these files from file: URLs in Firefox. > > # Details > > If a HTML or plain text file loaded from a file: URL does not contain > a UTF-8 error in the first 50 MB, assume it is UTF-8. (It is extremely > improbable for text intended to be in a non-UTF-8 encoding to look > like valid UTF-8 on the byte level.) Otherwise, behave like at > present: assume the fallback legacy encoding, whose default depends on > the Firefox UI locale. > > The 50 MB limit exists to avoid buffering everything when loading a > log file whose size is on the order of a gigabyte. 50 MB is an > arbitrary size that is significantly larger than "normal" HTML or text > files, so that "normal"-sized files are examined with 100% confidence > (i.e. the whole file is examined) but can be assumed to fit in RAM > even on computers that only have a couple of gigabytes of RAM. > > The limit, despite being arbitrary, is checked exactly to avoid > visible behavior changes depending on how Necko chooses buffer > boundaries. > > The limit is a number of bytes instead of a timeout in order to avoid > reintroducing timing dependencies (carefully removed in Firefox 4) to > HTML parsing--even for file: URLs. > > Unless a declaring the encoding (or a BOM) is found within the > first 1024 bytes, up to 50 MB of input is buffered before starting > tokenizing. That is, the feature assumes that local files don't need > incremental HTML parsing, that local file streams don't stall as part > of their intended operation, and that the content of local files is > available in its entirety (approximately) immediately. > > There are counter examples like Unix FIFOs (can be infinite and can > stall for an arbitrary amount of time) or file server shares mounted > as if they were local disks (data available somewhat less > immediately). It is assumed that it's OK to require people who have > built workflows around Unix FIFOs to use and that it's > OK to potentially start rendering a little later when file: URLs > actually cause network access. > > UTF-8 autodetection is given lower precedence that all other signals > that are presently considered for file: URLs. In particular, if a > file:-URL HTML document frames another file: URL HTML document (i.e. > they count as same-origin), the child inherits the encoding from the > parent instead of UTF-8 autodetection getting applied in the child > frame. > > # Why file: URLs only > > The reason why the feature does not apply to http: or https: resources > is that in those cases, it really isn't OK to assume that all bytes > arrive so quickly as to not benefit from incremental rendering and it > isn't OK to assume that the stream doesn't intentionally stall. > > Applying detection to http: or https: resources would mean at least on > of the following compromises: > > * Making the detection unreliable by making it depend on non-ASCII > appearing in the first 1024 bytes (the number of bytes currently > buffered for scanning ). If the was always near the > start of the file and the natural language used a non-Latin script to > make non-ASCII in the a certainty, this solution would be > reliable. However, this solution would be particularly bad for > Latin-script languages with infrequent non-ASCII, such as Finnish or > German, which can legitimately have all-ASCII titles despite the > language as a whole including non-ASCII. That is, if a developer > tested a site with a title that has some non-ASCII, things would > appear to work, but then the site would break when an all-ASCII title > occurs. > > * Making results depend on timing. (Having a detection timeout would > make the results depend on network performance relative to wall-clock > time.) > > * Making the detection unreliable by examining only the first buffer > passed by the networking subsystem to
Intent to implement and ship: UTF-8 autodetection for HTML and plain text loaded from file: URLs
(Note: This isn't really a Web-exposed feature, but this is a Web developer-exposed feature.) # Summary Autodetect UTF-8 when loading HTML or plain text from file: URLs (only!). Some Web developers like to develop locally from file: URLs (as opposed to local HTTP server) and then deploy using a Web server that declares charset=UTF-8. To get the same convenience as when developing with Chrome, they want the files loaded from file: URLs be treated as UTF-8 even though the HTTP header isn't there. Non-developer users save files from the Web verbatim without the HTTP headers and open the files from file: URLs. These days, those files are most often in UTF-8 and lack the BOM, and sometimes they lack , and plain text files can't even use . These users, too, would like a Chrome-like convenience when opening these files from file: URLs in Firefox. # Details If a HTML or plain text file loaded from a file: URL does not contain a UTF-8 error in the first 50 MB, assume it is UTF-8. (It is extremely improbable for text intended to be in a non-UTF-8 encoding to look like valid UTF-8 on the byte level.) Otherwise, behave like at present: assume the fallback legacy encoding, whose default depends on the Firefox UI locale. The 50 MB limit exists to avoid buffering everything when loading a log file whose size is on the order of a gigabyte. 50 MB is an arbitrary size that is significantly larger than "normal" HTML or text files, so that "normal"-sized files are examined with 100% confidence (i.e. the whole file is examined) but can be assumed to fit in RAM even on computers that only have a couple of gigabytes of RAM. The limit, despite being arbitrary, is checked exactly to avoid visible behavior changes depending on how Necko chooses buffer boundaries. The limit is a number of bytes instead of a timeout in order to avoid reintroducing timing dependencies (carefully removed in Firefox 4) to HTML parsing--even for file: URLs. Unless a declaring the encoding (or a BOM) is found within the first 1024 bytes, up to 50 MB of input is buffered before starting tokenizing. That is, the feature assumes that local files don't need incremental HTML parsing, that local file streams don't stall as part of their intended operation, and that the content of local files is available in its entirety (approximately) immediately. There are counter examples like Unix FIFOs (can be infinite and can stall for an arbitrary amount of time) or file server shares mounted as if they were local disks (data available somewhat less immediately). It is assumed that it's OK to require people who have built workflows around Unix FIFOs to use and that it's OK to potentially start rendering a little later when file: URLs actually cause network access. UTF-8 autodetection is given lower precedence that all other signals that are presently considered for file: URLs. In particular, if a file:-URL HTML document frames another file: URL HTML document (i.e. they count as same-origin), the child inherits the encoding from the parent instead of UTF-8 autodetection getting applied in the child frame. # Why file: URLs only The reason why the feature does not apply to http: or https: resources is that in those cases, it really isn't OK to assume that all bytes arrive so quickly as to not benefit from incremental rendering and it isn't OK to assume that the stream doesn't intentionally stall. Applying detection to http: or https: resources would mean at least on of the following compromises: * Making the detection unreliable by making it depend on non-ASCII appearing in the first 1024 bytes (the number of bytes currently buffered for scanning ). If the was always near the start of the file and the natural language used a non-Latin script to make non-ASCII in the a certainty, this solution would be reliable. However, this solution would be particularly bad for Latin-script languages with infrequent non-ASCII, such as Finnish or German, which can legitimately have all-ASCII titles despite the language as a whole including non-ASCII. That is, if a developer tested a site with a title that has some non-ASCII, things would appear to work, but then the site would break when an all-ASCII title occurs. * Making results depend on timing. (Having a detection timeout would make the results depend on network performance relative to wall-clock time.) * Making the detection unreliable by examining only the first buffer passed by the networking subsystem to the HTML parser. This makes the result dependent on network buffer boundaries (*and* potentially timing to the extent timing affects the boundaries), which is unreliable. Prior to Firefox 4, HTML parsing in Firefox depended on network buffer boundaries, which was bad and was remedied in Firefox 4. According to https://github.com/whatwg/encoding/issues/68#issuecomment-272993181 , Chrome chooses this mode of badness. * Breaking incremental rendering. (Not acceptable for remote content for user-perceived