Re: Cost of ICU data
Hi, Bill from user research here. We just finished some research in Thailand and Indonesia where we conducted ~40 interviews with desktop browser users (half of whom were Firefox users). We'll be presenting findings from the research next month, but I'd like to share a few observations from the field that give a clear picture of the Internet infrastructure in emerging markets in SE Asia. Indonesia is worth focusing on for the discussion because they have a large population and Firefox has a large market share. The infrastructure there is similar to India which has an even larger population. Some context: First, the connection speeds are really, really slow and stability is poor. Only 3% of the population in Indonesia has wired home connections. Everyone else either connects at wifi hotspots, internet cafes, or using the 3G network. Even these connections are slow. Average connection speed is 3 Mbps compared to 20 in the US. An example to give some context: most people are not able to stream video from YouTube. They install add-ons (such as IDM) to download the videos to watch them later. Second, most users buy their computers from local vendors, not chain stores. The local vendors preinstall software on the computers including Firefox (and Chrome). Many of these versions of Firefox are older. We saw versions 12, 15, 18. Some of these have add-ons preinstalled (such as Yahoo, etc.). Others are configured to prevent updates. There is a high correlation between download speed and being up-to-date with Firefox. We know from metrics data that ~50% of users in Indonesia are using versions other the current version of Firefox. Only the wealthiest of our participants had the most current version of Firefox. Our lower-income participants who were connecting to the Internet had older versions and add-ons that were hijacking search and the user experience in general. The key point is that download size is very important in these markets. Also, it is important for us to think about two related topics: 1) How to get people in these markets to current versions of Firefox? 2) If downloading is not currently the most effective distribution model in emerging markets, how can we think of alternatives or make downloading work? One final point: we have observed that in rural parts of N. America that connection speeds and stability are similar. So, it's not only an emerging vs. emerged markets challenge. Please let me know if you have more specific or follow-up questions. I'd love to share what we learned. Thanks! Bill ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform
Re: Cost of ICU data
On 10/15/2013 12:06 PM, Benjamin Smedberg wrote: With the landing of bug 853301, we are now shipping ICU in desktop Firefox builds. This costs us about 10% in both download and on-disk footprint I'm going to try and summarize the discussion and indicate next steps. == First, I want to be clear that I am approaching this question for desktop Firefox only. B2G has different adoption requirements and a different localization strategy than desktop. == Several people thought that it would be a bad thing to include only one or some subset of language data in Firefox as shipped by default, on the grounds that there should be one web platform. Anne brought up the use case of a computer in a hotel/hostel. Axel points out that ICU locales are more analogous to language dictionaries. Users can choose to install dictionaries independent of the UI locale of Firefox. Jeff is worried that any sort of dynamic system would cause the Intl.* methods to return different results over time, which is surprising to platform developers. Jeff also said that Chromium may not be shipping the full language list, but perhaps only a subset of languages. I tested Chrome's behavior, and it appears to be shipping a fairly full set of language data, including languages such as Amharic which I'm pretty sure it doesn't ship. I'll also mention that this came up in the previous discussion last December, and at the time we discussed whether it would be better for websites to provide their own implementation of these intl functions and download whatever data they needed; the obvious disadvantage of this is that each site would be downloading the data separately without sharing, which is not a good experience for developers. == jwatt mentioned that he has a dependency on DecimalFormat for parsing numbers from input type=number. What locale data does this actually require? == mbrubeck wonders why this particular feature is being questioned based on its size, when in general the Firefox package size has gotten larger with other features but without a lot of fuss. I am questioning this feature now because it is a sizeable jump even by historical standards, and because I was made aware of data that shows that download size affects both initial adoption and update rates. Perhaps we have been adding features to the platform too liberally and affecting adoption. Perhaps we need to set an absolute cap on download size, and figure out how to work within that cap. I don't really know the answers, but we should all be worried about our adoption and market share numbers; death by a fairly small set of 10% increases is still a big deal. == There was a technical discussion about how we could implement dynamic download of more languages, and whether the spec made that easy or hard. It is clear that the current spec is synchronous and doesn't have a way to request additional languages and wait for them. We could do the download and start showing results later, but we can't really block on that data. My only other thought here is whether we should propose for the Intl draft an additional async API to request new languages and get a promise back for when they are ready. == cpeterson asked whether we have funnelcake data to actually measure the effects of additional download weight. I had been pushing this with out stats/UR group, and this is now filed as bug 928017. I don't have commitments to make this happen yet, but I'm working on it. == Bill provided more details about the user research data about connection speeds and update rates. The summary seems to be that update rates are much lower for users with slower connection speeds. == I don't think that there is enough data yet to make a decision. Hopefully funnelcake results which help make a more informed choice. If it turns out that that Firefox wants this decision reconsidered, what groups and goals would be affected by asking for ICU or at least the number-format and date-format APIs to be disabled for download weight reasons in 27? To be clear, the final decision is definitely not mine to make: I just want to make sure that we know what we're trading off and that it's clearly what we ought to be doing. --BDS ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform
Re: Cost of ICU data
On 10/22/13 11:34 AM, wsel...@mozilla.com wrote: The key point is that download size is very important in these markets. Also, it is important for us to think about two related topics: 1) How to get people in these markets to current versions of Firefox? 2) If downloading is not currently the most effective distribution model in emerging markets, how can we think of alternatives or make downloading work? AOL had good success with CDs. :) I'm only half-joking! CDs are cheap and we have enthusiastic Mozilla Reps in many countries. We could make official ISO images and CD art designs available on our website. We could provide or subsidize blank CDs, CD burners, and CD stickers to official Mozilla Reps. They could help update people's browsers. The CD art could include slogans like after you install Firefox, pass this CD along to your friends. cpeterson ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform
Re: Cost of ICU data
On 10/22/13, 2:09 PM, wsel...@gmail.com wrote: One suggestion that our team came up with is to provide Firefox-branded USB keys and distribute them through reps, chains like KFC and 7-11, and local computer vendors where people connect online. These would have installers for the latest version of Firefox and FF for Android. I like the USB thumb drive idea because they are reusable, but (I assume) they are more expensive than CDs. But maybe USB thumb drives are cheaper because you don't need to buy a CD burner or blank CDs. chris ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform
Re: Cost of ICU data
On 2013-10-22 4:06 PM, Benjamin Smedberg wrote: I don't think that there is enough data yet to make a decision. Hopefully funnelcake results which help make a more informed choice. If it turns out that that Firefox wants this decision reconsidered, what groups and goals would be affected by asking for ICU or at least the number-format and date-format APIs to be disabled for download weight reasons in 27? More and more people in libxul want access to ICU. See the dependency list for bug 915735 for a (partial) list. Cheers, Ehsan ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform
Re: Cost of ICU data
On 10/22/2013 6:19 PM, Ehsan Akhgari wrote: On 2013-10-22 4:06 PM, Benjamin Smedberg wrote: I don't think that there is enough data yet to make a decision. Hopefully funnelcake results which help make a more informed choice. If it turns out that that Firefox wants this decision reconsidered, what groups and goals would be affected by asking for ICU or at least the number-format and date-format APIs to be disabled for download weight reasons in 27? More and more people in libxul want access to ICU. See the dependency list for bug 915735 for a (partial) list. I'm aware of that, but I'm not clear on whether any of those features require the language data we're talking about here, and whether having the single UI locale or all locales would be necessary. I know that the indexeddb use-case only requires the collation tables and not any of the locale data. --BDs ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform
Re: Cost of ICU data
On 2013-10-22 6:34 PM, Benjamin Smedberg wrote: On 10/22/2013 6:19 PM, Ehsan Akhgari wrote: On 2013-10-22 4:06 PM, Benjamin Smedberg wrote: I don't think that there is enough data yet to make a decision. Hopefully funnelcake results which help make a more informed choice. If it turns out that that Firefox wants this decision reconsidered, what groups and goals would be affected by asking for ICU or at least the number-format and date-format APIs to be disabled for download weight reasons in 27? More and more people in libxul want access to ICU. See the dependency list for bug 915735 for a (partial) list. I'm aware of that, but I'm not clear on whether any of those features require the language data we're talking about here, and whether having the single UI locale or all locales would be necessary. I know that the indexeddb use-case only requires the collation tables and not any of the locale data. Yes, that's correct. Simon and Jonathan can probably clarify which parts of ICU they're hoping to use. Cheers, Ehsan ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform
Re: Cost of ICU data
On 10/17/13 11:43 AM, Matt Brubeck wrote: For this reason, I'm a bit confused at the level of scrutiny of ICU's size when we've added many times that amount to our download size over the past couple of years without any pushback or even discussion. Do we have Funnelcake data comparing download size vs successful installations for 2013? If we don't know how big is too big, blocking ICU seems premature (but still worth investigation). Download size is a concern for users in developing countries, but the same users will benefit from ICU. chris ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform
Re: Cost of ICU data
On 10/18/13 4:06 PM, Chris Peterson wrote: On 10/17/13 11:43 AM, Matt Brubeck wrote: For this reason, I'm a bit confused at the level of scrutiny of ICU's size when we've added many times that amount to our download size over the past couple of years without any pushback or even discussion. Do we have Funnelcake data comparing download size vs successful installations for 2013? If we don't know how big is too big, blocking ICU seems premature (but still worth investigation). Download size is a concern for users in developing countries, but the same users will benefit from ICU. Also, if the ICU data does push us over the download size limit, then we may be able to decrease download size and/or improve download success through other means. chris ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform
Re: Cost of ICU data
On 10/17/13 12:02 PM, Gervase Markham wrote: On 16/10/13 16:02, Axel Hecht wrote: We'll need to go down a path that works for Firefox OS. With Firefox OS, we don't have the download-size issue, do we? So we can ship all the data. Gerv We have issues with disk space, currently. We're already in the situation where all our keyboard data doesn't fit on quite a few of the devices out there. Also, FOTA size matters a bit, though that's probably less of a problem. Axel ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform
Re: Cost of ICU data
On 10/16/13 5:39 PM, Jeff Walden wrote: On 10/16/2013 02:10 PM, Axel Hecht wrote: I wonder how far we can get by doing something along the lines we use for webfonts, starting to do the best we can with the data we already have, and improve once the perfect data is local. Having the Intl.Foo algorithms returning different data over time is, IMO, even worse than deciding that certain locales are less important than others. Aside from Math.random, of course, I can't think of anything in JS that has different behavior on the same inputs over time. Jeff For one, I don't think that's true for web. You might think so in terms of stuff in the js specs, but the distinction between that and html5 and all kinds of server errors and timing differences is just theory. More importantly, the impact of supporting a finite set of languages can easily be the nail in the coffin for the others. I don't think that's what mozilla stands for. Axel ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform
Re: Cost of ICU data
On 16.10.2013 17:02, Axel Hecht wrote: We'll need to go down a path that works for Firefox OS. [...] But, yes, I think we'll need a hosted service to provide that data on demand in the end. This sounds like a non-starter for mobile devices, doesn't it? ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform
Re: Cost of ICU data
On 10/17/13 2:41 PM, Dao wrote: On 16.10.2013 17:02, Axel Hecht wrote: We'll need to go down a path that works for Firefox OS. [...] But, yes, I think we'll need a hosted service to provide that data on demand in the end. This sounds like a non-starter for mobile devices, doesn't it? Well, it makes the implementation trickier. Of course, telefonica just updated the phones from 1.0.1 to 1.1 in spain, over the air without charges, so the infrastructure is there. It's an organizational effort to tie into that infrastructure. We'll need a reference implementation like we have with software update, and then get the our partner contacts in shape to explain how to do that on their side. Plus customizable hooks, of course. And then, yes, we'd need to still disable the downloads, or make them really optional, if you're on roaming data or something. But software update can do that already, too, I suspect. Axel ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform
Re: Cost of ICU data
On Thu, Oct 17, 2013 at 3:46 AM, Axel Hecht l...@mozilla.com wrote: We have issues with disk space, currently. We're already in the situation where all our keyboard data doesn't fit on quite a few of the devices out there. Where can one read more about this? This ICU data is not *that* huge. If we can't afford a couple of megabytes now on B2G then it seems like we're in for severe problems soon. Isn't Gecko alone growing by megabytes per year? Cheers, Brian -- Mozilla Networking/Crypto/Security (Necko/NSS/PSM) ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform
Re: Cost of ICU data
On 10/17/13 3:41 PM, Brian Smith wrote: On Thu, Oct 17, 2013 at 3:46 AM, Axel Hecht l...@mozilla.com wrote: We have issues with disk space, currently. We're already in the situation where all our keyboard data doesn't fit on quite a few of the devices out there. Where can one read more about this? This ICU data is not *that* huge. If we can't afford a couple of megabytes now on B2G then it seems like we're in for severe problems soon. Isn't Gecko alone growing by megabytes per year? I wish there were docs and clear cuts. We've been in dire problems already, when our QA smoketest phones wouldn't get updates for days due to system.img being too large. And thus we didn't get QA to run tests. These are the questions I asked last time, and don't have answers to: - What exactly are the limiting sizes? -- image size (per bootloader?) -- disk partition size --- at which point in time? user dependent? --- can we have telemetry for this, if so? I suspect we're talking about the joint size for gaia and gecko, but I'm not sure that's the case, or at least always the case. I.e., do we get a cookie if we move data from gaia into gecko? There's probably more that I don't know, just because I don't know much about phones, and the various processes to get software on to them. Axel ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform
Re: Cost of ICU data
On 10/17/2013 10:24 AM, Ehsan Akhgari wrote: We used to have codesighs measurements (and perhaps still do) but historically many people just ignored them. We stopped collecting codesighs measurements in November 2012 (bug 803736). As Ehsan says, it was widely ignored. It regressed constantly, and it never seemed reasonable to demand that people implement desired features and fixes without adding any code. For this reason, I'm a bit confused at the level of scrutiny of ICU's size when we've added many times that amount to our download size over the past couple of years without any pushback or even discussion. (On a related note, what happened to http://www.arewesmallyet.com/?) ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform
Re: Cost of ICU data
Jumping in late, so top posting. I think being able to load language data dynamically is a good idea. I don't see a reason why this should be tied in to a language pack, though. The other way around is a different question. i.e. language data doesn't include UI localization UI localization should include language data We have several multi-language products by now, those should work, in particular Firefox OS. We're doing quite a few things there that already duplicate language data. Much of that is in /shared, which isn't shared, but copied to many apps. Having that data inside gecko would actually get it to be shared. I think much of the ICU data (which is technically CLDR data packed in ICU mostly) flows along similar lines of our hyphenation dictionaries. The web should just work, independent of which UI locale you're using. I wonder how far we can get by doing something along the lines we use for webfonts, starting to do the best we can with the data we already have, and improve once the perfect data is local. I'm personally OK if this is a notification bar to reload, even. Axel PS: ICU is driven by js globalization api. That API was driven by MS and Google to get the data into their html app platforms. For mozilla, IMHO, the driver for g18n api should be Firefox OS, we're struggling to work around the lack of data for sorting, timezones, language data all around. On 10/15/13 6:06 PM, Benjamin Smedberg wrote: With the landing of bug 853301, we are now shipping ICU in desktop Firefox builds. This costs us about 10% in both download and on-disk footprint: see https://bugzilla.mozilla.org/show_bug.cgi?id=853301#c2. After a discussion with Waldo, I'm going to post some details here about how much this costs in terms of disk footprint, to discuss whether there are things we can remove from this footprint, and whether the footprint is actually worth the cost. This is particularly important because our user research team has identified Firefox download weight as an important factor affecting Firefox adoption and update rates in some markets. On-disk, ICU data breaks into the following categories: * collation tables - 3.3MB These are rules for sorting strings in multiple languages and situations. See http://userguide.icu-project.org/collation for basic background. These tables are necessary for implementing Intl.Collator. The Intl.Collator API has methods to expose a subset of languages. It is not clear from my reading of the specification whether it is expected that browsers will normally ship with the full set of languages or only the subset of the browser locale. * currency tables - 1.9 MB These are primarily the localized name of each currency in each language. This is used by the Intl.NumberFormat API to format international currencies. * timezone tables - 1.7MB Primarily the name of every time zone in each language. This data is necessary for implementing Intl.DateTimeFormat. * language data - 2.1 MB This is a bunch of other data associated with displaying information for a particular language: number formatting in various long and short formats, calendar formats and names for the various world calendar systems. == Do we need this data for any language other than the language Firefox ships in? Can we just include the relevant language data in each localized build of Firefox, and allow users to get other language data via downloadable language packs, similarly to how dictionaries are handled? Is it possible that some of this data (the collation tables?) should be in all Firefox locales, but other data (currency and timezone names) is not as important and we can ship it only in one language? As far as I can tell, the spec allows user agents to ship whatever languages they need; the real question is what users and site authors actually need and expect out of the API. (I'm reading the spec out of http://wiki.ecmascript.org/doku.php?id=globalization:specification_drafts) I am still working to get better number to quantify the costs in terms of lost adoption for additional download weight. Also, we are currently duplicating the data tables on mac universal builds, because they are compiled-in symbols. We should clearly use a separate file for these tables to avoid unnecessary download/install weight. This is now filed as bug 926980. --BDS ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform
Re: Cost of ICU data
On 15/10/13 17:06, Benjamin Smedberg wrote: With the landing of bug 853301, we are now shipping ICU in desktop Firefox builds. This costs us about 10% in both download and on-disk footprint: see https://bugzilla.mozilla.org/show_bug.cgi?id=853301#c2. After a discussion with Waldo, I'm going to post some details here about how much this costs in terms of disk footprint, to discuss whether there are things we can remove from this footprint, and whether the footprint is actually worth the cost. This is particularly important because our user research team has identified Firefox download weight as an important factor affecting Firefox adoption and update rates in some markets. You have given on-disk footprint values, but surely download size values are the important ones for the issue you are raising? After all, some of this data may be very compressible, and some may not. * currency tables - 1.9 MB These are primarily the localized name of each currency in each language. This is used by the Intl.NumberFormat API to format international currencies. * timezone tables - 1.7MB Primarily the name of every time zone in each language. This data is necessary for implementing Intl.DateTimeFormat. I wonder if we could do this as a webservice? That is, when the browser is asked to render a timezone string or a currency string in a particular language, it goes and grabs all the data for that language. We could therefore have full support, but a one-off delay for each new language the user wanted to see UI rendered in (which, for most people, will be a very small set). We could ship a set of common ones plus the UI language one to reduce still further the number of times the service would get hit. Gerv ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform
Re: Cost of ICU data
On Wed, Oct 16, 2013 at 2:39 PM, Gervase Markham g...@mozilla.org wrote: I wonder if we could do this as a webservice? That is, when the browser is asked to render a timezone string or a currency string in a particular language, it goes and grabs all the data for that language. We could therefore have full support, but a one-off delay for each new language the user wanted to see UI rendered in (which, for most people, will be a very small set). We could ship a set of common ones plus the UI language one to reduce still further the number of times the service would get hit. The API is synchronous so that seems like a bad idea. -- http://annevankesteren.nl/ ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform
Re: Cost of ICU data
On 16/10/13 14:47, Anne van Kesteren wrote: The API is synchronous so that seems like a bad idea. As in, it'll cause the tab to freeze (one time only, when a new language is called for) while the file is downloading? OK, that's bad, but so is having Firefox be a lot bigger... Perhaps, as Brian suggested, we should be looking at using the Windows APIs and/or system ICU for some of this data, even if there are some things for which we want to ship our own implementation. Gerv ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform
Re: Cost of ICU data
On 10/16/13 3:50 PM, Gervase Markham wrote: On 16/10/13 14:47, Anne van Kesteren wrote: The API is synchronous so that seems like a bad idea. As in, it'll cause the tab to freeze (one time only, when a new language is called for) while the file is downloading? OK, that's bad, but so is having Firefox be a lot bigger... Perhaps, as Brian suggested, we should be looking at using the Windows APIs and/or system ICU for some of this data, even if there are some things for which we want to ship our own implementation. Gerv We'll need to go down a path that works for Firefox OS. I think that being less-than-great at the first time you hit something off the main track is OK. We should see what actually happens with what's in the g18n apis now. We'll likely also need a way to free excessive use of disk space, or DOS attacks by sneaking up little fragments of language content for 200 languages or somesuch. But, yes, I think we'll need a hosted service to provide that data on demand in the end. Axel ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform
Re: Cost of ICU data
On 10/16/2013 12:45 AM, Karl Tomlinson wrote: When sync I/O is performed to read in-binary-object data, how is that better? Just readahead? Readahead, it being part of the binary/libxul/whatever so already one coherent file to load, etc. I'm not aware that you can reasonably predict adjacency predictions from the OS if you use separate files. But I could be mistaken about that. Jeff ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform
Re: Cost of ICU data
On 10/16/13 6:39 AM, Gervase Markham wrote: You have given on-disk footprint values, but surely download size values are the important ones for the issue you are raising? After all, some of this data may be very compressible, and some may not. Can we repackage the ICU data so we can compress it using a smarter content-aware algorithm? We could decompress the ICU data on first use. chris ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform
Re: Cost of ICU data
On 10/16/2013 9:39 AM, Gervase Markham wrote: On 15/10/13 17:06, Benjamin Smedberg wrote: You have given on-disk footprint values, but surely download size values are the important ones for the issue you are raising? After all, some of this data may be very compressible, and some may not. Correct. The download weight costs are listed in the bug, https://bugzilla.mozilla.org/show_bug.cgi?id=853301#c2 MacOS X, 32+64 bit (dmg):60.7 MB 54.7 MB 5.9 MB 10.8 % Windows, 32 bit (installer.exe): 22.4 MB 20.5 MB 1.9 MB9.2 % I don't know whether there is a way to more optimally compress these in the installer. --BDS ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform
Re: Cost of ICU data
Possible crazy idea: do we actively track and send tree management notices when package or binary size changes? This seems like something we'd want to cover under the perf regressions get backed out or need approval policy. It may also help identify build system regressions and compiler oddities where sections in the binaries change in size surprisingly. On 10/15/13 9:06 AM, Benjamin Smedberg wrote: With the landing of bug 853301, we are now shipping ICU in desktop Firefox builds. This costs us about 10% in both download and on-disk footprint: see https://bugzilla.mozilla.org/show_bug.cgi?id=853301#c2. After a discussion with Waldo, I'm going to post some details here about how much this costs in terms of disk footprint, to discuss whether there are things we can remove from this footprint, and whether the footprint is actually worth the cost. This is particularly important because our user research team has identified Firefox download weight as an important factor affecting Firefox adoption and update rates in some markets. On-disk, ICU data breaks into the following categories: * collation tables - 3.3MB These are rules for sorting strings in multiple languages and situations. See http://userguide.icu-project.org/collation for basic background. These tables are necessary for implementing Intl.Collator. The Intl.Collator API has methods to expose a subset of languages. It is not clear from my reading of the specification whether it is expected that browsers will normally ship with the full set of languages or only the subset of the browser locale. * currency tables - 1.9 MB These are primarily the localized name of each currency in each language. This is used by the Intl.NumberFormat API to format international currencies. * timezone tables - 1.7MB Primarily the name of every time zone in each language. This data is necessary for implementing Intl.DateTimeFormat. * language data - 2.1 MB This is a bunch of other data associated with displaying information for a particular language: number formatting in various long and short formats, calendar formats and names for the various world calendar systems. == Do we need this data for any language other than the language Firefox ships in? Can we just include the relevant language data in each localized build of Firefox, and allow users to get other language data via downloadable language packs, similarly to how dictionaries are handled? Is it possible that some of this data (the collation tables?) should be in all Firefox locales, but other data (currency and timezone names) is not as important and we can ship it only in one language? As far as I can tell, the spec allows user agents to ship whatever languages they need; the real question is what users and site authors actually need and expect out of the API. (I'm reading the spec out of http://wiki.ecmascript.org/doku.php?id=globalization:specification_drafts) I am still working to get better number to quantify the costs in terms of lost adoption for additional download weight. Also, we are currently duplicating the data tables on mac universal builds, because they are compiled-in symbols. We should clearly use a separate file for these tables to avoid unnecessary download/install weight. This is now filed as bug 926980. --BDS ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform
Re: Cost of ICU data
On 16 October 2013 23:10:39, Gregory Szorc wrote: Possible crazy idea: do we actively track and send tree management notices when package or binary size changes? Not at present as far as I know, though Tim Taubert created something temporary last year (no longer accessible, but perhaps worth following up with him): https://groups.google.com/d/msg/mozilla.dev.apps.firefox/k7fzkhdt9io/n6jnbeFsIBMJ Best wishes, Ed ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform
Re: Cost of ICU data
On 15/10/2013 17:06, Benjamin Smedberg wrote: I'm going to post some details here about how much this costs in terms of disk footprint, to discuss whether there are things we can remove from this footprint, and whether the footprint is actually worth the cost. As a heads up, I'm currently intending on using DecimalFormat (a subclass of NumberFormat) to parse numbers from strings as part of implementing input type=number. ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform
Re: Cost of ICU data
On Tue, Oct 15, 2013 at 9:06 AM, Benjamin Smedberg benja...@smedbergs.us wrote: Do we need this data for any language other than the language Firefox ships in? Can we just include the relevant language data in each localized build of Firefox, and allow users to get other language data via downloadable language packs, similarly to how dictionaries are handled? My understanding is that web content should not be able to tell which locale the browser is configured to use, for privacy (fingerprinting) reasons. If we went the route suggested above, it would be easy to figure out, for many users, which locale he/she is using. I am still working to get better number to quantify the costs in terms of lost adoption for additional download weight. My (naive) understanding is that the Windows has its own API that does what ICU does. I believe that Internet Explorer 11 is an existence proof of that. If we used the Windows API on Windows, maybe we could avoid building ICU altogether on Windows. Since that accounts to 90+% of our users, that would almost make it problem solved all on its own even if we did nothing else. Cheers, Brian -- Mozilla Networking/Crypto/Security (Necko/NSS/PSM) ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform
Re: Cost of ICU data
On 10/15/2013 06:06 PM, Benjamin Smedberg wrote: Do we need this data for any language other than the language Firefox ships in? Can we just include the relevant language data in each localized build of Firefox, and allow users to get other language data via downloadable language packs, similarly to how dictionaries are handled? Is it possible that some of this data (the collation tables?) should be in all Firefox locales, but other data (currency and timezone names) is not as important and we can ship it only in one language? It seems a fairly bad thing to me for us to get into the habit of prioritizing certain languages above others. Technically, if the data is compiled into the code, this would mean language repacks would...not be repacks any more. If you had sidealong data files everywhere, then you could perhaps repack still. This might require some repacking adjustments, possibly. ICU provides a udata_setCommonData function that lets you load data from anywhere, so there's some flexibility here. It's worth noting we currently have no central hook to insert this call before ICU's ever used. We init ICU at startup, but that init-call is fast. Presumably this new call can't be so fast, because you have to page in all the ICU data. And if you can't delay that til ICU is used, there's really no difference between the current setup and a setup that calls udata_setCommonData at startup. Of course, this is all just software. :-) As far as I can tell, the spec allows user agents to ship whatever languages they need; the real question is what users and site authors actually need and expect out of the API. (I'm reading the spec out of http://wiki.ecmascript.org/doku.php?id=globalization:specification_drafts) Grunging through v8's code, I...think...they cull locale lists for stuff to some degree. Maybe to the language set they ship. I'm looking at https://code.google.com/p/chromium/codesearch#chromium/src/third_party/icu/README.chromium and honestly don't understand enough about ICU to fully grok the substantial set of changes they've made. Also, we are currently duplicating the data tables on mac universal builds, because they are compiled-in symbols. That means sync I/O on the main thread, and not well-optimized because it won't be part of the binary. Just to note. Jeff ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform
Re: Cost of ICU data
On 10/15/2013 1:18 PM, Brian Smith wrote: On Tue, Oct 15, 2013 at 9:06 AM, Benjamin Smedberg benja...@smedbergs.us wrote: Do we need this data for any language other than the language Firefox ships in? Can we just include the relevant language data in each localized build of Firefox, and allow users to get other language data via downloadable language packs, similarly to how dictionaries are handled? My understanding is that web content should not be able to tell which locale the browser is configured to use, for privacy (fingerprinting) reasons. I haven't heard this rule before. By default your browser language affects the HTTP accept-lang setting, as well as things like default font choices. You can certainly customize those back to a non-fingerprintable setting, but I'm not convinced that we should worry about this as a fingerprinting vector. --BDS ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform
Re: Cost of ICU data
On Tue, Oct 15, 2013 at 6:45 PM, Benjamin Smedberg benja...@smedbergs.us wrote: On 10/15/2013 1:18 PM, Brian Smith wrote: My understanding is that web content should not be able to tell which locale the browser is configured to use, for privacy (fingerprinting) reasons. I haven't heard this rule before. By default your browser language affects the HTTP accept-lang setting, as well as things like default font choices. You can certainly customize those back to a non-fingerprintable setting, but I'm not convinced that we should worry about this as a fingerprinting vector. I think preventing fingerprinting at a technical level is something we've lost though we should try to avoid introducing new vectors. As far as JavaScript API features go, I don't think we should vary our offering by locale. E.g. for Firefox OS we want changing locale to just work and not require a new version of Firefox OS. The same goes for a computer in a hotel or hostel or some such. Firefox should work for each locale users might have set in Gmail. -- http://annevankesteren.nl/ ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform
Re: Cost of ICU data
On 10/15/2013 1:50 PM, Anne van Kesteren wrote: As far as JavaScript API features go, I don't think we should vary our offering by locale. E.g. for Firefox OS we want changing locale to just work and not require a new version of Firefox OS. The same goes for a computer in a hotel or hostel or some such. Firefox should work for each locale users might have set in Gmail. And yet, we don't ship by default a version of Firefox that has all the languages in it, even though that would be good for those use cases also. If it didn't cost us anything to include all languages, I wouldn't be harping on this. But we know that increased package sizes cost us Firefox desktop adoption. So what would the practical effect be of only including the English data files in the English Firefox, and so forth, and allowing users to get additional ICU data via langpacks, the same way we get a Firefox translation? Is there a primary use case for supporting these Intl APIs for languages that a user normally doesn't see? --BDS ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform
Re: Cost of ICU data
On Tue, Oct 15, 2013 at 10:50 AM, Anne van Kesteren ann...@annevk.nl wrote: On Tue, Oct 15, 2013 at 6:45 PM, Benjamin Smedberg benja...@smedbergs.us wrote: On 10/15/2013 1:18 PM, Brian Smith wrote: My understanding is that web content should not be able to tell which locale the browser is configured to use, for privacy (fingerprinting) reasons. I haven't heard this rule before. By default your browser language affects the HTTP accept-lang setting, as well as things like default font choices. You can certainly customize those back to a non-fingerprintable setting, but I'm not convinced that we should worry about this as a fingerprinting vector. I think preventing fingerprinting at a technical level is something we've lost though we should try to avoid introducing new vectors. I think, at least, we should consider ways to avoid adding new vectors when we are making decisions. It doesn't have to be *the* deciding factor. As far as JavaScript API features go, I don't think we should vary our offering by locale. E.g. for Firefox OS we want changing locale to just work and not require a new version of Firefox OS. The same goes for a computer in a hotel or hostel or some such. Firefox should work for each locale users might have set in Gmail. I strongly agree with this. No doubt there is a strong correlation between the UI locale and the locale used for web content, but it is far from a perfect correlation. Socially, we should be erring on the side of encouraging a multilingual society instead of discouraging a multilingual society. Technically, we should minimize the web-facing differences between different installations of Firefox, because having a consistent platform for web developers is a good thing. That is why we create web standards, and that is why making parts of standards optional is generally a bad thing. I have no idea how to install a langpack. Presumably it is something that is done through AMO. I am skeptical that this is easy enough to make it acceptable to push this task off to the user. we should at least automate it for them. If this data is too large and contributing towards aborted installs, why not just split the installation phase into two parts, and install the locale data in parallel to starting up the browser? Cheers, Brian -- Mozilla Networking/Crypto/Security (Necko/NSS/PSM) ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform
Re: Cost of ICU data
On 10/15/13 12:28 PM, Brian Smith wrote: I have no idea how to install a langpack. Presumably it is something that is done through AMO. I am skeptical that this is easy enough to make it acceptable to push this task off to the user. we should at least automate it for them. If this data is too large and contributing towards aborted installs, why not just split the installation phase into two parts, and install the locale data in parallel to starting up the browser? How large is a langpack? Could Firefox install (all) langpacks in the background or on demand? I've heard rumblings about a Firefox updater project to unify updates for Firefox data files that are not coupled to a particular Firefox release (such as CRLS and GPU driver blocklists). chris ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform
Re: Cost of ICU data
Jeff Walden writes: On 10/15/2013 06:06 PM, Benjamin Smedberg wrote: That means sync I/O on the main thread, and not well-optimized because it won't be part of the binary. Just to note. When sync I/O is performed to read in-binary-object data, how is that better? Just readahead? Wouldn't something similar be possible with separate files? ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform
Re: Cost of ICU data
On 10/15/13 2:41 PM, Chris Peterson wrote: On 10/15/13 12:28 PM, Brian Smith wrote: I have no idea how to install a langpack. Presumably it is something that is done through AMO. I am skeptical that this is easy enough to make it acceptable to push this task off to the user. we should at least automate it for them. If this data is too large and contributing towards aborted installs, why not just split the installation phase into two parts, and install the locale data in parallel to starting up the browser? How large is a langpack? Could Firefox install (all) langpacks in the background or on demand? I've heard rumblings about a Firefox updater project to unify updates for Firefox data files that are not coupled to a particular Firefox release (such as CRLS and GPU driver blocklists). chris A quick look at this page (https://addons.mozilla.org/firefox/language-tools/) shows that they're generally in the 350-400 Kb range, each. I don't know how those would compare with ICU lang packs. Jorge ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform