Re: [PATCH 09/17] Move unicode to ASCII conversion to shared function.
On Fri, Sep 20, 2013 at 8:00 AM, H. Peter Anvin wrote: > On 09/20/2013 04:27 AM, Matt Fleming wrote: >> On Wed, 18 Sep, at 09:48:44PM, Roy Franz wrote: >>> Would it be acceptable to fix the naming/comments, and convert values >>> above 126 to '?' >>> in the current patchset, and address a more thorough fix in another patch >>> set? >>> The ARM and ARM64 EFI stub patchsets that are mostly complete depend >>> on this one, >>> so getting this merged soon would be helpful. >> >> Just fixing the function name and comments is enough for this patch >> series. Anything else should be separate. >> > > I just whipped up a patch to do proper UTF-16 to UTF-8 conversion. > Completely untested, of course. > > -hpa > Thanks for putting this together. I fixed up a few minor issues, and it works. Updated version below. I'll submit this as a separate patch as part of the EFI stub common code series. Roy commit 827285bac3daa79cd562bf79b5e9e88a61d357be Author: H. Peter Anvin Date: Fri Sep 20 12:46:16 2013 -0700 Do proper conversion from UTF-16 to UTF-8 Improve the conversion of the UTF-16 EFI command line to UTF-8 for passing to the kernel. Signed-off-by: Roy Franz diff --git a/arch/x86/boot/compressed/eboot.c b/arch/x86/boot/compressed/eboot.c index 5e708c0..4723dc89 100644 --- a/arch/x86/boot/compressed/eboot.c +++ b/arch/x86/boot/compressed/eboot.c @@ -486,8 +486,7 @@ struct boot_params *make_boot_params(void *handle, efi_system_table_t *_table) hdr->type_of_loader = 0x21; /* Convert unicode cmdline to ascii */ - cmdline_ptr = efi_convert_cmdline_to_ascii(sys_table, image, - _size); + cmdline_ptr = efi_convert_cmdline(sys_table, image, _size); if (!cmdline_ptr) goto fail; hdr->cmd_line_ptr = (unsigned long)cmdline_ptr; diff --git a/drivers/firmware/efi/efi-stub-helper.c b/drivers/firmware/efi/efi-stub-helper.c index 335d17d..8a3ab4b 100644 --- a/drivers/firmware/efi/efi-stub-helper.c +++ b/drivers/firmware/efi/efi-stub-helper.c @@ -548,61 +548,112 @@ static efi_status_t efi_relocate_kernel(efi_system_table_t *sys_table_arg, return status; } -/* Convert the unicode UEFI command line to ASCII to pass to kernel. + +/* + * Get the number of UTF-8 bytes corresponding to an UTF-16 character. + * This overestimates for surrogates, but that is okay. + */ +static int efi_utf8_bytes(u16 c) +{ + return 1 + (c >= 0x80) + (c >= 0x800); +} + +/* + * Convert an UTF-16 string, not necessarily null terminated, to UTF-8. + */ +static u8 *efi_utf16_to_utf8(u8 *dst, const u16 *src, int n) +{ + unsigned int c; + + while (n--) { + c = *src++; + if (n && c >= 0xd800 && c <= 0xdbff && +*src >= 0xdc00 && *src <= 0xdfff) { + c = 0x1 + ((c & 0x3ff) << 10) + (*src & 0x3ff); + src++; + n--; + } + if (c >= 0xd800 && c <= 0xdfff) + c = 0xfffd; /* Unmatched surrogate */ + if (c < 0x80) { + *dst++ = c; + continue; + } + if (c < 0x800) { + *dst++ = 0xc0 + (c >> 6); + goto t1; + } + if (c < 0x1) { + *dst++ = 0xe0 + (c >> 12); + goto t2; + } + *dst++ = 0xf0 + (c >> 18); + *dst++ = 0x80 + ((c >> 12) & 0x3f); + t2: + *dst++ = 0x80 + ((c >> 6) & 0x3f); + t1: + *dst++ = 0x80 + (c & 0x3f); + } + + return dst; +} + +/* + * Convert the unicode UEFI command line to ASCII to pass to kernel. * Size of memory allocated return in *cmd_line_len. * Returns NULL on error. */ -static char *efi_convert_cmdline_to_ascii(efi_system_table_t *sys_table_arg, - efi_loaded_image_t *image, - int *cmd_line_len) +static char *efi_convert_cmdline(efi_system_table_t *sys_table_arg, + efi_loaded_image_t *image, + int *cmd_line_len) { - u16 *s2; + const u16 *s2; u8 *s1 = NULL; unsigned long cmdline_addr = 0; int load_options_size = image->load_options_size / 2; /* ASCII */ - void *options = image->load_options; - int options_size = 0; + const u16 *options = image->load_options; + int options_bytes = 0; /* UTF-8 bytes */ + int options_chars = 0; /* UTF-16 chars */ efi_status_t status; int i; u16 zero = 0; if (options) { s2 = options; - while (*s2 && *s2 != '\n' && options_size < load_options_size) { + while (*s2 && *s2 != '\n' && options_bytes < load_options_size) { + options_bytes += efi_utf8_bytes(*s2); s2++; - options_size++; } + options_chars = s2 - options; } - if (options_size == 0) { - /* No command line options, so return empty string*/ - options_size = 1; + if (!options_chars) { + /* No command line options, so return empty string */ options = } - options_size++; /* NUL termination */ + options_bytes++; /* NUL termination */ + #ifdef CONFIG_ARM /* For ARM, allocate at a high address to avoid reserved * regions at low addresses that we don't know the specfics of * at the time we are processing the command line. */ - status = efi_high_alloc(sys_table_arg, options_size, 0, + status = efi_high_alloc(sys_table_arg, options_bytes, 0, _addr, 0xf000); #else - status = efi_low_alloc(sys_table_arg, options_size, 0, + status = efi_low_alloc(sys_table_arg,
Re: [PATCH 09/17] Move unicode to ASCII conversion to shared function.
On Fri, Sep 20, 2013 at 8:00 AM, H. Peter Anvin h...@zytor.com wrote: On 09/20/2013 04:27 AM, Matt Fleming wrote: On Wed, 18 Sep, at 09:48:44PM, Roy Franz wrote: Would it be acceptable to fix the naming/comments, and convert values above 126 to '?' in the current patchset, and address a more thorough fix in another patch set? The ARM and ARM64 EFI stub patchsets that are mostly complete depend on this one, so getting this merged soon would be helpful. Just fixing the function name and comments is enough for this patch series. Anything else should be separate. I just whipped up a patch to do proper UTF-16 to UTF-8 conversion. Completely untested, of course. -hpa Thanks for putting this together. I fixed up a few minor issues, and it works. Updated version below. I'll submit this as a separate patch as part of the EFI stub common code series. Roy commit 827285bac3daa79cd562bf79b5e9e88a61d357be Author: H. Peter Anvin h...@zytor.com Date: Fri Sep 20 12:46:16 2013 -0700 Do proper conversion from UTF-16 to UTF-8 Improve the conversion of the UTF-16 EFI command line to UTF-8 for passing to the kernel. Signed-off-by: Roy Franz roy.fr...@linaro.org diff --git a/arch/x86/boot/compressed/eboot.c b/arch/x86/boot/compressed/eboot.c index 5e708c0..4723dc89 100644 --- a/arch/x86/boot/compressed/eboot.c +++ b/arch/x86/boot/compressed/eboot.c @@ -486,8 +486,7 @@ struct boot_params *make_boot_params(void *handle, efi_system_table_t *_table) hdr-type_of_loader = 0x21; /* Convert unicode cmdline to ascii */ - cmdline_ptr = efi_convert_cmdline_to_ascii(sys_table, image, - options_size); + cmdline_ptr = efi_convert_cmdline(sys_table, image, options_size); if (!cmdline_ptr) goto fail; hdr-cmd_line_ptr = (unsigned long)cmdline_ptr; diff --git a/drivers/firmware/efi/efi-stub-helper.c b/drivers/firmware/efi/efi-stub-helper.c index 335d17d..8a3ab4b 100644 --- a/drivers/firmware/efi/efi-stub-helper.c +++ b/drivers/firmware/efi/efi-stub-helper.c @@ -548,61 +548,112 @@ static efi_status_t efi_relocate_kernel(efi_system_table_t *sys_table_arg, return status; } -/* Convert the unicode UEFI command line to ASCII to pass to kernel. + +/* + * Get the number of UTF-8 bytes corresponding to an UTF-16 character. + * This overestimates for surrogates, but that is okay. + */ +static int efi_utf8_bytes(u16 c) +{ + return 1 + (c = 0x80) + (c = 0x800); +} + +/* + * Convert an UTF-16 string, not necessarily null terminated, to UTF-8. + */ +static u8 *efi_utf16_to_utf8(u8 *dst, const u16 *src, int n) +{ + unsigned int c; + + while (n--) { + c = *src++; + if (n c = 0xd800 c = 0xdbff +*src = 0xdc00 *src = 0xdfff) { + c = 0x1 + ((c 0x3ff) 10) + (*src 0x3ff); + src++; + n--; + } + if (c = 0xd800 c = 0xdfff) + c = 0xfffd; /* Unmatched surrogate */ + if (c 0x80) { + *dst++ = c; + continue; + } + if (c 0x800) { + *dst++ = 0xc0 + (c 6); + goto t1; + } + if (c 0x1) { + *dst++ = 0xe0 + (c 12); + goto t2; + } + *dst++ = 0xf0 + (c 18); + *dst++ = 0x80 + ((c 12) 0x3f); + t2: + *dst++ = 0x80 + ((c 6) 0x3f); + t1: + *dst++ = 0x80 + (c 0x3f); + } + + return dst; +} + +/* + * Convert the unicode UEFI command line to ASCII to pass to kernel. * Size of memory allocated return in *cmd_line_len. * Returns NULL on error. */ -static char *efi_convert_cmdline_to_ascii(efi_system_table_t *sys_table_arg, - efi_loaded_image_t *image, - int *cmd_line_len) +static char *efi_convert_cmdline(efi_system_table_t *sys_table_arg, + efi_loaded_image_t *image, + int *cmd_line_len) { - u16 *s2; + const u16 *s2; u8 *s1 = NULL; unsigned long cmdline_addr = 0; int load_options_size = image-load_options_size / 2; /* ASCII */ - void *options = image-load_options; - int options_size = 0; + const u16 *options = image-load_options; + int options_bytes = 0; /* UTF-8 bytes */ + int options_chars = 0; /* UTF-16 chars */ efi_status_t status; int i; u16 zero = 0; if (options) { s2 = options; - while (*s2 *s2 != '\n' options_size load_options_size) { + while (*s2 *s2 != '\n' options_bytes load_options_size) { + options_bytes += efi_utf8_bytes(*s2); s2++; - options_size++; } + options_chars = s2 - options; } - if (options_size == 0) { - /* No command line options, so return empty string*/ - options_size = 1; + if (!options_chars) { + /* No command line options, so return empty string */ options = zero; } - options_size++; /* NUL termination */ + options_bytes++; /* NUL termination */ + #ifdef CONFIG_ARM /* For ARM, allocate at a high address to avoid reserved * regions at low addresses that we don't know the specfics of * at the time we are processing the command line. */ - status = efi_high_alloc(sys_table_arg, options_size, 0, + status = efi_high_alloc(sys_table_arg, options_bytes, 0, cmdline_addr, 0xf000); #else - status = efi_low_alloc(sys_table_arg, options_size, 0, + status = efi_low_alloc(sys_table_arg, options_bytes, 0,
Re: [PATCH 09/17] Move unicode to ASCII conversion to shared function.
On 09/20/2013 04:27 AM, Matt Fleming wrote: > On Wed, 18 Sep, at 09:48:44PM, Roy Franz wrote: >> Would it be acceptable to fix the naming/comments, and convert values >> above 126 to '?' >> in the current patchset, and address a more thorough fix in another patch >> set? >> The ARM and ARM64 EFI stub patchsets that are mostly complete depend >> on this one, >> so getting this merged soon would be helpful. > > Just fixing the function name and comments is enough for this patch > series. Anything else should be separate. > I just whipped up a patch to do proper UTF-16 to UTF-8 conversion. Completely untested, of course. -hpa >From 7d6cf630c1adbb9787a24c2994230373c2b20a8f Mon Sep 17 00:00:00 2001 From: "H. Peter Anvin" Date: Fri, 20 Sep 2013 09:55:39 -0500 Subject: [PATCH] efi: Handle arbitrary Unicode characters Instead of truncating UTF-16 assuming all characters is ASCII, properly convert it to UTF-8. Signed-off-by: H. Peter Anvin --- arch/x86/boot/compressed/eboot.c | 3 +- drivers/firmware/efi/efi-stub-helper.c | 89 ++ 2 files changed, 71 insertions(+), 21 deletions(-) diff --git a/arch/x86/boot/compressed/eboot.c b/arch/x86/boot/compressed/eboot.c index bf56206..aacbe50 100644 --- a/arch/x86/boot/compressed/eboot.c +++ b/arch/x86/boot/compressed/eboot.c @@ -488,8 +488,7 @@ struct boot_params *make_boot_params(void *handle, efi_system_table_t *_table) hdr->type_of_loader = 0x21; /* Convert unicode cmdline to ascii */ - cmdline_ptr = efi_convert_cmdline_to_ascii(sys_table, image, - _size); + cmdline_ptr = efi_convert_cmdline(sys_table, image, _size); if (!cmdline_ptr) goto fail; hdr->cmd_line_ptr = (unsigned long)cmdline_ptr; diff --git a/drivers/firmware/efi/efi-stub-helper.c b/drivers/firmware/efi/efi-stub-helper.c index c91e757..7d4c1a4 100644 --- a/drivers/firmware/efi/efi-stub-helper.c +++ b/drivers/firmware/efi/efi-stub-helper.c @@ -567,20 +567,72 @@ static efi_status_t efi_relocate_kernel(efi_system_table_t *sys_table_arg, return status; } -/* Convert the unicode UEFI command line to ASCII to pass to kernel. + +/* + * Get the number of UTF-8 bytes corresponding to an UTF-16 character. + * This overestimates for surrogates, but that is okay. + */ +static int efi_utf8_bytes(u16 c) +{ + return 1 + (c >= 0x80) + (c >= 0x800); +} + +/* + * Convert an UTF-16 string, not necessarily null terminated, to UTF-8. + */ +static u8 *efi_utf8_to_utf16(u8 *dst, const u16 *src, int n) +{ + unsigned int c; + + while (n--) { + c = *src++; + if (n && c >= 0xd800 && c <= 0xdbff && + *src >= 0xdc00 && *src <= 0xdfff) { + c = 0x1 + ((c & 0x3ff) << 10) + (*src & 0x3ff); + src++; + n--; + } + if (c >= 0xd800 && c <= 0xdfff) + c = 0xfffd; /* Unmatched surrogate */ + if (c < 0x80) { + *dst++ = c; + continue; + } + if (c < 0x800) { + *dst++ = 0xc0 + (c >> 6); + goto t1; + } + if (c < 0x1) { + *dst++ = 0xe0 + (c >> 12); + goto t2; + } + *dst++ = 0xf0 + (c >> 18); + *dst++ = 0x80 + ((c >> 12) & 0x3f); + t2: + *dst++ = 0x80 + ((c >> 6) & 0x3f); + t1: + *dst++ = 0x80 + (c & 0x3f); + } + + return dst; +} + +/* + * Convert the unicode UEFI command line to ASCII to pass to kernel. * Size of memory allocated return in *cmd_line_len. * Returns NULL on error. */ -static char *efi_convert_cmdline_to_ascii(efi_system_table_t *sys_table_arg, - efi_loaded_image_t *image, - int *cmd_line_len) +static char *efi_convert_cmdline(efi_system_table_t *sys_table_arg, + efi_loaded_image_t *image, + int *cmd_line_len) { - u16 *s2; + const u16 *s2; u8 *s1 = NULL; unsigned long cmdline_addr = 0; int load_options_size = image->load_options_size / 2; /* ASCII */ - void *options = image->load_options; - int options_size = 0; + const void *options = image->load_options; + int options_bytes = 0; /* UTF-8 bytes */ + int options_chars = 0; /* UTF-16 chars */ efi_status_t status; int i; u16 zero = 0; @@ -588,40 +640,39 @@ static char *efi_convert_cmdline_to_ascii(efi_system_table_t *sys_table_arg, if (options) { s2 = options; while (*s2 && *s2 != '\n' && options_size < load_options_size) { + options_bytes += efi_utf8_bytes(*s2); s2++; - options_size++; } + options_chars = s2 - options; } - if (options_size == 0) { - /* No command line options, so return empty string*/ - options_size = 1; + if (!options_chars) { + /* No command line options, so return empty string */ options = } - options_size++; /* NUL termination */ + options_bytes++; /* NUL termination */ + #ifdef CONFIG_ARM /* For ARM, allocate at a high address to avoid reserved * regions at low addresses that we don't know the specfics of * at the time we are processing the command line. */ - status = efi_high_alloc(sys_table_arg, options_size, 0, + status = efi_high_alloc(sys_table_arg, options_bytes, 0, _addr, 0xf000); #else - status = efi_low_alloc(sys_table_arg,
Re: [PATCH 09/17] Move unicode to ASCII conversion to shared function.
On Fri, 20 Sep, at 01:02:41AM, Adam Borowski wrote: > Are we on the same page so far? If so, I can make a patch atop yours. Patches to fix this would be most welcome, though I'd wait until Roy sends out another version of this patch with the name/comment changes. This series will eventually end up in the 'next' branch at, git://git.kernel.org/pub/scm/linux/kernel/git/mfleming/efi.git -- Matt Fleming, Intel Open Source Technology Center -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 09/17] Move unicode to ASCII conversion to shared function.
On Wed, 18 Sep, at 09:48:44PM, Roy Franz wrote: > Would it be acceptable to fix the naming/comments, and convert values > above 126 to '?' > in the current patchset, and address a more thorough fix in another patch set? > The ARM and ARM64 EFI stub patchsets that are mostly complete depend > on this one, > so getting this merged soon would be helpful. Just fixing the function name and comments is enough for this patch series. Anything else should be separate. -- Matt Fleming, Intel Open Source Technology Center -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 09/17] Move unicode to ASCII conversion to shared function.
On Wed, 18 Sep, at 09:48:44PM, Roy Franz wrote: Would it be acceptable to fix the naming/comments, and convert values above 126 to '?' in the current patchset, and address a more thorough fix in another patch set? The ARM and ARM64 EFI stub patchsets that are mostly complete depend on this one, so getting this merged soon would be helpful. Just fixing the function name and comments is enough for this patch series. Anything else should be separate. -- Matt Fleming, Intel Open Source Technology Center -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 09/17] Move unicode to ASCII conversion to shared function.
On Fri, 20 Sep, at 01:02:41AM, Adam Borowski wrote: Are we on the same page so far? If so, I can make a patch atop yours. Patches to fix this would be most welcome, though I'd wait until Roy sends out another version of this patch with the name/comment changes. This series will eventually end up in the 'next' branch at, git://git.kernel.org/pub/scm/linux/kernel/git/mfleming/efi.git -- Matt Fleming, Intel Open Source Technology Center -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 09/17] Move unicode to ASCII conversion to shared function.
On 09/20/2013 04:27 AM, Matt Fleming wrote: On Wed, 18 Sep, at 09:48:44PM, Roy Franz wrote: Would it be acceptable to fix the naming/comments, and convert values above 126 to '?' in the current patchset, and address a more thorough fix in another patch set? The ARM and ARM64 EFI stub patchsets that are mostly complete depend on this one, so getting this merged soon would be helpful. Just fixing the function name and comments is enough for this patch series. Anything else should be separate. I just whipped up a patch to do proper UTF-16 to UTF-8 conversion. Completely untested, of course. -hpa From 7d6cf630c1adbb9787a24c2994230373c2b20a8f Mon Sep 17 00:00:00 2001 From: H. Peter Anvin h...@linux.intel.com Date: Fri, 20 Sep 2013 09:55:39 -0500 Subject: [PATCH] efi: Handle arbitrary Unicode characters Instead of truncating UTF-16 assuming all characters is ASCII, properly convert it to UTF-8. Signed-off-by: H. Peter Anvin h...@linux.intel.com --- arch/x86/boot/compressed/eboot.c | 3 +- drivers/firmware/efi/efi-stub-helper.c | 89 ++ 2 files changed, 71 insertions(+), 21 deletions(-) diff --git a/arch/x86/boot/compressed/eboot.c b/arch/x86/boot/compressed/eboot.c index bf56206..aacbe50 100644 --- a/arch/x86/boot/compressed/eboot.c +++ b/arch/x86/boot/compressed/eboot.c @@ -488,8 +488,7 @@ struct boot_params *make_boot_params(void *handle, efi_system_table_t *_table) hdr-type_of_loader = 0x21; /* Convert unicode cmdline to ascii */ - cmdline_ptr = efi_convert_cmdline_to_ascii(sys_table, image, - options_size); + cmdline_ptr = efi_convert_cmdline(sys_table, image, options_size); if (!cmdline_ptr) goto fail; hdr-cmd_line_ptr = (unsigned long)cmdline_ptr; diff --git a/drivers/firmware/efi/efi-stub-helper.c b/drivers/firmware/efi/efi-stub-helper.c index c91e757..7d4c1a4 100644 --- a/drivers/firmware/efi/efi-stub-helper.c +++ b/drivers/firmware/efi/efi-stub-helper.c @@ -567,20 +567,72 @@ static efi_status_t efi_relocate_kernel(efi_system_table_t *sys_table_arg, return status; } -/* Convert the unicode UEFI command line to ASCII to pass to kernel. + +/* + * Get the number of UTF-8 bytes corresponding to an UTF-16 character. + * This overestimates for surrogates, but that is okay. + */ +static int efi_utf8_bytes(u16 c) +{ + return 1 + (c = 0x80) + (c = 0x800); +} + +/* + * Convert an UTF-16 string, not necessarily null terminated, to UTF-8. + */ +static u8 *efi_utf8_to_utf16(u8 *dst, const u16 *src, int n) +{ + unsigned int c; + + while (n--) { + c = *src++; + if (n c = 0xd800 c = 0xdbff + *src = 0xdc00 *src = 0xdfff) { + c = 0x1 + ((c 0x3ff) 10) + (*src 0x3ff); + src++; + n--; + } + if (c = 0xd800 c = 0xdfff) + c = 0xfffd; /* Unmatched surrogate */ + if (c 0x80) { + *dst++ = c; + continue; + } + if (c 0x800) { + *dst++ = 0xc0 + (c 6); + goto t1; + } + if (c 0x1) { + *dst++ = 0xe0 + (c 12); + goto t2; + } + *dst++ = 0xf0 + (c 18); + *dst++ = 0x80 + ((c 12) 0x3f); + t2: + *dst++ = 0x80 + ((c 6) 0x3f); + t1: + *dst++ = 0x80 + (c 0x3f); + } + + return dst; +} + +/* + * Convert the unicode UEFI command line to ASCII to pass to kernel. * Size of memory allocated return in *cmd_line_len. * Returns NULL on error. */ -static char *efi_convert_cmdline_to_ascii(efi_system_table_t *sys_table_arg, - efi_loaded_image_t *image, - int *cmd_line_len) +static char *efi_convert_cmdline(efi_system_table_t *sys_table_arg, + efi_loaded_image_t *image, + int *cmd_line_len) { - u16 *s2; + const u16 *s2; u8 *s1 = NULL; unsigned long cmdline_addr = 0; int load_options_size = image-load_options_size / 2; /* ASCII */ - void *options = image-load_options; - int options_size = 0; + const void *options = image-load_options; + int options_bytes = 0; /* UTF-8 bytes */ + int options_chars = 0; /* UTF-16 chars */ efi_status_t status; int i; u16 zero = 0; @@ -588,40 +640,39 @@ static char *efi_convert_cmdline_to_ascii(efi_system_table_t *sys_table_arg, if (options) { s2 = options; while (*s2 *s2 != '\n' options_size load_options_size) { + options_bytes += efi_utf8_bytes(*s2); s2++; - options_size++; } + options_chars = s2 - options; } - if (options_size == 0) { - /* No command line options, so return empty string*/ - options_size = 1; + if (!options_chars) { + /* No command line options, so return empty string */ options = zero; } - options_size++; /* NUL termination */ + options_bytes++; /* NUL termination */ + #ifdef CONFIG_ARM /* For ARM, allocate at a high address to avoid reserved * regions at low addresses that we don't know the specfics of * at the time we are processing the command line. */ - status = efi_high_alloc(sys_table_arg, options_size, 0, + status = efi_high_alloc(sys_table_arg, options_bytes, 0, cmdline_addr, 0xf000); #else - status = efi_low_alloc(sys_table_arg,
Re: [PATCH 09/17] Move unicode to ASCII conversion to shared function.
On Wed, Sep 18, 2013 at 09:48:44PM -0700, Roy Franz wrote: > On Wed, Sep 18, 2013 at 8:44 PM, Adam Borowski wrote: > > [UCS2 truncation] > > I stuck to re-arranging the code that was there, as I don't know enough > about character encodings to propose changes. I on the other hand don't know the kernel (lurking because of my first patch), but I'm on a crusade against mangled Unicode (so far in the userland). Can't let such a blatant error slip through on my watch :) > Also, this code is running as part of the kernel decompressor, rather than > the kernel itself, so it doesn't have access to any kernel facilities, and > it also needs to be position independent. Ok, so it can't reuse common libraries. No problem, a simplified, sanitized and optimized copy of utf16s_to_utf8s() can be done in quite less code than the original. > It's running in a quite limited environment - the decompressor has > its own copy of strstr(), and other string functions. I'd need nothing but a way to alloc the new string. And I see this is already done (efi_{low,high_alloc()). > I checked the UEFI specification, and it states that all 16 bit strings > are UCS-2, unless otherwise noted. ... which means it will either get upgraded to UTF-16 in a subsequent version, or some Unicode strings get mangled. I'd ignore this bit and implement full UTF-16 from the start: every legal UCS-2 string can be decoded as UTF-16 so it's a strict superset. > The load options that the command line is provided through a void pointer > specified as: [snip] Either a null pointer or a 16-bit string, that sounds clear enough. I see not a word about endianness (does anything do EFI on big endian?), but "same as host" seems to be a reasonable assumption. > Would it be acceptable to fix the naming/comments, and convert values > above 126 to '?' in the current patchset, and address a more thorough fix > in another patch set? The ARM and ARM64 EFI stub patchsets that are > mostly complete depend on this one, so getting this merged soon would be > helpful. I don't want to hinder your work, so what about putting in your version as-is and fixing it later? > > There's just one problem: which encoding to use, but > > these days, most distributions have either dropped non-UTF8 or hardly pay > > lip service, so we could get away with hard-coding UTF-8: those few who > > use ancient charsets can stick to ASCII. Not being able to use regular kernel facilities makes supporting ancient charsets a lost cause. I'm so weeping about them... not. > I would certainly appreciate your help improving this Are we on the same page so far? If so, I can make a patch atop yours. -- ᛊᚨᚾᛁᛏᚣ᛫ᛁᛊ᛫ᚠᛟᚱ᛫ᚦᛖ᛫ᚹᛖᚨᚲ -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 09/17] Move unicode to ASCII conversion to shared function.
On Wed, Sep 18, 2013 at 09:48:44PM -0700, Roy Franz wrote: On Wed, Sep 18, 2013 at 8:44 PM, Adam Borowski kilob...@angband.pl wrote: [UCS2 truncation] I stuck to re-arranging the code that was there, as I don't know enough about character encodings to propose changes. I on the other hand don't know the kernel (lurking because of my first patch), but I'm on a crusade against mangled Unicode (so far in the userland). Can't let such a blatant error slip through on my watch :) Also, this code is running as part of the kernel decompressor, rather than the kernel itself, so it doesn't have access to any kernel facilities, and it also needs to be position independent. Ok, so it can't reuse common libraries. No problem, a simplified, sanitized and optimized copy of utf16s_to_utf8s() can be done in quite less code than the original. It's running in a quite limited environment - the decompressor has its own copy of strstr(), and other string functions. I'd need nothing but a way to alloc the new string. And I see this is already done (efi_{low,high_alloc()). I checked the UEFI specification, and it states that all 16 bit strings are UCS-2, unless otherwise noted. ... which means it will either get upgraded to UTF-16 in a subsequent version, or some Unicode strings get mangled. I'd ignore this bit and implement full UTF-16 from the start: every legal UCS-2 string can be decoded as UTF-16 so it's a strict superset. The load options that the command line is provided through a void pointer specified as: [snip] Either a null pointer or a 16-bit string, that sounds clear enough. I see not a word about endianness (does anything do EFI on big endian?), but same as host seems to be a reasonable assumption. Would it be acceptable to fix the naming/comments, and convert values above 126 to '?' in the current patchset, and address a more thorough fix in another patch set? The ARM and ARM64 EFI stub patchsets that are mostly complete depend on this one, so getting this merged soon would be helpful. I don't want to hinder your work, so what about putting in your version as-is and fixing it later? There's just one problem: which encoding to use, but these days, most distributions have either dropped non-UTF8 or hardly pay lip service, so we could get away with hard-coding UTF-8: those few who use ancient charsets can stick to ASCII. Not being able to use regular kernel facilities makes supporting ancient charsets a lost cause. I'm so weeping about them... not. I would certainly appreciate your help improving this Are we on the same page so far? If so, I can make a patch atop yours. -- ᛊᚨᚾᛁᛏᚣ᛫ᛁᛊ᛫ᚠᛟᚱ᛫ᚦᛖ᛫ᚹᛖᚨᚲ -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 09/17] Move unicode to ASCII conversion to shared function.
On Wed, Sep 18, 2013 at 8:44 PM, Adam Borowski wrote: > On Mon, Sep 16, 2013 at 09:11:25PM -0700, Roy Franz wrote: >> +/* Convert the unicode UEFI command line to ASCII to pass to kernel. >> + * Size of memory allocated return in *cmd_line_len. >> + * Returns NULL on error. >> + */ >> +static char *efi_convert_cmdline_to_ascii(efi_system_table_t *sys_table_arg, > >> + int load_options_size = image->load_options_size / 2; /* ASCII */ > >> + for (i = 0; i < options_size - 1; i++) >> + *s1++ = *s2++; > > I'm afraid both this commit and comments inside are misnamed. What you're > changing here is the encoding rather than character set. > > In fact, these days it's 8-bit encodings that are more likely to be Unicode > than 16-bit ones: UTF-8 is ubiquitous, while you usually get UCS2 at most. > In either case, though, we have here is a 7-bit charset encoded as either > 8-bit or 16-bit units. What this function does is blindly truncating upper > byte. The supported payload is in both cases ASCII. > > I'd thus rename the function to what it already does: truncating u16 to u8, > and adjust comments accordingly. > > Replacing values above 126 with a token character like '?' would be good > too: that'd avoid producing corrupted characters and/or random ASCII chars. I stuck to re-arranging the code that was there, as I don't know enough about character encodings to propose changes. Also, this code is running as part of the kernel decompressor, rather than the kernel itself, so it doesn't have access to any kernel facilities, and it also needs to be position independent. It's running in a quite limited environment - the decompressor has its own copy of strstr(), and other string functions. I checked the UEFI specification, and it states that all 16 bit strings are UCS-2, unless otherwise noted. The load options that the command line is provided through a void pointer specified as: "LoadOptions A pointer to the image's binary load options." I couldn't quickly find any other mentions of LoadOptions in the spec., so I don't know what other types of values could be, or are typically provided. The spec isn't helpful regarding the type of this data, and I don't know what common practice is here among the various EFI/UEFI firmwares out there. Would it be acceptable to fix the naming/comments, and convert values above 126 to '?' in the current patchset, and address a more thorough fix in another patch set? The ARM and ARM64 EFI stub patchsets that are mostly complete depend on this one, so getting this merged soon would be helpful. Something like: >> +/* Convert the UEFI command line from 16 bit to 8 bit encoding to pass to >> kernel. >> + * by truncating to 7 bits. Values above 126 are converted to 63 (ASCII >> '?') >> + * Size of memory allocated return in *cmd_line_len. >> + * Returns NULL on error. >> + */ +static char *efi_truncate_cmdline_to_8bit_encoding(efi_system_table_t *sys_table_arg, (update code to truncate and convert to '?') > > Your commit only moves things around, so it might be out of scope for now, > but I wonder: what if the kernel actually supported Unicode here? Few > cmdline arguments take values where non-ASCII makes sense, but at least some > do: for example, a Russian guy is not unlikely to name subvolumes using > cyrillic. Supporting that would be easy (estimating the length then > tf16us_to_utf8s()). There's just one problem: which encoding to use, but > these days, most distributions have either dropped non-UTF8 or hardly pay > lip service, so we could get away with hard-coding UTF-8: those few who > use ancient charsets can stick to ASCII. Would this be ok? If so, shout, > I can code this if you don't care enough. I would certainly appreciate your help improving this, although I think we'll need to have a better understanding of what the UEFI firmware is providing before we can implement something better. > > -- > ᛊᚨᚾᛁᛏᚣ᛫ᛁᛊ᛫ᚠᛟᚱ᛫ᚦᛖ᛫ᚹᛖᚨᚲ -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 09/17] Move unicode to ASCII conversion to shared function.
On 09/18/2013 10:44 PM, Adam Borowski wrote: > > In fact, these days it's 8-bit encodings that are more likely to be Unicode > than 16-bit ones: UTF-8 is ubiquitous, while you usually get UCS2 at most. > In either case, though, we have here is a 7-bit charset encoded as either > 8-bit or 16-bit units. What this function does is blindly truncating upper > byte. The supported payload is in both cases ASCII. > > I'd thus rename the function to what it already does: truncating u16 to u8, > and adjust comments accordingly. > > Replacing values above 126 with a token character like '?' would be good > too: that'd avoid producing corrupted characters and/or random ASCII chars. > > Your commit only moves things around, so it might be out of scope for now, > but I wonder: what if the kernel actually supported Unicode here? Few > cmdline arguments take values where non-ASCII makes sense, but at least some > do: for example, a Russian guy is not unlikely to name subvolumes using > cyrillic. Supporting that would be easy (estimating the length then > utf16s_to_utf8s()). There's just one problem: which encoding to use, but > these days, most distributions have either dropped non-UTF8 or hardly pay > lip service, so we could get away with hard-coding UTF-8: those few who > use ancient charsets can stick to ASCII. Would this be ok? If so, shout, > I can code this if you don't care enough. > We should, indeed, do proper conversion to UTF-8 here. I also suspect we should assume the input is UTF-16 rather than UCS-2, although that is a bit more exotic. -hpa -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 09/17] Move unicode to ASCII conversion to shared function.
On Mon, Sep 16, 2013 at 09:11:25PM -0700, Roy Franz wrote: > +/* Convert the unicode UEFI command line to ASCII to pass to kernel. > + * Size of memory allocated return in *cmd_line_len. > + * Returns NULL on error. > + */ > +static char *efi_convert_cmdline_to_ascii(efi_system_table_t *sys_table_arg, > + int load_options_size = image->load_options_size / 2; /* ASCII */ > + for (i = 0; i < options_size - 1; i++) > + *s1++ = *s2++; I'm afraid both this commit and comments inside are misnamed. What you're changing here is the encoding rather than character set. In fact, these days it's 8-bit encodings that are more likely to be Unicode than 16-bit ones: UTF-8 is ubiquitous, while you usually get UCS2 at most. In either case, though, we have here is a 7-bit charset encoded as either 8-bit or 16-bit units. What this function does is blindly truncating upper byte. The supported payload is in both cases ASCII. I'd thus rename the function to what it already does: truncating u16 to u8, and adjust comments accordingly. Replacing values above 126 with a token character like '?' would be good too: that'd avoid producing corrupted characters and/or random ASCII chars. Your commit only moves things around, so it might be out of scope for now, but I wonder: what if the kernel actually supported Unicode here? Few cmdline arguments take values where non-ASCII makes sense, but at least some do: for example, a Russian guy is not unlikely to name subvolumes using cyrillic. Supporting that would be easy (estimating the length then utf16s_to_utf8s()). There's just one problem: which encoding to use, but these days, most distributions have either dropped non-UTF8 or hardly pay lip service, so we could get away with hard-coding UTF-8: those few who use ancient charsets can stick to ASCII. Would this be ok? If so, shout, I can code this if you don't care enough. -- ᛊᚨᚾᛁᛏᚣ᛫ᛁᛊ᛫ᚠᛟᚱ᛫ᚦᛖ᛫ᚹᛖᚨᚲ -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 09/17] Move unicode to ASCII conversion to shared function.
On Mon, Sep 16, 2013 at 09:11:25PM -0700, Roy Franz wrote: +/* Convert the unicode UEFI command line to ASCII to pass to kernel. + * Size of memory allocated return in *cmd_line_len. + * Returns NULL on error. + */ +static char *efi_convert_cmdline_to_ascii(efi_system_table_t *sys_table_arg, + int load_options_size = image-load_options_size / 2; /* ASCII */ + for (i = 0; i options_size - 1; i++) + *s1++ = *s2++; I'm afraid both this commit and comments inside are misnamed. What you're changing here is the encoding rather than character set. In fact, these days it's 8-bit encodings that are more likely to be Unicode than 16-bit ones: UTF-8 is ubiquitous, while you usually get UCS2 at most. In either case, though, we have here is a 7-bit charset encoded as either 8-bit or 16-bit units. What this function does is blindly truncating upper byte. The supported payload is in both cases ASCII. I'd thus rename the function to what it already does: truncating u16 to u8, and adjust comments accordingly. Replacing values above 126 with a token character like '?' would be good too: that'd avoid producing corrupted characters and/or random ASCII chars. Your commit only moves things around, so it might be out of scope for now, but I wonder: what if the kernel actually supported Unicode here? Few cmdline arguments take values where non-ASCII makes sense, but at least some do: for example, a Russian guy is not unlikely to name subvolumes using cyrillic. Supporting that would be easy (estimating the length then utf16s_to_utf8s()). There's just one problem: which encoding to use, but these days, most distributions have either dropped non-UTF8 or hardly pay lip service, so we could get away with hard-coding UTF-8: those few who use ancient charsets can stick to ASCII. Would this be ok? If so, shout, I can code this if you don't care enough. -- ᛊᚨᚾᛁᛏᚣ᛫ᛁᛊ᛫ᚠᛟᚱ᛫ᚦᛖ᛫ᚹᛖᚨᚲ -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 09/17] Move unicode to ASCII conversion to shared function.
On 09/18/2013 10:44 PM, Adam Borowski wrote: In fact, these days it's 8-bit encodings that are more likely to be Unicode than 16-bit ones: UTF-8 is ubiquitous, while you usually get UCS2 at most. In either case, though, we have here is a 7-bit charset encoded as either 8-bit or 16-bit units. What this function does is blindly truncating upper byte. The supported payload is in both cases ASCII. I'd thus rename the function to what it already does: truncating u16 to u8, and adjust comments accordingly. Replacing values above 126 with a token character like '?' would be good too: that'd avoid producing corrupted characters and/or random ASCII chars. Your commit only moves things around, so it might be out of scope for now, but I wonder: what if the kernel actually supported Unicode here? Few cmdline arguments take values where non-ASCII makes sense, but at least some do: for example, a Russian guy is not unlikely to name subvolumes using cyrillic. Supporting that would be easy (estimating the length then utf16s_to_utf8s()). There's just one problem: which encoding to use, but these days, most distributions have either dropped non-UTF8 or hardly pay lip service, so we could get away with hard-coding UTF-8: those few who use ancient charsets can stick to ASCII. Would this be ok? If so, shout, I can code this if you don't care enough. We should, indeed, do proper conversion to UTF-8 here. I also suspect we should assume the input is UTF-16 rather than UCS-2, although that is a bit more exotic. -hpa -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 09/17] Move unicode to ASCII conversion to shared function.
On Wed, Sep 18, 2013 at 8:44 PM, Adam Borowski kilob...@angband.pl wrote: On Mon, Sep 16, 2013 at 09:11:25PM -0700, Roy Franz wrote: +/* Convert the unicode UEFI command line to ASCII to pass to kernel. + * Size of memory allocated return in *cmd_line_len. + * Returns NULL on error. + */ +static char *efi_convert_cmdline_to_ascii(efi_system_table_t *sys_table_arg, + int load_options_size = image-load_options_size / 2; /* ASCII */ + for (i = 0; i options_size - 1; i++) + *s1++ = *s2++; I'm afraid both this commit and comments inside are misnamed. What you're changing here is the encoding rather than character set. In fact, these days it's 8-bit encodings that are more likely to be Unicode than 16-bit ones: UTF-8 is ubiquitous, while you usually get UCS2 at most. In either case, though, we have here is a 7-bit charset encoded as either 8-bit or 16-bit units. What this function does is blindly truncating upper byte. The supported payload is in both cases ASCII. I'd thus rename the function to what it already does: truncating u16 to u8, and adjust comments accordingly. Replacing values above 126 with a token character like '?' would be good too: that'd avoid producing corrupted characters and/or random ASCII chars. I stuck to re-arranging the code that was there, as I don't know enough about character encodings to propose changes. Also, this code is running as part of the kernel decompressor, rather than the kernel itself, so it doesn't have access to any kernel facilities, and it also needs to be position independent. It's running in a quite limited environment - the decompressor has its own copy of strstr(), and other string functions. I checked the UEFI specification, and it states that all 16 bit strings are UCS-2, unless otherwise noted. The load options that the command line is provided through a void pointer specified as: LoadOptions A pointer to the image's binary load options. I couldn't quickly find any other mentions of LoadOptions in the spec., so I don't know what other types of values could be, or are typically provided. The spec isn't helpful regarding the type of this data, and I don't know what common practice is here among the various EFI/UEFI firmwares out there. Would it be acceptable to fix the naming/comments, and convert values above 126 to '?' in the current patchset, and address a more thorough fix in another patch set? The ARM and ARM64 EFI stub patchsets that are mostly complete depend on this one, so getting this merged soon would be helpful. Something like: +/* Convert the UEFI command line from 16 bit to 8 bit encoding to pass to kernel. + * by truncating to 7 bits. Values above 126 are converted to 63 (ASCII '?') + * Size of memory allocated return in *cmd_line_len. + * Returns NULL on error. + */ +static char *efi_truncate_cmdline_to_8bit_encoding(efi_system_table_t *sys_table_arg, (update code to truncate and convert to '?') Your commit only moves things around, so it might be out of scope for now, but I wonder: what if the kernel actually supported Unicode here? Few cmdline arguments take values where non-ASCII makes sense, but at least some do: for example, a Russian guy is not unlikely to name subvolumes using cyrillic. Supporting that would be easy (estimating the length then tf16us_to_utf8s()). There's just one problem: which encoding to use, but these days, most distributions have either dropped non-UTF8 or hardly pay lip service, so we could get away with hard-coding UTF-8: those few who use ancient charsets can stick to ASCII. Would this be ok? If so, shout, I can code this if you don't care enough. I would certainly appreciate your help improving this, although I think we'll need to have a better understanding of what the UEFI firmware is providing before we can implement something better. -- ᛊᚨᚾᛁᛏᚣ᛫ᛁᛊ᛫ᚠᛟᚱ᛫ᚦᛖ᛫ᚹᛖᚨᚲ -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/