Re: unicode and dbf files
On Oct 28, 2:51 am, Ethan Furman et...@stoneleaf.us wrote: John Machin wrote: On Oct 27, 7:15 am, Ethan Furman et...@stoneleaf.us wrote: Let me rephrase -- say I get a dbf file with an LDID of \x0f that maps to a cp437, and the file came from a german oem machine... could that file have upper-ascii codes that will not map to anything reasonable on my \x01 cp437 machine? If so, is there anything I can do about it? ASCII is defined over the first 128 codepoints; upper-ascii codes is meaningless. As for the rest of your question, if the file's encoded in cpXXX, it's encoded in cpXXX. If either the creator or the reader or both are lying, then all bets are off. My confusion is this -- is there a difference between any of the various cp437s? What various cp437s??? Going down the list at ESRI: 0x01, 0x09, 0x0b, 0x0d, 0x0f, 0x11, 0x15, 0x18, 0x19, and 0x1b all map to cp437, Yes, this is called a many-to-*one* relationship. and they have names they being the Language Drivers, not the codepages. such as US, Dutch, Finnish, French, German, Italian, Swedish, Spanish, English (Britain US)... are these all the same? When you read the Wikipedia page on cp437, did you see any reference to different versions for French, German, Finnish, etc? I saw only one mapping table; how many did you see? If there are multiple language versions of a codepage, how do you expect to handle this given Python has only one codec per codepage? Trying again: *ONE* attribute of a Language Driver ID (LDID) is the character set (codepage) that it uses. Other attributes may be things like the collating (sorting) sequence, whether they use a dot or a comma as the decimal point, etc. Many different languages in Western Europe can use the same codepage. Initially the common one was cp 437, then 850, then 1252. There may possibly different interpretations of a codepage out there somewhere, but they are all *intended* to be the same, and I advise you to cross the different-cp437s bridge *if* it exists and you ever come to it. Have you got access to files with LDID not in (0, 1) that you can try out? Cheers, John -- http://mail.python.org/mailman/listinfo/python-list
Re: unicode and dbf files
John Machin wrote: There may possibly different interpretations of a codepage out there somewhere, but they are all *intended* to be the same, and I advise you to cross the different-cp437s bridge *if* it exists and you ever come to it. Have you got access to files with LDID not in (0, 1) that you can try out? Alas, I do not. And I probably never will, making the whole thing academic. Speaking of tables I do not have access to, and documentation for that matter, I would love to get information on db4, 5, 7, etc. Many thanks for your time and knowledge, and my apologies for seeming so dense. :) Cheers! ~Ethan~ -- http://mail.python.org/mailman/listinfo/python-list
Re: unicode and dbf files
John Machin wrote: On Oct 24, 4:14 am, Ethan Furman et...@stoneleaf.us wrote: John Machin wrote: On Oct 23, 3:03 pm, Ethan Furman et...@stoneleaf.us wrote: John Machin wrote: On Oct 23, 7:28 am, Ethan Furman et...@stoneleaf.us wrote: Greetings, all! I would like to add unicode support to my dbf project. The dbf header has a one-byte field to hold the encoding of the file. For example, \x03 is code-page 437 MS-DOS. My google-fu is apparently not up to the task of locating a complete resource that has a list of the 256 possible values and their corresponding code pages. What makes you imagine that all 256 possible values are mapped to code pages? I'm just wanting to make sure I have whatever is available, and preferably standard. :D So far I have found this, plus variations:http://support.microsoft.com/kb/129631 Does anyone know of anything more complete? That is for VFP3. Try the VFP9 equivalent. dBase 5,5,6,7 use others which are not defined in publicly available dBase docs AFAICT. Look for language driver ID and LDID. Secondary source: ESRI support site. Well, a couple hours later and still not more than I started with. Thanks for trying, though! Huh? You got tips to (1) the VFP9 docs (2) the ESRI site (3) search keywords and you couldn't come up with anything?? Perhaps nothing new would have been a better description. I'd already seen the clicketyclick site (good info there) Do you think so? My take is that it leaves out most of the codepage numbers, and these two lines are wrong: 65h Nordic MS-DOS code page 865 66h Russian MS-DOS code page 866 That was the site I used to get my whole project going, so ignoring the unicode aspect, it has been very helpful to me. and all I found at ESRI were folks trying to figure it out, plus one link to a list that was no different from the vfp3 list (or was it that the list did not give the hex values? Either way, of no use to me.) Try this: http://webhelp.esri.com/arcpad/8.0/referenceguide/ Wow. Question, though: all those codepages mapping to 437 and 850 -- are they really all the same? I looked at dbase.com, but came up empty-handed there (not surprising, since they are a commercial company). MS and ESRI have docs ... does that mean that they are non-commercial companies? I don't know enough about ESRI to make an informed comment, so I'll just say I'm grateful they have them! MS is a complete mystery... perhaps they are finally seeing the light? Hard to believe, though, from a company that has consistently changed their file formats with every release. I searched some more on Microsoft's site in the VFP9 section, and was able to find the code page section this time. Sadly, it only added about seven codes. At any rate, here is what I have come up with so far. Any corrections and/or additions greatly appreciated. code_pages = { '\x01' : ('ascii', 'U.S. MS-DOS'), All of the sources say codepage 437, so why ascii instead of cp437? Hard to say, really. Adjusted. '\x02' : ('cp850', 'International MS-DOS'), '\x03' : ('cp1252', 'Windows ANSI'), '\x04' : ('mac_roman', 'Standard Macintosh'), '\x64' : ('cp852', 'Eastern European MS-DOS'), '\x65' : ('cp866', 'Russian MS-DOS'), '\x66' : ('cp865', 'Nordic MS-DOS'), '\x67' : ('cp861', 'Icelandic MS-DOS'), '\x68' : ('cp895', 'Kamenicky (Czech) MS-DOS'), # iffy Indeed iffy. Python doesn't have a cp895 encoding, and it's probably not alone. I suggest that you omit Kamenicky until someone actually wants it. Yeah, I noticed that. Tentative plan was to implement it myself (more for practice than anything else), and also to be able to raise a more specific error (Kamenicky not currently supported or some such). '\x69' : ('cp852', 'Mazovia (Polish) MS-DOS'), # iffy Look 5 lines back. cp852 is 'Eastern European MS-DOS'. Mazovia predates and is not the same as cp852. In any case, I suggest that you omit Masovia until someone wants it. Interesting reading: http://www.jastra.com.pl/klub/ogonki.htm Very interesting reading. '\x6a' : ('cp737', 'Greek MS-DOS (437G)'), '\x6b' : ('cp857', 'Turkish MS-DOS'), '\x78' : ('big5', 'Traditional Chinese (Hong Kong SAR, Taiwan)\ big5 is *not* the same as cp950. The products that create DBF files were designed for Windows. So when your source says that LDID 0xXX maps to Windows codepage YYY, I would suggest that all you should do is translate that without thinking to python encoding cpYYY. Ack. Not sure how I missed 'Windows' at the end of that description. Windows'), # wag What does wag mean? wag == 'wild ass guess' '\x79' : ('iso2022_kr', 'Korean Windows'), # wag Try cp949. Done. '\x7a' : ('iso2022_jp_2', 'Chinese Simplified (PRC, Singapore)\ Windows'), # wag Very wrong. iso2022_jp_2 is supposed to include basic Japanese, basic
Re: unicode and dbf files
On Oct 27, 3:22 am, Ethan Furman et...@stoneleaf.us wrote: John Machin wrote: On Oct 24, 4:14 am, Ethan Furman et...@stoneleaf.us wrote: John Machin wrote: On Oct 23, 3:03 pm, Ethan Furman et...@stoneleaf.us wrote: John Machin wrote: On Oct 23, 7:28 am, Ethan Furman et...@stoneleaf.us wrote: Try this: http://webhelp.esri.com/arcpad/8.0/referenceguide/ Wow. Question, though: all those codepages mapping to 437 and 850 -- are they really all the same? 437 and 850 *are* codepages. You mean all those language driver IDs mapping to codepages 437 and 850. A codepage merely gives an encoding. An LDID is like a locale; it includes other things besides the encoding. That's why many Western European languages map to the same codepage, first 437 then later 850 then 1252 when Windows came along. '\x68' : ('cp895', 'Kamenicky (Czech) MS-DOS'), # iffy Indeed iffy. Python doesn't have a cp895 encoding, and it's probably not alone. I suggest that you omit Kamenicky until someone actually wants it. Yeah, I noticed that. Tentative plan was to implement it myself (more for practice than anything else), and also to be able to raise a more specific error (Kamenicky not currently supported or some such). The error idea is fine, but I don't get the implement it yourself for practice bit ... practice what? You plan a long and fruitful career inplementing codecs for YAGNI codepages? '\x7b' : ('iso2022_jp', 'Japanese Windows'), # wag Try cp936. You mean 932? Yes. Very helpful indeed. Many thanks for reviewing and correcting. You're welcome. Learning to deal with unicode is proving more difficult for me than learning Python was to begin with! ;D ?? As far as I can tell, the topic has been about mapping from something like a locale to the name of an encoding, i.e. all about the pre-Unicode mishmash and nothing to do with dealing with unicode ... BTW, what are you planning to do with an LDID of 0x00? Cheers, John -- http://mail.python.org/mailman/listinfo/python-list
Re: unicode and dbf files
John Machin wrote: On Oct 27, 3:22 am, Ethan Furman et...@stoneleaf.us wrote: John Machin wrote: Try this: http://webhelp.esri.com/arcpad/8.0/referenceguide/ Wow. Question, though: all those codepages mapping to 437 and 850 -- are they really all the same? 437 and 850 *are* codepages. You mean all those language driver IDs mapping to codepages 437 and 850. A codepage merely gives an encoding. An LDID is like a locale; it includes other things besides the encoding. That's why many Western European languages map to the same codepage, first 437 then later 850 then 1252 when Windows came along. Let me rephrase -- say I get a dbf file with an LDID of \x0f that maps to a cp437, and the file came from a german oem machine... could that file have upper-ascii codes that will not map to anything reasonable on my \x01 cp437 machine? If so, is there anything I can do about it? '\x68' : ('cp895', 'Kamenicky (Czech) MS-DOS'), # iffy Indeed iffy. Python doesn't have a cp895 encoding, and it's probably not alone. I suggest that you omit Kamenicky until someone actually wants it. Yeah, I noticed that. Tentative plan was to implement it myself (more for practice than anything else), and also to be able to raise a more specific error (Kamenicky not currently supported or some such). The error idea is fine, but I don't get the implement it yourself for practice bit ... practice what? You plan a long and fruitful career inplementing codecs for YAGNI codepages? ROFL. Playing with code; the unicode/code page interactions. Possibly looking at constructs I might not otherwise. Since this would almost certainly (I don't like saying absolutely and never -- been troubleshooting for too many years for that!-) be a YAGNI, implementing it is very low priority '\x7b' : ('iso2022_jp', 'Japanese Windows'),# wag Try cp936. You mean 932? Yes. Very helpful indeed. Many thanks for reviewing and correcting. You're welcome. Learning to deal with unicode is proving more difficult for me than learning Python was to begin with! ;D ?? As far as I can tell, the topic has been about mapping from something like a locale to the name of an encoding, i.e. all about the pre-Unicode mishmash and nothing to do with dealing with unicode ... You are, of course, correct. Once it's all unicode life will be easier (he says, all innocent-like). And dbf files even bigger, lol. BTW, what are you planning to do with an LDID of 0x00? Hmmm. Well, logical choices seem to be either treating it as plain ascii, and barfing when high-ascii shows up; defaulting to \x01; or forcing the user to choose one on initial access. I am definitely open to ideas! Cheers, John -- http://mail.python.org/mailman/listinfo/python-list
Re: unicode and dbf files
On Oct 27, 7:15 am, Ethan Furman et...@stoneleaf.us wrote: John Machin wrote: On Oct 27, 3:22 am, Ethan Furman et...@stoneleaf.us wrote: John Machin wrote: Try this: http://webhelp.esri.com/arcpad/8.0/referenceguide/ Wow. Question, though: all those codepages mapping to 437 and 850 -- are they really all the same? 437 and 850 *are* codepages. You mean all those language driver IDs mapping to codepages 437 and 850. A codepage merely gives an encoding. An LDID is like a locale; it includes other things besides the encoding. That's why many Western European languages map to the same codepage, first 437 then later 850 then 1252 when Windows came along. Let me rephrase -- say I get a dbf file with an LDID of \x0f that maps to a cp437, and the file came from a german oem machine... could that file have upper-ascii codes that will not map to anything reasonable on my \x01 cp437 machine? If so, is there anything I can do about it? ASCII is defined over the first 128 codepoints; upper-ascii codes is meaningless. As for the rest of your question, if the file's encoded in cpXXX, it's encoded in cpXXX. If either the creator or the reader or both are lying, then all bets are off. BTW, what are you planning to do with an LDID of 0x00? Hmmm. Well, logical choices seem to be either treating it as plain ascii, and barfing when high-ascii shows up; defaulting to \x01; or forcing the user to choose one on initial access. It would be more useful to allow the user to specify an encoding than an LDID. You need to be able to read files created not only by software like VFP or dBase but also scripts using third-party libraries. It would be useful to allow an encoding to override an LDID that is incorrect e.g. the LDID implies cp1251 but the data is actually encoded in koi8[ru] Read this: http://en.wikipedia.org/wiki/Code_page_437 With no LDID in the file and no encoding supplied, I'd be inclined to make it barf if any codepoint not in range(32, 128) showed up. Cheers, John -- http://mail.python.org/mailman/listinfo/python-list
Re: unicode and dbf files
On Oct 24, 4:14 am, Ethan Furman et...@stoneleaf.us wrote: John Machin wrote: On Oct 23, 3:03 pm, Ethan Furman et...@stoneleaf.us wrote: John Machin wrote: On Oct 23, 7:28 am, Ethan Furman et...@stoneleaf.us wrote: Greetings, all! I would like to add unicode support to my dbf project. The dbf header has a one-byte field to hold the encoding of the file. For example, \x03 is code-page 437 MS-DOS. My google-fu is apparently not up to the task of locating a complete resource that has a list of the 256 possible values and their corresponding code pages. What makes you imagine that all 256 possible values are mapped to code pages? I'm just wanting to make sure I have whatever is available, and preferably standard. :D So far I have found this, plus variations:http://support.microsoft.com/kb/129631 Does anyone know of anything more complete? That is for VFP3. Try the VFP9 equivalent. dBase 5,5,6,7 use others which are not defined in publicly available dBase docs AFAICT. Look for language driver ID and LDID. Secondary source: ESRI support site. Well, a couple hours later and still not more than I started with. Thanks for trying, though! Huh? You got tips to (1) the VFP9 docs (2) the ESRI site (3) search keywords and you couldn't come up with anything?? Perhaps nothing new would have been a better description. I'd already seen the clicketyclick site (good info there) Do you think so? My take is that it leaves out most of the codepage numbers, and these two lines are wrong: 65h Nordic MS-DOS code page 865 66h Russian MS-DOS code page 866 and all I found at ESRI were folks trying to figure it out, plus one link to a list that was no different from the vfp3 list (or was it that the list did not give the hex values? Either way, of no use to me.) Try this: http://webhelp.esri.com/arcpad/8.0/referenceguide/ I looked at dbase.com, but came up empty-handed there (not surprising, since they are a commercial company). MS and ESRI have docs ... does that mean that they are non-commercial companies? I searched some more on Microsoft's site in the VFP9 section, and was able to find the code page section this time. Sadly, it only added about seven codes. At any rate, here is what I have come up with so far. Any corrections and/or additions greatly appreciated. code_pages = { '\x01' : ('ascii', 'U.S. MS-DOS'), All of the sources say codepage 437, so why ascii instead of cp437? '\x02' : ('cp850', 'International MS-DOS'), '\x03' : ('cp1252', 'Windows ANSI'), '\x04' : ('mac_roman', 'Standard Macintosh'), '\x64' : ('cp852', 'Eastern European MS-DOS'), '\x65' : ('cp866', 'Russian MS-DOS'), '\x66' : ('cp865', 'Nordic MS-DOS'), '\x67' : ('cp861', 'Icelandic MS-DOS'), '\x68' : ('cp895', 'Kamenicky (Czech) MS-DOS'), # iffy Indeed iffy. Python doesn't have a cp895 encoding, and it's probably not alone. I suggest that you omit Kamenicky until someone actually wants it. '\x69' : ('cp852', 'Mazovia (Polish) MS-DOS'), # iffy Look 5 lines back. cp852 is 'Eastern European MS-DOS'. Mazovia predates and is not the same as cp852. In any case, I suggest that you omit Masovia until someone wants it. Interesting reading: http://www.jastra.com.pl/klub/ogonki.htm '\x6a' : ('cp737', 'Greek MS-DOS (437G)'), '\x6b' : ('cp857', 'Turkish MS-DOS'), '\x78' : ('big5', 'Traditional Chinese (Hong Kong SAR, Taiwan)\ big5 is *not* the same as cp950. The products that create DBF files were designed for Windows. So when your source says that LDID 0xXX maps to Windows codepage YYY, I would suggest that all you should do is translate that without thinking to python encoding cpYYY. Windows'), # wag What does wag mean? '\x79' : ('iso2022_kr', 'Korean Windows'), # wag Try cp949. '\x7a' : ('iso2022_jp_2', 'Chinese Simplified (PRC, Singapore)\ Windows'), # wag Very wrong. iso2022_jp_2 is supposed to include basic Japanese, basic (1980) Chinese (GB2312) and a basic Korean kit. However to quote from CJKV Information Processing by Ken Lunde, ... from a practical point of view, ISO-2022-JP-2 . [is] equivalent to ISO-2022-JP-1 encoding. i.e. no Chinese support at all. Try cp936. '\x7b' : ('iso2022_jp', 'Japanese Windows'), # wag Try cp936. '\x7c' : ('cp874', 'Thai Windows'), # wag '\x7d' : ('cp1255', 'Hebrew Windows'), '\x7e' : ('cp1256', 'Arabic Windows'), '\xc8' : ('cp1250', 'Eastern European Windows'), '\xc9' : ('cp1251', 'Russian Windows'), '\xca' : ('cp1254', 'Turkish Windows'), '\xcb' : ('cp1253', 'Greek Windows'), '\x96' : ('mac_cyrillic', 'Russian Macintosh'), '\x97' : ('mac_latin2', 'Macintosh EE'), '\x98' : ('mac_greek', 'Greek Macintosh') } HTH, John -- http://mail.python.org/mailman/listinfo/python-list
Re: unicode and dbf files
John Machin wrote: On Oct 23, 3:03 pm, Ethan Furman et...@stoneleaf.us wrote: John Machin wrote: On Oct 23, 7:28 am, Ethan Furman et...@stoneleaf.us wrote: Greetings, all! I would like to add unicode support to my dbf project. The dbf header has a one-byte field to hold the encoding of the file. For example, \x03 is code-page 437 MS-DOS. My google-fu is apparently not up to the task of locating a complete resource that has a list of the 256 possible values and their corresponding code pages. What makes you imagine that all 256 possible values are mapped to code pages? I'm just wanting to make sure I have whatever is available, and preferably standard. :D So far I have found this, plus variations:http://support.microsoft.com/kb/129631 Does anyone know of anything more complete? That is for VFP3. Try the VFP9 equivalent. dBase 5,5,6,7 use others which are not defined in publicly available dBase docs AFAICT. Look for language driver ID and LDID. Secondary source: ESRI support site. Well, a couple hours later and still not more than I started with. Thanks for trying, though! Huh? You got tips to (1) the VFP9 docs (2) the ESRI site (3) search keywords and you couldn't come up with anything?? Perhaps nothing new would have been a better description. I'd already seen the clicketyclick site (good info there), and all I found at ESRI were folks trying to figure it out, plus one link to a list that was no different from the vfp3 list (or was it that the list did not give the hex values? Either way, of no use to me.) I looked at dbase.com, but came up empty-handed there (not surprising, since they are a commercial company). I searched some more on Microsoft's site in the VFP9 section, and was able to find the code page section this time. Sadly, it only added about seven codes. At any rate, here is what I have come up with so far. Any corrections and/or additions greatly appreciated. code_pages = { '\x01' : ('ascii', 'U.S. MS-DOS'), '\x02' : ('cp850', 'International MS-DOS'), '\x03' : ('cp1252', 'Windows ANSI'), '\x04' : ('mac_roman', 'Standard Macintosh'), '\x64' : ('cp852', 'Eastern European MS-DOS'), '\x65' : ('cp866', 'Russian MS-DOS'), '\x66' : ('cp865', 'Nordic MS-DOS'), '\x67' : ('cp861', 'Icelandic MS-DOS'), '\x68' : ('cp895', 'Kamenicky (Czech) MS-DOS'), # iffy '\x69' : ('cp852', 'Mazovia (Polish) MS-DOS'), # iffy '\x6a' : ('cp737', 'Greek MS-DOS (437G)'), '\x6b' : ('cp857', 'Turkish MS-DOS'), '\x78' : ('big5', 'Traditional Chinese (Hong Kong SAR, Taiwan)\ Windows'), # wag '\x79' : ('iso2022_kr', 'Korean Windows'), # wag '\x7a' : ('iso2022_jp_2', 'Chinese Simplified (PRC, Singapore)\ Windows'), # wag '\x7b' : ('iso2022_jp', 'Japanese Windows'),# wag '\x7c' : ('cp874', 'Thai Windows'), # wag '\x7d' : ('cp1255', 'Hebrew Windows'), '\x7e' : ('cp1256', 'Arabic Windows'), '\xc8' : ('cp1250', 'Eastern European Windows'), '\xc9' : ('cp1251', 'Russian Windows'), '\xca' : ('cp1254', 'Turkish Windows'), '\xcb' : ('cp1253', 'Greek Windows'), '\x96' : ('mac_cyrillic', 'Russian Macintosh'), '\x97' : ('mac_latin2', 'Macintosh EE'), '\x98' : ('mac_greek', 'Greek Macintosh') } ~Ethan~ -- http://mail.python.org/mailman/listinfo/python-list
unicode and dbf files
Greetings, all! I would like to add unicode support to my dbf project. The dbf header has a one-byte field to hold the encoding of the file. For example, \x03 is code-page 437 MS-DOS. My google-fu is apparently not up to the task of locating a complete resource that has a list of the 256 possible values and their corresponding code pages. So far I have found this, plus variations: http://support.microsoft.com/kb/129631 Does anyone know of anything more complete? ~Ethan~ -- http://mail.python.org/mailman/listinfo/python-list
Re: unicode and dbf files
On Oct 23, 7:28 am, Ethan Furman et...@stoneleaf.us wrote: Greetings, all! I would like to add unicode support to my dbf project. The dbf header has a one-byte field to hold the encoding of the file. For example, \x03 is code-page 437 MS-DOS. My google-fu is apparently not up to the task of locating a complete resource that has a list of the 256 possible values and their corresponding code pages. What makes you imagine that all 256 possible values are mapped to code pages? So far I have found this, plus variations:http://support.microsoft.com/kb/129631 Does anyone know of anything more complete? That is for VFP3. Try the VFP9 equivalent. dBase 5,5,6,7 use others which are not defined in publicly available dBase docs AFAICT. Look for language driver ID and LDID. Secondary source: ESRI support site. -- http://mail.python.org/mailman/listinfo/python-list
Re: unicode and dbf files
John Machin wrote: On Oct 23, 7:28 am, Ethan Furman et...@stoneleaf.us wrote: Greetings, all! I would like to add unicode support to my dbf project. The dbf header has a one-byte field to hold the encoding of the file. For example, \x03 is code-page 437 MS-DOS. My google-fu is apparently not up to the task of locating a complete resource that has a list of the 256 possible values and their corresponding code pages. What makes you imagine that all 256 possible values are mapped to code pages? I'm just wanting to make sure I have whatever is available, and preferably standard. :D So far I have found this, plus variations:http://support.microsoft.com/kb/129631 Does anyone know of anything more complete? That is for VFP3. Try the VFP9 equivalent. dBase 5,5,6,7 use others which are not defined in publicly available dBase docs AFAICT. Look for language driver ID and LDID. Secondary source: ESRI support site. Well, a couple hours later and still not more than I started with. Thanks for trying, though! ~Ethan~ -- http://mail.python.org/mailman/listinfo/python-list