On Tue, Apr 7, 2009 at 11:59 AM, Pirritano, Matthew <mpirrit...@ochca.com>wrote:
> I did get an error… > > > > Traceback (most recent call last): > > File "C:\Projects\unicode_convert.py", line 8, in <module> > > outp.write(outLine.strip()+'\n') > > UnicodeEncodeError: 'ascii' codec can't encode characters in position > 640-641: ordinal not in range(128) > Should I be worried about this. And where does this message indicate that > the error is. And what is the error? > > That error is telling you that your input file contains character(s) that don't have a valid representation in ASCII (which is the AMERICAN Standard Code for Information Interchange - no furrin languages need apply!) I came into this conversation late (for which I apologize), so I don't actually know what your input file contains; the fact that it's encoded in UTF-16 indicates to me that its creators anticipated that there might be non-English symbols in it. Where is the error... well, it's positions 640-641 (which, since most code points in UTF-16 are two bytes long, might mean character 319 or 320 if you opened it in a text editor - or it might mean character 640 or 641, I don't really know...) of _some_ line. Unfortunately we don't know which line, 'cause we're not keeping track of line numbers. Something I often do in similar situations: add print statements temporarily (printing to the screen makes a significant performance drag in the middle of a loop, so I comment out or delete the prints when it's time for production.) Two approaches: add an index variable and print just the number, or print the data you're about to try to save to the file. In your case, I'd go with the index, 'cause your lines are apparently much longer than screen width. index = 1 for outLine in inp: print(index) outp.write(outLine.rstrip() + '\n') index += 1 You'll have a long string of numbers, followed by the error/traceback. The last number before the error should be the line number that caused the error; open your file in a decent text editor, go to that line, and look at columns 319/320 (or 640/641 - I'm curious to find out which) - that will show you which character caused your problem. What to do about it, though: the "codecs" module has several ways of dealing with errors in encoding/decoding; it defaults to 'strict', which is what you're seeing. Other choices include 'replace' (replaces invalid ASCII characters with question marks, if I recall correctly) and 'ignore' (which just drops the invalid character from the output.) Change your file-opening line to: inp = codecs.open('g:\\data\\amm\\text files\\test20090320.txt', 'r','utf-16', 'replace') or inp = codecs.open('g:\\data\\amm\\text files\\test20090320.txt', 'r','utf-16', 'ignore') depending on your preference. Have fun - -- www.fsrtechnologies.com
_______________________________________________ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor