Re: encoding problem with BeautifulSoup - problem when writing parsed text to file
On Wed, 05 Oct 2011 21:39:17 -0700, Greg wrote: Here is the final code for those who are struggling with similar problems: ## open and decode file # In this case, the encoding comes from the charset argument in a meta tag # e.g. meta charset=iso-8859-2 fileObj = open(filePath,r).read() fileContent = fileObj.decode(iso-8859-2) fileSoup = BeautifulSoup(fileContent) The fileObj.decode() step should be unnecessary, and is usually undesirable; Beautiful Soup should be doing the decoding itself. If you actually know the encoding (e.g. from a Content-Type header), you can specify it via the fromEncoding parameter to the BeautifulSoup constructor, e.g.: fileSoup = BeautifulSoup(fileObj.read(), fromEncoding=iso-8859-2) If you don't specify the encoding, it will be deduced from a meta tag if one is present, or a Unicode BOM, or using the chardet library if available, or using built-in heuristics, before finally falling back to Windows-1252 (which seems to be the preferred encoding of people who don't understand what an encoding is or why it needs to be specified). -- http://mail.python.org/mailman/listinfo/python-list
Re: encoding problem with BeautifulSoup - problem when writing parsed text to file
Am 06.10.2011 05:40, schrieb Steven D'Aprano: (4) Do all your processing in Unicode, not bytes. (5) Encode the text into bytes using UTF-8 encoding. (6) Write the bytes to a file. Just wondering, why do you split the latter two parts? I would have used codecs.open() to open the file and define the encoding in a single step. Is there a downside to this approach? Otherwise, I can only confirm that your overall approach is the easiest way to get correct results. Uli -- http://mail.python.org/mailman/listinfo/python-list
Re: encoding problem with BeautifulSoup - problem when writing parsed text to file
On Thu, Oct 6, 2011 at 8:29 PM, Ulrich Eckhardt ulrich.eckha...@dominalaser.com wrote: Just wondering, why do you split the latter two parts? I would have used codecs.open() to open the file and define the encoding in a single step. Is there a downside to this approach? Those two steps still happen, even if you achieve them in a single function call. What Steven described is language- and library- independent. ChrisA -- http://mail.python.org/mailman/listinfo/python-list
Re: encoding problem with BeautifulSoup - problem when writing parsed text to file
On 6 oct, 06:39, Greg gregor.hochsch...@googlemail.com wrote: Brilliant! It worked. Thanks! Here is the final code for those who are struggling with similar problems: ## open and decode file # In this case, the encoding comes from the charset argument in a meta tag # e.g. meta charset=iso-8859-2 fileObj = open(filePath,r).read() fileContent = fileObj.decode(iso-8859-2) fileSoup = BeautifulSoup(fileContent) ## Do some BeautifulSoup magic and preserve unicode, presume result is saved in 'text' ## ## write extracted text to file f = open(outFilePath, 'w') f.write(text.encode('utf-8')) f.close() or (Python2/Python3) import io with io.open('abc.txt', 'r', encoding='iso-8859-2') as f: ... r = f.read() ... repr(r) u'a\nb\nc\n' with io.open('def.txt', 'w', encoding='utf-8-sig') as f: ... t = f.write(r) ... f.closed True jmf -- http://mail.python.org/mailman/listinfo/python-list
Re: encoding problem with BeautifulSoup - problem when writing parsed text to file
On Thursday 2011 October 06 10:41, jmfauth wrote: or (Python2/Python3) import io with io.open('abc.txt', 'r', encoding='iso-8859-2') as f: ... r = f.read() ... repr(r) u'a\nb\nc\n' with io.open('def.txt', 'w', encoding='utf-8-sig') as f: ... t = f.write(r) ... f.closed True jmf What is this io of which you speak? -- I have seen the future and I am not in it. -- http://mail.python.org/mailman/listinfo/python-list
Re: encoding problem with BeautifulSoup - problem when writing parsed text to file
In mailman.1785.1317928997.27778.python-l...@python.org xDog Walker thud...@gmail.com writes: What is this io of which you speak? It was introduced in Python 2.6. -- John Gordon A is for Amy, who fell down the stairs gor...@panix.com B is for Basil, assaulted by bears -- Edward Gorey, The Gashlycrumb Tinies -- http://mail.python.org/mailman/listinfo/python-list
encoding problem with BeautifulSoup - problem when writing parsed text to file
Hi, I am having some encoding problems when I first parse stuff from a non-english website using BeautifulSoup and then write the results to a txt file. I have the text both as a normal (text) and as a unicode string (utext): print repr(text) 'Branie zak\xc2\xb3adnik\xc3\xb3w' print repr(utext) u'Branie zak\xb3adnik\xf3w' print text or print utext (fileSoup.prettify() also shows 'wrong' symbols): Branie zak³adników Now I am trying to save this to a file but I never get the encoding right. Here is what I tried (+ lot's of different things with encode, decode...): outFile=open(filePath,w) outFile.write(text) outFile.close() outFile=codecs.open( filePath, w, UTF8 ) outFile.write(utext) outFile.close() Thanks!! -- http://mail.python.org/mailman/listinfo/python-list
Re: encoding problem with BeautifulSoup - problem when writing parsed text to file
On Wed, 05 Oct 2011 16:35:59 -0700, Greg wrote: Hi, I am having some encoding problems when I first parse stuff from a non-english website using BeautifulSoup and then write the results to a txt file. If you haven't already read this, you should do so: http://www.joelonsoftware.com/articles/Unicode.html I have the text both as a normal (text) and as a unicode string (utext): print repr(text) 'Branie zak\xc2\xb3adnik\xc3\xb3w' This is pretty much meaningless, because we don't know how you got the text and what it actually is. You're showing us a bunch of bytes, with no clue as to whether they are the right bytes or not. Considering that your Unicode text is also incorrect, I would say it is *not* right and your description of the problem is 100% backwards: the problem is not *writing* the text, but *reading* the bytes and decoding it. You should do something like this: (1) Inspect the web page to find out what encoding is actually used. (2) If the web page doesn't know what encoding it uses, or if it uses bits and pieces of different encodings, then the source is broken and you shouldn't expect much better results. You could try guessing, but you should expect mojibake in your results. http://en.wikipedia.org/wiki/Mojibake (3) Decode the web page into Unicode text, using the correct encoding. (4) Do all your processing in Unicode, not bytes. (5) Encode the text into bytes using UTF-8 encoding. (6) Write the bytes to a file. [...] Now I am trying to save this to a file but I never get the encoding right. Here is what I tried (+ lot's of different things with encode, decode...): outFile=codecs.open( filePath, w, UTF8 ) outFile.write(utext) outFile.close() That's the correct approach, but it won't help you if utext contains the wrong characters in the first place. The critical step is taking the bytes in the web page and turning them into text. How are you generating utext? -- Steven -- http://mail.python.org/mailman/listinfo/python-list
Re: encoding problem with BeautifulSoup - problem when writing parsed text to file
Brilliant! It worked. Thanks! Here is the final code for those who are struggling with similar problems: ## open and decode file # In this case, the encoding comes from the charset argument in a meta tag # e.g. meta charset=iso-8859-2 fileObj = open(filePath,r).read() fileContent = fileObj.decode(iso-8859-2) fileSoup = BeautifulSoup(fileContent) ## Do some BeautifulSoup magic and preserve unicode, presume result is saved in 'text' ## ## write extracted text to file f = open(outFilePath, 'w') f.write(text.encode('utf-8')) f.close() On Oct 5, 11:40 pm, Steven D'Aprano steve +comp.lang.pyt...@pearwood.info wrote: On Wed, 05 Oct 2011 16:35:59 -0700, Greg wrote: Hi, I am having some encoding problems when I first parse stuff from a non-english website using BeautifulSoup and then write the results to a txt file. If you haven't already read this, you should do so: http://www.joelonsoftware.com/articles/Unicode.html I have the text both as a normal (text) and as a unicode string (utext): print repr(text) 'Branie zak\xc2\xb3adnik\xc3\xb3w' This is pretty much meaningless, because we don't know how you got the text and what it actually is. You're showing us a bunch of bytes, with no clue as to whether they are the right bytes or not. Considering that your Unicode text is also incorrect, I would say it is *not* right and your description of the problem is 100% backwards: the problem is not *writing* the text, but *reading* the bytes and decoding it. You should do something like this: (1) Inspect the web page to find out what encoding is actually used. (2) If the web page doesn't know what encoding it uses, or if it uses bits and pieces of different encodings, then the source is broken and you shouldn't expect much better results. You could try guessing, but you should expect mojibake in your results. http://en.wikipedia.org/wiki/Mojibake (3) Decode the web page into Unicode text, using the correct encoding. (4) Do all your processing in Unicode, not bytes. (5) Encode the text into bytes using UTF-8 encoding. (6) Write the bytes to a file. [...] Now I am trying to save this to a file but I never get the encoding right. Here is what I tried (+ lot's of different things with encode, decode...): outFile=codecs.open( filePath, w, UTF8 ) outFile.write(utext) outFile.close() That's the correct approach, but it won't help you if utext contains the wrong characters in the first place. The critical step is taking the bytes in the web page and turning them into text. How are you generating utext? -- Steven -- http://mail.python.org/mailman/listinfo/python-list
Re: encoding problem with BeautifulSoup - problem when writing parsed text to file
On Thu, Oct 6, 2011 at 3:39 PM, Greg gregor.hochsch...@googlemail.com wrote: Brilliant! It worked. Thanks! Here is the final code for those who are struggling with similar problems: ## open and decode file # In this case, the encoding comes from the charset argument in a meta tag # e.g. meta charset=iso-8859-2 fileContent = fileObj.decode(iso-8859-2) f.write(text.encode('utf-8')) In other words, when you decode correctly into Unicode and encode correctly onto the disk, it works! This is why encodings are so important :) ChrisA -- http://mail.python.org/mailman/listinfo/python-list