[Imdbpy-help] Fwd: imdbpy2sql 4.7 - invalid byte sequence for encoding UTF8
Hi, This fix worked for about some months and unfortunately there is similar encoding error in latest data files (16.sep.2011) Using latest DEV version on virtualenv: IMDbPY==4.8dev-20110822 and python 2.6 This configuration worked perfectly with previous data files. So it means there must be some kind of a trash again for actor and characters files. Here is the full log for error: SCANNING actor: Ribeiro, Freddy SCANNING actor: Richard, Darryl * FLUSHING SQLData... SCANNING actor: Richardson, Ian (I) SCANNING actor: Richter, Friedrich * FLUSHING SQLData... SCANNING actor: Riebisi, Romeo SCANNING actor: Rignault, Alexandre * FLUSHING CharactersCache... Traceback (most recent call last): File ./bin/imdbpy2sql.py, line 5, in module pkg_resources.run_script('IMDbPY==4.8dev-20110822', 'imdbpy2sql.py') File /usr/share/nginx/store/imdb/virtualenv/lib/python2.6/site-packages/setuptools-0.6c11-py2.6.egg/pkg_resources.py, line 489, in run_script File /usr/share/nginx/store/imdb/virtualenv/lib/python2.6/site-packages/setuptools-0.6c11-py2.6.egg/pkg_resources.py, line 1207, in run_script File /usr/share/nginx/store/imdb/virtualenv/lib/python2.6/site-packages/IMDbPY-4.8dev_20110822-py2.6-linux-x86_64.egg/EGG-INFO/scripts/imdbpy2sql.py, line 2959, in module run() File /usr/share/nginx/store/imdb/virtualenv/lib/python2.6/site-packages/IMDbPY-4.8dev_20110822-py2.6-linux-x86_64.egg/EGG-INFO/scripts/imdbpy2sql.py, line 2820, in run castLists(_charIDsList=characters_imdbIDs) File /usr/share/nginx/store/imdb/virtualenv/lib/python2.6/site-packages/IMDbPY-4.8dev_20110822-py2.6-linux-x86_64.egg/EGG-INFO/scripts/imdbpy2sql.py, line 1584, in castLists doCast(f, roleid, rolename) File /usr/share/nginx/store/imdb/virtualenv/lib/python2.6/site-packages/IMDbPY-4.8dev_20110822-py2.6-linux-x86_64.egg/EGG-INFO/scripts/imdbpy2sql.py, line 1543, in doCast cid = CACHE_CID.addUnique(role) File /usr/share/nginx/store/imdb/virtualenv/lib/python2.6/site-packages/IMDbPY-4.8dev_20110822-py2.6-linux-x86_64.egg/EGG-INFO/scripts/imdbpy2sql.py, line 966, in addUnique else: return self.add(key, miscData) File /usr/share/nginx/store/imdb/virtualenv/lib/python2.6/site-packages/IMDbPY-4.8dev_20110822-py2.6-linux-x86_64.egg/EGG-INFO/scripts/imdbpy2sql.py, line 959, in add self[key] = c File /usr/share/nginx/store/imdb/virtualenv/lib/python2.6/site-packages/IMDbPY-4.8dev_20110822-py2.6-linux-x86_64.egg/EGG-INFO/scripts/imdbpy2sql.py, line 869, in __setitem__ self.flush() File /usr/share/nginx/store/imdb/virtualenv/lib/python2.6/site-packages/IMDbPY-4.8dev_20110822-py2.6-linux-x86_64.egg/EGG-INFO/scripts/imdbpy2sql.py, line 892, in flush self._toDB(quiet) File /usr/share/nginx/store/imdb/virtualenv/lib/python2.6/site-packages/IMDbPY-4.8dev_20110822-py2.6-linux-x86_64.egg/EGG-INFO/scripts/imdbpy2sql.py, line 1194, in _toDB CURS.executemany(self.sqlstr, self.converter(l)) *psycopg2.DataError: invalid byte sequence for encoding UTF8: 0xc320* Any ideas? Thanks. On Mon, May 2, 2011 at 9:32 PM, Davide Alberani davide.alber...@gmail.comwrote: On Mon, May 2, 2011 at 08:47, darklow dark...@gmail.com wrote: Thank you for your patience and guiding through the tests, i really glad we finally found the problem and fixed it. Yep, even if it took a little too long. :-) Just curious, why only me and one another user encountered this problem, but when you run the same tests, you didn't see the error? :) It may have something to do with the use python library to connect to Postgres. Maybe some libraries handle gracefully this kind of error; I have to check better the versions installed on my system and on the virtualenv I've used to reproduce the bug. In fact the right thing to do in such cases is to raise an exception (like in our case); other databases - or libraries to connect to databases - like MySQL simply ignore with a warning these errors (not a great idea). -- Davide Alberani davide.alber...@gmail.com [PGP KeyID: 0x465BFD47] http://www.mimante.net/ -- All the data continuously generated in your IT infrastructure contains a definitive record of customers, application performance, security threats, fraudulent activity and more. Splunk takes this data and makes sense of it. Business sense. IT sense. Common sense. http://p.sf.net/sfu/splunk-d2dcopy1___ Imdbpy-help mailing list Imdbpy-help@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/imdbpy-help
Re: [Imdbpy-help] imdbpy2sql 4.7 - invalid byte sequence for encoding UTF8
Thanks, let me know if you have any ideas, how to fix the problem... About virtalenv. I was also quite pedantic on ignoring virtualenv solution - i am programmer, not a system administrator, i am not familiar with python, i understand the code logic, but haven't coded any application so far, just one test parser to diagnose error. I looked at virtualenv documentation, i didn't understand how to use it, the problem is my little knowledge in Python and its components, so i think you have to be more familiar with Python and its libraries and way they are installed and configured before installing and configuring virtualenv. Also our sysadmin is quite pain in the a.. It is hard to prove the need of that or another new tool to install. If it has a stable debian package, then it is easier. But for all the other packages, almost impossible. Also i am not sure i want to intrude in sysadmins environment and do some installs by myself, even if it doesn't require root access.. On Mon, Apr 25, 2011 at 1:19 AM, Davide Alberani davide.alber...@gmail.comwrote: On Sun, Apr 24, 2011 at 22:44, darklow dark...@gmail.com wrote: Yes i can confirm - Script version 4.6 works perfectly on same server with same files. And i think by this we come closer to solution. Excellent! (well, it still baffles me why I'm absolutely unable to reproduce the problem on my system, but that's another story...) Maybe this helps to identify the problem, this is what we did on our server. (Remember, we are doing this copying because there are only stable versions for Debian on server allowed, but we need those md5 hashes from 4.7 version) I'll look at your setup tomorrow. I'll surely sound pedantic, but... seriously: why you don't use a virtualenv environment? It's easy to install and doesn't require root privileges. -- Davide Alberani davide.alber...@gmail.com [PGP KeyID: 0x465BFD47] http://www.mimante.net/ -- WhatsUp Gold - Download Free Network Management Software The most intuitive, comprehensive, and cost-effective network management toolset available today. Delivers lowest initial acquisition cost and overall TCO of any competing solution. http://p.sf.net/sfu/whatsupgold-sd___ Imdbpy-help mailing list Imdbpy-help@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/imdbpy-help
Re: [Imdbpy-help] imdbpy2sql 4.7 - invalid byte sequence for encoding UTF8
There has never been any issues with our PostgresSQL database, we always have used UTF-8 and are using this time. I have tried plenty of scripts, workarounds so far, many decode().encode() tries, but nothing helps, just gettings different errors by these. I also tried adding following lines, to be sure everything is fine with connection to Database: import psycopg2 import psycopg2.extensions psycopg2.extensions.register_type(psycopg2.extensions.UNICODE) psycopg2.extensions.register_type(psycopg2.extensions.UNICODEARRAY) import codecs sys.setdefaultencoding('utf-8') CURS.execute(SET NAMES 'utf8') CURS.execute(SET CLIENT_ENCODING TO 'utf8') But still nothing helps. I tried reinstalling all installed dependancies and run from clean sources, but no luck. I tried to run scripts with SQLAlchemy instead of SQLObject, but same error, so the problem is not there. I woud like to ask you one thing. Every test takes about 1h, because error takes place in Actors Cast list. Can you please tell what are the exact list of commands that are converting lines from file to line to sql. So i could create new script, that tries small version of actors.list with problematic lines only, runs few unicode() and decode() lines in correct order and try to insert these lines in some test table into database. So i could try, more faster and not to wait 1 hour for every try... What i tried already is to open actor.list file with PHP, read every line and using iconv converted string to UTF8 and inserted into PostgreSQL database and everything worked fine. It makes me think that problem might be somewhere in cutting line in peaces, maybe it does something wrong, cuts some good unicode character into peaces and so invalid byte sequence appears. If i had correct function list for Python, i could run more tests. PS. Just run test with 4.6 version, to see if it still works with 4.6 version, then we could more easy diagnose by looking in file changes. I'll post the results Thank you. On Sat, Apr 23, 2011 at 3:23 PM, Davide Alberani davide.alber...@gmail.comwrote: On Wed, Apr 20, 2011 at 14:08, darklow dark...@gmail.com wrote: Still no luck :/ maybe the problem is in some environmental variables or settings, which on installed version are present, but running from source are missing or incorrect? Seems unlikely to me. What about this, i printed out some variables: print sys.stdout.encoding - UTF-8 print sys.stdin.encoding - UTF-8 print sys.getdefaultencoding(); - ascii Is it ok that sys.getdefaultencoding(); == ascii ? These are fine. I've reproduced - at the best of my capabilities - your environment: - no IMDbPY installed in the system. - IMDbPY from source (the latest version in the Mercurial repository), setting the PYTHONPATH environment variable to point to the source directory. - the cutils C module was not compiled. - the last actors.list.gz file. - postgres 8.4; my database was created with these settings: CREATE DATABASE imdb WITH OWNER = postgres ENCODING = 'UTF8' TABLESPACE = pg_default LC_COLLATE = 'it_IT.utf8' LC_CTYPE = 'it_IT.utf8' CONNECTION LIMIT = -1; I've run it with your and other portions of the actors.list.gz file, and everything went fine. Now... if I were you, I'd: - create a virtualenv environment with: virtualenv --no-site-packages - install in it IMDbPY, using easy_install or pip (the executable in your virtualenv, I mean) so that you'll have all the correct dependecies available. - run the imdbpy2sql.py within your virtualenv. If it still fails: - check your postgres settings. - try using SQLite (just for a test) - see notes in README.sqldb HTH, -- Davide Alberani davide.alber...@gmail.com [PGP KeyID: 0x465BFD47] http://www.mimante.net/ -- Fulfilling the Lean Software Promise Lean software platforms are now widely adopted and the benefits have been demonstrated beyond question. Learn why your peers are replacing JEE containers with lightweight application servers - and what you can gain from the move. http://p.sf.net/sfu/vmware-sfemails___ Imdbpy-help mailing list Imdbpy-help@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/imdbpy-help
Re: [Imdbpy-help] imdbpy2sql 4.7 - invalid byte sequence for encoding UTF8
Yes i can confirm - Script version 4.6 works perfectly on same server with same files. And i think by this we come closer to solution. Maybe this helps to identify the problem, this is what we did on our server. (Remember, we are doing this copying because there are only stable versions for Debian on server allowed, but we need those md5 hashes from 4.7 version) 1. We installed imdbpy 4.6 with all the dependancies (python-psycopg2, python-dns python-formencode python-pkg-resources python-sqlobject) 2. I downloaded version 4.7 and overwritten following directories with files from 4.7 source: cp -r imdbpy4.7/docs/* /usr/share/doc/python-imdb/ cp -r imdbpy4.7/imdb/* /usr/share/pyshared/imdb/ 3. Now i run imdbpy2sql.py from version 4.7 source like before and it fails with invalid byte sequence. 4. I copied back 4.6. version files to mentioned directories and import for version 4.6 works again. By looking on install log, i didnt see any more relative files, that i should overwrite. So the problem might be at dependancies. You have any idea, where could be the problem and what else should we overwrite or update so that v4.7 works? Thank you. On Sun, Apr 24, 2011 at 10:03 PM, darklow dark...@gmail.com wrote: There has never been any issues with our PostgresSQL database, we always have used UTF-8 and are using this time. I have tried plenty of scripts, workarounds so far, many decode().encode() tries, but nothing helps, just gettings different errors by these. I also tried adding following lines, to be sure everything is fine with connection to Database: import psycopg2 import psycopg2.extensions psycopg2.extensions.register_type(psycopg2.extensions.UNICODE) psycopg2.extensions.register_type(psycopg2.extensions.UNICODEARRAY) import codecs sys.setdefaultencoding('utf-8') CURS.execute(SET NAMES 'utf8') CURS.execute(SET CLIENT_ENCODING TO 'utf8') But still nothing helps. I tried reinstalling all installed dependancies and run from clean sources, but no luck. I tried to run scripts with SQLAlchemy instead of SQLObject, but same error, so the problem is not there. I woud like to ask you one thing. Every test takes about 1h, because error takes place in Actors Cast list. Can you please tell what are the exact list of commands that are converting lines from file to line to sql. So i could create new script, that tries small version of actors.list with problematic lines only, runs few unicode() and decode() lines in correct order and try to insert these lines in some test table into database. So i could try, more faster and not to wait 1 hour for every try... What i tried already is to open actor.list file with PHP, read every line and using iconv converted string to UTF8 and inserted into PostgreSQL database and everything worked fine. It makes me think that problem might be somewhere in cutting line in peaces, maybe it does something wrong, cuts some good unicode character into peaces and so invalid byte sequence appears. If i had correct function list for Python, i could run more tests. PS. Just run test with 4.6 version, to see if it still works with 4.6 version, then we could more easy diagnose by looking in file changes. I'll post the results Thank you. On Sat, Apr 23, 2011 at 3:23 PM, Davide Alberani davide.alber...@gmail.com wrote: On Wed, Apr 20, 2011 at 14:08, darklow dark...@gmail.com wrote: Still no luck :/ maybe the problem is in some environmental variables or settings, which on installed version are present, but running from source are missing or incorrect? Seems unlikely to me. What about this, i printed out some variables: print sys.stdout.encoding - UTF-8 print sys.stdin.encoding - UTF-8 print sys.getdefaultencoding(); - ascii Is it ok that sys.getdefaultencoding(); == ascii ? These are fine. I've reproduced - at the best of my capabilities - your environment: - no IMDbPY installed in the system. - IMDbPY from source (the latest version in the Mercurial repository), setting the PYTHONPATH environment variable to point to the source directory. - the cutils C module was not compiled. - the last actors.list.gz file. - postgres 8.4; my database was created with these settings: CREATE DATABASE imdb WITH OWNER = postgres ENCODING = 'UTF8' TABLESPACE = pg_default LC_COLLATE = 'it_IT.utf8' LC_CTYPE = 'it_IT.utf8' CONNECTION LIMIT = -1; I've run it with your and other portions of the actors.list.gz file, and everything went fine. Now... if I were you, I'd: - create a virtualenv environment with: virtualenv --no-site-packages - install in it IMDbPY, using easy_install or pip (the executable in your virtualenv, I mean) so that you'll have all the correct dependecies available. - run the imdbpy2sql.py within your virtualenv. If it still fails: - check your postgres settings. - try using SQLite (just for a test) - see notes in README.sqldb HTH, -- Davide
Re: [Imdbpy-help] imdbpy2sql 4.7 - invalid byte sequence for encoding UTF8
Still no luck :/ maybe the problem is in some environmental variables or settings, which on installed version are present, but running from source are missing or incorrect? What about this, i printed out some variables: print sys.stdout.encoding - UTF-8 print sys.stdin.encoding - UTF-8 print sys.getdefaultencoding(); - ascii Is it ok that sys.getdefaultencoding(); == ascii ? Maybe there are some more variables i should check? On Tue, Apr 19, 2011 at 11:11 PM, Davide Alberani davide.alber...@gmail.com wrote: On Mon, Apr 18, 2011 at 09:30, Davide Alberani davide.alber...@gmail.com wrote: Thanks for the file, I hope to look at it within a day or two. Ok: the file is correctly encoded in iso8859-1, as expected, and contains no garbage. Using it as the only input for imdbpy2sql.py (putting the attached file in a directory by itself), I can run the script with no errors (besides the expected warnings about missing files). I'm using the version from the Mercurial repository, without the cutils.so library. Please, if you can't install IMDbPY in your system, consider the use of virtualenv. Having tried that, I have to recommend you to double check the settings of your Postgresql server for some kind of incoherences about encodings and collations. HTH, -- Davide Alberani davide.alber...@gmail.com [PGP KeyID: 0x465BFD47] http://www.mimante.net/ -- Benefiting from Server Virtualization: Beyond Initial Workload Consolidation -- Increasing the use of server virtualization is a top priority.Virtualization can reduce costs, simplify management, and improve application availability and disaster protection. Learn more about boosting the value of server virtualization. http://p.sf.net/sfu/vmware-sfdev2dev___ Imdbpy-help mailing list Imdbpy-help@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/imdbpy-help
Re: [Imdbpy-help] imdbpy2sql 4.7 - invalid byte sequence for encoding UTF8
On Sun, Apr 17, 2011 at 5:13 PM, Davide Alberani davide.alber...@gmail.comwrote: On Sun, Apr 17, 2011 at 14:04, darklow dark...@gmail.com wrote: Updated this morning to latest data files, no change and unfortunately this fix also doesn't work. Hmm... to debug a problem like this without being able to reproduce, is extremely difficult. :-/ This error started when we uninstalled imdbpy (left all the dependency libs) and started run it without installation. Maybe there is some kind of problem and some kind of hidden unicode dependencies? Maybe you can try to run without installation, jus from source? Have you some very good reason to do so? :-) We have Debian linux on our server and our sysadmin allows only stable packs. However latest version of imdbpy has these md5 checksum that are quite important in our situation, that is why i have to run it from source. Can't you try to purge every reference to IMDbPY left on the system (search for the scripts in /usr/bin/ and /usr/local/bin/ and be sure that import imdb fails, at the python prompt) and see if the problem is solved, after IMDbPY 4.7 is reinstalled? Unfortunately right now i can't do reinstall, just to run it by source. However if this is the reason and there will be no way to fix this, i'll try to convince sysadmin to install this version from unofficial debian packs If you have problems locating the IMDbPY package, just open the Python prompt and: import imdb print imdb Also every time i start the script i receive two warnings: 2011-04-17 11:13:37,398 WARNING [imdbpy.parser.sql.aux] /data/web/imdb/imdbpy4.7-159671/imdb/parser/sql/__init__.py:125: Unable to import the cutils.ratcliff function. Searching names and titles using the sql data access system will be slower. This will force IMDbPY to use some pure-python fall-back functions. It's entirely possible that there are some bug in these functions, even if a run without cutils.so is running fine, for me (so far). IMPORTING psyco... FAILED (not a big deal, everything is alright...) That's not a problem for sure. Right now, my first guess is that somewhere, after the *.list files ar read and turned into utf-8 encoded strings, the imdbpy2sql.py script does Something Very Wrong(tm) to a string (like cutting it at a certain place, ending up cutting a single utf-8 encoded char in two: this could explain the error). I've tried the conversion suggested by Petite Abeille, and it works fine. Please, could you cut a small piece (few kilobytes) of the actors.list file, and attach it (no cut-and-paste)? It goes without saying that you should chose a portion where you see (or guess are) the strange chars :-) I attached the small part of actors.list file right the place with the broken characters (used unix sed command to cut the problematic lines out). Thanks! -- Davide Alberani davide.alber...@gmail.com [PGP KeyID: 0x465BFD47] http://www.mimante.net/ actors.list.small Description: Binary data -- Benefiting from Server Virtualization: Beyond Initial Workload Consolidation -- Increasing the use of server virtualization is a top priority.Virtualization can reduce costs, simplify management, and improve application availability and disaster protection. Learn more about boosting the value of server virtualization. http://p.sf.net/sfu/vmware-sfdev2dev___ Imdbpy-help mailing list Imdbpy-help@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/imdbpy-help
Re: [Imdbpy-help] imdbpy2sql 4.7 - invalid byte sequence for encoding UTF8
Unfortunately adding this line k = k.replace('\xec\x8c\xa0', '') in the place you mentioned wont help. Still same error on same place :( SCANNING actor: Havel, Jir? * FLUSHING CharactersCache... Traceback (most recent call last): . self.flush() File ./imdbpy2sql.py, line 1195, in _toDB CURS.executemany(self.sqlstr, self.converter(l)) psycopg2.DataError: invalid byte sequence for encoding UTF8: 0xc320 On Wed, Apr 13, 2011 at 11:56 PM, Davide Alberani davide.alber...@gmail.com wrote: On Mon, Apr 11, 2011 at 18:35, darklow dark...@gmail.com wrote: File ./imdbpy2sql.py, line 1194, in _toDB CURS.executemany(self.sqlstr, self.converter(l)) psycopg2.DataError: invalid byte sequence for encoding UTF8: 0xc320 HINT: This error can also happen if the byte sequence does not match the encoding expected by the server, which is controlled by client_encoding. Hi all, I'm writing regarding the recent 0xc320 problem with IMDbPY. The above notice is extremely interesting, and should be investigated: how can it be that 0xc320 is not UTF8 encodable? It should work; from the Python prompt: unichr(0xc320).encode('utf8') '\xec\x8c\xa0' Anyway, as a very fast and dirty fix (the main problem is probably some crap in the data files), try this: after line 1181 of imdbpy2sql.py, add: k = k.replace('\xec\x8c\xa0', '') So that the nearby lines will become: try: k = k.replace('\xec\x8c\xa0', '') t = analyze_name(k) except IMDbParserError: Please be aware that this fix was not tested at all, but I'm almost sure that, at the above point, 'k' is a string encoded in utf8. Anyway, beside the garbage theory, I have another idea about the source of the error, but I have to verify it later... Bye, and let me know if it works! -- Davide Alberani davide.alber...@gmail.com [PGP KeyID: 0x465BFD47] http://www.mimante.net/ -- Benefiting from Server Virtualization: Beyond Initial Workload Consolidation -- Increasing the use of server virtualization is a top priority.Virtualization can reduce costs, simplify management, and improve application availability and disaster protection. Learn more about boosting the value of server virtualization. http://p.sf.net/sfu/vmware-sfdev2dev___ Imdbpy-help mailing list Imdbpy-help@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/imdbpy-help
Re: [Imdbpy-help] imdbpy2sql 4.7 - invalid byte sequence for encoding UTF8
Since i am not familiar with python, maybe you could suggest some fast fix so that scripts doesn't hangs? Maybe this helps: In PHP we have perfeclty same error with encoding when importing some wrong decoded data. When we have no control over data and we cant all the time do utf8_encode since it could encode string twice - to bypass this error i use this function which at least prevents from postgresql error: function fix_encoding($in_str) { $cur_encoding = mb_detect_encoding($in_str) ; if($cur_encoding == UTF-8 mb_check_encoding($in_str,UTF-8)){ return $in_str; }else{ return utf8_encode($in_str); } } Maybe you can help to adapt this function to Python if similar functions are available so we can use it as a quick fix? Thanks a lot. On Mon, Apr 11, 2011 at 10:46 PM, Davide Alberani davide.alber...@gmail.com wrote: On Mon, Apr 11, 2011 at 18:35, darklow dark...@gmail.com wrote: File ./imdbpy2sql.py, line 1194, in _toDB CURS.executemany(self.sqlstr, self.converter(l)) psycopg2.DataError: invalid byte sequence for encoding UTF8: 0xc320 HINT: This error can also happen if the byte sequence does not match the encoding expected by the server, which is controlled by client_encoding. Any suggestions? I found similar topic, but there were also no solutions. Yes, I've had other reports about this bug. Seems to be related to some garbage in the actors.list.gz file. I hope to have time to investigate the problem within a week or two. Thanks for the bug report! -- Davide Alberani davide.alber...@gmail.com [PGP KeyID: 0x465BFD47] http://www.mimante.net/ -- Forrester Wave Report - Recovery time is now measured in hours and minutes not days. Key insights are discussed in the 2010 Forrester Wave Report as part of an in-depth evaluation of disaster recovery service providers. Forrester found the best-in-class provider in terms of services and vision. Read this report now! http://p.sf.net/sfu/ibm-webcastpromo___ Imdbpy-help mailing list Imdbpy-help@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/imdbpy-help
Re: [Imdbpy-help] imdbpy2sql 4.7 - invalid byte sequence for encoding UTF8
Maybe someone knows some fast dirty fix at least how to skip such invalid byte sequence strings while there are no official fix, so i can finish the import? Can we detect invalid byte characters? Maybe we can somehow replace or get rid of *0xc320* character, which mostly is appearing. Or skip these rows. Ananlyzed error a bit more. Mostly these errors occur in Japanese actors (actors.list), in filmography there apperars strange characters: Hayakawa, Yuzo Burai hij*8)* * * Tried to delete these rows manually, but the are too much of them :/ Thank you. On Wed, Apr 13, 2011 at 9:45 AM, darklow dark...@gmail.com wrote: Since i am not familiar with python, maybe you could suggest some fast fix so that scripts doesn't hangs? Maybe this helps: In PHP we have perfeclty same error with encoding when importing some wrong decoded data. When we have no control over data and we cant all the time do utf8_encode since it could encode string twice - to bypass this error i use this function which at least prevents from postgresql error: function fix_encoding($in_str) { $cur_encoding = mb_detect_encoding($in_str) ; if($cur_encoding == UTF-8 mb_check_encoding($in_str,UTF-8)){ return $in_str; }else{ return utf8_encode($in_str); } } Maybe you can help to adapt this function to Python if similar functions are available so we can use it as a quick fix? Thanks a lot. On Mon, Apr 11, 2011 at 10:46 PM, Davide Alberani davide.alber...@gmail.com wrote: On Mon, Apr 11, 2011 at 18:35, darklow dark...@gmail.com wrote: File ./imdbpy2sql.py, line 1194, in _toDB CURS.executemany(self.sqlstr, self.converter(l)) psycopg2.DataError: invalid byte sequence for encoding UTF8: 0xc320 HINT: This error can also happen if the byte sequence does not match the encoding expected by the server, which is controlled by client_encoding. Any suggestions? I found similar topic, but there were also no solutions. Yes, I've had other reports about this bug. Seems to be related to some garbage in the actors.list.gz file. I hope to have time to investigate the problem within a week or two. Thanks for the bug report! -- Davide Alberani davide.alber...@gmail.com [PGP KeyID: 0x465BFD47] http://www.mimante.net/ -- Forrester Wave Report - Recovery time is now measured in hours and minutes not days. Key insights are discussed in the 2010 Forrester Wave Report as part of an in-depth evaluation of disaster recovery service providers. Forrester found the best-in-class provider in terms of services and vision. Read this report now! http://p.sf.net/sfu/ibm-webcastpromo___ Imdbpy-help mailing list Imdbpy-help@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/imdbpy-help