Re: [Imdbpy-help] imdbpy2sql 4.7 - invalid byte sequence for encoding UTF8
On Mon, May 2, 2011 at 08:47, darklow dark...@gmail.com wrote: Thank you for your patience and guiding through the tests, i really glad we finally found the problem and fixed it. Yep, even if it took a little too long. :-) Just curious, why only me and one another user encountered this problem, but when you run the same tests, you didn't see the error? :) It may have something to do with the use python library to connect to Postgres. Maybe some libraries handle gracefully this kind of error; I have to check better the versions installed on my system and on the virtualenv I've used to reproduce the bug. In fact the right thing to do in such cases is to raise an exception (like in our case); other databases - or libraries to connect to databases - like MySQL simply ignore with a warning these errors (not a great idea). -- Davide Alberani davide.alber...@gmail.com [PGP KeyID: 0x465BFD47] http://www.mimante.net/ -- WhatsUp Gold - Download Free Network Management Software The most intuitive, comprehensive, and cost-effective network management toolset available today. Delivers lowest initial acquisition cost and overall TCO of any competing solution. http://p.sf.net/sfu/whatsupgold-sd ___ Imdbpy-help mailing list Imdbpy-help@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/imdbpy-help
Re: [Imdbpy-help] imdbpy2sql 4.7 - invalid byte sequence for encoding UTF8
On Thu, Apr 28, 2011 at 22:52, darklow dark...@gmail.com wrote: However last command pip install IMDbPY didn't succeeded so well, looks like i got exactly the same error, that another user reported some days ago in the same discussion and he has also UTF-8 encoding problem: Sure: you don't have the python-dev package installed in your system. :-/ A per-user installation is possible, but a little tricky... By running python setup.py install I receive the same error. I also tried latest version (4.8dev20110425) but got same error. Using the latest version sources, run (after you've activated your virtualenv!): python setup.py install --without-cutils Maybe this explains the problem why the script doesn't handle UTF-8 at first place - some strange incapabilities with cutils.c I've run some tests without the compiled C module, so I think this is not the cause, but at this point... who knows. :-) -- Davide Alberani davide.alber...@gmail.com [PGP KeyID: 0x465BFD47] http://www.mimante.net/ -- WhatsUp Gold - Download Free Network Management Software The most intuitive, comprehensive, and cost-effective network management toolset available today. Delivers lowest initial acquisition cost and overall TCO of any competing solution. http://p.sf.net/sfu/whatsupgold-sd ___ Imdbpy-help mailing list Imdbpy-help@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/imdbpy-help
Re: [Imdbpy-help] imdbpy2sql 4.7 - invalid byte sequence for encoding UTF8
Thanks, let me know if you have any ideas, how to fix the problem... About virtalenv. I was also quite pedantic on ignoring virtualenv solution - i am programmer, not a system administrator, i am not familiar with python, i understand the code logic, but haven't coded any application so far, just one test parser to diagnose error. I looked at virtualenv documentation, i didn't understand how to use it, the problem is my little knowledge in Python and its components, so i think you have to be more familiar with Python and its libraries and way they are installed and configured before installing and configuring virtualenv. Also our sysadmin is quite pain in the a.. It is hard to prove the need of that or another new tool to install. If it has a stable debian package, then it is easier. But for all the other packages, almost impossible. Also i am not sure i want to intrude in sysadmins environment and do some installs by myself, even if it doesn't require root access.. On Mon, Apr 25, 2011 at 1:19 AM, Davide Alberani davide.alber...@gmail.comwrote: On Sun, Apr 24, 2011 at 22:44, darklow dark...@gmail.com wrote: Yes i can confirm - Script version 4.6 works perfectly on same server with same files. And i think by this we come closer to solution. Excellent! (well, it still baffles me why I'm absolutely unable to reproduce the problem on my system, but that's another story...) Maybe this helps to identify the problem, this is what we did on our server. (Remember, we are doing this copying because there are only stable versions for Debian on server allowed, but we need those md5 hashes from 4.7 version) I'll look at your setup tomorrow. I'll surely sound pedantic, but... seriously: why you don't use a virtualenv environment? It's easy to install and doesn't require root privileges. -- Davide Alberani davide.alber...@gmail.com [PGP KeyID: 0x465BFD47] http://www.mimante.net/ -- WhatsUp Gold - Download Free Network Management Software The most intuitive, comprehensive, and cost-effective network management toolset available today. Delivers lowest initial acquisition cost and overall TCO of any competing solution. http://p.sf.net/sfu/whatsupgold-sd___ Imdbpy-help mailing list Imdbpy-help@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/imdbpy-help
Re: [Imdbpy-help] imdbpy2sql 4.7 - invalid byte sequence for encoding UTF8
On Tue, Apr 26, 2011 at 09:36, darklow dark...@gmail.com wrote: Thanks, let me know if you have any ideas, how to fix the problem... Eh... As usual, right now I'm really busy. :-( I looked at virtualenv documentation, i didn't understand how to use it, Ok, let's try: - download virtualenv from http://pypi.python.org/pypi/virtualenv#downloads - tar xvfz virtualenv-1.6.tar.gz - cd virtualenv-1.6 - python virtualenv.py --no-site-packages ~/myvenv - cd ~/myvenv - . ./bin/activate # notice the initial dot - pip install formencode # bug with the dependencies. :( - pip install IMDbPY # or download from the Mercurial repository and run 'python setup.py install' The most important step is the activation of the virtualenv: your prompt should change to something like (myvenv)$ to denote that your virtualenv is active. Now, always from inside the virtualenv, you can run the imdbpy2sql.py script: everything was installed locally to your ~/myvenv/ directory (the local python interpreter is in ~/myvenv/bin/python). If you need to deactivate the virtualenv, simply run the deactivate command. HTH, -- Davide Alberani davide.alber...@gmail.com [PGP KeyID: 0x465BFD47] http://www.mimante.net/ -- WhatsUp Gold - Download Free Network Management Software The most intuitive, comprehensive, and cost-effective network management toolset available today. Delivers lowest initial acquisition cost and overall TCO of any competing solution. http://p.sf.net/sfu/whatsupgold-sd ___ Imdbpy-help mailing list Imdbpy-help@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/imdbpy-help
Re: [Imdbpy-help] imdbpy2sql 4.7 - invalid byte sequence for encoding UTF8
There has never been any issues with our PostgresSQL database, we always have used UTF-8 and are using this time. I have tried plenty of scripts, workarounds so far, many decode().encode() tries, but nothing helps, just gettings different errors by these. I also tried adding following lines, to be sure everything is fine with connection to Database: import psycopg2 import psycopg2.extensions psycopg2.extensions.register_type(psycopg2.extensions.UNICODE) psycopg2.extensions.register_type(psycopg2.extensions.UNICODEARRAY) import codecs sys.setdefaultencoding('utf-8') CURS.execute(SET NAMES 'utf8') CURS.execute(SET CLIENT_ENCODING TO 'utf8') But still nothing helps. I tried reinstalling all installed dependancies and run from clean sources, but no luck. I tried to run scripts with SQLAlchemy instead of SQLObject, but same error, so the problem is not there. I woud like to ask you one thing. Every test takes about 1h, because error takes place in Actors Cast list. Can you please tell what are the exact list of commands that are converting lines from file to line to sql. So i could create new script, that tries small version of actors.list with problematic lines only, runs few unicode() and decode() lines in correct order and try to insert these lines in some test table into database. So i could try, more faster and not to wait 1 hour for every try... What i tried already is to open actor.list file with PHP, read every line and using iconv converted string to UTF8 and inserted into PostgreSQL database and everything worked fine. It makes me think that problem might be somewhere in cutting line in peaces, maybe it does something wrong, cuts some good unicode character into peaces and so invalid byte sequence appears. If i had correct function list for Python, i could run more tests. PS. Just run test with 4.6 version, to see if it still works with 4.6 version, then we could more easy diagnose by looking in file changes. I'll post the results Thank you. On Sat, Apr 23, 2011 at 3:23 PM, Davide Alberani davide.alber...@gmail.comwrote: On Wed, Apr 20, 2011 at 14:08, darklow dark...@gmail.com wrote: Still no luck :/ maybe the problem is in some environmental variables or settings, which on installed version are present, but running from source are missing or incorrect? Seems unlikely to me. What about this, i printed out some variables: print sys.stdout.encoding - UTF-8 print sys.stdin.encoding - UTF-8 print sys.getdefaultencoding(); - ascii Is it ok that sys.getdefaultencoding(); == ascii ? These are fine. I've reproduced - at the best of my capabilities - your environment: - no IMDbPY installed in the system. - IMDbPY from source (the latest version in the Mercurial repository), setting the PYTHONPATH environment variable to point to the source directory. - the cutils C module was not compiled. - the last actors.list.gz file. - postgres 8.4; my database was created with these settings: CREATE DATABASE imdb WITH OWNER = postgres ENCODING = 'UTF8' TABLESPACE = pg_default LC_COLLATE = 'it_IT.utf8' LC_CTYPE = 'it_IT.utf8' CONNECTION LIMIT = -1; I've run it with your and other portions of the actors.list.gz file, and everything went fine. Now... if I were you, I'd: - create a virtualenv environment with: virtualenv --no-site-packages - install in it IMDbPY, using easy_install or pip (the executable in your virtualenv, I mean) so that you'll have all the correct dependecies available. - run the imdbpy2sql.py within your virtualenv. If it still fails: - check your postgres settings. - try using SQLite (just for a test) - see notes in README.sqldb HTH, -- Davide Alberani davide.alber...@gmail.com [PGP KeyID: 0x465BFD47] http://www.mimante.net/ -- Fulfilling the Lean Software Promise Lean software platforms are now widely adopted and the benefits have been demonstrated beyond question. Learn why your peers are replacing JEE containers with lightweight application servers - and what you can gain from the move. http://p.sf.net/sfu/vmware-sfemails___ Imdbpy-help mailing list Imdbpy-help@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/imdbpy-help
Re: [Imdbpy-help] imdbpy2sql 4.7 - invalid byte sequence for encoding UTF8
Yes i can confirm - Script version 4.6 works perfectly on same server with same files. And i think by this we come closer to solution. Maybe this helps to identify the problem, this is what we did on our server. (Remember, we are doing this copying because there are only stable versions for Debian on server allowed, but we need those md5 hashes from 4.7 version) 1. We installed imdbpy 4.6 with all the dependancies (python-psycopg2, python-dns python-formencode python-pkg-resources python-sqlobject) 2. I downloaded version 4.7 and overwritten following directories with files from 4.7 source: cp -r imdbpy4.7/docs/* /usr/share/doc/python-imdb/ cp -r imdbpy4.7/imdb/* /usr/share/pyshared/imdb/ 3. Now i run imdbpy2sql.py from version 4.7 source like before and it fails with invalid byte sequence. 4. I copied back 4.6. version files to mentioned directories and import for version 4.6 works again. By looking on install log, i didnt see any more relative files, that i should overwrite. So the problem might be at dependancies. You have any idea, where could be the problem and what else should we overwrite or update so that v4.7 works? Thank you. On Sun, Apr 24, 2011 at 10:03 PM, darklow dark...@gmail.com wrote: There has never been any issues with our PostgresSQL database, we always have used UTF-8 and are using this time. I have tried plenty of scripts, workarounds so far, many decode().encode() tries, but nothing helps, just gettings different errors by these. I also tried adding following lines, to be sure everything is fine with connection to Database: import psycopg2 import psycopg2.extensions psycopg2.extensions.register_type(psycopg2.extensions.UNICODE) psycopg2.extensions.register_type(psycopg2.extensions.UNICODEARRAY) import codecs sys.setdefaultencoding('utf-8') CURS.execute(SET NAMES 'utf8') CURS.execute(SET CLIENT_ENCODING TO 'utf8') But still nothing helps. I tried reinstalling all installed dependancies and run from clean sources, but no luck. I tried to run scripts with SQLAlchemy instead of SQLObject, but same error, so the problem is not there. I woud like to ask you one thing. Every test takes about 1h, because error takes place in Actors Cast list. Can you please tell what are the exact list of commands that are converting lines from file to line to sql. So i could create new script, that tries small version of actors.list with problematic lines only, runs few unicode() and decode() lines in correct order and try to insert these lines in some test table into database. So i could try, more faster and not to wait 1 hour for every try... What i tried already is to open actor.list file with PHP, read every line and using iconv converted string to UTF8 and inserted into PostgreSQL database and everything worked fine. It makes me think that problem might be somewhere in cutting line in peaces, maybe it does something wrong, cuts some good unicode character into peaces and so invalid byte sequence appears. If i had correct function list for Python, i could run more tests. PS. Just run test with 4.6 version, to see if it still works with 4.6 version, then we could more easy diagnose by looking in file changes. I'll post the results Thank you. On Sat, Apr 23, 2011 at 3:23 PM, Davide Alberani davide.alber...@gmail.com wrote: On Wed, Apr 20, 2011 at 14:08, darklow dark...@gmail.com wrote: Still no luck :/ maybe the problem is in some environmental variables or settings, which on installed version are present, but running from source are missing or incorrect? Seems unlikely to me. What about this, i printed out some variables: print sys.stdout.encoding - UTF-8 print sys.stdin.encoding - UTF-8 print sys.getdefaultencoding(); - ascii Is it ok that sys.getdefaultencoding(); == ascii ? These are fine. I've reproduced - at the best of my capabilities - your environment: - no IMDbPY installed in the system. - IMDbPY from source (the latest version in the Mercurial repository), setting the PYTHONPATH environment variable to point to the source directory. - the cutils C module was not compiled. - the last actors.list.gz file. - postgres 8.4; my database was created with these settings: CREATE DATABASE imdb WITH OWNER = postgres ENCODING = 'UTF8' TABLESPACE = pg_default LC_COLLATE = 'it_IT.utf8' LC_CTYPE = 'it_IT.utf8' CONNECTION LIMIT = -1; I've run it with your and other portions of the actors.list.gz file, and everything went fine. Now... if I were you, I'd: - create a virtualenv environment with: virtualenv --no-site-packages - install in it IMDbPY, using easy_install or pip (the executable in your virtualenv, I mean) so that you'll have all the correct dependecies available. - run the imdbpy2sql.py within your virtualenv. If it still fails: - check your postgres settings. - try using SQLite (just for a test) - see notes in README.sqldb HTH, -- Davide
Re: [Imdbpy-help] imdbpy2sql 4.7 - invalid byte sequence for encoding UTF8
On Sun, Apr 24, 2011 at 20:03, Thomas Stewart tho...@stewarts.org.uk wrote: I've just had a try using sqlite with fresh lists and on my Debian system and I get this: thomas@ikaite:~$ /tmp/imdbpy2sql.py -d /home/thomas/Desktop/imdb/lists -u sqlite:///home/thomas/Desktop/imdb/imdb.db --sqlite-transactions IMPORTING psyco... DONE! [...] CURS.executemany(self.sqlstr, self.converter(dataList)) pysqlite2.dbapi2.ProgrammingError: You must not use 8-bit bytestrings unless you use a text_factory that can interpret 8-bit bytestrings (like text_factory = str). It is highly recommended that you instead just switch your application to Unicode strings. This specific bug (a bad interaction between SQLObject and SQLite) should be fixed in the version in the Mercurial repository; isn't it? -- Davide Alberani davide.alber...@gmail.com [PGP KeyID: 0x465BFD47] http://www.mimante.net/ -- Fulfilling the Lean Software Promise Lean software platforms are now widely adopted and the benefits have been demonstrated beyond question. Learn why your peers are replacing JEE containers with lightweight application servers - and what you can gain from the move. http://p.sf.net/sfu/vmware-sfemails ___ Imdbpy-help mailing list Imdbpy-help@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/imdbpy-help
Re: [Imdbpy-help] imdbpy2sql 4.7 - invalid byte sequence for encoding UTF8
On Sun, Apr 24, 2011 at 21:03, darklow dark...@gmail.com wrote: I tried reinstalling all installed dependancies and run from clean sources, but no luck. I tried to run scripts with SQLAlchemy instead of SQLObject, but same error, so the problem is not there. Perfect - these tests are really important to spot the problem. Every test takes about 1h, because error takes place in Actors Cast list. Wait: I'll read the rest of your mails tomorrow, but this can help you to do things faster: you don't need the other files at all. Simply put the actors.list.gz file in a directory by itself, and run imdbpy2sql.py with this directory as -d argument. You can even use a shorter version of actors.list.gz, just remember to leave the lines at the begin and at the end (various separators are used to identify where the data begin), like I did with the actors.lists.gz file that I attached some days ago. In the 'docs/goodies' directory you'll find the 'reduce.sh' script, which takes a whole directory of *.list.gz files and reduce them to 1% of their length. It makes me think that problem might be somewhere in cutting line in peaces, maybe it does something wrong, cuts some good unicode character into peaces and so invalid byte sequence appears. My guess, too... it's just that I can't see where it happens... :-/ Thanks for your tests! -- Davide Alberani davide.alber...@gmail.com [PGP KeyID: 0x465BFD47] http://www.mimante.net/ -- Fulfilling the Lean Software Promise Lean software platforms are now widely adopted and the benefits have been demonstrated beyond question. Learn why your peers are replacing JEE containers with lightweight application servers - and what you can gain from the move. http://p.sf.net/sfu/vmware-sfemails ___ Imdbpy-help mailing list Imdbpy-help@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/imdbpy-help
Re: [Imdbpy-help] imdbpy2sql 4.7 - invalid byte sequence for encoding UTF8
On Sun, Apr 24, 2011 at 22:44, darklow dark...@gmail.com wrote: Yes i can confirm - Script version 4.6 works perfectly on same server with same files. And i think by this we come closer to solution. Excellent! (well, it still baffles me why I'm absolutely unable to reproduce the problem on my system, but that's another story...) Maybe this helps to identify the problem, this is what we did on our server. (Remember, we are doing this copying because there are only stable versions for Debian on server allowed, but we need those md5 hashes from 4.7 version) I'll look at your setup tomorrow. I'll surely sound pedantic, but... seriously: why you don't use a virtualenv environment? It's easy to install and doesn't require root privileges. -- Davide Alberani davide.alber...@gmail.com [PGP KeyID: 0x465BFD47] http://www.mimante.net/ -- Fulfilling the Lean Software Promise Lean software platforms are now widely adopted and the benefits have been demonstrated beyond question. Learn why your peers are replacing JEE containers with lightweight application servers - and what you can gain from the move. http://p.sf.net/sfu/vmware-sfemails ___ Imdbpy-help mailing list Imdbpy-help@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/imdbpy-help
Re: [Imdbpy-help] imdbpy2sql 4.7 - invalid byte sequence for encoding UTF8
On Wed, Apr 20, 2011 at 14:08, darklow dark...@gmail.com wrote: Still no luck :/ maybe the problem is in some environmental variables or settings, which on installed version are present, but running from source are missing or incorrect? Seems unlikely to me. What about this, i printed out some variables: print sys.stdout.encoding - UTF-8 print sys.stdin.encoding - UTF-8 print sys.getdefaultencoding(); - ascii Is it ok that sys.getdefaultencoding(); == ascii ? These are fine. I've reproduced - at the best of my capabilities - your environment: - no IMDbPY installed in the system. - IMDbPY from source (the latest version in the Mercurial repository), setting the PYTHONPATH environment variable to point to the source directory. - the cutils C module was not compiled. - the last actors.list.gz file. - postgres 8.4; my database was created with these settings: CREATE DATABASE imdb WITH OWNER = postgres ENCODING = 'UTF8' TABLESPACE = pg_default LC_COLLATE = 'it_IT.utf8' LC_CTYPE = 'it_IT.utf8' CONNECTION LIMIT = -1; I've run it with your and other portions of the actors.list.gz file, and everything went fine. Now... if I were you, I'd: - create a virtualenv environment with: virtualenv --no-site-packages - install in it IMDbPY, using easy_install or pip (the executable in your virtualenv, I mean) so that you'll have all the correct dependecies available. - run the imdbpy2sql.py within your virtualenv. If it still fails: - check your postgres settings. - try using SQLite (just for a test) - see notes in README.sqldb HTH, -- Davide Alberani davide.alber...@gmail.com [PGP KeyID: 0x465BFD47] http://www.mimante.net/ -- Fulfilling the Lean Software Promise Lean software platforms are now widely adopted and the benefits have been demonstrated beyond question. Learn why your peers are replacing JEE containers with lightweight application servers - and what you can gain from the move. http://p.sf.net/sfu/vmware-sfemails ___ Imdbpy-help mailing list Imdbpy-help@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/imdbpy-help
Re: [Imdbpy-help] imdbpy2sql 4.7 - invalid byte sequence for encoding UTF8
Still no luck :/ maybe the problem is in some environmental variables or settings, which on installed version are present, but running from source are missing or incorrect? What about this, i printed out some variables: print sys.stdout.encoding - UTF-8 print sys.stdin.encoding - UTF-8 print sys.getdefaultencoding(); - ascii Is it ok that sys.getdefaultencoding(); == ascii ? Maybe there are some more variables i should check? On Tue, Apr 19, 2011 at 11:11 PM, Davide Alberani davide.alber...@gmail.com wrote: On Mon, Apr 18, 2011 at 09:30, Davide Alberani davide.alber...@gmail.com wrote: Thanks for the file, I hope to look at it within a day or two. Ok: the file is correctly encoded in iso8859-1, as expected, and contains no garbage. Using it as the only input for imdbpy2sql.py (putting the attached file in a directory by itself), I can run the script with no errors (besides the expected warnings about missing files). I'm using the version from the Mercurial repository, without the cutils.so library. Please, if you can't install IMDbPY in your system, consider the use of virtualenv. Having tried that, I have to recommend you to double check the settings of your Postgresql server for some kind of incoherences about encodings and collations. HTH, -- Davide Alberani davide.alber...@gmail.com [PGP KeyID: 0x465BFD47] http://www.mimante.net/ -- Benefiting from Server Virtualization: Beyond Initial Workload Consolidation -- Increasing the use of server virtualization is a top priority.Virtualization can reduce costs, simplify management, and improve application availability and disaster protection. Learn more about boosting the value of server virtualization. http://p.sf.net/sfu/vmware-sfdev2dev___ Imdbpy-help mailing list Imdbpy-help@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/imdbpy-help
Re: [Imdbpy-help] imdbpy2sql 4.7 - invalid byte sequence for encoding UTF8
On Mon, Apr 18, 2011 at 09:30, Davide Alberani davide.alber...@gmail.com wrote: Thanks for the file, I hope to look at it within a day or two. Ok: the file is correctly encoded in iso8859-1, as expected, and contains no garbage. Using it as the only input for imdbpy2sql.py (putting the attached file in a directory by itself), I can run the script with no errors (besides the expected warnings about missing files). I'm using the version from the Mercurial repository, without the cutils.so library. Please, if you can't install IMDbPY in your system, consider the use of virtualenv. Having tried that, I have to recommend you to double check the settings of your Postgresql server for some kind of incoherences about encodings and collations. HTH, -- Davide Alberani davide.alber...@gmail.com [PGP KeyID: 0x465BFD47] http://www.mimante.net/ actors.list.gz Description: GNU Zip compressed data -- Benefiting from Server Virtualization: Beyond Initial Workload Consolidation -- Increasing the use of server virtualization is a top priority.Virtualization can reduce costs, simplify management, and improve application availability and disaster protection. Learn more about boosting the value of server virtualization. http://p.sf.net/sfu/vmware-sfdev2dev___ Imdbpy-help mailing list Imdbpy-help@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/imdbpy-help
Re: [Imdbpy-help] imdbpy2sql 4.7 - invalid byte sequence for encoding UTF8
On Sun, Apr 17, 2011 at 5:13 PM, Davide Alberani davide.alber...@gmail.comwrote: On Sun, Apr 17, 2011 at 14:04, darklow dark...@gmail.com wrote: Updated this morning to latest data files, no change and unfortunately this fix also doesn't work. Hmm... to debug a problem like this without being able to reproduce, is extremely difficult. :-/ This error started when we uninstalled imdbpy (left all the dependency libs) and started run it without installation. Maybe there is some kind of problem and some kind of hidden unicode dependencies? Maybe you can try to run without installation, jus from source? Have you some very good reason to do so? :-) We have Debian linux on our server and our sysadmin allows only stable packs. However latest version of imdbpy has these md5 checksum that are quite important in our situation, that is why i have to run it from source. Can't you try to purge every reference to IMDbPY left on the system (search for the scripts in /usr/bin/ and /usr/local/bin/ and be sure that import imdb fails, at the python prompt) and see if the problem is solved, after IMDbPY 4.7 is reinstalled? Unfortunately right now i can't do reinstall, just to run it by source. However if this is the reason and there will be no way to fix this, i'll try to convince sysadmin to install this version from unofficial debian packs If you have problems locating the IMDbPY package, just open the Python prompt and: import imdb print imdb Also every time i start the script i receive two warnings: 2011-04-17 11:13:37,398 WARNING [imdbpy.parser.sql.aux] /data/web/imdb/imdbpy4.7-159671/imdb/parser/sql/__init__.py:125: Unable to import the cutils.ratcliff function. Searching names and titles using the sql data access system will be slower. This will force IMDbPY to use some pure-python fall-back functions. It's entirely possible that there are some bug in these functions, even if a run without cutils.so is running fine, for me (so far). IMPORTING psyco... FAILED (not a big deal, everything is alright...) That's not a problem for sure. Right now, my first guess is that somewhere, after the *.list files ar read and turned into utf-8 encoded strings, the imdbpy2sql.py script does Something Very Wrong(tm) to a string (like cutting it at a certain place, ending up cutting a single utf-8 encoded char in two: this could explain the error). I've tried the conversion suggested by Petite Abeille, and it works fine. Please, could you cut a small piece (few kilobytes) of the actors.list file, and attach it (no cut-and-paste)? It goes without saying that you should chose a portion where you see (or guess are) the strange chars :-) I attached the small part of actors.list file right the place with the broken characters (used unix sed command to cut the problematic lines out). Thanks! -- Davide Alberani davide.alber...@gmail.com [PGP KeyID: 0x465BFD47] http://www.mimante.net/ actors.list.small Description: Binary data -- Benefiting from Server Virtualization: Beyond Initial Workload Consolidation -- Increasing the use of server virtualization is a top priority.Virtualization can reduce costs, simplify management, and improve application availability and disaster protection. Learn more about boosting the value of server virtualization. http://p.sf.net/sfu/vmware-sfdev2dev___ Imdbpy-help mailing list Imdbpy-help@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/imdbpy-help
Re: [Imdbpy-help] imdbpy2sql 4.7 - invalid byte sequence for encoding UTF8
On Mon, Apr 18, 2011 at 08:53, darklow dark...@gmail.com wrote: We have Debian linux on our server and our sysadmin allows only stable packs. However latest version of imdbpy has these md5 checksum that are quite important in our situation, that is why i have to run it from source. Ehhh... what about a virtual machine or - even easier - virtualenv [0] Thanks for the file, I hope to look at it within a day or two. +++ [0] http://pypi.python.org/pypi/virtualenv -- Davide Alberani davide.alber...@gmail.com [PGP KeyID: 0x465BFD47] http://www.mimante.net/ -- Benefiting from Server Virtualization: Beyond Initial Workload Consolidation -- Increasing the use of server virtualization is a top priority.Virtualization can reduce costs, simplify management, and improve application availability and disaster protection. Learn more about boosting the value of server virtualization. http://p.sf.net/sfu/vmware-sfdev2dev ___ Imdbpy-help mailing list Imdbpy-help@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/imdbpy-help
Re: [Imdbpy-help] imdbpy2sql 4.7 - invalid byte sequence for encoding UTF8
On Apr 13, 2011, at 8:46 AM, darklow wrote: Ananlyzed error a bit more. Mostly these errors occur in Japanese actors (actors.list), in filmography there apperars strange characters: Sounds like a character set encoding issue. Originally, something like actors.list is ISO-8859-1 encoded. IMDbPY converts it to UTF-8 internally: http://imdbpy.sourceforge.net/docs/README.utf8.txt You can check if actors.list is properly encoded by converting it to UTF-8 outside of IMDbPY. For example, using iconv: iconv -f ISO-8859-1 -t UTF-8 actors.list actors.list.txt This should result in a proper UTF-8 encoded file. If anything goes wrong, iconv should point out the issue. For example, the entries for Hayakawa, Yuzo should look like the following: A, zerosen (1965) [Tokunaga] Abunai Deka ritaanzu (1996) Akumyo ichidai (1967) Aru joshi kôkôi no kiroku: shisshin (1969) Chijin no ai (1967) [Namikawa] Dai akutô (1968) Daikaijû kettô: Gamera tai Barugon (1966) [Kawajiri] 3 Dorodarake no junjô (1977) [Det. Seki] Furin (1965) [Saruoka] 6 Genkai yûkyôden: Yabure kabure (1970) [Yanagawa] Haru kôrô no hana no en (1958) [Sata] Hiroshima (1995) (TV) [Koshiro Oikawa] 70 Jet F-104 dassyutsu seyo (1968) Kaidan otoshiana (1968) [Sakabe] Kawaki (1958) 4 Kimimachi-bune (1954) (as Yûji Hayakawa) [Tomii] Konki (1961) Malenkiy beglets (1966) Mi wa jukushitari (1959) [Chef at Mizumi] Mushukunin Mikogami no Jôkichi: Kiba wa hikisaita (1972) 9 Nagasugita haru (1957) [Student] Nihonkai daikaisen: Umi yukaba (1983) [Kataoka] Nippon chinbotsu (1974) [SDF General] Nobi (1959) (as Yuji Hayakawa) Obi o toku Natsuko (1965) [Kwashima] 6 Okoto to Sasuke (1961) (as Yûzô Hayakawa) Onna ga aishite nikumu toki (1963) [Iwashita] Onna tobakushi (1967) Rikugun Nakano gakko (1966) [Colonel Iwakura] 6 Rikugun Nakano gakko: Ryu-sango shirei (1967) Sakura no ki no shita de (1989) Salary man donto bushi - Kiraku na kagyô to kita monda (1962) (as Yûzô Hayakawa) [Shibayama] Satsujinsha (1966) Seisaku no tsuma (1965) [Sergeant] Sekkusu chekku: Daini no sei (1968) [Sasanuma] 5 Shiroi Kyotou (1966) 14 Shuntou (1989) (TV) 15 Tokyo no josei (1960) Tokyo onigiri musume (1961) (as Yûzô Hayakawa) [Draper] Uchu kaijû Gamera (1980) [Policeman] Yoru no wana (1967) [Fumikichi Hayashi] Zatôichi rôyaburi (1967) Zoku sex doctor no kiroku (1968) 5 Kôya no surônin (1972) 11 Sukeban Deka (1985) {Nerawareta atakkâ (#1.10)} (as Yûzô Hayakawa) 14 Zoku zoku jiken: Tsuki no keshiki (1980) {(#1.2)} [Dr. Arai] 11 There shouldn't be any strange characters in sight :) -- Benefiting from Server Virtualization: Beyond Initial Workload Consolidation -- Increasing the use of server virtualization is a top priority.Virtualization can reduce costs, simplify management, and improve application availability and disaster protection. Learn more about boosting the value of server virtualization. http://p.sf.net/sfu/vmware-sfdev2dev ___ Imdbpy-help mailing list Imdbpy-help@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/imdbpy-help
Re: [Imdbpy-help] imdbpy2sql 4.7 - invalid byte sequence for encoding UTF8
On Sun, Apr 17, 2011 at 14:04, darklow dark...@gmail.com wrote: Updated this morning to latest data files, no change and unfortunately this fix also doesn't work. Hmm... to debug a problem like this without being able to reproduce, is extremely difficult. :-/ This error started when we uninstalled imdbpy (left all the dependency libs) and started run it without installation. Maybe there is some kind of problem and some kind of hidden unicode dependencies? Maybe you can try to run without installation, jus from source? Have you some very good reason to do so? :-) Can't you try to purge every reference to IMDbPY left on the system (search for the scripts in /usr/bin/ and /usr/local/bin/ and be sure that import imdb fails, at the python prompt) and see if the problem is solved, after IMDbPY 4.7 is reinstalled? If you have problems locating the IMDbPY package, just open the Python prompt and: import imdb print imdb Also every time i start the script i receive two warnings: 2011-04-17 11:13:37,398 WARNING [imdbpy.parser.sql.aux] /data/web/imdb/imdbpy4.7-159671/imdb/parser/sql/__init__.py:125: Unable to import the cutils.ratcliff function. Searching names and titles using the sql data access system will be slower. This will force IMDbPY to use some pure-python fall-back functions. It's entirely possible that there are some bug in these functions, even if a run without cutils.so is running fine, for me (so far). IMPORTING psyco... FAILED (not a big deal, everything is alright...) That's not a problem for sure. Right now, my first guess is that somewhere, after the *.list files ar read and turned into utf-8 encoded strings, the imdbpy2sql.py script does Something Very Wrong(tm) to a string (like cutting it at a certain place, ending up cutting a single utf-8 encoded char in two: this could explain the error). I've tried the conversion suggested by Petite Abeille, and it works fine. Please, could you cut a small piece (few kilobytes) of the actors.list file, and attach it (no cut-and-paste)? It goes without saying that you should chose a portion where you see (or guess are) the strange chars :-) Thanks! -- Davide Alberani davide.alber...@gmail.com [PGP KeyID: 0x465BFD47] http://www.mimante.net/ -- Benefiting from Server Virtualization: Beyond Initial Workload Consolidation -- Increasing the use of server virtualization is a top priority.Virtualization can reduce costs, simplify management, and improve application availability and disaster protection. Learn more about boosting the value of server virtualization. http://p.sf.net/sfu/vmware-sfdev2dev ___ Imdbpy-help mailing list Imdbpy-help@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/imdbpy-help
Re: [Imdbpy-help] imdbpy2sql 4.7 - invalid byte sequence for encoding UTF8
On Wed, Apr 13, 2011 at 08:46, darklow dark...@gmail.com wrote: Maybe someone knows some fast dirty fix at least how to skip such invalid byte sequence strings while there are no official fix, so i can finish the import? Can we detect invalid byte characters? Hi again, actually my problem is that I'm unable to reproduce this bug. :-) Using Postgresql and SQLObject, my run goes on smooth. I have downloaded the 'actors.list.gz' file today, so it's possible that some garbage was removed. Anyway, the previously proposed solution was obviously flawed, since the problem was on _character_ names. So, let's edit again the imdbpy2sql.py file and change the lines around 1540 so that they become: movieid = CACHE_MID.addUnique(title) if role is not None: roles = filter(None, [x.strip() for x in role.split('/')]) for role in roles: role = role.replace('\xec\x8c\xa0', '') # TEMPORARY FIX cid = CACHE_CID.addUnique(role) sqldata.add((pid, movieid, cid, note, order)) Maybe this will help... who knows? :-) -- Davide Alberani davide.alber...@gmail.com [PGP KeyID: 0x465BFD47] http://www.mimante.net/ -- Benefiting from Server Virtualization: Beyond Initial Workload Consolidation -- Increasing the use of server virtualization is a top priority.Virtualization can reduce costs, simplify management, and improve application availability and disaster protection. Learn more about boosting the value of server virtualization. http://p.sf.net/sfu/vmware-sfdev2dev ___ Imdbpy-help mailing list Imdbpy-help@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/imdbpy-help
Re: [Imdbpy-help] imdbpy2sql 4.7 - invalid byte sequence for encoding UTF8
Unfortunately adding this line k = k.replace('\xec\x8c\xa0', '') in the place you mentioned wont help. Still same error on same place :( SCANNING actor: Havel, Jir? * FLUSHING CharactersCache... Traceback (most recent call last): . self.flush() File ./imdbpy2sql.py, line 1195, in _toDB CURS.executemany(self.sqlstr, self.converter(l)) psycopg2.DataError: invalid byte sequence for encoding UTF8: 0xc320 On Wed, Apr 13, 2011 at 11:56 PM, Davide Alberani davide.alber...@gmail.com wrote: On Mon, Apr 11, 2011 at 18:35, darklow dark...@gmail.com wrote: File ./imdbpy2sql.py, line 1194, in _toDB CURS.executemany(self.sqlstr, self.converter(l)) psycopg2.DataError: invalid byte sequence for encoding UTF8: 0xc320 HINT: This error can also happen if the byte sequence does not match the encoding expected by the server, which is controlled by client_encoding. Hi all, I'm writing regarding the recent 0xc320 problem with IMDbPY. The above notice is extremely interesting, and should be investigated: how can it be that 0xc320 is not UTF8 encodable? It should work; from the Python prompt: unichr(0xc320).encode('utf8') '\xec\x8c\xa0' Anyway, as a very fast and dirty fix (the main problem is probably some crap in the data files), try this: after line 1181 of imdbpy2sql.py, add: k = k.replace('\xec\x8c\xa0', '') So that the nearby lines will become: try: k = k.replace('\xec\x8c\xa0', '') t = analyze_name(k) except IMDbParserError: Please be aware that this fix was not tested at all, but I'm almost sure that, at the above point, 'k' is a string encoded in utf8. Anyway, beside the garbage theory, I have another idea about the source of the error, but I have to verify it later... Bye, and let me know if it works! -- Davide Alberani davide.alber...@gmail.com [PGP KeyID: 0x465BFD47] http://www.mimante.net/ -- Benefiting from Server Virtualization: Beyond Initial Workload Consolidation -- Increasing the use of server virtualization is a top priority.Virtualization can reduce costs, simplify management, and improve application availability and disaster protection. Learn more about boosting the value of server virtualization. http://p.sf.net/sfu/vmware-sfdev2dev___ Imdbpy-help mailing list Imdbpy-help@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/imdbpy-help
Re: [Imdbpy-help] imdbpy2sql 4.7 - invalid byte sequence for encoding UTF8
Since i am not familiar with python, maybe you could suggest some fast fix so that scripts doesn't hangs? Maybe this helps: In PHP we have perfeclty same error with encoding when importing some wrong decoded data. When we have no control over data and we cant all the time do utf8_encode since it could encode string twice - to bypass this error i use this function which at least prevents from postgresql error: function fix_encoding($in_str) { $cur_encoding = mb_detect_encoding($in_str) ; if($cur_encoding == UTF-8 mb_check_encoding($in_str,UTF-8)){ return $in_str; }else{ return utf8_encode($in_str); } } Maybe you can help to adapt this function to Python if similar functions are available so we can use it as a quick fix? Thanks a lot. On Mon, Apr 11, 2011 at 10:46 PM, Davide Alberani davide.alber...@gmail.com wrote: On Mon, Apr 11, 2011 at 18:35, darklow dark...@gmail.com wrote: File ./imdbpy2sql.py, line 1194, in _toDB CURS.executemany(self.sqlstr, self.converter(l)) psycopg2.DataError: invalid byte sequence for encoding UTF8: 0xc320 HINT: This error can also happen if the byte sequence does not match the encoding expected by the server, which is controlled by client_encoding. Any suggestions? I found similar topic, but there were also no solutions. Yes, I've had other reports about this bug. Seems to be related to some garbage in the actors.list.gz file. I hope to have time to investigate the problem within a week or two. Thanks for the bug report! -- Davide Alberani davide.alber...@gmail.com [PGP KeyID: 0x465BFD47] http://www.mimante.net/ -- Forrester Wave Report - Recovery time is now measured in hours and minutes not days. Key insights are discussed in the 2010 Forrester Wave Report as part of an in-depth evaluation of disaster recovery service providers. Forrester found the best-in-class provider in terms of services and vision. Read this report now! http://p.sf.net/sfu/ibm-webcastpromo___ Imdbpy-help mailing list Imdbpy-help@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/imdbpy-help
Re: [Imdbpy-help] imdbpy2sql 4.7 - invalid byte sequence for encoding UTF8
Maybe someone knows some fast dirty fix at least how to skip such invalid byte sequence strings while there are no official fix, so i can finish the import? Can we detect invalid byte characters? Maybe we can somehow replace or get rid of *0xc320* character, which mostly is appearing. Or skip these rows. Ananlyzed error a bit more. Mostly these errors occur in Japanese actors (actors.list), in filmography there apperars strange characters: Hayakawa, Yuzo Burai hij*8)* * * Tried to delete these rows manually, but the are too much of them :/ Thank you. On Wed, Apr 13, 2011 at 9:45 AM, darklow dark...@gmail.com wrote: Since i am not familiar with python, maybe you could suggest some fast fix so that scripts doesn't hangs? Maybe this helps: In PHP we have perfeclty same error with encoding when importing some wrong decoded data. When we have no control over data and we cant all the time do utf8_encode since it could encode string twice - to bypass this error i use this function which at least prevents from postgresql error: function fix_encoding($in_str) { $cur_encoding = mb_detect_encoding($in_str) ; if($cur_encoding == UTF-8 mb_check_encoding($in_str,UTF-8)){ return $in_str; }else{ return utf8_encode($in_str); } } Maybe you can help to adapt this function to Python if similar functions are available so we can use it as a quick fix? Thanks a lot. On Mon, Apr 11, 2011 at 10:46 PM, Davide Alberani davide.alber...@gmail.com wrote: On Mon, Apr 11, 2011 at 18:35, darklow dark...@gmail.com wrote: File ./imdbpy2sql.py, line 1194, in _toDB CURS.executemany(self.sqlstr, self.converter(l)) psycopg2.DataError: invalid byte sequence for encoding UTF8: 0xc320 HINT: This error can also happen if the byte sequence does not match the encoding expected by the server, which is controlled by client_encoding. Any suggestions? I found similar topic, but there were also no solutions. Yes, I've had other reports about this bug. Seems to be related to some garbage in the actors.list.gz file. I hope to have time to investigate the problem within a week or two. Thanks for the bug report! -- Davide Alberani davide.alber...@gmail.com [PGP KeyID: 0x465BFD47] http://www.mimante.net/ -- Forrester Wave Report - Recovery time is now measured in hours and minutes not days. Key insights are discussed in the 2010 Forrester Wave Report as part of an in-depth evaluation of disaster recovery service providers. Forrester found the best-in-class provider in terms of services and vision. Read this report now! http://p.sf.net/sfu/ibm-webcastpromo___ Imdbpy-help mailing list Imdbpy-help@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/imdbpy-help
Re: [Imdbpy-help] imdbpy2sql 4.7 - invalid byte sequence for encoding UTF8
On Mon, Apr 11, 2011 at 18:35, darklow dark...@gmail.com wrote: File ./imdbpy2sql.py, line 1194, in _toDB CURS.executemany(self.sqlstr, self.converter(l)) psycopg2.DataError: invalid byte sequence for encoding UTF8: 0xc320 HINT: This error can also happen if the byte sequence does not match the encoding expected by the server, which is controlled by client_encoding. Hi all, I'm writing regarding the recent 0xc320 problem with IMDbPY. The above notice is extremely interesting, and should be investigated: how can it be that 0xc320 is not UTF8 encodable? It should work; from the Python prompt: unichr(0xc320).encode('utf8') '\xec\x8c\xa0' Anyway, as a very fast and dirty fix (the main problem is probably some crap in the data files), try this: after line 1181 of imdbpy2sql.py, add: k = k.replace('\xec\x8c\xa0', '') So that the nearby lines will become: try: k = k.replace('\xec\x8c\xa0', '') t = analyze_name(k) except IMDbParserError: Please be aware that this fix was not tested at all, but I'm almost sure that, at the above point, 'k' is a string encoded in utf8. Anyway, beside the garbage theory, I have another idea about the source of the error, but I have to verify it later... Bye, and let me know if it works! -- Davide Alberani davide.alber...@gmail.com [PGP KeyID: 0x465BFD47] http://www.mimante.net/ -- Benefiting from Server Virtualization: Beyond Initial Workload Consolidation -- Increasing the use of server virtualization is a top priority.Virtualization can reduce costs, simplify management, and improve application availability and disaster protection. Learn more about boosting the value of server virtualization. http://p.sf.net/sfu/vmware-sfdev2dev ___ Imdbpy-help mailing list Imdbpy-help@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/imdbpy-help
Re: [Imdbpy-help] imdbpy2sql 4.7 - invalid byte sequence for encoding UTF8
On Mon, Apr 11, 2011 at 18:35, darklow dark...@gmail.com wrote: File ./imdbpy2sql.py, line 1194, in _toDB CURS.executemany(self.sqlstr, self.converter(l)) psycopg2.DataError: invalid byte sequence for encoding UTF8: 0xc320 HINT: This error can also happen if the byte sequence does not match the encoding expected by the server, which is controlled by client_encoding. Any suggestions? I found similar topic, but there were also no solutions. Yes, I've had other reports about this bug. Seems to be related to some garbage in the actors.list.gz file. I hope to have time to investigate the problem within a week or two. Thanks for the bug report! -- Davide Alberani davide.alber...@gmail.com [PGP KeyID: 0x465BFD47] http://www.mimante.net/ -- Forrester Wave Report - Recovery time is now measured in hours and minutes not days. Key insights are discussed in the 2010 Forrester Wave Report as part of an in-depth evaluation of disaster recovery service providers. Forrester found the best-in-class provider in terms of services and vision. Read this report now! http://p.sf.net/sfu/ibm-webcastpromo ___ Imdbpy-help mailing list Imdbpy-help@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/imdbpy-help