from:"darklow"

[Imdbpy-help] Fwd: imdbpy2sql 4.7 - invalid byte sequence for encoding UTF8

2011-09-19 Thread darklow

Hi,

This fix worked for about some months and unfortunately there is similar
encoding error in latest data files (16.sep.2011)
Using latest DEV version on virtualenv: IMDbPY==4.8dev-20110822 and python
2.6
This configuration worked perfectly with previous data files. So it means
there must be some kind of a trash again for actor and characters files.
Here is the full log for error:

SCANNING actor: Ribeiro, Freddy
SCANNING actor: Richard, Darryl
 * FLUSHING SQLData...
SCANNING actor: Richardson, Ian (I)
SCANNING actor: Richter, Friedrich
 * FLUSHING SQLData...
SCANNING actor: Riebisi, Romeo
SCANNING actor: Rignault, Alexandre
 * FLUSHING CharactersCache...
Traceback (most recent call last):
  File ./bin/imdbpy2sql.py, line 5, in module
pkg_resources.run_script('IMDbPY==4.8dev-20110822', 'imdbpy2sql.py')
  File
/usr/share/nginx/store/imdb/virtualenv/lib/python2.6/site-packages/setuptools-0.6c11-py2.6.egg/pkg_resources.py,
line 489, in run_script
  File
/usr/share/nginx/store/imdb/virtualenv/lib/python2.6/site-packages/setuptools-0.6c11-py2.6.egg/pkg_resources.py,
line 1207, in run_script
  File
/usr/share/nginx/store/imdb/virtualenv/lib/python2.6/site-packages/IMDbPY-4.8dev_20110822-py2.6-linux-x86_64.egg/EGG-INFO/scripts/imdbpy2sql.py,
line 2959, in module
run()
  File
/usr/share/nginx/store/imdb/virtualenv/lib/python2.6/site-packages/IMDbPY-4.8dev_20110822-py2.6-linux-x86_64.egg/EGG-INFO/scripts/imdbpy2sql.py,
line 2820, in run
castLists(_charIDsList=characters_imdbIDs)
  File
/usr/share/nginx/store/imdb/virtualenv/lib/python2.6/site-packages/IMDbPY-4.8dev_20110822-py2.6-linux-x86_64.egg/EGG-INFO/scripts/imdbpy2sql.py,
line 1584, in castLists
doCast(f, roleid, rolename)
  File
/usr/share/nginx/store/imdb/virtualenv/lib/python2.6/site-packages/IMDbPY-4.8dev_20110822-py2.6-linux-x86_64.egg/EGG-INFO/scripts/imdbpy2sql.py,
line 1543, in doCast
cid = CACHE_CID.addUnique(role)
  File
/usr/share/nginx/store/imdb/virtualenv/lib/python2.6/site-packages/IMDbPY-4.8dev_20110822-py2.6-linux-x86_64.egg/EGG-INFO/scripts/imdbpy2sql.py,
line 966, in addUnique
else: return self.add(key, miscData)
  File
/usr/share/nginx/store/imdb/virtualenv/lib/python2.6/site-packages/IMDbPY-4.8dev_20110822-py2.6-linux-x86_64.egg/EGG-INFO/scripts/imdbpy2sql.py,
line 959, in add
self[key] = c
  File
/usr/share/nginx/store/imdb/virtualenv/lib/python2.6/site-packages/IMDbPY-4.8dev_20110822-py2.6-linux-x86_64.egg/EGG-INFO/scripts/imdbpy2sql.py,
line 869, in __setitem__
self.flush()
  File
/usr/share/nginx/store/imdb/virtualenv/lib/python2.6/site-packages/IMDbPY-4.8dev_20110822-py2.6-linux-x86_64.egg/EGG-INFO/scripts/imdbpy2sql.py,
line 892, in flush
self._toDB(quiet)
  File
/usr/share/nginx/store/imdb/virtualenv/lib/python2.6/site-packages/IMDbPY-4.8dev_20110822-py2.6-linux-x86_64.egg/EGG-INFO/scripts/imdbpy2sql.py,
line 1194, in _toDB
CURS.executemany(self.sqlstr, self.converter(l))
*psycopg2.DataError: invalid byte sequence for encoding UTF8: 0xc320*


Any ideas?
Thanks.


On Mon, May 2, 2011 at 9:32 PM, Davide Alberani
davide.alber...@gmail.comwrote:

 On Mon, May 2, 2011 at 08:47, darklow dark...@gmail.com wrote:
 
  Thank you for your patience and guiding through the tests, i really glad
 we
  finally found the problem and fixed it.

 Yep, even if it took a little too long. :-)

  Just curious, why only me and one another user encountered this problem,
 but
  when you run the same tests, you didn't see the error? :)

 It may have something to do with the use python library to connect to
 Postgres.  Maybe some libraries handle gracefully this kind of error; I
 have
 to check better the versions installed on my system and on the virtualenv
 I've used to reproduce the bug.
 In fact the right thing to do in such cases is to raise an exception (like
 in
 our case); other databases - or libraries to connect to databases - like
 MySQL
 simply ignore with a warning these errors (not a great idea).

 --
 Davide Alberani davide.alber...@gmail.com  [PGP KeyID: 0x465BFD47]
 http://www.mimante.net/

--
All the data continuously generated in your IT infrastructure contains a
definitive record of customers, application performance, security
threats, fraudulent activity and more. Splunk takes this data and makes
sense of it. Business sense. IT sense. Common sense.
http://p.sf.net/sfu/splunk-d2dcopy1___
Imdbpy-help mailing list
Imdbpy-help@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/imdbpy-help

Re: [Imdbpy-help] imdbpy2sql 4.7 - invalid byte sequence for encoding UTF8

2011-04-26 Thread darklow

Thanks, let me know if you have any ideas, how to fix the problem...
About virtalenv. I was also quite pedantic on ignoring virtualenv solution -
i am programmer, not a system administrator, i am not familiar with python,
i understand the code logic, but haven't coded any application so far, just
one test parser to diagnose error.
I looked at virtualenv documentation, i didn't understand how to use it, the
problem is my little knowledge in Python and its components, so i think you
have to be more familiar with Python and its libraries and way they are
installed and configured before installing and configuring virtualenv.
Also our sysadmin is quite pain in the a.. It is hard to prove the need of
that or another new tool to install. If it has a stable debian package, then
it is easier. But for all the other packages, almost impossible. Also i am
not sure i want to intrude in sysadmins environment and do some installs by
myself, even if it doesn't require root access..


On Mon, Apr 25, 2011 at 1:19 AM, Davide Alberani
davide.alber...@gmail.comwrote:

 On Sun, Apr 24, 2011 at 22:44, darklow dark...@gmail.com wrote:
  Yes i can confirm - Script version 4.6 works perfectly on same server
 with
  same files.
  And i think by this we come closer to solution.

 Excellent!  (well, it still baffles me why I'm absolutely unable to
 reproduce the problem on my system, but that's another story...)

  Maybe this helps to identify the problem, this is what we did on our
 server.
  (Remember, we are doing this copying because there are only stable
 versions
  for Debian on server allowed, but we need those md5 hashes from 4.7
 version)

 I'll look at your setup tomorrow.  I'll surely sound pedantic, but...
 seriously:
 why you don't use a virtualenv environment?  It's easy to install and
 doesn't require root privileges.



 --
 Davide Alberani davide.alber...@gmail.com  [PGP KeyID: 0x465BFD47]
 http://www.mimante.net/

--
WhatsUp Gold - Download Free Network Management Software
The most intuitive, comprehensive, and cost-effective network 
management toolset available today.  Delivers lowest initial 
acquisition cost and overall TCO of any competing solution.
http://p.sf.net/sfu/whatsupgold-sd___
Imdbpy-help mailing list
Imdbpy-help@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/imdbpy-help

Re: [Imdbpy-help] imdbpy2sql 4.7 - invalid byte sequence for encoding UTF8

2011-04-24 Thread darklow

There has never been any issues with our PostgresSQL database, we always
have used UTF-8 and are using this time.
I have tried plenty of scripts, workarounds so far, many decode().encode()
tries, but nothing helps, just gettings different errors by these.
I also tried adding following lines, to be sure everything is fine with
connection to Database:

import psycopg2
import psycopg2.extensions
psycopg2.extensions.register_type(psycopg2.extensions.UNICODE)
psycopg2.extensions.register_type(psycopg2.extensions.UNICODEARRAY)

import codecs
sys.setdefaultencoding('utf-8')

CURS.execute(SET NAMES 'utf8')
CURS.execute(SET CLIENT_ENCODING TO 'utf8')


But still nothing helps.
I tried reinstalling all installed dependancies and run from clean sources,
but no luck.
I tried to run scripts with SQLAlchemy instead of SQLObject, but same error,
so the problem is not there.

I woud like to ask you one thing.
Every test takes about 1h, because error takes place in Actors Cast list.
Can you please tell what are the exact list of commands that are converting
lines from file to line to sql.
So i could create new script, that tries small version of actors.list with
problematic lines only, runs few unicode() and decode() lines in correct
order and try to insert these lines in some test table into database. So i
could try, more faster and not to wait 1 hour for every try...

What i tried already is to open actor.list file with PHP, read every line
and using iconv converted string to UTF8 and inserted into PostgreSQL
database and everything worked fine. It makes me think that problem might be
somewhere in cutting line in peaces, maybe it does something wrong, cuts
some good unicode character into peaces and so invalid byte sequence
appears. If i had correct function list for Python, i could run more tests.

PS. Just run test with 4.6 version, to see if it still works with 4.6
version, then we could more easy diagnose by looking in file changes.
I'll post the results

Thank you.

On Sat, Apr 23, 2011 at 3:23 PM, Davide Alberani
davide.alber...@gmail.comwrote:

 On Wed, Apr 20, 2011 at 14:08, darklow dark...@gmail.com wrote:
  Still no luck :/ maybe the problem is in some environmental variables or
  settings, which on installed version are present, but running from source
  are missing or incorrect?

 Seems unlikely to me.

  What about this, i printed out some variables:
  print sys.stdout.encoding - UTF-8
  print sys.stdin.encoding   - UTF-8
  print sys.getdefaultencoding(); - ascii
  Is it ok that  sys.getdefaultencoding(); == ascii ?

 These are fine.

 I've reproduced - at the best of my capabilities - your environment:
 - no IMDbPY installed in the system.
 - IMDbPY from source (the latest version in the Mercurial repository),
  setting the PYTHONPATH environment variable to point to the
  source directory.
 - the cutils C module was not compiled.
 - the last actors.list.gz file.
 - postgres 8.4; my database was created with these settings:
  CREATE DATABASE imdb
WITH OWNER = postgres
   ENCODING = 'UTF8'
   TABLESPACE = pg_default
   LC_COLLATE = 'it_IT.utf8'
   LC_CTYPE = 'it_IT.utf8'
   CONNECTION LIMIT = -1;

 I've run it with your and other portions of the actors.list.gz file, and
 everything went fine.

 Now... if I were you, I'd:
 - create a virtualenv environment with:
virtualenv --no-site-packages
 - install in it IMDbPY, using easy_install or pip (the executable in
  your virtualenv, I mean) so that you'll have all the correct dependecies
  available.
 - run the imdbpy2sql.py within your virtualenv.

 If it still fails:
 - check your postgres settings.
 - try using SQLite (just for a test) - see notes in README.sqldb


 HTH,
 --
 Davide Alberani davide.alber...@gmail.com  [PGP KeyID: 0x465BFD47]
 http://www.mimante.net/

--
Fulfilling the Lean Software Promise
Lean software platforms are now widely adopted and the benefits have been 
demonstrated beyond question. Learn why your peers are replacing JEE 
containers with lightweight application servers - and what you can gain 
from the move. http://p.sf.net/sfu/vmware-sfemails___
Imdbpy-help mailing list
Imdbpy-help@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/imdbpy-help

Re: [Imdbpy-help] imdbpy2sql 4.7 - invalid byte sequence for encoding UTF8

2011-04-24 Thread darklow

Yes i can confirm - Script version 4.6 works perfectly on same server with
same files.
And i think by this we come closer to solution.
Maybe this helps to identify the problem, this is what we did on our server.
(Remember, we are doing this copying because there are only stable versions
for Debian on server allowed, but we need those md5 hashes from 4.7 version)

1. We installed imdbpy 4.6 with all the dependancies
(python-psycopg2, python-dns python-formencode python-pkg-resources
python-sqlobject)
2. I downloaded version 4.7 and overwritten following directories with files
from 4.7 source:

cp -r imdbpy4.7/docs/* /usr/share/doc/python-imdb/
cp -r imdbpy4.7/imdb/* /usr/share/pyshared/imdb/


3. Now i run imdbpy2sql.py from version 4.7 source like before and it fails
with invalid byte sequence.
4. I copied back 4.6. version files to mentioned directories and import for
version 4.6 works again.

By looking on install log, i didnt see any more relative files, that i
should overwrite. So the problem might be at dependancies.
You have any idea, where could be the problem and what else should we
overwrite or update so that v4.7 works?
Thank you.


On Sun, Apr 24, 2011 at 10:03 PM, darklow dark...@gmail.com wrote:

 There has never been any issues with our PostgresSQL database, we always
 have used UTF-8 and are using this time.
 I have tried plenty of scripts, workarounds so far, many decode().encode()
 tries, but nothing helps, just gettings different errors by these.
 I also tried adding following lines, to be sure everything is fine with
 connection to Database:

 import psycopg2
 import psycopg2.extensions
 psycopg2.extensions.register_type(psycopg2.extensions.UNICODE)
 psycopg2.extensions.register_type(psycopg2.extensions.UNICODEARRAY)

 import codecs
 sys.setdefaultencoding('utf-8')

 CURS.execute(SET NAMES 'utf8')
 CURS.execute(SET CLIENT_ENCODING TO 'utf8')


 But still nothing helps.
 I tried reinstalling all installed dependancies and run from clean sources,
 but no luck.
 I tried to run scripts with SQLAlchemy instead of SQLObject, but same
 error, so the problem is not there.

 I woud like to ask you one thing.
 Every test takes about 1h, because error takes place in Actors Cast list.
 Can you please tell what are the exact list of commands that are converting
 lines from file to line to sql.
 So i could create new script, that tries small version of actors.list with
 problematic lines only, runs few unicode() and decode() lines in correct
 order and try to insert these lines in some test table into database. So i
 could try, more faster and not to wait 1 hour for every try...

 What i tried already is to open actor.list file with PHP, read every line
 and using iconv converted string to UTF8 and inserted into PostgreSQL
 database and everything worked fine. It makes me think that problem might be
 somewhere in cutting line in peaces, maybe it does something wrong, cuts
 some good unicode character into peaces and so invalid byte sequence
 appears. If i had correct function list for Python, i could run more tests.

 PS. Just run test with 4.6 version, to see if it still works with 4.6
 version, then we could more easy diagnose by looking in file changes.
 I'll post the results

 Thank you.

 On Sat, Apr 23, 2011 at 3:23 PM, Davide Alberani 
 davide.alber...@gmail.com wrote:

 On Wed, Apr 20, 2011 at 14:08, darklow dark...@gmail.com wrote:
  Still no luck :/ maybe the problem is in some environmental variables or
  settings, which on installed version are present, but running from
 source
  are missing or incorrect?

 Seems unlikely to me.

  What about this, i printed out some variables:
  print sys.stdout.encoding - UTF-8
  print sys.stdin.encoding   - UTF-8
  print sys.getdefaultencoding(); - ascii
  Is it ok that  sys.getdefaultencoding(); == ascii ?

 These are fine.

 I've reproduced - at the best of my capabilities - your environment:
 - no IMDbPY installed in the system.
 - IMDbPY from source (the latest version in the Mercurial repository),
  setting the PYTHONPATH environment variable to point to the
  source directory.
 - the cutils C module was not compiled.
 - the last actors.list.gz file.
 - postgres 8.4; my database was created with these settings:
  CREATE DATABASE imdb
WITH OWNER = postgres
   ENCODING = 'UTF8'
   TABLESPACE = pg_default
   LC_COLLATE = 'it_IT.utf8'
   LC_CTYPE = 'it_IT.utf8'
   CONNECTION LIMIT = -1;

 I've run it with your and other portions of the actors.list.gz file, and
 everything went fine.

 Now... if I were you, I'd:
 - create a virtualenv environment with:
virtualenv --no-site-packages
 - install in it IMDbPY, using easy_install or pip (the executable in
  your virtualenv, I mean) so that you'll have all the correct dependecies
  available.
 - run the imdbpy2sql.py within your virtualenv.

 If it still fails:
 - check your postgres settings.
 - try using SQLite (just for a test) - see notes in README.sqldb


 HTH,
 --
 Davide

Re: [Imdbpy-help] imdbpy2sql 4.7 - invalid byte sequence for encoding UTF8

2011-04-20 Thread darklow

Still no luck :/ maybe the problem is in some environmental variables or
settings, which on installed version are present, but running from source
are missing or incorrect?

What about this, i printed out some variables:

print sys.stdout.encoding - UTF-8
print sys.stdin.encoding   - UTF-8
print sys.getdefaultencoding(); - ascii

Is it ok that  sys.getdefaultencoding(); == ascii ?

Maybe there are some more variables i should check?


On Tue, Apr 19, 2011 at 11:11 PM, Davide Alberani davide.alber...@gmail.com
 wrote:

 On Mon, Apr 18, 2011 at 09:30, Davide Alberani
 davide.alber...@gmail.com wrote:
 
  Thanks for the file, I hope to look at it within a day or two.

 Ok: the file is correctly encoded in iso8859-1, as expected, and contains
 no garbage.

 Using it as the only input for imdbpy2sql.py (putting the attached file in
 a directory by itself), I can run the script with no errors (besides
 the expected
 warnings about missing files).

 I'm using the version from the Mercurial repository, without the cutils.so
 library.

 Please, if you can't install IMDbPY in your system, consider the use
 of virtualenv.
 Having tried that, I have to recommend you to double check the
 settings of your Postgresql server for some kind of incoherences
 about encodings and collations.

 HTH,
 --
 Davide Alberani davide.alber...@gmail.com  [PGP KeyID: 0x465BFD47]
 http://www.mimante.net/

--
Benefiting from Server Virtualization: Beyond Initial Workload 
Consolidation -- Increasing the use of server virtualization is a top
priority.Virtualization can reduce costs, simplify management, and improve 
application availability and disaster protection. Learn more about boosting 
the value of server virtualization. http://p.sf.net/sfu/vmware-sfdev2dev___
Imdbpy-help mailing list
Imdbpy-help@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/imdbpy-help

Re: [Imdbpy-help] imdbpy2sql 4.7 - invalid byte sequence for encoding UTF8

2011-04-18 Thread darklow

On Sun, Apr 17, 2011 at 5:13 PM, Davide Alberani
davide.alber...@gmail.comwrote:

 On Sun, Apr 17, 2011 at 14:04, darklow dark...@gmail.com wrote:
  Updated this morning to latest data files, no change and unfortunately
 this
  fix also doesn't work.

 Hmm...  to debug a problem like this without being able to reproduce,
 is extremely difficult. :-/

  This error started when we uninstalled imdbpy (left all the dependency
 libs)
  and started run it without installation. Maybe there is some kind of
 problem
  and some kind of hidden unicode dependencies? Maybe you can try to run
  without installation, jus from source?

 Have you some very good reason to do so? :-)


We have Debian linux on our server and our sysadmin allows only stable
packs. However latest version of imdbpy has these md5 checksum that are
quite important in our situation, that is why i have to run it from source.


 Can't you try to purge every reference to IMDbPY left on the
 system (search for the scripts in /usr/bin/ and /usr/local/bin/ and
 be sure that import imdb fails, at the python prompt) and see
 if the problem is solved, after IMDbPY 4.7 is reinstalled?


Unfortunately right now i can't do reinstall, just to run it by source.
However if this is the reason and there will be no way to fix this, i'll try
to convince sysadmin to install this version from unofficial debian packs


 If you have problems locating the IMDbPY package, just open
 the Python prompt and:
  import imdb
  print imdb

  Also every time i start the script i receive two warnings:
  2011-04-17 11:13:37,398 WARNING [imdbpy.parser.sql.aux]
  /data/web/imdb/imdbpy4.7-159671/imdb/parser/sql/__init__.py:125: Unable
 to
  import the cutils.ratcliff function.  Searching names and titles using
 the
  sql data access system will be slower.

 This will force IMDbPY to use some pure-python fall-back functions.
 It's entirely possible that there are some bug in these functions, even
 if a run without cutils.so is running fine, for me (so far).

  IMPORTING psyco... FAILED (not a big deal, everything is alright...)

 That's not a problem for sure.

 Right now, my first guess is that somewhere, after the *.list files ar
 read and turned into utf-8 encoded strings, the imdbpy2sql.py
 script does Something Very Wrong(tm) to a string (like cutting it at a
 certain
 place, ending up cutting a single utf-8 encoded char in two: this could
 explain the error).

 I've tried the conversion suggested by Petite Abeille, and it works fine.

 Please, could you cut a small piece (few kilobytes) of the actors.list
 file,
 and attach it (no cut-and-paste)?
 It goes without saying that you should chose a portion where you see
 (or guess are) the strange chars :-)


I attached the small part of actors.list file right the place with the
broken characters (used unix sed command to cut the problematic lines out).



 Thanks!

 --
 Davide Alberani davide.alber...@gmail.com  [PGP KeyID: 0x465BFD47]
 http://www.mimante.net/



actors.list.small
Description: Binary data
--
Benefiting from Server Virtualization: Beyond Initial Workload 
Consolidation -- Increasing the use of server virtualization is a top
priority.Virtualization can reduce costs, simplify management, and improve 
application availability and disaster protection. Learn more about boosting 
the value of server virtualization. http://p.sf.net/sfu/vmware-sfdev2dev___
Imdbpy-help mailing list
Imdbpy-help@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/imdbpy-help

Re: [Imdbpy-help] imdbpy2sql 4.7 - invalid byte sequence for encoding UTF8

2011-04-14 Thread darklow

Unfortunately adding this line
k = k.replace('\xec\x8c\xa0', '') in the place you mentioned wont help.

Still same error on same place :(

SCANNING actor: Havel, Jir?
 * FLUSHING CharactersCache...
Traceback (most recent call last):
 .
self.flush()
  File ./imdbpy2sql.py, line 1195, in _toDB
CURS.executemany(self.sqlstr, self.converter(l))
psycopg2.DataError: invalid byte sequence for encoding UTF8: 0xc320

On Wed, Apr 13, 2011 at 11:56 PM, Davide Alberani davide.alber...@gmail.com
 wrote:

 On Mon, Apr 11, 2011 at 18:35, darklow dark...@gmail.com wrote:
 
File ./imdbpy2sql.py, line 1194, in _toDB
  CURS.executemany(self.sqlstr, self.converter(l))
  psycopg2.DataError: invalid byte sequence for encoding UTF8: 0xc320
  HINT:  This error can also happen if the byte sequence does not match the
  encoding expected by the server, which is controlled by
 client_encoding.

 Hi all,
 I'm writing regarding the recent 0xc320 problem with IMDbPY.
 The above notice is extremely interesting, and should be investigated:
 how can it be that 0xc320 is not UTF8 encodable?
 It should work; from the Python prompt:
   unichr(0xc320).encode('utf8')
  '\xec\x8c\xa0'

 Anyway, as a very fast and dirty fix (the main problem is probably some
 crap in the data files), try this: after line 1181 of imdbpy2sql.py, add:
  k = k.replace('\xec\x8c\xa0', '')

 So that the nearby lines will become:
try:
k = k.replace('\xec\x8c\xa0', '')
t = analyze_name(k)
except IMDbParserError:

 Please be aware that this fix was not tested at all, but I'm
 almost sure that, at the above point, 'k' is a string encoded in utf8.

 Anyway, beside the garbage theory, I have another idea
 about the source of the error, but I have to verify it later...

 Bye, and let me know if it works!

 --
 Davide Alberani davide.alber...@gmail.com  [PGP KeyID: 0x465BFD47]
 http://www.mimante.net/

--
Benefiting from Server Virtualization: Beyond Initial Workload 
Consolidation -- Increasing the use of server virtualization is a top
priority.Virtualization can reduce costs, simplify management, and improve 
application availability and disaster protection. Learn more about boosting 
the value of server virtualization. http://p.sf.net/sfu/vmware-sfdev2dev___
Imdbpy-help mailing list
Imdbpy-help@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/imdbpy-help

Re: [Imdbpy-help] imdbpy2sql 4.7 - invalid byte sequence for encoding UTF8

2011-04-13 Thread darklow

Since i am not familiar with python, maybe you could suggest some fast fix
so that scripts doesn't hangs?
Maybe this helps: In PHP we have perfeclty same error with encoding when
importing some wrong decoded data. When we have no control over data and we
cant all the time do utf8_encode since it could encode string twice - to
bypass this error i use this function which at least prevents from
postgresql error:

function  fix_encoding($in_str) {
$cur_encoding = mb_detect_encoding($in_str) ;
if($cur_encoding == UTF-8  mb_check_encoding($in_str,UTF-8)){
return $in_str;
}else{
return utf8_encode($in_str);
}
}

Maybe you can help to adapt this function to Python if similar functions are
available so we can use it as a quick fix?
Thanks a lot.

On Mon, Apr 11, 2011 at 10:46 PM, Davide Alberani davide.alber...@gmail.com
 wrote:

 On Mon, Apr 11, 2011 at 18:35, darklow dark...@gmail.com wrote:
 
File ./imdbpy2sql.py, line 1194, in _toDB
  CURS.executemany(self.sqlstr, self.converter(l))
  psycopg2.DataError: invalid byte sequence for encoding UTF8: 0xc320
  HINT:  This error can also happen if the byte sequence does not match the
  encoding expected by the server, which is controlled by
 client_encoding.
 
  Any suggestions? I found similar topic, but there were also no solutions.

 Yes, I've had other reports about this bug.
 Seems to be related to some garbage in the actors.list.gz file.
 I hope to have time to investigate the problem within a week or two.

 Thanks for the bug report!

 --
 Davide Alberani davide.alber...@gmail.com  [PGP KeyID: 0x465BFD47]
 http://www.mimante.net/

--
Forrester Wave Report - Recovery time is now measured in hours and minutes
not days. Key insights are discussed in the 2010 Forrester Wave Report as
part of an in-depth evaluation of disaster recovery service providers.
Forrester found the best-in-class provider in terms of services and vision.
Read this report now!  http://p.sf.net/sfu/ibm-webcastpromo___
Imdbpy-help mailing list
Imdbpy-help@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/imdbpy-help

Re: [Imdbpy-help] imdbpy2sql 4.7 - invalid byte sequence for encoding UTF8

2011-04-13 Thread darklow

Maybe someone knows some fast dirty fix at least how to skip such invalid
byte sequence strings while there are no official fix, so i can finish the
import?
Can we detect invalid byte characters? Maybe we can somehow replace or get
rid of *0xc320* character, which mostly is appearing. Or skip these rows.

Ananlyzed error a bit more. Mostly these errors occur in Japanese actors
(actors.list), in filmography there apperars strange characters:

Hayakawa, Yuzo

Burai hij*8)*
*
*

Tried to delete these rows manually, but the are too much of them :/
Thank you.


On Wed, Apr 13, 2011 at 9:45 AM, darklow dark...@gmail.com wrote:

 Since i am not familiar with python, maybe you could suggest some fast fix
 so that scripts doesn't hangs?
 Maybe this helps: In PHP we have perfeclty same error with encoding when
 importing some wrong decoded data. When we have no control over data and we
 cant all the time do utf8_encode since it could encode string twice - to
 bypass this error i use this function which at least prevents from
 postgresql error:

 function  fix_encoding($in_str) {
 $cur_encoding = mb_detect_encoding($in_str) ;
 if($cur_encoding == UTF-8  mb_check_encoding($in_str,UTF-8)){
 return $in_str;
 }else{
 return utf8_encode($in_str);
 }
 }

 Maybe you can help to adapt this function to Python if similar functions
 are available so we can use it as a quick fix?
 Thanks a lot.

 On Mon, Apr 11, 2011 at 10:46 PM, Davide Alberani 
 davide.alber...@gmail.com wrote:

 On Mon, Apr 11, 2011 at 18:35, darklow dark...@gmail.com wrote:
 
File ./imdbpy2sql.py, line 1194, in _toDB
  CURS.executemany(self.sqlstr, self.converter(l))
  psycopg2.DataError: invalid byte sequence for encoding UTF8: 0xc320
  HINT:  This error can also happen if the byte sequence does not match
 the
  encoding expected by the server, which is controlled by
 client_encoding.
 
  Any suggestions? I found similar topic, but there were also no
 solutions.

 Yes, I've had other reports about this bug.
 Seems to be related to some garbage in the actors.list.gz file.
 I hope to have time to investigate the problem within a week or two.

 Thanks for the bug report!

 --
 Davide Alberani davide.alber...@gmail.com  [PGP KeyID: 0x465BFD47]
 http://www.mimante.net/



--
Forrester Wave Report - Recovery time is now measured in hours and minutes
not days. Key insights are discussed in the 2010 Forrester Wave Report as
part of an in-depth evaluation of disaster recovery service providers.
Forrester found the best-in-class provider in terms of services and vision.
Read this report now!  http://p.sf.net/sfu/ibm-webcastpromo___
Imdbpy-help mailing list
Imdbpy-help@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/imdbpy-help

[Imdbpy-help] Fwd: imdbpy2sql 4.7 - invalid byte sequence for encoding UTF8

Re: [Imdbpy-help] imdbpy2sql 4.7 - invalid byte sequence for encoding UTF8

Re: [Imdbpy-help] imdbpy2sql 4.7 - invalid byte sequence for encoding UTF8

Re: [Imdbpy-help] imdbpy2sql 4.7 - invalid byte sequence for encoding UTF8

Re: [Imdbpy-help] imdbpy2sql 4.7 - invalid byte sequence for encoding UTF8

Re: [Imdbpy-help] imdbpy2sql 4.7 - invalid byte sequence for encoding UTF8

Re: [Imdbpy-help] imdbpy2sql 4.7 - invalid byte sequence for encoding UTF8

Re: [Imdbpy-help] imdbpy2sql 4.7 - invalid byte sequence for encoding UTF8

Re: [Imdbpy-help] imdbpy2sql 4.7 - invalid byte sequence for encoding UTF8

9 matches

Site Navigation

Mail list logo

Footer information