Re: [ADMIN] invalid byte sequence for encoding UTF8: 0xf481 - how could this happen?

2012-04-19 Thread Albe Laurenz
Rural Hunter wrote:
 Do you get the error if you try Chinese settings without nlpbamboo?

 How can I do this? The Chinese processing is provided by nlpbamboo.

Er, sorry, forget what I said.  I have no experience with Chinese.

I guess you should try to ask the nlpbamboo people.
Is there anything about encodings in the documentation?

Yours,
Laurenz Albe



-- 
Sent via pgsql-admin mailing list (pgsql-admin@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-admin


Re: [ADMIN] invalid byte sequence for encoding UTF8: 0xf481 - how could this happen?

2012-04-18 Thread Rural Hunter

于 2012/4/17 18:06, Albe Laurenz 写道:

Rural Hunter wrote:
That's probably the problem - it seems to emit something that is
not proper UTF-8 sometimes.

Do you get the error if you try Chinese settings without nlpbamboo?

How can I do this? The Chinese processing is provided by nlpbamboo.

--
Sent via pgsql-admin mailing list (pgsql-admin@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-admin


Re: [ADMIN] invalid byte sequence for encoding UTF8: 0xf481 - how could this happen?

2012-04-17 Thread Albe Laurenz
Rural Hunter wrote:
 My db is in utf-8, I have a row in my table say tmp_article and I wanted
 to generate ts_vector from the article content:
 select to_tsvector(content) from tmp_article;
 But I got this error:
 ERROR:  invalid byte sequence for encoding UTF8: 0xf481

 Do you use any nonstandard text search parsers or dictionaries?
 What is the text search configuration you use (parameter
 default_text_search_config)?

 Yes, I'm using a Chinese tsearch2 plugin called nlpbamboo and the
 default default_text_search_config is 'chinesecfg'. I tested select
 to_tsvector('english',content) from tmp_article and this works. So it's
 the problem of nlpbamboo?

That's probably the problem - it seems to emit something that is
not proper UTF-8 sometimes.

Do you get the error if you try Chinese settings without nlpbamboo?

  btw, what's the meaning of Please don't top post.? I
 don't understand but sorry if I caused any problem.

http://en.wikipedia.org/wiki/Posting_style

Yours,
Laurenz Albe

-- 
Sent via pgsql-admin mailing list (pgsql-admin@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-admin


Re: [ADMIN] invalid byte sequence for encoding UTF8: 0xf481 - how could this happen?

2012-04-16 Thread Albe Laurenz
Rural Hunter wrote:
 My db is in utf-8, I have a row in my table say tmp_article and I
wanted
 to generate ts_vector from the article content:
 select to_tsvector(content) from tmp_article;
 But I got this error:
 ERROR:  invalid byte sequence for encoding UTF8: 0xf481
 
 I am wondering how this could happen. I think if there was invalid
UTF8
 bytes in the content, it shouldn't have been able to inserted into the
 tmp_article table as I sometimes see similar errors when inserting
 records to tmp_article. Am I right?

You are right in theory.  A lot depends on your PostgreSQL version,
because
the efforts to prevent invalid strings from entering the database have
led to changes over the versions.  Older versions are more permissive.

To test the theory that the contents of the table are bad, you can
test if the same happens if you

SELECT convert_to(content, 'UTF8') FROM tmp_article;

Yours,
Laurenz Albe

-- 
Sent via pgsql-admin mailing list (pgsql-admin@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-admin


Re: [ADMIN] invalid byte sequence for encoding UTF8: 0xf481 - how could this happen?

2012-04-16 Thread Rural Hunter

=SELECT convert_to(content, 'UTF8') FROM tmp_article;
This works. My pg is at latest 9.1.3 on ubuntu 10.04 server. We have 
millions of data in the db but this is the only one we met the problem. 
The bad data is inserted in recent days and we upgraded to 9.1.3 right 
after it was released.


于 2012/4/16 16:31, Albe Laurenz 写道:

Rural Hunter wrote:

My db is in utf-8, I have a row in my table say tmp_article and I

wanted

to generate ts_vector from the article content:
select to_tsvector(content) from tmp_article;
But I got this error:
ERROR:  invalid byte sequence for encoding UTF8: 0xf481

I am wondering how this could happen. I think if there was invalid

UTF8

bytes in the content, it shouldn't have been able to inserted into the
tmp_article table as I sometimes see similar errors when inserting
records to tmp_article. Am I right?

You are right in theory.  A lot depends on your PostgreSQL version,
because
the efforts to prevent invalid strings from entering the database have
led to changes over the versions.  Older versions are more permissive.

To test the theory that the contents of the table are bad, you can
test if the same happens if you

SELECT convert_to(content, 'UTF8') FROM tmp_article;

Yours,
Laurenz Albe




--
Sent via pgsql-admin mailing list (pgsql-admin@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-admin


Re: [ADMIN] invalid byte sequence for encoding UTF8: 0xf481 - how could this happen?

2012-04-16 Thread Albe Laurenz
Please don't top post.

Rural Hunter wrote:
 My db is in utf-8, I have a row in my table say tmp_article and I wanted
 to generate ts_vector from the article content:
 select to_tsvector(content) from tmp_article;
 But I got this error:
 ERROR:  invalid byte sequence for encoding UTF8: 0xf481

 I am wondering how this could happen. I think if there was invalid UTF8
 bytes in the content, it shouldn't have been able to inserted into the
 tmp_article table as I sometimes see similar errors when inserting
 records to tmp_article. Am I right?

 You are right in theory.  A lot depends on your PostgreSQL version,
 because
 the efforts to prevent invalid strings from entering the database have
 led to changes over the versions.  Older versions are more permissive.

 To test the theory that the contents of the table are bad, you can
 test if the same happens if you

 SELECT convert_to(content, 'UTF8') FROM tmp_article;

 =SELECT convert_to(content, 'UTF8') FROM tmp_article;
 This works. My pg is at latest 9.1.3 on ubuntu 10.04 server. We have
 millions of data in the db but this is the only one we met the problem.
 The bad data is inserted in recent days and we upgraded to 9.1.3 right
 after it was released.

Hmm, that is strange.
Could you find the row that causes the problem and post the
result of CAST (content AS bytea) for this row?

Do you use any nonstandard text search parsers or dictionaries?
What is the text search configuration you use (parameter
default_text_search_config)?

Yours,
Laurenz Albe

-- 
Sent via pgsql-admin mailing list (pgsql-admin@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-admin


Re: [ADMIN] invalid byte sequence for encoding UTF8: 0xf481 - how could this happen?

2012-04-16 Thread Rural Hunter

Hi Albe,
Yes, I'm using a Chinese tsearch2 plugin called nlpbamboo and the 
default default_text_search_config is 'chinesecfg'. I tested select 
to_tsvector('english',content) from tmp_article and this works. So it's 
the problem of nlpbamboo? The result of 'CAST (content AS bytea)'  is 
shown below. btw, what's the meaning of Please don't top post.? I 
don't understand but sorry if I caused any problem.


\xe6a0b8e5bf83e68f90e7a4baefbc9ae69caae69da5e4b889e5b9b4e5b086e698afe694bfe5ba9ce99c80e6b182e9878ae694beefbc8ce8a18ce4b89ae9ab98e9809fe68890e995bfe79a84e697b6e69c9fefbc8ce585ac
e58fb8e79baee5898de582a8e5a487e9a1b9e79baee4bc97e5a49aefbc8ce4b89ae7bba9e5ad98e59ca8e59091e4b88ae8b685e9a284e69c9fe79a84e58fafe883bdefbc8ce5bd93e5898de79a84e4bcb0e580bce6b0b4e5b
9b3e4b88de9ab98efbc8ce68c89e785a73230313220e5b9b43235efbd9e333020e5808de79a84502f45efbc8ce585ace58fb8e4bcb0e580bce58cbae997b4e59ca832322e3439efbd9e32362e393920e58583e58cbae997b4
efbc8ce8be83e79baee5898de882a1e4bbb7e7baa6e69c893425efbd9e323525e79a84e4b88ae58d87e7a9bae997b4efbc8ce7bbb4e68c81e2809ce68ea8e88d90e2809de79a84e68a95e8b584e8af84e7baa7e380820ae8b
083e7a094e58685e5aeb90ae8bf91e697a5efbc8ce58f82e58aa0e4ba86e585ace58fb8e7bb84e7bb87e79a84e68a95e8b584e88085e68ea5e5be85e6b4bbe58aa8efbc8ce4b88ee585ace58fb8e9ab98e7aea1e8bf9be8a1
8ce4ba86e4baa4e6b581e380820ae4b8bbe8a681e8a782e782b90ae2809ce58d81e4ba8cc2b7e4ba94e2809de69c9fe997b4e694bfe5ba9ce99c80e6b182e6988ee698bee5a29ee58aa0efbc8ce585ace58fb8e58aa0e9809
fe8b791e9a9ace59c88e59cb0e380820ae7bb8fe8bf87e58c97e4baace38081e4b88ae6b5b7e38081e69dade5b79ee38081e5ae81e6b3a2e7ad89e58588e8bf9be59f8ee5b882e794b5e5ad90e694bfe58aa1e5bbbae8aebe
e79a84e7a4bae88c83e4bd9ce794a8efbc8ce59084e59cb0e694bfe5ba9ce68488e58f91e58585e58886e8aea4e8af86e588b0e695b0e5ad97e58c96e59f8ee5b882e7aea1e79086e7b3bbe7bb9fe5afb9e694bfe5ba9ce59
f8ee5b882e7aea1e79086e6b0b4e5b9b3e68f90e58d87e68980e8b5b7e588b0e79a84e9878de5a4a7e4bd9ce794a8efbc8ce8808ce99a8fe79d80e68890e58a9fe6a188e4be8be79a84e4b88de696ade7a7afe7b4afefbc8c
e694bfe5ba9ce887aae8baabe5928ce585ace58fb8e5afb9e99c80e6b182e79a84e79086e8a7a3e4b99fe983bde69bb4e58aa0e6b7b1e585a5e38082e7bb8fe8bf873230313020e5b9b4e5928c3230313120e5b9b4e79a84e
694bfe5ba9ce68da2e5b18ae4b88ee8a784e58892e588b6e5ae9aefbc8c32303132efbd9e3230313420e5b9b4e698afe59cb0e696b9e694bfe5ba9ce2809ce58d81e4ba8cc2b7e4ba94e2809de5bbbae8aebee585b7e4bd93
e5ae9ee696bde5b9b4efbc8ce79baee5898de59bbde58685e7baa63236303020e4b8aae58ebfe7baa7e68896e4bba5e4b88ae79a84e5b882efbc88e58ebfefbc89e4b8adefbc8ce5b7b2e7bb8fe5bbbae8aebee4ba86e695b
0e5ad97e58c96e59f8ee5b882e7aea1e79086e7b3bbe7bb9fe79a84e4bb8533303020e69da5e5aeb6efbc8ce794b1e4ba8ee59cb0e696b9e694bfe5ba9ce59ca8e5bbbae8aebee696b9e99da2e79a84e68a95e585a5e585b7
e69c89e78886e58f91e5bc8fe79a84e789b9e782b9efbc8ce88083e89991e588b0e585ace58fb83230313220e5b9b4e4b880e5ada3e5baa6e4b89ae7bba9e79a84e5a4a7e5b985e5a29ee995bfefbc8ce79bb8e4bfa1e4bb8
ee4bb8ae5b9b4e5bc80e5a78be79a843320e5b9b4efbc8ce5b086e698afe695b4e4b8aae8a18ce4b89ae5bfabe9809fe58f91e5b195e79a84e697b6e69c9fe38082e585ace58fb8e698afe8a18ce4b89ae4b8ade79a84e9be
99e5a4b4e4bc81e4b89aefbc8ce5b882e59cbae58da0e69c89e78e87e59ca8363025e4bba5e4b88ae38082e8808ce4b894e585ace58fb8e59ca8e59084e4bba3e8a1a8e680a7e59f8ee5b882e79a84e9a1b9e79baee585b7e
69c89e6a0b7e69dbfe69588e5ba94efbc8ce58fafe4bba5e4b8bae585ace58fb8e5b8a6e69da5e69bb4e5a49ae79a84e694bfe5ba9ce5aea2e688b7efbc8ce5b9b6e4b894e5ae9ee78eb0e4ba86e99c80e6b182e79086e8a7
a32de99c80e6b182e5ae9ee78eb02de99c80e6b182e5bc95e5afbce79a84e889afe680a7e5beaae78eafe38082e59ba0e6ada4efbc8ce8aea4e4b8bae59ca8e69c80e8bf91e79a84e4b880e4b8a4e5b9b4efbc8ce585ace58
fb8e5b086e4bba5e9ab98e4ba8ee8a18ce4b89ae6b0b4e5b9b3e79a84e5a29ee9809fe8bf85e78c9be58f91e5b195efbc8ce4b99fe58db3e8bf9be585a5e8b791e9a9ace59c88e59cb0efbc8ce5b7a9e59bbae5b882e59cba
e58da0e69c89e78e87e79a84e697b6e69c9fe38082e6ada4e5908eefbc8ce99a8fe79d80e694bfe5ba9ce5908ee7bbade68a95e585a5e58fafe883bde587bae78eb0e79a84e591a8e69c9fe680a7e6b3a2e58aa8efbc8ce58
5ace58fb8e9809ae8bf87e59084e9a1b9e4b89ae58aa1e7bb93e69e84e79a84e8b083e88a82e5ae9ee78eb0e5b9b3e7a8b3e8be83e5bfabe9809fe79a84e5a29ee995bfefbc8ce59ca8e8bf99e4b880e998b6e6aeb5efbc8c
e9a284e8aea1e5a29ee9809fe887b3e5b091e4b99fe59ca8323025efbd9e333025e58cbae997b4e38082e5a29ee58aa0e7b3bbe7bb9fe99b86e68890e4b89ae58aa1e9878fefbc8ce689a9e5a4a7e694b6e585a5e8a784e6a
8a1e5b9b6e58aa0e5bcbae5aea2e688b7e7b298e680a7e7b297e795a5e4bcb0e7ae97efbc8ce79bb4e8be96e5b882e38081e79c81e4bc9ae7baa7e59f8ee5b882e38081e59cb0e7baa7e5b882efbc88e4b88de590abe79c81
e4bc9ae59f8ee5b882efbc89e38081e5b882e8be96e58cbae38081e58ebfe7baa7e5b882efbc88e58ebfefbc89e695b0e5ad97e58c96e59f8ee5b882e7aea1e79086e7b3bbe7bb9fe79a84e5bbbae8aebee8b4b9e794a8e7b
aa6e4b8ba3530303020e4b887e58583e380813330303020e4b887e58583e380813235303020e4b887e58583e380813130303020e4b887e58583e5928c38303020e4b887e58583e38082e68c89e785a7e79baee5898de68891

Re: [ADMIN] invalid byte sequence for encoding UTF8: 0xf481 - how could this happen?

2012-04-14 Thread raghu ram
On Sat, Apr 14, 2012 at 9:31 AM, Rural Hunter ruralhun...@gmail.com wrote:

  doesn't work either.

 db=# show client_encoding;
  client_encoding
 -
  UTF8
 (1 row)

 db=# set client_encoding='LATIN1';
 SET
 db=# show client_encoding;
  client_encoding
 -
  LATIN1
 (1 row)

 db=# select to_tsvector(content) from tmp_article;

 ERROR:  invalid byte sequence for encoding UTF8: 0xf481


Try to set client_encoding='SQL_ASCII'

and then execute

select to_tsvector(content) from tmp_article;


-- 

Thanks  Regards,

Raghu Ram

EnterpriseDB: http://www.enterprisedb.com


Re: [ADMIN] invalid byte sequence for encoding UTF8: 0xf481 - how could this happen?

2012-04-14 Thread Rural Hunter

  
  
still doesn't work.
db=# set client_encoding='SQL_ASCII' ;
SET
db=# select to_tsvector(content) from tmp_article;
ERROR: invalid byte sequence for encoding "UTF8": 0xf481


 2012/4/15 1:38, raghu ram :

  
  On Sat, Apr 14, 2012 at 9:31 AM, Rural
Hunter ruralhun...@gmail.com
wrote:

   doesn't work either.

db=# show client_encoding;
client_encoding 
-
UTF8
(1 row)

db=# set client_encoding='LATIN1';
SET
db=# show client_encoding;
client_encoding 
-
LATIN1
(1 row)

db=# select to_tsvector(content) from tmp_article;

  ERROR: invalid byte sequence for encoding "UTF8": 0xf481

  




  

  Try to set client_encoding='SQL_ASCII'
  
  
  and then execute



select to_tsvector(content) from tmp_article;
  



  
  -- 
  Thanks  Regards,
  Raghu Ram
  EnterpriseDB: http://www.enterprisedb.com
  


  



Re: [ADMIN] invalid byte sequence for encoding UTF8: 0xf481 - how could this happen?

2012-04-13 Thread raghu ram
2012/4/14 Rural Hunter ruralhun...@gmail.com

 My db is in utf-8, I have a row in my table say tmp_article and I wanted
 to generate ts_vector from the article content:
 select to_tsvector(content) from tmp_article;
 But I got this error:
 ERROR:  invalid byte sequence for encoding UTF8: 0xf481

 I am wondering how this could happen. I think if there was invalid UTF8
 bytes in the content, it shouldn't have been able to inserted into the
 tmp_article table as I sometimes see similar errors when inserting records
 to tmp_article. Am I right?


This error can also happen if the byte sequence does not match the
encodingexpected by the server, which is controlled by
client_encoding.
Try to set client_encoding='LATIN1'

and then execute

select to_tsvector(content) from tmp_article;

-- 

Thanks  Regards,

Raghu Ram

EnterpriseDB: http://www.enterprisedb.com


Re: [ADMIN] invalid byte sequence for encoding UTF8: 0xf481 - how could this happen?

2012-04-13 Thread Rural Hunter

  
  
doesn't work either.

db=# show client_encoding;
client_encoding 
-
UTF8
(1 row)

db=# set client_encoding='LATIN1';
SET
db=# show client_encoding;
client_encoding 
-
LATIN1
(1 row)

db=# select to_tsvector(content) from tmp_article;
ERROR: invalid byte sequence for encoding "UTF8": 0xf481

 2012/4/14 10:15, raghu ram :

  
  2012/4/14 Rural Hunter ruralhun...@gmail.com

  My db is in utf-8, I have a row in my table say tmp_article
  and I wanted to generate ts_vector from the article content:
  select to_tsvector(content) from tmp_article;
  But I got this error:
  ERROR: invalid byte sequence for encoding "UTF8": 0xf481
  
  I am wondering how this could happen. I think if there was
  invalid UTF8 bytes in the content, it shouldn't have been able
  to inserted into the tmp_article table as I sometimes see
  similar errors when inserting records to tmp_article. Am I
  right?
  




  This error can also happen if the byte sequence
does not match the encoding
expected by the server, which is controlled by
"client_encoding".

Try to set client_encoding='LATIN1'


and then execute
  
  
  
  select to_tsvector(content) from tmp_article;
  
  
  -- 
  Thanks  Regards,
  Raghu Ram
  EnterpriseDB: http://www.enterprisedb.com