Re: [Wikitech-l] correct way to import SQL dumps into MySQL database in terms of character encoding

2012-04-01 Thread Svip
On 1 April 2012 16:04, Piotr Jagielski piotr.jagiel...@op.pl wrote:

 mysql --user root --password=root wiki 
 C:\Path\plwiki-20111227-categorylinks.sql --default-character-set=utf8

It's -p, not --password=root and it will prompt you for the password.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] correct way to import SQL dumps into MySQL database in terms of character encoding

2012-04-01 Thread Piotr Jagielski
These options should be equivalent. It does load the data using the 
below command. It just incorrectly handles non-English characters.


Regards,
Piotr

On 2012-04-01 16:31, Svip wrote:

On 1 April 2012 16:04, Piotr Jagielskipiotr.jagiel...@op.pl  wrote:


mysql --user root --password=root wiki
C:\Path\plwiki-20111227-categorylinks.sql --default-character-set=utf8

It's -p, not --password=root and it will prompt you for the password.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l




___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] correct way to import SQL dumps into MySQL database in terms of character encoding

2012-04-01 Thread Platonides
On 01/04/12 17:05, Piotr Jagielski wrote:
 These options should be equivalent. It does load the data using the
 below command. It just incorrectly handles non-English characters.
 
 Regards,
 Piotr

Do you have $wgDBmysql5 set in your LocalSettings.php?



___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] correct way to import SQL dumps into MySQL database in terms of character encoding

2012-04-01 Thread Piotr Jagielski
I don't have MediaWiki installed. I'm just trying to import the dump 
into a standalone database so I can do some batch processing on the data.


Regards,
Piotr

On 2012-04-01 17:30, Platonides wrote:

On 01/04/12 17:05, Piotr Jagielski wrote:

These options should be equivalent. It does load the data using the
below command. It just incorrectly handles non-English characters.

Regards,
Piotr

Do you have $wgDBmysql5 set in your LocalSettings.php?



___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l




___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] correct way to import SQL dumps into MySQL database in terms of character encoding

2012-04-01 Thread Marcin Cieslak
 Piotr Jagielski piotr.jagiel...@op.pl wrote:
 Hello,

 set my data source URL to the following in my Java code:
 jdbc:mysql://localhost/plwiki?useUnicode=truecharacterEncoding=UTF-8

Please note you have plwiki here and you imported into wiki.
Assuming your .my.cnf is not making things difficult I ran a small
Jython script to test:

$ jython
Jython 2.5.2 (Release_2_5_2:7206, Mar 2 2011, 23:12:06) 
[OpenJDK 64-Bit Server VM (Sun Microsystems Inc.)] on java1.6.0
Type help, copyright, credits or license for more information.
 from com.ziclix.python.sql import zxJDBC
 d, u, p, v = jdbc:mysql://localhost/wiki, root, None, 
 org.gjt.mm.mysql.Driver
 db = zxJDBC.connect(d, u, p, v, CHARSET=utf8)
 c=db.cursor()
 c.execute(select cl_from, cl_to from categorylinks where cl_from=61 limit 
 10)
 c.fetchone()
(61, array('b', [65, 110, 100, 111, 114, 97]))
 (a,b) = c.fetchone()
 print b
array('b', [67, 122, -59, -126, 111, 110, 107, 111, 119, 105, 101, 95, 79, 114, 
103, 97, 110, 105, 122, 97, 99, 106, 105, 95, 78, 97, 114, 111, 100, -61, -77, 
119, 95, 90, 106, 101, 100, 110, 111, 99, 122, 111, 110, 121, 99, 104])
 for x in b:
... try:
... print chr(x),
... except ValueError:
... print %02x % x,
... 
C z -3b -7e o n k o w i e _ O r g a n i z a c j i _ N a r o d -3d -4d w _ Z j e 
d n o c z o n y c h

array('b, [ ... ]) in Jython means that SQL driver returns an array of bytes.

It seems to me that array of bytes contains raw UTF-8, so you need to decode it 
into
proper Unicode that Java uses in strings. 

I think this behaviour is described in

http://bugs.mysql.com/bug.php?id=25528

Probably you need to play with getBytes() on a result object
to get what you want.

//Saper


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] correct way to import SQL dumps into MySQL database in terms of character encoding

2012-04-01 Thread Piotr Jagielski
Sorry, I made a mistake in the e-mail. I had the database set to the 
same name in both places.


My problem is actually opposite because I don't get any result where I 
use UTF-8 string as an input in the query. But I verified that I don't 
get correct results where using the query you provided neither. The link 
with the MySQL bug report might be helpful in resolving the problem so 
thanks for providing it.


Piotr

On 2012-04-01 19:50, Marcin Cieslak wrote:

Piotr Jagielskipiotr.jagiel...@op.pl  wrote:

Hello,

set my data source URL to the following in my Java code:
jdbc:mysql://localhost/plwiki?useUnicode=truecharacterEncoding=UTF-8

Please note you have plwiki here and you imported into wiki.
Assuming your .my.cnf is not making things difficult I ran a small
Jython script to test:

$ jython
Jython 2.5.2 (Release_2_5_2:7206, Mar 2 2011, 23:12:06)
[OpenJDK 64-Bit Server VM (Sun Microsystems Inc.)] on java1.6.0
Type help, copyright, credits or license for more information.

from com.ziclix.python.sql import zxJDBC
d, u, p, v = jdbc:mysql://localhost/wiki, root, None, 
org.gjt.mm.mysql.Driver
db = zxJDBC.connect(d, u, p, v, CHARSET=utf8)
c=db.cursor()
c.execute(select cl_from, cl_to from categorylinks where cl_from=61 limit 10)
c.fetchone()

(61, array('b', [65, 110, 100, 111, 114, 97]))

(a,b) = c.fetchone()
print b

array('b', [67, 122, -59, -126, 111, 110, 107, 111, 119, 105, 101, 95, 79, 114, 
103, 97, 110, 105, 122, 97, 99, 106, 105, 95, 78, 97, 114, 111, 100, -61, -77, 
119, 95, 90, 106, 101, 100, 110, 111, 99, 122, 111, 110, 121, 99, 104])

for x in b:

... try:
... print chr(x),
... except ValueError:
... print %02x % x,
...
C z -3b -7e o n k o w i e _ O r g a n i z a c j i _ N a r o d -3d -4d w _ Z j e 
d n o c z o n y c h

array('b, [ ... ]) in Jython means that SQL driver returns an array of bytes.

It seems to me that array of bytes contains raw UTF-8, so you need to decode it 
into
proper Unicode that Java uses in strings.

I think this behaviour is described in

http://bugs.mysql.com/bug.php?id=25528

Probably you need to play with getBytes() on a result object
to get what you want.

//Saper


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l




___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] correct way to import SQL dumps into MySQL database in terms of character encoding

2012-04-01 Thread Platonides
On 01/04/12 17:37, Piotr Jagielski wrote:
 I don't have MediaWiki installed. I'm just trying to import the dump
 into a standalone database so I can do some batch processing on the data.
 
 Regards,
 Piotr

It inserts the data fine for me. I suspect your java code is failing to
appropiately read them. Try reading the table with a different tool,
such as phpMyAdmin.

 mysql select * from categorylinks limit 20;
 +-+---+-+-+---+--+-+
 | cl_from | cl_to | cl_sortkey
   | cl_timestamp| cl_sortkey_prefix | cl_collation | cl_type |
 +-+---+-+-+---+--+-+
 |   0 | Ekspresowe_kasowanko  | Golembiovski Andzey   
   | 2009-07-09 21:01:30 |   |  | page|
 |   2 | Języki_skryptowe  | AWK
 AWK | 2011-01-18 01:11:23 | Awk   | 
 uppercase| page|
 |   4 | Specjalności_lekarskie| ALERGOLOGIA   
   | 2008-04-25 10:31:22 |   | uppercase| page|
 |   6 | Formaty_plików_komputerowych  | ASCII 
   | 2011-09-23 11:01:05 |   | uppercase| page|
 |   6 | Kodowania_znaków  | ASCII 
   | 2011-09-23 11:01:05 |   | uppercase| page|
 |   7 | Artykuły_na_medal | ATOM  
   | 2010-12-01 16:40:37 |   | uppercase| page|
 |   7 | Artykuły_wymagające_dopracowania  | ATOM  
   | 2011-08-16 15:53:43 |   | uppercase| page|
 |   7 | Atomy |  
 ATOM  | 2011-08-09 00:56:39 |   | 
 uppercase| page|
 |   8 | Logika_matematyczna   | AKSJOMAT  
   | 2007-11-10 08:18:06 |   | uppercase| page|
 |  10 | Arytmetyka|  
 ARYTMETYKA| 2011-10-17 02:36:39 |   | 
 uppercase| page|
 |  11 | Artykuły_pod_opieką_Projektu_Chemia   | AMINOKWASY
   | 2011-08-19 02:48:21 |   | uppercase| page|
 |  12 | Alkeny| *
 ALKENY| 2006-08-07 17:23:22 | * | 
 uppercase| page|
 |  13 | Multimedia| ACTIVEX   
   | 2007-05-24 20:20:15 |   | uppercase| page|
 |  13 | Windows   | ACTIVEX   
   | 2007-05-24 20:20:15 |   | uppercase| page|
 |  14 | Interfejsy_programistyczne| !
 APPLICATION PROGRAMMING INTERFACE | 2011-04-27 11:33:17 | ! | 
 uppercase| page|
 |  15 | Amiga | AMIGAOS   
   | 2007-09-09 17:19:11 |   | uppercase| page|
 |  15 | Systemy_operacyjne| AMIGAOS   
   | 2007-09-09 17:19:11 |   | uppercase| page|
 |  16 | Organizacje_międzynarodowe| ASSOCIATION FOR COMPUTING 
 MACHINERY | 2011-10-19 15:52:28 |   | uppercase| page|
 |  18 | Funkcje_boolowskie| ALTERNATYWA   
   | 2007-03-23 17:43:05 |   | uppercase| page|
 |  19 | Logika_matematyczna   | AKSJOMAT INDUKCJI 
   | 2007-08-31 22:54:55 |   | uppercase| page|
 +-+---+-+-+---+--+-+
 20 rows in set (0.00 sec)


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l