Re: [Labs-l] Scripts which adds template to articles created by ContentTranslation tool do not work on the grid

2017-06-17 Thread Martin Urbanec
Thank you all! I've added export LC_ALL=en_US.UTF-8 to my launch bash
script and all works correctly.

Best,
Martin

so 17. 6. 2017 v 11:15 odesílatel Merlijn van Deen (valhallasw) <
valhall...@arctus.nl> napsal:

> Hi all,
>
> This is a combination of a Python 3 design choice (PEP 383 [1]) and T60786
> [2]. What happens is the following:
>
> 1) The locale is set to a encoding that cannot decode certain bytes -- for
> example, ASCII, which can only decode bytes < 128.
> 2) Python is started with a command line parameter that contains a byte >
> 128 (\x80), for example, "ř' when UTF-8 encoded is represented by two
> bytes: \xc5\x99. Both of these are > \x80, and can therefore not be
> interpreted as ASCII
> 3) Python3 needs to somehow decode these bytes into a text string. But
> there is no valid way to do so! Instead of complaining loudly with a
> UnicodeDecodeError, Python3 embeds the bytes as 'fake characters' in the
> string -- as described in PEP 383.
> \xc5\x59 is therefore now suddenly decoded as "'\udcc5\udc99".  instead of
> "ř".
> 4) Pywikibot tries to encode these characters using utf-8, but they are
> fake characters, and the .encode step blows up.
>
> A simple way to reproduce this is the following:
>
> valhallasw@tools-bastion-03:~/ucm$ cat test.py
> import sys
> encoded = sys.argv[1].encode('utf-8')
>
> valhallasw@tools-bastion-03:~/ucm$ LC_ALL=C python3 test.py řeklad
> Traceback (most recent call last):
>   File "test.py", line 2, in 
> encoded = sys.argv[1].encode('utf-8')
> UnicodeEncodeError: 'utf-8' codec can't encode character '\udcc5' in
> position 0: surrogates not allowed
>
> This should be fixed in future Python versions (likely 3.7), when PEP540
> [3] is implemented.
>
> As for the current situation, the simplest solution is to add  'export
> LC_ALL=en_US.UTF-8' to your script, before the 'python ...' line.
>
> Best,
> Merlijn
>
> [1] https://www.python.org/dev/peps/pep-0383/
> [2] https://phabricator.wikimedia.org/T60784
> [3] https://www.python.org/dev/peps/pep-0540/
>
>
> On 16 June 2017 at 23:58, Bryan Davis  wrote:
>
>> On Fri, Jun 16, 2017 at 10:15 AM, Martin Urbanec
>>  wrote:
>> > Hello,
>> >
>> > I have a script which should add a template to articles which are
>> created by
>> > the ContentTranslation tool (the template has parameters which depends
>> on
>> > language and revision which were used as the source one; this is the
>> reason
>> > why I use separate script). It may be found at
>> > https://github.com/urbanecm/addPrekladCT/blob/master/addmissing.py. The
>> > script work perfectly on my local PC and on bastion host but I can't
>> get it
>> > work on the grid.
>> >
>> > The script itself is run by python3 addmissing.py -always
>> -file:pages.txt
>> > -search:'-insource:/\{\{[Pp]řeklad/' and require pages.txt file and
>> > preklads.txt file at
>> > https://tools.wmflabs.org/urbanecmbot/test/preklads.txt. The first
>> contains
>> > pages that should be processed and act as the generator, the second one
>> is
>> > something like a database with exact templates which should be inserted.
>> > Both files are as an example in the attachments.
>> >
>> > When I try to run it at toollabs bastion, all works as it should. When I
>> > send the script to grid, it do not work (see sample output below). Why?
>> Can
>> > somebody help me with it?
>> >
>> > Thank you in advance,
>> > Martin Urbanec / Urbanecm
>> >
>> > ; Output
>> >
>> > urbanecm@tools-bastion-02 ~/Documents/cswiki/addPrekladCT
>> > $ cat test.sh
>> > python3 addmissing.py -always -file:pages.txt
>> > -search:'-insource:/\{\{[Pp]řeklad/'
>> > urbanecm@tools-bastion-02 ~/Documents/cswiki/addPrekladCT
>> > $ jsub bash test.sh
>> > Your job 6201363 ("bash") has been submitted
>> > urbanecm@tools-bastion-02 ~/Documents/cswiki/addPrekladCT
>> > $ qstat
>> > job-ID  prior   name   user state submit/start at queue
>> > slots ja-task-ID
>> >
>> -
>> > 6201363 0.3 bash   urbanecm r 06/16/2017 18:14:42
>> > t...@tools-exec-1404.eqiad.wmf 1
>> > urbanecm@tools-bastion-02 ~/Documents/cswiki/addPrekladCT
>> > $ ls ~/bash.*
>> > /home/urbanecm/bash.err  /home/urbanecm/bash.out
>> > urbanecm@tools-bastion-02 ~/Documents/cswiki/addPrekladCT
>> > $ cat ~/bash.*
>> > Traceback (most recent call last):
>> >   File "addmissing.py", line 223, in 
>> > main()
>> >   File "addmissing.py", line 183, in main
>> > local_args = pywikibot.handle_args(args)
>> >   File "/shared/pywikipedia/core/pywikibot/bot.py", line 954, in
>> handle_args
>> > writeToCommandLogFile()
>> >   File "/shared/pywikipedia/core/pywikibot/bot.py", line 1128, in
>> > writeToCommandLogFile
>> > command_log_file.write(s + os.linesep)
>> >   File "/usr/lib/python3.4/codecs.py", line 711, in write
>> > return self.writer.write(data)
>> >   File "/usr/lib/python3.4/codecs.py", line 368, in write
>> >  

Re: [Labs-l] Scripts which adds template to articles created by ContentTranslation tool do not work on the grid

2017-06-17 Thread Merlijn van Deen (valhallasw)
Hi all,

This is a combination of a Python 3 design choice (PEP 383 [1]) and T60786
[2]. What happens is the following:

1) The locale is set to a encoding that cannot decode certain bytes -- for
example, ASCII, which can only decode bytes < 128.
2) Python is started with a command line parameter that contains a byte >
128 (\x80), for example, "ř' when UTF-8 encoded is represented by two
bytes: \xc5\x99. Both of these are > \x80, and can therefore not be
interpreted as ASCII
3) Python3 needs to somehow decode these bytes into a text string. But
there is no valid way to do so! Instead of complaining loudly with a
UnicodeDecodeError, Python3 embeds the bytes as 'fake characters' in the
string -- as described in PEP 383.
\xc5\x59 is therefore now suddenly decoded as "'\udcc5\udc99".  instead of
"ř".
4) Pywikibot tries to encode these characters using utf-8, but they are
fake characters, and the .encode step blows up.

A simple way to reproduce this is the following:

valhallasw@tools-bastion-03:~/ucm$ cat test.py
import sys
encoded = sys.argv[1].encode('utf-8')

valhallasw@tools-bastion-03:~/ucm$ LC_ALL=C python3 test.py řeklad
Traceback (most recent call last):
  File "test.py", line 2, in 
encoded = sys.argv[1].encode('utf-8')
UnicodeEncodeError: 'utf-8' codec can't encode character '\udcc5' in
position 0: surrogates not allowed

This should be fixed in future Python versions (likely 3.7), when PEP540
[3] is implemented.

As for the current situation, the simplest solution is to add  'export
LC_ALL=en_US.UTF-8' to your script, before the 'python ...' line.

Best,
Merlijn

[1] https://www.python.org/dev/peps/pep-0383/
[2] https://phabricator.wikimedia.org/T60784
[3] https://www.python.org/dev/peps/pep-0540/


On 16 June 2017 at 23:58, Bryan Davis  wrote:

> On Fri, Jun 16, 2017 at 10:15 AM, Martin Urbanec
>  wrote:
> > Hello,
> >
> > I have a script which should add a template to articles which are
> created by
> > the ContentTranslation tool (the template has parameters which depends on
> > language and revision which were used as the source one; this is the
> reason
> > why I use separate script). It may be found at
> > https://github.com/urbanecm/addPrekladCT/blob/master/addmissing.py. The
> > script work perfectly on my local PC and on bastion host but I can't get
> it
> > work on the grid.
> >
> > The script itself is run by python3 addmissing.py -always -file:pages.txt
> > -search:'-insource:/\{\{[Pp]řeklad/' and require pages.txt file and
> > preklads.txt file at
> > https://tools.wmflabs.org/urbanecmbot/test/preklads.txt. The first
> contains
> > pages that should be processed and act as the generator, the second one
> is
> > something like a database with exact templates which should be inserted.
> > Both files are as an example in the attachments.
> >
> > When I try to run it at toollabs bastion, all works as it should. When I
> > send the script to grid, it do not work (see sample output below). Why?
> Can
> > somebody help me with it?
> >
> > Thank you in advance,
> > Martin Urbanec / Urbanecm
> >
> > ; Output
> >
> > urbanecm@tools-bastion-02 ~/Documents/cswiki/addPrekladCT
> > $ cat test.sh
> > python3 addmissing.py -always -file:pages.txt
> > -search:'-insource:/\{\{[Pp]řeklad/'
> > urbanecm@tools-bastion-02 ~/Documents/cswiki/addPrekladCT
> > $ jsub bash test.sh
> > Your job 6201363 ("bash") has been submitted
> > urbanecm@tools-bastion-02 ~/Documents/cswiki/addPrekladCT
> > $ qstat
> > job-ID  prior   name   user state submit/start at queue
> > slots ja-task-ID
> > 
> -
> > 6201363 0.3 bash   urbanecm r 06/16/2017 18:14:42
> > t...@tools-exec-1404.eqiad.wmf 1
> > urbanecm@tools-bastion-02 ~/Documents/cswiki/addPrekladCT
> > $ ls ~/bash.*
> > /home/urbanecm/bash.err  /home/urbanecm/bash.out
> > urbanecm@tools-bastion-02 ~/Documents/cswiki/addPrekladCT
> > $ cat ~/bash.*
> > Traceback (most recent call last):
> >   File "addmissing.py", line 223, in 
> > main()
> >   File "addmissing.py", line 183, in main
> > local_args = pywikibot.handle_args(args)
> >   File "/shared/pywikipedia/core/pywikibot/bot.py", line 954, in
> handle_args
> > writeToCommandLogFile()
> >   File "/shared/pywikipedia/core/pywikibot/bot.py", line 1128, in
> > writeToCommandLogFile
> > command_log_file.write(s + os.linesep)
> >   File "/usr/lib/python3.4/codecs.py", line 711, in write
> > return self.writer.write(data)
> >   File "/usr/lib/python3.4/codecs.py", line 368, in write
> > data, consumed = self.encode(object, self.errors)
> > UnicodeEncodeError: 'utf-8' codec can't encode character '\udcc5' in
> > position 67: surrogates not allowed
> > CRITICAL: Closing network session.
> > 
> > urbanecm@tools-bastion-02 ~/Documents/cswiki/addPrekladCT
> > $
>
> Zhuyifei1999 saw your email and noted on irc that it looks to be a
> case of the known bug th

Re: [Labs-l] Scripts which adds template to articles created by ContentTranslation tool do not work on the grid

2017-06-16 Thread Bryan Davis
On Fri, Jun 16, 2017 at 10:15 AM, Martin Urbanec
 wrote:
> Hello,
>
> I have a script which should add a template to articles which are created by
> the ContentTranslation tool (the template has parameters which depends on
> language and revision which were used as the source one; this is the reason
> why I use separate script). It may be found at
> https://github.com/urbanecm/addPrekladCT/blob/master/addmissing.py. The
> script work perfectly on my local PC and on bastion host but I can't get it
> work on the grid.
>
> The script itself is run by python3 addmissing.py -always -file:pages.txt
> -search:'-insource:/\{\{[Pp]řeklad/' and require pages.txt file and
> preklads.txt file at
> https://tools.wmflabs.org/urbanecmbot/test/preklads.txt. The first contains
> pages that should be processed and act as the generator, the second one is
> something like a database with exact templates which should be inserted.
> Both files are as an example in the attachments.
>
> When I try to run it at toollabs bastion, all works as it should. When I
> send the script to grid, it do not work (see sample output below). Why? Can
> somebody help me with it?
>
> Thank you in advance,
> Martin Urbanec / Urbanecm
>
> ; Output
>
> urbanecm@tools-bastion-02 ~/Documents/cswiki/addPrekladCT
> $ cat test.sh
> python3 addmissing.py -always -file:pages.txt
> -search:'-insource:/\{\{[Pp]řeklad/'
> urbanecm@tools-bastion-02 ~/Documents/cswiki/addPrekladCT
> $ jsub bash test.sh
> Your job 6201363 ("bash") has been submitted
> urbanecm@tools-bastion-02 ~/Documents/cswiki/addPrekladCT
> $ qstat
> job-ID  prior   name   user state submit/start at queue
> slots ja-task-ID
> -
> 6201363 0.3 bash   urbanecm r 06/16/2017 18:14:42
> t...@tools-exec-1404.eqiad.wmf 1
> urbanecm@tools-bastion-02 ~/Documents/cswiki/addPrekladCT
> $ ls ~/bash.*
> /home/urbanecm/bash.err  /home/urbanecm/bash.out
> urbanecm@tools-bastion-02 ~/Documents/cswiki/addPrekladCT
> $ cat ~/bash.*
> Traceback (most recent call last):
>   File "addmissing.py", line 223, in 
> main()
>   File "addmissing.py", line 183, in main
> local_args = pywikibot.handle_args(args)
>   File "/shared/pywikipedia/core/pywikibot/bot.py", line 954, in handle_args
> writeToCommandLogFile()
>   File "/shared/pywikipedia/core/pywikibot/bot.py", line 1128, in
> writeToCommandLogFile
> command_log_file.write(s + os.linesep)
>   File "/usr/lib/python3.4/codecs.py", line 711, in write
> return self.writer.write(data)
>   File "/usr/lib/python3.4/codecs.py", line 368, in write
> data, consumed = self.encode(object, self.errors)
> UnicodeEncodeError: 'utf-8' codec can't encode character '\udcc5' in
> position 67: surrogates not allowed
> CRITICAL: Closing network session.
> 
> urbanecm@tools-bastion-02 ~/Documents/cswiki/addPrekladCT
> $

Zhuyifei1999 saw your email and noted on irc that it looks to be a
case of the known bug that I just retitled as "Shell LOCALE neither
consistent nor sane across grid engine nodes"
(). The current best work
around that bug is to launch the job as a shell script that sets
either LANG=C.UTF-8 or PYTHONIOENCODING=utf-8.

If setting the job to run with the same locale you are using in your
interactive tests does not work to fix the problem, you may also be
hitting a deeper Python3 unicode issue related to surrogate codepoints
(). This is hinted by the
"position 67: surrogates not allowed" error message.

I can actually reproduce your error message in an interactive python
session on tools-dev from a starting state of LANG=en_US.UTF-8:

  $ python3
  Python 3.4.0 (default, Jun 19 2015, 14:20:21)
  [GCC 4.8.2] on linux
  Type "help", "copyright", "credits" or "license" for more information.
  >>> print('\udcc5')
  Traceback (most recent call last):
File "", line 1, in 
  UnicodeEncodeError: 'utf-8' codec can't encode character '\udcc5' in
position 0: surrogates not allowed
  >>>

Explictly encoding using 'surrogateescape' does work:
  >>> print('\udcc5'.encode('utf-8', 'surrogateescape'))
  b'\xc5'

It looks like the error could be dealt with in pywikibot by patching
writeToCommandLogFile() to open the codec used for output with any
value other than the default errors='strict'
().

  $ python3
  Python 3.4.0 (default, Jun 19 2015, 14:20:21)
  [GCC 4.8.2] on linux
  Type "help", "copyright", "credits" or "license" for more information.
  >>> print('\udcc5'.encode('utf-8', 'ignore'))
  b''
  >>> print('\udcc5'.encode('utf-8', 'replace'))
  b'?'
  >>> print('\udcc5'.encode('utf-8', 'xmlcharrefreplace'))
  b'�'
  >>> print('\udcc5'.encode('utf-8', 'backslashreplace'))
  b'\\udcc5'
  >>> print('\udcc5'.encode('utf-8', 'surrogateescape'))
  b'\xc5'
  >>> print('\udcc5'.enc

Re: [Labs-l] Scripts which adds template to articles created by ContentTranslation tool do not work on the grid

2017-06-16 Thread Martin Urbanec
I can't, the script is Python3 only.

Martin

pá 16. 6. 2017 v 19:04 odesílatel MarcoAurelio  napsal:

> It's being some time I was not in the list and I've missed important
> announcements for sure but last time I checked Python 3 was still not
> supported by Labs. I mean, the shared pywikibot files where still in Python
> 2. Not sure if this is still the case but could you please try to run in on
> the grid with python instead of python3 and see if that solves the issue?
> Regards.
>
> El El vie, 16 jun 2017 a las 18:16, Martin Urbanec <
> martin.urba...@wikimedia.cz> escribió:
>
>> Hello,
>>
>> I have a script which should add a template to articles which are created
>> by the ContentTranslation tool (the template has parameters which depends
>> on language and revision which were used as the source one; this is the
>> reason why I use separate script). It may be found at
>> https://github.com/urbanecm/addPrekladCT/blob/master/addmissing.py. The
>> script work perfectly on my local PC and on bastion host but I can't get it
>> work on the grid.
>>
>> The script itself is run by *python3 addmissing.py -always
>> -file:pages.txt -search:'-insource:/\{\{[Pp]řeklad/'* and require
>> pages.txt file and preklads.txt file at
>> https://tools.wmflabs.org/urbanecmbot/test/preklads.txt. The first
>> contains pages that should be processed and act as the generator, the
>> second one is something like a database with exact templates which should
>> be inserted. Both files are as an example in the attachments.
>>
>> When I try to run it at toollabs bastion, all works as it should. When I
>> send the script to grid, it do not work (see sample output below). Why? Can
>> somebody help me with it?
>>
>> Thank you in advance,
>> Martin Urbanec / Urbanecm
>>
>> ; Output
>>
>> urbanecm@tools-bastion-02 ~/Documents/cswiki/addPrekladCT
>> $ cat test.sh
>> python3 addmissing.py -always -file:pages.txt
>> -search:'-insource:/\{\{[Pp]řeklad/'
>> urbanecm@tools-bastion-02 ~/Documents/cswiki/addPrekladCT
>> $ jsub bash test.sh
>> Your job 6201363 ("bash") has been submitted
>> urbanecm@tools-bastion-02 ~/Documents/cswiki/addPrekladCT
>> $ qstat
>> job-ID  prior   name   user state submit/start at queue
>>slots ja-task-ID
>>
>> -
>> 6201363 0.3 bash   urbanecm r 06/16/2017 18:14:42
>> t...@tools-exec-1404.eqiad.wmf 1
>> urbanecm@tools-bastion-02 ~/Documents/cswiki/addPrekladCT
>> $ ls ~/bash.*
>> /home/urbanecm/bash.err  /home/urbanecm/bash.out
>> urbanecm@tools-bastion-02 ~/Documents/cswiki/addPrekladCT
>> $ cat ~/bash.*
>> Traceback (most recent call last):
>>   File "addmissing.py", line 223, in 
>> main()
>>   File "addmissing.py", line 183, in main
>> local_args = pywikibot.handle_args(args)
>>   File "/shared/pywikipedia/core/pywikibot/bot.py", line 954, in
>> handle_args
>> writeToCommandLogFile()
>>   File "/shared/pywikipedia/core/pywikibot/bot.py", line 1128, in
>> writeToCommandLogFile
>> command_log_file.write(s + os.linesep)
>>   File "/usr/lib/python3.4/codecs.py", line 711, in write
>> return self.writer.write(data)
>>   File "/usr/lib/python3.4/codecs.py", line 368, in write
>> data, consumed = self.encode(object, self.errors)
>> UnicodeEncodeError: 'utf-8' codec can't encode character '\udcc5' in
>> position 67: surrogates not allowed
>> CRITICAL: Closing network session.
>> 
>> urbanecm@tools-bastion-02 ~/Documents/cswiki/addPrekladCT
>> $
>>
> ___
>> Labs-l mailing list
>> Labs-l@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/labs-l
>>
> --
> M. A.
> ___
> Labs-l mailing list
> Labs-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/labs-l
>
___
Labs-l mailing list
Labs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/labs-l


Re: [Labs-l] Scripts which adds template to articles created by ContentTranslation tool do not work on the grid

2017-06-16 Thread MarcoAurelio
It's being some time I was not in the list and I've missed important
announcements for sure but last time I checked Python 3 was still not
supported by Labs. I mean, the shared pywikibot files where still in Python
2. Not sure if this is still the case but could you please try to run in on
the grid with python instead of python3 and see if that solves the issue?
Regards.

El El vie, 16 jun 2017 a las 18:16, Martin Urbanec <
martin.urba...@wikimedia.cz> escribió:

> Hello,
>
> I have a script which should add a template to articles which are created
> by the ContentTranslation tool (the template has parameters which depends
> on language and revision which were used as the source one; this is the
> reason why I use separate script). It may be found at
> https://github.com/urbanecm/addPrekladCT/blob/master/addmissing.py. The
> script work perfectly on my local PC and on bastion host but I can't get it
> work on the grid.
>
> The script itself is run by *python3 addmissing.py -always
> -file:pages.txt -search:'-insource:/\{\{[Pp]řeklad/'* and require
> pages.txt file and preklads.txt file at
> https://tools.wmflabs.org/urbanecmbot/test/preklads.txt. The first
> contains pages that should be processed and act as the generator, the
> second one is something like a database with exact templates which should
> be inserted. Both files are as an example in the attachments.
>
> When I try to run it at toollabs bastion, all works as it should. When I
> send the script to grid, it do not work (see sample output below). Why? Can
> somebody help me with it?
>
> Thank you in advance,
> Martin Urbanec / Urbanecm
>
> ; Output
>
> urbanecm@tools-bastion-02 ~/Documents/cswiki/addPrekladCT
> $ cat test.sh
> python3 addmissing.py -always -file:pages.txt
> -search:'-insource:/\{\{[Pp]řeklad/'
> urbanecm@tools-bastion-02 ~/Documents/cswiki/addPrekladCT
> $ jsub bash test.sh
> Your job 6201363 ("bash") has been submitted
> urbanecm@tools-bastion-02 ~/Documents/cswiki/addPrekladCT
> $ qstat
> job-ID  prior   name   user state submit/start at queue
>slots ja-task-ID
>
> -
> 6201363 0.3 bash   urbanecm r 06/16/2017 18:14:42
> t...@tools-exec-1404.eqiad.wmf 1
> urbanecm@tools-bastion-02 ~/Documents/cswiki/addPrekladCT
> $ ls ~/bash.*
> /home/urbanecm/bash.err  /home/urbanecm/bash.out
> urbanecm@tools-bastion-02 ~/Documents/cswiki/addPrekladCT
> $ cat ~/bash.*
> Traceback (most recent call last):
>   File "addmissing.py", line 223, in 
> main()
>   File "addmissing.py", line 183, in main
> local_args = pywikibot.handle_args(args)
>   File "/shared/pywikipedia/core/pywikibot/bot.py", line 954, in
> handle_args
> writeToCommandLogFile()
>   File "/shared/pywikipedia/core/pywikibot/bot.py", line 1128, in
> writeToCommandLogFile
> command_log_file.write(s + os.linesep)
>   File "/usr/lib/python3.4/codecs.py", line 711, in write
> return self.writer.write(data)
>   File "/usr/lib/python3.4/codecs.py", line 368, in write
> data, consumed = self.encode(object, self.errors)
> UnicodeEncodeError: 'utf-8' codec can't encode character '\udcc5' in
> position 67: surrogates not allowed
> CRITICAL: Closing network session.
> 
> urbanecm@tools-bastion-02 ~/Documents/cswiki/addPrekladCT
> $
> ___
> Labs-l mailing list
> Labs-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/labs-l
>
-- 
M. A.
___
Labs-l mailing list
Labs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/labs-l