Re: [Wikitech-l] garbage characters show up when fetching wikimedia api

2016-05-07 Thread MZMcBride
MZMcBride wrote:
>The error you're getting generally means that the JSON was malformed for
>some reason. It seems unlikely that MediaWiki's api.php is outputting
>invalid JSON, but I suppose it's possible.

I left a note on the Phabricator task that Marius linked to:
.

It seems api.php end-points really are outputting garbage characters in
some cases, though it remains unclear which layer is to blame. :-/

MZMcBride



___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] garbage characters show up when fetching wikimedia api

2016-05-06 Thread Marius Hoch

Hi,

that sounds like https://phabricator.wikimedia.org/T133866.

Cheers

Marius

On 05.05.2016 21:56, Trung Dinh wrote:

Hi all,
I have an issue why trying to parse data fetched from wikipedia api.
This is the piece of code that I am using:
api_url = 
'http://en.wikipedia.org/w/api.php'
api_params = 
'action=query=recentchanges=5000=edit=0=newer=json=20160504022715'

f = urllib2.Request(api_url, api_params)
print ('requesting ' + api_url + '?' + api_params)
source = urllib2.urlopen(f, None, 300).read()
source = json.loads(source)

json.loads(source) raised the following exception " Expecting , delimiter: line 1 
column 817105 (char 817104"

I tried to use source.encode('utf-8') and some other encodings but they all 
didn't help.
Do we have any workaround for that issue ? Thanks :)
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l



___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] garbage characters show up when fetching wikimedia api

2016-05-05 Thread Trung Dinh
Guys, 

Thanks so much for your prompt feedback.
Basically, what I am doing is to keep sending the request based on date &
time until we reach to another day.
Specifically, what I have is something like:

api_url = 'http://en.wikipedia.org/w/api.php'
date='20160504022715'

while (True):
  api_params = 
'action=query=recentchanges=5000=edit=0
ir=newer=json={date}'.format(date=date)
  f = urllib2.Request(api_url, api_params)
  source = urllib2.urlopen(f, None, 300).read()
  source = json.loads(source)
  Increase date.

Given the above code, I am encountering an weird situation. In the query,
if I set rclimit to 500 then it runs normally. However, if I set rclimit
to 5000 like my previous email, I will see the error. I know that for
recent change rclimit should be set to 500. But is there anything
particular about the values of rclimit that could lead to the break in
json ?

On 5/5/16, 11:16 PM, "Wikitech-l on behalf of MZMcBride"

wrote:

>Trung Dinh wrote:
>>Hi all,
>>I have an issue why trying to parse data fetched from wikipedia api.
>>This is the piece of code that I am using:
>>api_url = 
>>'https://urldefense.proofpoint.com/v2/url?u=http-3A__en.wikipedia.org_w_a
>>pi.php=CwIGaQ=5VD0RTtNlTh3ycd41b3MUw=K9jJjNfacravQkfypdTZOg=Gl3eq
>>wsc58M_ot8G6G2qehCjARnv3B19Uv5b6hApJz4=AjBJxhe0ZaeTqz3r3wPQOH_kiIjq2_h4
>>UgKIgJUC5XQ= '
>>api_params = 
>>'action=query=recentchanges=5000=edit=0
>>c
>>dir=newer=json=20160504022715'
>>
>>f = urllib2.Request(api_url, api_params)
>>print ('requesting ' + api_url + '?' + api_params)
>>source = urllib2.urlopen(f, None, 300).read()
>>source = json.loads(source)
>>
>>json.loads(source) raised the following exception " Expecting ,
>>delimiter: line 1 column 817105 (char 817104"
>>
>>I tried to use source.encode('utf-8') and some other encodings but they
>>all didn't help.
>>Do we have any workaround for that issue ? Thanks :)
>
>Hi.
>
>Weird, I can't reproduce this error. I had to import the "json" and
>"urllib2" modules, but after doing so, executing the code you provided
>here worked fine for me: .
>
>You probably want to use
>'https://urldefense.proofpoint.com/v2/url?u=https-3A__en.wikipedia.org_w_a
>pi.php=CwIGaQ=5VD0RTtNlTh3ycd41b3MUw=K9jJjNfacravQkfypdTZOg=Gl3eqw
>sc58M_ot8G6G2qehCjARnv3B19Uv5b6hApJz4=aw9laFsQi8JGilqru0zbRUlrBdcWj52NmF
>tRw6ZW5sI= ' as your
>end-point (HTTPS, not HTTP).
>
>As far as I know, JSON is always encoded as UTF-8, so you shouldn't need
>to encode or decode the data explicitly.
>
>The error you're getting generally means that the JSON was malformed for
>some reason. It seems unlikely that MediaWiki's api.php is outputting
>invalid JSON, but I suppose it's possible.
>
>Since you're coding in Python, you may be interested in a framework such
>as .
>
>MZMcBride
>
>
>
>___
>Wikitech-l mailing list
>Wikitech-l@lists.wikimedia.org
>https://lists.wikimedia.org/mailman/listinfo/wikitech-l


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] garbage characters show up when fetching wikimedia api

2016-05-05 Thread Antoine Musso
Le 05/05/2016 21:56, Trung Dinh a écrit :
> I have an issue why trying to parse data fetched from wikipedia api.
> This is the piece of code that I am using:
> api_url = 
> 'http://en.wikipedia.org/w/api.php'
> api_params = 
> 'action=query=recentchanges=5000=edit=0=newer=json=20160504022715'
> 
> f = urllib2.Request(api_url, api_params)
> print ('requesting ' + api_url + '?' + api_params)
> source = urllib2.urlopen(f, None, 300).read()
> source = json.loads(source)
> 
> json.loads(source) raised the following exception " Expecting , delimiter: 
> line 1 column 817105 (char 817104"
> 
> I tried to use source.encode('utf-8') and some other encodings but they all 
> didn't help.
> Do we have any workaround for that issue ? Thanks :)

The error is due to the response not being valid json.

Can you have your script write the failing content to a file and share
it somewhere? For example via https://phabricator.wikimedia.org/file/upload/

There is a very thin chance that the server/caches actually garbage the
tail of some content.  I have seen some related discussion about it
earlier this week.


-- 
Antoine "hashar" Musso


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] garbage characters show up when fetching wikimedia api

2016-05-05 Thread Brad Jorsch (Anomie)
On Thu, May 5, 2016 at 6:16 PM, MZMcBride  wrote:

> The error you're getting generally means that the JSON was malformed for
> some reason. It seems unlikely that MediaWiki's api.php is outputting
> invalid JSON, but I suppose it's possible.
>

There is https://phabricator.wikimedia.org/T132159 along those lines,
although it's not an API issue.

I note that the reported issue is with list=recentchanges, the output of
which (even at a constant timestamp offset) could easily change with page
deletion or revdel.


-- 
Brad Jorsch (Anomie)
Senior Software Engineer
Wikimedia Foundation
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] garbage characters show up when fetching wikimedia api

2016-05-05 Thread MZMcBride
Trung Dinh wrote:
>Hi all,
>I have an issue why trying to parse data fetched from wikipedia api.
>This is the piece of code that I am using:
>api_url = 'http://en.wikipedia.org/w/api.php'
>api_params = 
>'action=query=recentchanges=5000=edit=0
>dir=newer=json=20160504022715'
>
>f = urllib2.Request(api_url, api_params)
>print ('requesting ' + api_url + '?' + api_params)
>source = urllib2.urlopen(f, None, 300).read()
>source = json.loads(source)
>
>json.loads(source) raised the following exception " Expecting ,
>delimiter: line 1 column 817105 (char 817104"
>
>I tried to use source.encode('utf-8') and some other encodings but they
>all didn't help.
>Do we have any workaround for that issue ? Thanks :)

Hi.

Weird, I can't reproduce this error. I had to import the "json" and
"urllib2" modules, but after doing so, executing the code you provided
here worked fine for me: .

You probably want to use 'https://en.wikipedia.org/w/api.php' as your
end-point (HTTPS, not HTTP).

As far as I know, JSON is always encoded as UTF-8, so you shouldn't need
to encode or decode the data explicitly.

The error you're getting generally means that the JSON was malformed for
some reason. It seems unlikely that MediaWiki's api.php is outputting
invalid JSON, but I suppose it's possible.

Since you're coding in Python, you may be interested in a framework such
as .

MZMcBride



___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

[Wikitech-l] garbage characters show up when fetching wikimedia api

2016-05-05 Thread Trung Dinh
Hi all,
I have an issue why trying to parse data fetched from wikipedia api.
This is the piece of code that I am using:
api_url = 
'http://en.wikipedia.org/w/api.php'
api_params = 
'action=query=recentchanges=5000=edit=0=newer=json=20160504022715'

f = urllib2.Request(api_url, api_params)
print ('requesting ' + api_url + '?' + api_params)
source = urllib2.urlopen(f, None, 300).read()
source = json.loads(source)

json.loads(source) raised the following exception " Expecting , delimiter: line 
1 column 817105 (char 817104"

I tried to use source.encode('utf-8') and some other encodings but they all 
didn't help.
Do we have any workaround for that issue ? Thanks :)
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l