[Bug 67117] Performance review of PubSubHubbub extension

2014-09-17 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=67117

Dimitris Kontokostas jimk...@gmail.com changed:

   What|Removed |Added

 CC||jimk...@gmail.com

--- Comment #8 from Dimitris Kontokostas jimk...@gmail.com ---
Without looking at the code we had a few comments on this extension that also
relate to performance [1] [2].

The basic idea is whether the client (wiki mirror) can delay the poll (e.g.
5-10 minutes) and aggregate multiple edits of the same page into one. OAI had a
native support for this.

The reason is that in DBpedia Live,  we had a lot of scaling issues when the
database was in constant update (even with the aggregated edits  with 5-10
minutes poll delay) and using in parallel the MW API to get data for
extraction.

I don't know if this is supported by the PubSubHubbub specs but the client
could do some aggregation manually.

We had another query for Hub data persistence but not sure which thread is more
appropriate for this question.

[1] https://lists.wikimedia.org/pipermail/wikidata-tech/2014-July/000533.html
[2] https://lists.wikimedia.org/pipermail/wikidata-tech/2014-July/000539.html

-- 
You are receiving this mail because:
You are the assignee for the bug.
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 67117] Performance review of PubSubHubbub extension

2014-09-10 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=67117

Aaron Schulz aschulz4...@gmail.com changed:

   What|Removed |Added

   Assignee|aschulz4...@gmail.com   |wikibugs-l@lists.wikimedia.
   ||org

-- 
You are receiving this mail because:
You are the assignee for the bug.
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 67117] Performance review of PubSubHubbub extension

2014-07-16 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=67117

--- Comment #6 from Brad Jorsch bjor...@wikimedia.org ---
I tested it before posting comment 4, just to be sure. It does.

You can now test it yourself. Go to
http://tools.wmflabs.org/oauth-hello-world/index.php?action=authorize, then go
to http://tools.wmflabs.org/oauth-hello-world/index.php?action=testspecial
which will dump the response received when hitting Special:MyPage.

-- 
You are receiving this mail because:
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 67117] Performance review of PubSubHubbub extension

2014-07-16 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=67117

--- Comment #7 from Daniel Kinzler daniel.kinz...@wikimedia.de ---
The updates uses the XML dump format. That's not really compatible with the way
the API structures XML output, and in case JSON is requested from the API,
would would have to be encoded as a single string value. That's really
annoying.

Also, API modules generally shouldn't mess with HTTP headers, which is
necessary for PuSH, as far as I understand.

A Special page seems to fit the bill much better on the technical level, though
it's a bit awkward conceptually. There are several special pages that generate
non-HTML output though. 

Hm, perhaps the best way to implement this would be a custom action?...

-- 
You are receiving this mail because:
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 67117] Performance review of PubSubHubbub extension

2014-07-15 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=67117

--- Comment #5 from Aaron Schulz aschulz4...@gmail.com ---
I don't think it works on special pages. I asked Chris about that a few days
back.

-- 
You are receiving this mail because:
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 67117] Performance review of PubSubHubbub extension

2014-07-14 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=67117

--- Comment #4 from Brad Jorsch bjor...@wikimedia.org ---
(In reply to Aaron Schulz from comment #3)
 * It's not a huge deal, but it would be nice to use the API (which also
 works with things like OAuth if we wanted hubs with more access). Brad might
 have an opinion on special page vs API. I don't feel strongly, but we should
 get it right given the Link header and cache interaction.

OAuth works with special pages too.

It sounds like the general idea of the extension is possible to be configured
for general users to use? Then an API endpoint wouldn't necessarily be out of
place. On the other hand, is there anything wrong with the existing special
page implementation? If not, is there a reason not to just stay with that?

-- 
You are receiving this mail because:
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 67117] Performance review of PubSubHubbub extension

2014-07-11 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=67117

Aaron Schulz aschulz4...@gmail.com changed:

   What|Removed |Added

 CC||bjor...@wikimedia.org

--- Comment #3 from Aaron Schulz aschulz4...@gmail.com ---
* The delay could be in theory be based on the current max slave lag...but I
simple approach would be to just pick a value. 10 seconds would be very safe
under non-broken circumstances.
* It's not a huge deal, but it would be nice to use the API (which also works
with things like OAuth if we wanted hubs with more access). Brad might have an
opinion on special page vs API. I don't feel strongly, but we should get it
right given the Link header and cache interaction.
* You could still use the recent changes table to reduce the number of jobs.
There could be at most one de-duplicated job per hub that would grab all titles
changes from the last time and send them to the hub. When the job succeeds (not
HTTP errors posting the changed URIs), it could bump the time. The time table
could be a simple DB table. The job could be delayed to give it time to cover a
larger time range. The range could be last time to present (or a smaller
range to limit the number of items, especially for the new hub case). Maybe 5
minutes. It could hopefully batch more titles in one or a few HTTP requests (or
use pipelining or at least some curl_multi).

-- 
You are receiving this mail because:
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 67117] Performance review of PubSubHubbub extension

2014-07-10 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=67117

--- Comment #2 from Alexander Lehmann 
alexander.lehm...@student.hpi.uni-potsdam.de ---
Hi Aaron.

Thanks for your notes. Some questions to your notes are inline.

(In reply to Aaron Schulz from comment #1)
 A few things:
 * This seems to be missing a ArticleRevisionVisibilitySet handler
 * http_post() needs to handle $wgHTTPProxy
 * The maximum jobs attempts for the queue will have to be set very high to
 avoid update losses (maxTries)
 * NewRevisionFromEditComplete and other hooks trigger before COMMIT so the
 jobs should probably be delayed (using 'jobReleaseTimestamp'). Moving them
 post-COMMIT is not an option since the network partition could cause nothing
 to be enqueued (the reverse, a job and no COMMIT, is wasteful but harmless).

I'm not sure if I understood the problem correctly. How should I set the time
period of the delay? Should the delay dynamically adjust or a should we set a
fixed period of time.

 * We tend to run lots of jobs for one wiki at a time. http_post() could
 benefit from some sort of singleton on the curl handle instead closing it
 each time. See
 http://stackoverflow.com/questions/972925/persistent-keepalive-http-with-the-
 php-curl-library.
 * createLastChangesOutput() should use LIMIT+DISTINCT instead of a break
 statement. Also, I'm not sure how well that works. There can only be one job
 for hitting the URL that returns this result in the queue, but it only does
 the last 60 seconds of changes. Also, it selects rc_timestamp but does not
 use it now. Is it OK if the Hub missing a bunch of changes from this (e.g.
 are the per-Title jobs good enough?)
 * It's curious that the hub is supposed to talk back to a special page, why
 not an API page instead? 

The extension provides data in the MediaWiki XML export format, I don't know if
the API is intended for this format. Does it make a real difference to using a
special page?

 * The Link headers also go there. What is the use of these? Also, since
 they'd take 30 days to apply to all pages (the varnish cache TTL), it would
 be a pain to change them. They definitely need to be stable.
 

They need to be there according to the specification of PubSubHubbub[1].
But they are stable.

 Come to think of it, it seems like we need to send the following events to
 the hub:
 * New revisions
 * Page (un)deletions
 * Revision (un)deletions
 * Page moves
 All of the above leave either edit or log entries in recent changes. Page
 moves only leave one at the old title...though rc_params can be inspected to
 get the new title. I wonder if instead of a job per title if there can
 instead be a single job that sends all changes since the last update time
 and updates the last update time on success. The advantages would be:
 a) Far fewer jobs needed
 b) All updates would be batched
 c) Supporting more hubs is easier since only another job and time position
 is needed (rather than N jobs for each hub for each title)
 Of course I may have missed something.

Unfortunately, there is no possibility to identify the request of the Hubs.
Theoretically, anyone can call the PubSubHubbub export interface. Therefore, we
do not know the last update time. The hub cannot provide the time, because
the resource is identified only through the exact URL.

-- 
You are receiving this mail because:
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 67117] Performance review of PubSubHubbub extension

2014-07-09 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=67117

Andre Klapper aklap...@wikimedia.org changed:

   What|Removed |Added

 CC||alexander.lehmann@student.h
   ||pi.uni-potsdam.de,
   ||sebastian.brueckner@student
   ||.hpi.uni-potsdam.de
  Component|WikidataRepo|PubSubHubbub

-- 
You are receiving this mail because:
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 67117] Performance review of PubSubHubbub extension

2014-07-08 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=67117

--- Comment #1 from Aaron Schulz aschulz4...@gmail.com ---
A few things:
* This seems to be missing a ArticleRevisionVisibilitySet handler
* http_post() needs to handle $wgHTTPProxy
* The maximum jobs attempts for the queue will have to be set very high to
avoid update losses (maxTries)
* NewRevisionFromEditComplete and other hooks trigger before COMMIT so the jobs
should probably be delayed (using 'jobReleaseTimestamp'). Moving them
post-COMMIT is not an option since the network partition could cause nothing to
be enqueued (the reverse, a job and no COMMIT, is wasteful but harmless).
* We tend to run lots of jobs for one wiki at a time. http_post() could benefit
from some sort of singleton on the curl handle instead closing it each time.
See
http://stackoverflow.com/questions/972925/persistent-keepalive-http-with-the-php-curl-library.
* createLastChangesOutput() should use LIMIT+DISTINCT instead of a break
statement. Also, I'm not sure how well that works. There can only be one job
for hitting the URL that returns this result in the queue, but it only does the
last 60 seconds of changes. Also, it selects rc_timestamp but does not use it
now. Is it OK if the Hub missing a bunch of changes from this (e.g. are the
per-Title jobs good enough?)
* It's curious that the hub is supposed to talk back to a special page, why not
an API page instead? 
* The Link headers also go there. What is the use of these? Also, since they'd
take 30 days to apply to all pages (the varnish cache TTL), it would be a pain
to change them. They definitely need to be stable.

Come to think of it, it seems like we need to send the following events to the
hub:
* New revisions
* Page (un)deletions
* Revision (un)deletions
* Page moves
All of the above leave either edit or log entries in recent changes. Page moves
only leave one at the old title...though rc_params can be inspected to get the
new title. I wonder if instead of a job per title if there can instead be a
single job that sends all changes since the last update time and updates the
last update time on success. The advantages would be:
a) Far fewer jobs needed
b) All updates would be batched
c) Supporting more hubs is easier since only another job and time position is
needed (rather than N jobs for each hub for each title)
Of course I may have missed something.

-- 
You are receiving this mail because:
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 67117] Performance review of PubSubHubbub extension

2014-07-07 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=67117

Greg Grossmeier g...@wikimedia.org changed:

   What|Removed |Added

   Priority|Unprioritized   |Normal
 CC||g...@wikimedia.org

-- 
You are receiving this mail because:
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 67117] Performance review of PubSubHubbub extension

2014-07-07 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=67117

Greg Grossmeier g...@wikimedia.org changed:

   What|Removed |Added

 Blocks||67623

-- 
You are receiving this mail because:
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 67117] Performance review of PubSubHubbub extension

2014-07-07 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=67117

Sam Reed (reedy) s...@reedyboy.net changed:

   What|Removed |Added

 Blocks|38970   |

-- 
You are receiving this mail because:
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 67117] Performance review of PubSubHubbub extension

2014-07-03 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=67117

Aaron Schulz aschulz4...@gmail.com changed:

   What|Removed |Added

   Assignee|wikidata-bugs@lists.wikimed |aschulz4...@gmail.com
   |ia.org  |

-- 
You are receiving this mail because:
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 67117] Performance review of PubSubHubbub extension

2014-07-03 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=67117

Sam Reed (reedy) s...@reedyboy.net changed:

   What|Removed |Added

   Keywords|performance |

-- 
You are receiving this mail because:
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 67117] Performance review of PubSubHubbub extension

2014-06-26 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=67117

Sebastian Brückner sebastian.brueck...@student.hpi.uni-potsdam.de changed:

   What|Removed |Added

   See Also||https://bugzilla.wikimedia.
   ||org/show_bug.cgi?id=67118

-- 
You are receiving this mail because:
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 67117] Performance review of PubSubHubbub extension

2014-06-26 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=67117

Sebastian Brückner sebastian.brueck...@student.hpi.uni-potsdam.de changed:

   What|Removed |Added

   Keywords||performance
 CC||daniel.kinz...@wikimedia.de
   ||,
   ||lydia.pintscher@wikimedia.d
   ||e

-- 
You are receiving this mail because:
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 67117] Performance review of PubSubHubbub extension

2014-06-26 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=67117

Lydia Pintscher lydia.pintsc...@wikimedia.de changed:

   What|Removed |Added

 Blocks||38970

-- 
You are receiving this mail because:
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l