[Wikidata-bugs] [Maniphest] [Commented On] T74309: Investigate how prefetching labels would work for watchlist and recent changes

2014-12-02 Thread daniel
daniel added a comment.

@aude: thanks for the benchmark! I'm surprised that loading the full entities 
does not have a bigger impact on memory usage. How many different entities 
where hit, and how big are these entities?

TASK DETAIL
  https://phabricator.wikimedia.org/T74309

REPLY HANDLER ACTIONS
  Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign 
username.

To: daniel
Cc: Tobi_WMDE_SW, Liuxinyu970226, Wikidata-bugs, daniel, aude



___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T74309: Investigate how prefetching labels would work for watchlist and recent changes

2014-12-01 Thread daniel
daniel added a comment.

@aude: memcached retrieval of entities is faster than sql queries of term 
table  -- i'm not sure this is true, especially for large items. I have sent 
mail to Springle and Ori asking for input. Will do some benchmarking.

For course, fre-fetching/batching plus in-process caching is the Right Thing; 
the question at hand is if we want to switch to table lookups NOW, to preserve 
memory and perhaps also time.

TASK DETAIL
  https://phabricator.wikimedia.org/T74309

REPLY HANDLER ACTIONS
  Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign 
username.

To: daniel
Cc: wikidata-bugs, Tobi_WMDE_SW, Liuxinyu970226, Wikidata-bugs, daniel, aude



___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T74309: Investigate how prefetching labels would work for watchlist and recent changes

2014-12-01 Thread daniel
daniel added a comment.

@aude: memcached retrieval of entities is faster than sql queries of term 
table  -- i'm not sure this is true, especially for large items. I have sent 
mail to Springle and Ori asking for input. Will do some benchmarking.

For course, fre-fetching/batching plus in-process caching is the Right Thing; 
the question at hand is if we want to switch to table lookups NOW, to preserve 
memory and perhaps also time.

TASK DETAIL
  https://phabricator.wikimedia.org/T74309

REPLY HANDLER ACTIONS
  Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign 
username.

To: daniel
Cc: wikidata-bugs, Tobi_WMDE_SW, Liuxinyu970226, Wikidata-bugs, daniel, aude



___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T74309: Investigate how prefetching labels would work for watchlist and recent changes

2014-12-01 Thread aude
aude added a comment.

only one data point comparison on my dev wiki and obviously can't reproduce all 
production conditions, but:

recent changes with EntityTermLookup (30 days, 60 items) - TermSqlIndex:

65955512 memory
28786 backend response time

recent changes with EntityRetrievingTermLookup (30 days, 60 items), few cache 
misses:

70434344 memory
20670 backend response time

recent changes with EntityRetrievingTermLookup (30 days, 60 items), have 
restarted memcached so many cache misses:

70634728 memory
33539 backend response time

neither is really a good choice.  using TermSqlIndex queries are somewhat 
better in regards to memory usage, while EntityRetrivingTermLookup seems better 
if cache misses are few enough.

these results are also consistent with comparisons i had made on the EntityView 
(before we had EntityIntoTermLookup).

if we can somehow profile in production (with reasonable effort), would be good 
and interesting.

and overall, continue to work on batched lookup and caching in as many places 
as possible.

TASK DETAIL
  https://phabricator.wikimedia.org/T74309

REPLY HANDLER ACTIONS
  Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign 
username.

To: aude
Cc: Tobi_WMDE_SW, Liuxinyu970226, Wikidata-bugs, daniel, aude



___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T74309: Investigate how prefetching labels would work for watchlist and recent changes

2014-11-30 Thread gerritbot
gerritbot added a comment.

Change 176378 abandoned by Daniel Kinzler:
Introducing RecentChangesRowsForDisplay hook.

Reason:
ChangesListInitRows exists and should do

[[https://gerrit.wikimedia.org/r/176378]]

TASK DETAIL
  https://phabricator.wikimedia.org/T74309

REPLY HANDLER ACTIONS
  Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign 
username.

To: gerritbot
Cc: wikidata-bugs, Tobi_WMDE_SW, Liuxinyu970226, Wikidata-bugs, daniel



___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T74309: Investigate how prefetching labels would work for watchlist and recent changes

2014-11-30 Thread gerritbot
gerritbot added a comment.

Change 176378 abandoned by Daniel Kinzler:
Introducing RecentChangesRowsForDisplay hook.

Reason:
ChangesListInitRows exists and should do

[[https://gerrit.wikimedia.org/r/176378]]

TASK DETAIL
  https://phabricator.wikimedia.org/T74309

REPLY HANDLER ACTIONS
  Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign 
username.

To: gerritbot
Cc: wikidata-bugs, Tobi_WMDE_SW, Liuxinyu970226, Wikidata-bugs, daniel



___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T74309: Investigate how prefetching labels would work for watchlist and recent changes

2014-11-30 Thread daniel
daniel added a comment.

As Katie pointed out, the existing ChangesListInitRows hook should fit the bill 
already.
The rest of the investigation should be to outline which services/interfaces we 
need to make the pre-fetching work.

TASK DETAIL
  https://phabricator.wikimedia.org/T74309

REPLY HANDLER ACTIONS
  Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign 
username.

To: daniel
Cc: wikidata-bugs, Tobi_WMDE_SW, Liuxinyu970226, Wikidata-bugs, daniel



___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T74309: Investigate how prefetching labels would work for watchlist and recent changes

2014-11-30 Thread daniel
daniel added a comment.

As Katie pointed out, the existing ChangesListInitRows hook should fit the bill 
already.
The rest of the investigation should be to outline which services/interfaces we 
need to make the pre-fetching work.

TASK DETAIL
  https://phabricator.wikimedia.org/T74309

REPLY HANDLER ACTIONS
  Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign 
username.

To: daniel
Cc: wikidata-bugs, Tobi_WMDE_SW, Liuxinyu970226, Wikidata-bugs, daniel



___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T74309: Investigate how prefetching labels would work for watchlist and recent changes

2014-11-30 Thread daniel
daniel added a comment.

Here's a rough outline of the pre-fetching infrastructure for labels and other 
Terms:

We need a TermCache service like this:

```
TermCache {

/**
*  Update terms for the given entity. Any old terms associated with the 
entity are discarded.
**/
public function updateTerms( EntityId $entityId, Fingerprint $terms );

/**
*  Loads a set of terms into memory, for later use by getTerms()
**/
public function prefetchTerms( EntityId $entityId, Fingerprint $terms );

/**
*  Get terms of the given types for the given entities, in the given 
languages (or all languages).
**/
public function getTerms( EntityId[] $entityIds, $termTypes, $languages );
}
```

This interface is somewhat similar to the TermIndex interface. We should 
consider cleaning up TermIndex, and using that.

The TermCache makes use of a persistent cache (optionally shared between wikis) 
aka memcached, and local in-process caching. It would be used as follows:

  - ChangeHandler calls updateTerms() when it is notified of an Entity being 
updated. If the cache is shared, the repo does this immediately when the terms 
associated with an Entity  are modified. In both cases, terms from the 
Fingerprint are placed into memcached using setMulti(), one entry per term. The 
cache key will contain the entity Id, term type, and language (plus possibly a 
wiki id, version id, etc). Multi-value terms (aliases) are stored as a list of 
values. A special key that does not include the type or language parts is used 
to store a list of all the keys used for a given entity, to allow these keys to 
be purged when updateTerms() is called again for the same entity.
  - A hook like ChangesListInitRows is used to trigger prefetchTerms(); 
prefetchTerms() uses getMulti() to fetch all the desired terms from memcached, 
and stores them locally in a hash. PROBLEM: if some terms are missing, we do 
not know whether they are uncached, or do not exist. Negative caching and/or 
checking the key list should be used. TBD: Decide whether TermCache should know 
how to fetch uncached terms.
  - A hook like LinkBegin may use a LabelLookup at is based upon a TermCache 
and TermLookup to get terms associated with item pages; If the desired label 
was previously fetched via prefetchTerms(), this should be very quick.

The main issue that remains to be decided is when and where cache misses are 
resolved (by looking at the wb_terms table); one complication is that it's 
unclear whether we should load all terms of an entity in such a case, or just 
the one we currently need.

This architecture is intended to minimize i/o volume as well as round trips to 
memcached. It still means putting a large number of small entries into 
memcached with no expiration, possibly swamping it and pushing out high usage 
entries with low usage labels. A randomized put strategy could be used to 
mitigate this issue by writing only every Kth entry to the cache, giving 
frequently used labels a higher chance of being cached.

TASK DETAIL
  https://phabricator.wikimedia.org/T74309

REPLY HANDLER ACTIONS
  Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign 
username.

To: daniel
Cc: wikidata-bugs, Tobi_WMDE_SW, Liuxinyu970226, Wikidata-bugs, daniel



___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T74309: Investigate how prefetching labels would work for watchlist and recent changes

2014-11-30 Thread daniel
daniel added a comment.

Here's a rough outline of the pre-fetching infrastructure for labels and other 
Terms:

We need a TermCache service like this:

```
TermCache {

/**
*  Update terms for the given entity. Any old terms associated with the 
entity are discarded.
**/
public function updateTerms( EntityId $entityId, Fingerprint $terms );

/**
*  Loads a set of terms into memory, for later use by getTerms()
**/
public function prefetchTerms( EntityId $entityId, Fingerprint $terms );

/**
*  Get terms of the given types for the given entities, in the given 
languages (or all languages).
**/
public function getTerms( EntityId[] $entityIds, $termTypes, $languages );
}
```

This interface is somewhat similar to the TermIndex interface. We should 
consider cleaning up TermIndex, and using that.

The TermCache makes use of a persistent cache (optionally shared between wikis) 
aka memcached, and local in-process caching. It would be used as follows:

  - ChangeHandler calls updateTerms() when it is notified of an Entity being 
updated. If the cache is shared, the repo does this immediately when the terms 
associated with an Entity  are modified. In both cases, terms from the 
Fingerprint are placed into memcached using setMulti(), one entry per term. The 
cache key will contain the entity Id, term type, and language (plus possibly a 
wiki id, version id, etc). Multi-value terms (aliases) are stored as a list of 
values. A special key that does not include the type or language parts is used 
to store a list of all the keys used for a given entity, to allow these keys to 
be purged when updateTerms() is called again for the same entity.
  - A hook like ChangesListInitRows is used to trigger prefetchTerms(); 
prefetchTerms() uses getMulti() to fetch all the desired terms from memcached, 
and stores them locally in a hash. PROBLEM: if some terms are missing, we do 
not know whether they are uncached, or do not exist. Negative caching and/or 
checking the key list should be used. TBD: Decide whether TermCache should know 
how to fetch uncached terms.
  - A hook like LinkBegin may use a LabelLookup at is based upon a TermCache 
and TermLookup to get terms associated with item pages; If the desired label 
was previously fetched via prefetchTerms(), this should be very quick.

The main issue that remains to be decided is when and where cache misses are 
resolved (by looking at the wb_terms table); one complication is that it's 
unclear whether we should load all terms of an entity in such a case, or just 
the one we currently need.

This architecture is intended to minimize i/o volume as well as round trips to 
memcached. It still means putting a large number of small entries into 
memcached with no expiration, possibly swamping it and pushing out high usage 
entries with low usage labels. A randomized put strategy could be used to 
mitigate this issue by writing only every Kth entry to the cache, giving 
frequently used labels a higher chance of being cached.

TASK DETAIL
  https://phabricator.wikimedia.org/T74309

REPLY HANDLER ACTIONS
  Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign 
username.

To: daniel
Cc: wikidata-bugs, Tobi_WMDE_SW, Liuxinyu970226, Wikidata-bugs, daniel



___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T74309: Investigate how prefetching labels would work for watchlist and recent changes

2014-11-30 Thread daniel
daniel added a comment.

Jan and Thiemo brought up an idea for pre-fetching labels accessed via Lua or 
{{#property}} with arbitrary access enabled: we pre-fetch based on the usage 
tracking info, so that when parsing a new revision of a page, pre have fast 
access to all labels/terms used by the previous revision. The idea is that in 
most cases, the labels used will change little, if at all.

TASK DETAIL
  https://phabricator.wikimedia.org/T74309

REPLY HANDLER ACTIONS
  Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign 
username.

To: daniel
Cc: wikidata-bugs, Tobi_WMDE_SW, Liuxinyu970226, Wikidata-bugs, daniel



___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T74309: Investigate how prefetching labels would work for watchlist and recent changes

2014-11-30 Thread daniel
daniel added a comment.

Jan and Thiemo brought up an idea for pre-fetching labels accessed via Lua or 
{{#property}} with arbitrary access enabled: we pre-fetch based on the usage 
tracking info, so that when parsing a new revision of a page, pre have fast 
access to all labels/terms used by the previous revision. The idea is that in 
most cases, the labels used will change little, if at all.

TASK DETAIL
  https://phabricator.wikimedia.org/T74309

REPLY HANDLER ACTIONS
  Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign 
username.

To: daniel
Cc: wikidata-bugs, Tobi_WMDE_SW, Liuxinyu970226, Wikidata-bugs, daniel



___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T74309: Investigate how prefetching labels would work for watchlist and recent changes

2014-11-30 Thread aude
aude added a comment.

if we use memcached, then suggest we cache entities that are in the recent 
changes table (maybe x recent days) plus perhaps most used entities as 
determined by client usage tracking.

TASK DETAIL
  https://phabricator.wikimedia.org/T74309

REPLY HANDLER ACTIONS
  Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign 
username.

To: aude
Cc: wikidata-bugs, Tobi_WMDE_SW, Liuxinyu970226, Wikidata-bugs, daniel, aude



___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T74309: Investigate how prefetching labels would work for watchlist and recent changes

2014-11-30 Thread aude
aude added a comment.

if we use memcached, then suggest we cache entities that are in the recent 
changes table (maybe x recent days) plus perhaps most used entities as 
determined by client usage tracking.

TASK DETAIL
  https://phabricator.wikimedia.org/T74309

REPLY HANDLER ACTIONS
  Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign 
username.

To: aude
Cc: wikidata-bugs, Tobi_WMDE_SW, Liuxinyu970226, Wikidata-bugs, daniel, aude



___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs