[Wikidata-bugs] [Maniphest] [Commented On] T74309: Investigate how prefetching labels would work for watchlist and recent changes
daniel added a comment. @aude: thanks for the benchmark! I'm surprised that loading the full entities does not have a bigger impact on memory usage. How many different entities where hit, and how big are these entities? TASK DETAIL https://phabricator.wikimedia.org/T74309 REPLY HANDLER ACTIONS Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign username. To: daniel Cc: Tobi_WMDE_SW, Liuxinyu970226, Wikidata-bugs, daniel, aude ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Commented On] T74309: Investigate how prefetching labels would work for watchlist and recent changes
daniel added a comment. @aude: memcached retrieval of entities is faster than sql queries of term table -- i'm not sure this is true, especially for large items. I have sent mail to Springle and Ori asking for input. Will do some benchmarking. For course, fre-fetching/batching plus in-process caching is the Right Thing; the question at hand is if we want to switch to table lookups NOW, to preserve memory and perhaps also time. TASK DETAIL https://phabricator.wikimedia.org/T74309 REPLY HANDLER ACTIONS Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign username. To: daniel Cc: wikidata-bugs, Tobi_WMDE_SW, Liuxinyu970226, Wikidata-bugs, daniel, aude ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Commented On] T74309: Investigate how prefetching labels would work for watchlist and recent changes
daniel added a comment. @aude: memcached retrieval of entities is faster than sql queries of term table -- i'm not sure this is true, especially for large items. I have sent mail to Springle and Ori asking for input. Will do some benchmarking. For course, fre-fetching/batching plus in-process caching is the Right Thing; the question at hand is if we want to switch to table lookups NOW, to preserve memory and perhaps also time. TASK DETAIL https://phabricator.wikimedia.org/T74309 REPLY HANDLER ACTIONS Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign username. To: daniel Cc: wikidata-bugs, Tobi_WMDE_SW, Liuxinyu970226, Wikidata-bugs, daniel, aude ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Commented On] T74309: Investigate how prefetching labels would work for watchlist and recent changes
aude added a comment. only one data point comparison on my dev wiki and obviously can't reproduce all production conditions, but: recent changes with EntityTermLookup (30 days, 60 items) - TermSqlIndex: 65955512 memory 28786 backend response time recent changes with EntityRetrievingTermLookup (30 days, 60 items), few cache misses: 70434344 memory 20670 backend response time recent changes with EntityRetrievingTermLookup (30 days, 60 items), have restarted memcached so many cache misses: 70634728 memory 33539 backend response time neither is really a good choice. using TermSqlIndex queries are somewhat better in regards to memory usage, while EntityRetrivingTermLookup seems better if cache misses are few enough. these results are also consistent with comparisons i had made on the EntityView (before we had EntityIntoTermLookup). if we can somehow profile in production (with reasonable effort), would be good and interesting. and overall, continue to work on batched lookup and caching in as many places as possible. TASK DETAIL https://phabricator.wikimedia.org/T74309 REPLY HANDLER ACTIONS Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign username. To: aude Cc: Tobi_WMDE_SW, Liuxinyu970226, Wikidata-bugs, daniel, aude ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Commented On] T74309: Investigate how prefetching labels would work for watchlist and recent changes
gerritbot added a comment. Change 176378 abandoned by Daniel Kinzler: Introducing RecentChangesRowsForDisplay hook. Reason: ChangesListInitRows exists and should do [[https://gerrit.wikimedia.org/r/176378]] TASK DETAIL https://phabricator.wikimedia.org/T74309 REPLY HANDLER ACTIONS Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign username. To: gerritbot Cc: wikidata-bugs, Tobi_WMDE_SW, Liuxinyu970226, Wikidata-bugs, daniel ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Commented On] T74309: Investigate how prefetching labels would work for watchlist and recent changes
gerritbot added a comment. Change 176378 abandoned by Daniel Kinzler: Introducing RecentChangesRowsForDisplay hook. Reason: ChangesListInitRows exists and should do [[https://gerrit.wikimedia.org/r/176378]] TASK DETAIL https://phabricator.wikimedia.org/T74309 REPLY HANDLER ACTIONS Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign username. To: gerritbot Cc: wikidata-bugs, Tobi_WMDE_SW, Liuxinyu970226, Wikidata-bugs, daniel ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Commented On] T74309: Investigate how prefetching labels would work for watchlist and recent changes
daniel added a comment. As Katie pointed out, the existing ChangesListInitRows hook should fit the bill already. The rest of the investigation should be to outline which services/interfaces we need to make the pre-fetching work. TASK DETAIL https://phabricator.wikimedia.org/T74309 REPLY HANDLER ACTIONS Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign username. To: daniel Cc: wikidata-bugs, Tobi_WMDE_SW, Liuxinyu970226, Wikidata-bugs, daniel ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Commented On] T74309: Investigate how prefetching labels would work for watchlist and recent changes
daniel added a comment. As Katie pointed out, the existing ChangesListInitRows hook should fit the bill already. The rest of the investigation should be to outline which services/interfaces we need to make the pre-fetching work. TASK DETAIL https://phabricator.wikimedia.org/T74309 REPLY HANDLER ACTIONS Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign username. To: daniel Cc: wikidata-bugs, Tobi_WMDE_SW, Liuxinyu970226, Wikidata-bugs, daniel ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Commented On] T74309: Investigate how prefetching labels would work for watchlist and recent changes
daniel added a comment. Here's a rough outline of the pre-fetching infrastructure for labels and other Terms: We need a TermCache service like this: ``` TermCache { /** * Update terms for the given entity. Any old terms associated with the entity are discarded. **/ public function updateTerms( EntityId $entityId, Fingerprint $terms ); /** * Loads a set of terms into memory, for later use by getTerms() **/ public function prefetchTerms( EntityId $entityId, Fingerprint $terms ); /** * Get terms of the given types for the given entities, in the given languages (or all languages). **/ public function getTerms( EntityId[] $entityIds, $termTypes, $languages ); } ``` This interface is somewhat similar to the TermIndex interface. We should consider cleaning up TermIndex, and using that. The TermCache makes use of a persistent cache (optionally shared between wikis) aka memcached, and local in-process caching. It would be used as follows: - ChangeHandler calls updateTerms() when it is notified of an Entity being updated. If the cache is shared, the repo does this immediately when the terms associated with an Entity are modified. In both cases, terms from the Fingerprint are placed into memcached using setMulti(), one entry per term. The cache key will contain the entity Id, term type, and language (plus possibly a wiki id, version id, etc). Multi-value terms (aliases) are stored as a list of values. A special key that does not include the type or language parts is used to store a list of all the keys used for a given entity, to allow these keys to be purged when updateTerms() is called again for the same entity. - A hook like ChangesListInitRows is used to trigger prefetchTerms(); prefetchTerms() uses getMulti() to fetch all the desired terms from memcached, and stores them locally in a hash. PROBLEM: if some terms are missing, we do not know whether they are uncached, or do not exist. Negative caching and/or checking the key list should be used. TBD: Decide whether TermCache should know how to fetch uncached terms. - A hook like LinkBegin may use a LabelLookup at is based upon a TermCache and TermLookup to get terms associated with item pages; If the desired label was previously fetched via prefetchTerms(), this should be very quick. The main issue that remains to be decided is when and where cache misses are resolved (by looking at the wb_terms table); one complication is that it's unclear whether we should load all terms of an entity in such a case, or just the one we currently need. This architecture is intended to minimize i/o volume as well as round trips to memcached. It still means putting a large number of small entries into memcached with no expiration, possibly swamping it and pushing out high usage entries with low usage labels. A randomized put strategy could be used to mitigate this issue by writing only every Kth entry to the cache, giving frequently used labels a higher chance of being cached. TASK DETAIL https://phabricator.wikimedia.org/T74309 REPLY HANDLER ACTIONS Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign username. To: daniel Cc: wikidata-bugs, Tobi_WMDE_SW, Liuxinyu970226, Wikidata-bugs, daniel ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Commented On] T74309: Investigate how prefetching labels would work for watchlist and recent changes
daniel added a comment. Here's a rough outline of the pre-fetching infrastructure for labels and other Terms: We need a TermCache service like this: ``` TermCache { /** * Update terms for the given entity. Any old terms associated with the entity are discarded. **/ public function updateTerms( EntityId $entityId, Fingerprint $terms ); /** * Loads a set of terms into memory, for later use by getTerms() **/ public function prefetchTerms( EntityId $entityId, Fingerprint $terms ); /** * Get terms of the given types for the given entities, in the given languages (or all languages). **/ public function getTerms( EntityId[] $entityIds, $termTypes, $languages ); } ``` This interface is somewhat similar to the TermIndex interface. We should consider cleaning up TermIndex, and using that. The TermCache makes use of a persistent cache (optionally shared between wikis) aka memcached, and local in-process caching. It would be used as follows: - ChangeHandler calls updateTerms() when it is notified of an Entity being updated. If the cache is shared, the repo does this immediately when the terms associated with an Entity are modified. In both cases, terms from the Fingerprint are placed into memcached using setMulti(), one entry per term. The cache key will contain the entity Id, term type, and language (plus possibly a wiki id, version id, etc). Multi-value terms (aliases) are stored as a list of values. A special key that does not include the type or language parts is used to store a list of all the keys used for a given entity, to allow these keys to be purged when updateTerms() is called again for the same entity. - A hook like ChangesListInitRows is used to trigger prefetchTerms(); prefetchTerms() uses getMulti() to fetch all the desired terms from memcached, and stores them locally in a hash. PROBLEM: if some terms are missing, we do not know whether they are uncached, or do not exist. Negative caching and/or checking the key list should be used. TBD: Decide whether TermCache should know how to fetch uncached terms. - A hook like LinkBegin may use a LabelLookup at is based upon a TermCache and TermLookup to get terms associated with item pages; If the desired label was previously fetched via prefetchTerms(), this should be very quick. The main issue that remains to be decided is when and where cache misses are resolved (by looking at the wb_terms table); one complication is that it's unclear whether we should load all terms of an entity in such a case, or just the one we currently need. This architecture is intended to minimize i/o volume as well as round trips to memcached. It still means putting a large number of small entries into memcached with no expiration, possibly swamping it and pushing out high usage entries with low usage labels. A randomized put strategy could be used to mitigate this issue by writing only every Kth entry to the cache, giving frequently used labels a higher chance of being cached. TASK DETAIL https://phabricator.wikimedia.org/T74309 REPLY HANDLER ACTIONS Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign username. To: daniel Cc: wikidata-bugs, Tobi_WMDE_SW, Liuxinyu970226, Wikidata-bugs, daniel ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Commented On] T74309: Investigate how prefetching labels would work for watchlist and recent changes
daniel added a comment. Jan and Thiemo brought up an idea for pre-fetching labels accessed via Lua or {{#property}} with arbitrary access enabled: we pre-fetch based on the usage tracking info, so that when parsing a new revision of a page, pre have fast access to all labels/terms used by the previous revision. The idea is that in most cases, the labels used will change little, if at all. TASK DETAIL https://phabricator.wikimedia.org/T74309 REPLY HANDLER ACTIONS Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign username. To: daniel Cc: wikidata-bugs, Tobi_WMDE_SW, Liuxinyu970226, Wikidata-bugs, daniel ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Commented On] T74309: Investigate how prefetching labels would work for watchlist and recent changes
daniel added a comment. Jan and Thiemo brought up an idea for pre-fetching labels accessed via Lua or {{#property}} with arbitrary access enabled: we pre-fetch based on the usage tracking info, so that when parsing a new revision of a page, pre have fast access to all labels/terms used by the previous revision. The idea is that in most cases, the labels used will change little, if at all. TASK DETAIL https://phabricator.wikimedia.org/T74309 REPLY HANDLER ACTIONS Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign username. To: daniel Cc: wikidata-bugs, Tobi_WMDE_SW, Liuxinyu970226, Wikidata-bugs, daniel ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Commented On] T74309: Investigate how prefetching labels would work for watchlist and recent changes
aude added a comment. if we use memcached, then suggest we cache entities that are in the recent changes table (maybe x recent days) plus perhaps most used entities as determined by client usage tracking. TASK DETAIL https://phabricator.wikimedia.org/T74309 REPLY HANDLER ACTIONS Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign username. To: aude Cc: wikidata-bugs, Tobi_WMDE_SW, Liuxinyu970226, Wikidata-bugs, daniel, aude ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Commented On] T74309: Investigate how prefetching labels would work for watchlist and recent changes
aude added a comment. if we use memcached, then suggest we cache entities that are in the recent changes table (maybe x recent days) plus perhaps most used entities as determined by client usage tracking. TASK DETAIL https://phabricator.wikimedia.org/T74309 REPLY HANDLER ACTIONS Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign username. To: aude Cc: wikidata-bugs, Tobi_WMDE_SW, Liuxinyu970226, Wikidata-bugs, daniel, aude ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs