Smalyshev added a comment.

Some analysis:

Master run:
https://items-repo.wmflabs.org/xhprof_html/?run=598b8f9c66fab&source=dumpRdf&sort=excl_wt
Patched run:
https://items-repo.wmflabs.org/xhprof_html/?run=598b8d44818ce&source=dumpRdf&sort=excl_wt

We can see that one of the most expensive functions there is Wikibase\DataModel\Entity\EntityId::splitSerialization. In fact, it is super-expensive even in master run, being only second to actual SQL queries, but we get almost 3x calls for it in the patch run.
So, the immediate suggested action would be to see whether we can eliminate additional or even better, all calls to splitSerialization or severely reduce them.

Most of the calls to splitSerialization and all additional ones come from:

  • EntityId::getLocalPart
  • EntityId::getRepositoryName

In the master, top ones are:

  • DispatchingEntityIdParser::parse
  • PrefixMappingEntityIdParser::parse

Together, they are 60% of the calls.
EntityId::getLocalPart mainly comes from RdfVocabulary::getEntityLName, and EntityId::getRepositoryName from various snak builders.

Surprisingly, even though the result of these for every given ID should be static, both call child functions. I think it should be an easy fix to convert both of those function calls to either public variables, of if this is too offensive to good design, functions returning such variables directly, without additional calls.

I also note we construct way more ID objects that there presumably are different IDs (e.g. PropertyId ctor is called 34,892 times even though there are about 3K properties in Wikidata altogether) so I wonder if caching id->object may be helpful. Caching all IDs may be too much (even though well within range of any of our 128G servers :) but at least LRU of reasonable size or something? Not sure how easy it would be to add - this cache is useless for most web workloads probably but would be useful for dump. Maybe we could inject something into WikibaseLib.entitytypes.php? It won't save too much - ItemId ctor is 3.2% and PropertyId ctor is 2.8%, but even 6% gain is not too bad.

Looking into splitSerialization, we see:

  • Call to assertValidSerialization - we validate serialization format on the IDs we take from database, 4,665,466 times for 7K ids. That means we validate each ID over 600 times. Even if there were a case for validating IDs which we take from DB, which I don't see, there is certainly no case for doing it 600+ times. This should be eliminated. Validation takes 1/4 of time consumed by the function.
  • 4 array functions - explode, array_pop, array_shift, implode. They don't take much time, together being about 5%, but I wonder if it could be simpler. Maybe not, but worth spending some thought.
  • normalizeIdSerialization - this takes 10% of the time, and is completely unnecessary, as far as I can see, for IDs coming from DB. Maybe we should have separate code paths for "dirty" IDs - which come from user, etc. and have all validation and transformation and what not - and for "clean" IDs which come from trusted source so we don't have to spend so much cycles on validating clean data?

I am not sure I fully appreciate how IDs work there yet, so we may find further ways to optimize this, let's discuss it. But the above I think are the most immediate things


TASK DETAIL
https://phabricator.wikimedia.org/T162371

EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Smalyshev
Cc: hoo, Ladsgroup, PokestarFan, Lucas_Werkmeister_WMDE, Smalyshev, daniel, WMDE-leszek, Aklapper, GoranSMilovanovic, QZanden, Izno, Wikidata-bugs, aude, Mbch331
_______________________________________________
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs

Reply via email to