dcausse added a comment.

  I did some experiments using one chunk of our dumps which accounts for 
31,883,361 triples which is ~3‰ of the dump size.
  The journal size using the default //tertiary// strength is 154Gb it grows up 
to 174Gb using //identical// which is close to 13% increase in size. Assuming 
that this increase remains linear we would jump from 886Gb to 1Tb (114Gb 
increase) on current production machine.
  For the benefit (the terms that are no longer conflated): //Identical// 
allows to store 9855953 terms vs 9855878 for //tertiary//. Which means that out 
of the 9855953 terms I inspected only **75** are conflated.
  Using collation strength //Identical// does not seem to be the right approach 
to me (cost vs benefit).
  
  I believe we should at least fix the obvious ICU issues by upgrading the 
version used by blazegraph but concerning the symbols (P13502 
<https://phabricator.wikimedia.org/P13502>) we should try to find an 
alternative at the blazegraph level that does not involve a 13% increase in 
journal size.
  
  I wonder for instance why blazegraph is using collation for building its keys 
here, is the term index used for sorting or doing range queries? If not maybe 
there would be a way to add a custom key generator that just does NFC 
normalization and using UTF-8 for the Term2ID index a bit like what lucene does.
  
  To summarize:
  
  - using //Identical// does not seem to be viable solution to solve this issue
  - upgrading blazegraph to a newer version of ICU will solve **some** of the 
problems
  - evaluate other approaches for computing the Term2ID keys to stop conflating 
symbols
  
  Given that blazegraph is un-maintained I'm pessimistic about the third point, 
the second point sounds more approachable.

TASK DETAIL
  https://phabricator.wikimedia.org/T233204

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: dcausse
Cc: Unjoanqualsevol, Nikki, CamelCaseNick, Smalyshev, Aklapper, 
Lucas_Werkmeister_WMDE, Igorkim78, Gehel, Lea_Lacroix_WMDE, CBogen, Akuckartz, 
Nandana, Namenlos314, Lahi, Gq86, GoranSMilovanovic, QZanden, EBjune, merbst, 
LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, 
Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331
_______________________________________________
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs

Reply via email to