dcausse added a comment.
I did some experiments using one chunk of our dumps which accounts for 31,883,361 triples which is ~3‰ of the dump size. The journal size using the default //tertiary// strength is 154Gb it grows up to 174Gb using //identical// which is close to 13% increase in size. Assuming that this increase remains linear we would jump from 886Gb to 1Tb (114Gb increase) on current production machine. For the benefit (the terms that are no longer conflated): //Identical// allows to store 9855953 terms vs 9855878 for //tertiary//. Which means that out of the 9855953 terms I inspected only **75** are conflated. Using collation strength //Identical// does not seem to be the right approach to me (cost vs benefit). I believe we should at least fix the obvious ICU issues by upgrading the version used by blazegraph but concerning the symbols (P13502 <https://phabricator.wikimedia.org/P13502>) we should try to find an alternative at the blazegraph level that does not involve a 13% increase in journal size. I wonder for instance why blazegraph is using collation for building its keys here, is the term index used for sorting or doing range queries? If not maybe there would be a way to add a custom key generator that just does NFC normalization and using UTF-8 for the Term2ID index a bit like what lucene does. To summarize: - using //Identical// does not seem to be viable solution to solve this issue - upgrading blazegraph to a newer version of ICU will solve **some** of the problems - evaluate other approaches for computing the Term2ID keys to stop conflating symbols Given that blazegraph is un-maintained I'm pessimistic about the third point, the second point sounds more approachable. TASK DETAIL https://phabricator.wikimedia.org/T233204 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: dcausse Cc: Unjoanqualsevol, Nikki, CamelCaseNick, Smalyshev, Aklapper, Lucas_Werkmeister_WMDE, Igorkim78, Gehel, Lea_Lacroix_WMDE, CBogen, Akuckartz, Nandana, Namenlos314, Lahi, Gq86, GoranSMilovanovic, QZanden, EBjune, merbst, LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331
_______________________________________________ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs