akosiaris added a comment.
In T301471#7840496 <https://phabricator.wikimedia.org/T301471#7840496>, @Michaelcochez wrote: > I merged the pull request on github now. > > I do not have rights to push to the gerrit repository, it might just be my limited knowledge of how gerrit works. I 've added to you to the gerrit `wikidata-propertysuggester-RecommenderServer` group, you should have access now. > I will look into the helm chart/CI setup soon. > >> questions around the index file: > > This file is the serialization of the in-memory tree structure used for recommendation. The file is a compressed (gzipped) binary file. For serialization we use https://pkg.go.dev/encoding/gob . Given this, and the fact that changes in the tree structure can have a 'rippling effect', it is not possible (or at least extremely hard) to alter the file. This tree is a specifically crafted type of index, serving its data from an external database would be impossible/detrimental for performance as it would require //a lot// of roundtrips. Despite the allure, shipping around serialized memory objects has many drawbacks as an approach. Most obvious are the security ones and most languages indeed put wording in their respective frameworks to point that out. https://github.com/golang/go/issues/20221 has some hints. Python's pickle more or less points out the same. Really big hacks that have ex-filtrated tons of private data have happened because of "serialization" vulnerabilities (e.g. the equifax hack was reliant on an apache struts serialization vulnerability - https://nakedsecurity.sophos.com/2017/09/06/apache-struts-serialisation-vulnerability-what-you-need-to-know/) There's more drawbacks of course. For example, how do you do versioning of the dataset? It needs to always match the definitions of the Golang Struct that it contains. Even simple changes in field names, can cause unintended behavior. e.g. changing a field name means that data for it will silently be dropped when deserializing an older dataset and loading it into memory. Thus the dataset needs to be strongly coupled with the application (that is they need to be deployed in tandem), which is a bad pattern due to the size constraints I 've explained about above, not to mention the fact that gerrit currently won't allow you to even upload the file. > The index file is loaded into memory once when the process starts. It could be loaded from 'anywhere' and does not even have to reside on disk necessarily. That's the thing. It can't be loaded from 'anywhere' cause of the security issues and because of the strong coupling it has with the application itself. A final question, regarding the external database roundtrips note. Almost all datastores (either RDBMS ones or NoSQL ones) have the ability to batch results, obviating the need to multiple roundtrips. As a result e.g. many ORMs (Hibernate, Django, SQLAlchemy, Gorm) also support this (naming the functionality with various terms, but it's there). In fact, we 've seen this before and in most cases rewriting the queries to fetch hundreds or thousands of entities in 1 go instead of hundreds or thousands of queries was trivial. There are also other well known, proven and safer ways of shipping around serialized structured data, e.g. protocol buffers (aka protobufs[1]). Have any of the above been evaluated? [1] https://developers.google.com/protocol-buffers TASK DETAIL https://phabricator.wikimedia.org/T301471 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: akosiaris Cc: akosiaris, QChris, ItamarWMDE, Joe, Aklapper, Addshore, karapayneWMDE, Martaannaj, Michaelcochez, Astuthiodit_1, Arnoldokoth, Invadibot, maantietaja, wkandek, JMeybohm, Akuckartz, Nandana, jijiki, Lahi, Gq86, GoranSMilovanovic, QZanden, LawExplorer, _jensen, rosalieper, Scott_WUaS, Eevans, Hardikj, Wikidata-bugs, aude, Sjoerddebruin, Jdforrester-WMF, Mbch331, Jay8g, Dzahn
_______________________________________________ Wikidata-bugs mailing list -- [email protected] To unsubscribe send an email to [email protected]
