[Wikidata-bugs] [Maniphest] T301471: New Service Request SchemaTree

akosiaris Mon, 18 Apr 2022 06:49:42 -0700

akosiaris added a comment.

  In T301471#7840496 <https://phabricator.wikimedia.org/T301471#7840496>, 
@Michaelcochez wrote:

  > I merged the pull request on github now.
  >
  > I do not have rights to push to the gerrit repository, it might just be my 
limited knowledge of how gerrit works.

  I 've added to you to the gerrit 
`wikidata-propertysuggester-RecommenderServer` group, you should have access 
now.

  > I will look into the helm chart/CI setup soon.
  >
  >> questions around the index file:
  >
  > This file is the serialization of the in-memory tree structure used for 
recommendation. The file is a compressed (gzipped) binary file.  For 
serialization we  use https://pkg.go.dev/encoding/gob . Given this, and the 
fact that changes in the tree structure can have a 'rippling effect', it is not 
possible (or at least extremely hard) to alter the file. This tree is a 
specifically crafted type of index, serving its data from an external database 
would be impossible/detrimental for performance as it would require //a lot// 
of roundtrips.

  Despite the allure, shipping around serialized memory objects has many 
drawbacks as an approach. Most obvious are the security ones and most languages 
indeed put wording in their respective frameworks to point that out. 
https://github.com/golang/go/issues/20221 has some hints. Python's pickle more 
or less points out the same. Really big hacks that have ex-filtrated tons of 
private data have happened because of "serialization" vulnerabilities (e.g. the 
equifax hack was reliant on an apache struts serialization vulnerability - 
https://nakedsecurity.sophos.com/2017/09/06/apache-struts-serialisation-vulnerability-what-you-need-to-know/)

  There's more drawbacks of course. For example, how do you do versioning of 
the dataset? It needs to always match the definitions of the Golang Struct that 
it contains. Even simple changes in field names, can cause unintended behavior. 
e.g. changing a field name means that data for it will silently be dropped when 
deserializing an older dataset and loading it into memory.

  Thus the dataset needs to be strongly coupled with the application (that is 
they need to be deployed in tandem), which is a bad pattern due to the size 
constraints I 've explained about above, not to mention the fact that gerrit 
currently won't allow you to even upload the file.

  > The index file is loaded into memory once when the process starts. It could 
be loaded from 'anywhere' and does not even have to reside on disk necessarily.

  That's the thing. It can't be loaded from 'anywhere' cause of the security 
issues and because of the strong coupling it has with the application itself.

  A final question, regarding the external database roundtrips note. Almost all 
datastores (either RDBMS ones or NoSQL ones) have the ability to batch results, 
obviating the need to multiple roundtrips. As a result e.g. many ORMs 
(Hibernate, Django, SQLAlchemy, Gorm) also support this (naming the 
functionality with various terms, but it's there). In fact, we 've seen this 
before and in most cases rewriting the queries to fetch hundreds or thousands 
of entities in 1 go instead of hundreds or thousands of queries was trivial.

  There are also other well known, proven and safer ways of shipping around 
serialized structured data, e.g. protocol buffers (aka protobufs[1]).

  Have any of the above been evaluated?

  [1] https://developers.google.com/protocol-buffers

TASK DETAIL
  https://phabricator.wikimedia.org/T301471

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: akosiaris
Cc: akosiaris, QChris, ItamarWMDE, Joe, Aklapper, Addshore, karapayneWMDE, 
Martaannaj, Michaelcochez, Astuthiodit_1, Arnoldokoth, Invadibot, maantietaja, 
wkandek, JMeybohm, Akuckartz, Nandana, jijiki, Lahi, Gq86, GoranSMilovanovic, 
QZanden, LawExplorer, _jensen, rosalieper, Scott_WUaS, Eevans, Hardikj, 
Wikidata-bugs, aude, Sjoerddebruin, Jdforrester-WMF, Mbch331, Jay8g, Dzahn

_______________________________________________
Wikidata-bugs mailing list -- [email protected]
To unsubscribe send an email to [email protected]

[Wikidata-bugs] [Maniphest] T301471: New Service Request SchemaTree

Reply via email to