On Thu, Dec 1, 2011 at 8:49 PM, bawolff <[email protected]> wrote:
> Thus, just storing a table of key/value pairs is kind of problematic - > how do you store an "array" value. Additionally you have to consider > finding info. You probably want to efficiently be able to search > through lang values in a specific language, or for a specific property > and not caring for the language. > Two easiest things based on my previous experience: 1) separate values with \x00, making them easy to split after extracting a row 2) store multiple entries with an index field, making it easy to query for potentially multiples > Also consider how big a metadata field can get. Theoretically it's not > really limited, well I don't expect it to be huge, > 255 bytes of > utf-8 seems a totally reasonable size for a value of a metadata field. > > Last of all, you have to keep in mind all sorts of stuff is stored in > the img_metadata. This includes things like the text layer of Djvu > files (although arguably that shouldn't be stored there...) and other > handler specific things (OggHandler stores some very complex > structures in img_metadata). Of course, we could just keep the > img_metadata blob there, and simply stop using it for "exif-like" > data, but continue using it for handler specific ugly metadata that's > generally invisible to user [probably a good idea. The two types of > data are actually quite different]. > On text: DjVu and PDF files can optionally contain flattened searchable text, which we extract so it can be used for things like Extension:ProofreadPage and, potentially, search indexing: https://bugzilla.wikimedia.org/showdependencytree.cgi?id=21061&hide_resolved=1 Currently this gets stuffed into the metadata blob along with the exif data etc, and can make metadata blobs *very* large if there are hundreds of pages of text. If extracted page text is stored in a better key-value store, we should make sure it doesn't get pulled in to backwards-compatible metadata blobs (if we keep em around as they are now) -- but they should be accessible through some API. > One issue to consider is the file archive. Should we replicate the > metadata > > table for file archive? Or serialize the data and store it in a new > table > > (something like fa_metadata)? > > Honestly, I wouldn't worry about that, especially in the beginning. As > far as i know, the only place fa_metadata/oi_metadata is used, is that > you can request it via api (I suppose it's copied over during file > reverts as well). I don't think anyone uses that field on archived > images really. (maybe one day bug 26741 will be fixed and this would > be less of a concern). > That reminds me: ForeignAPIRepo (InstantCommons) wants to be able to transfer the metadata at least for current versions; API formats should remain compatible if possible in order for data to continue to be transferred to clients running old versions. -- brion _______________________________________________ Wikitech-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikitech-l
