On Thu, Dec 1, 2011 at 8:49 PM, bawolff <[email protected]> wrote:

> Thus, just storing a table of key/value pairs is kind of problematic -
> how do you store an "array" value. Additionally you have to consider
> finding info. You probably want to efficiently be able to search
> through lang values in a specific language, or for a specific property
> and not caring for the language.
>

Two easiest things based on my previous experience:
1) separate values with \x00, making them easy to split after extracting a
row
2) store multiple entries with an index field, making it easy to query for
potentially multiples



> Also consider how big a metadata field can get. Theoretically it's not
> really limited, well I don't expect it to be huge, > 255 bytes of
> utf-8 seems a totally reasonable size for a value of a metadata field.
>
> Last of all, you have to keep in mind all sorts of stuff is stored in
> the img_metadata. This includes things like the text layer of Djvu
> files (although arguably that shouldn't be stored there...) and other
> handler specific things (OggHandler stores some very complex
> structures in img_metadata). Of course, we could just keep the
> img_metadata blob there, and simply stop using it for "exif-like"
> data, but continue using it for handler specific ugly metadata that's
> generally invisible to user [probably a good idea. The two types of
> data are actually quite different].
>

On text: DjVu and PDF files can optionally contain flattened searchable
text, which we extract so it can be used for things like
Extension:ProofreadPage and, potentially, search indexing:

https://bugzilla.wikimedia.org/showdependencytree.cgi?id=21061&hide_resolved=1

Currently this gets stuffed into the metadata blob along with the exif data
etc, and can make metadata blobs *very* large if there are hundreds of
pages of text.

If extracted page text is stored in a better key-value store, we should
make sure it doesn't get pulled in to backwards-compatible metadata blobs
(if we keep em around as they are now) -- but they should be accessible
through some API.

  > One issue to consider is the file archive. Should we replicate the
> metadata
>  > table for file archive? Or serialize the data and store it in a new
> table
>  > (something like fa_metadata)?
>
> Honestly, I wouldn't worry about that, especially in the beginning. As
> far as i know, the only place fa_metadata/oi_metadata is used, is that
> you can request it via api (I suppose it's copied over during file
> reverts as well). I don't think anyone uses that field on archived
> images really. (maybe one day bug 26741 will be fixed and this would
> be less of a concern).
>

That reminds me: ForeignAPIRepo (InstantCommons) wants to be able to
transfer the metadata at least for current versions; API formats should
remain compatible if possible in order for data to continue to be
transferred to clients running old versions.

-- brion
_______________________________________________
Wikitech-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Reply via email to