Re: [Wikitech-l] [gsoc] splitting the img_metadata field into a new table

Michael Dale Fri, 28 May 2010 11:13:55 -0700

More important than file_metadata and page asset metadata working with 
the same db table backed, its important that you can query export all 
the properties in the same way.


Within SMW you already have some "special" properties like pagelinks, 
langlinks, category properties etc, that are not stored the same as the 
other SMW page properties ...  The SMW system should name-space all 
these file_metadata properties along with all the other structured data 
available and enable universal querying / RDF exporting all the 
structured wiki data. This way file_metadata would just be one more 
special data type with its own independent tables. ...

SMW should abstract the data store so it works with the existing 
structured tables. I know this was already done for categories correct?  
Was enabling this for all the other links and usage tables explored?

This also make sense from an architecture perspective, where 
file_metadata is tied to the file asset and SMW properties are tied to 
the asset wiki description page.  This way you know you don't have to 
think about that subset of metadata properties on page updates since 
they are tied to the file asset not the wiki page propriety driven from 
structured user input. Likewise uploading a new version of the file 
would not touch the page data tables.

--michael

Markus Krötzsch wrote:
> Hi Bawolff,
>
> interesting project! I am currently preparing a "light" version of SMW that 
> does something very similar, but using wiki-defined properties for adding 
> metadata to normal pages (in essence, SMW is an extension to store and 
> retrieve page metadata for properties defined in the wiki -- like XMP for MW 
> pages; though our data model is not quite as sophisticated ;-).
>
> The use cases for this light version are just what you describe: simple 
> retrieval (select) and basic inverse searches. The idea is to thus have a 
> solid foundation for editing and viewing data, so that more complex functions 
> like category intersections or arbitrary metadata conjunctive queries would 
> be 
> done on external servers based on some data dump.
>
> It would be great if the table you design could be used for such metadata as 
> well. As you say, XMP already requires extensibility by design, so it might 
> not be too much work to achieve this. SMW properties are usually identified 
> by 
> pages in the wiki (like categories), so page titles can be used to refer to 
> them. This just requires that the meta_name field is long enough to hold MW 
> page title names. Your meta_schema could be used to separate wiki properties 
> from other XMP properties. SMW Light does not require nested structures, but 
> they could be interesting for possible extensions (the full SMW does support 
> one-level of nesting for making compound values).
>
> Two things about your design I did not completely understand (maybe just 
> because I don't know much about XMP):
>
> (1) You use mediumblob for values. This excludes range searches for numerical 
> image properties ("Show all images of height 1000px or more") which do not 
> seem to be overly costly if a suitable schema were used. If XMP has a typing 
> scheme for property values anyway, then I guess one could find the numbers 
> and 
> simply put them in a table where the value field is a number. Is this use 
> case 
> out of scope for you, or do you think the cost of reading from two tables too 
> high? One could also have an optional helper field "meta_numvalue" used for 
> sorting/range-SELECT when it is known from the input that the values that are 
> searched for are numbers.
>
> (2) Each row in your table specifies property (name and schema), type, and 
> the 
> additional meta_qualifies. Does this mean that one XMP property can have 
> values of many different types and with different flags for meta_qualifies? 
> Otherwise it seems like a lot of redundant data. Also, one could put stuff 
> like type and qualifies into the mediumblob value field if they are closely 
> tied together (I guess, when searching for some value, you implicitly specify 
> what type the data you search for has, so it is not problematic to search for 
> the value + type data at once). Maybe such considerations could simplify the 
> table layout, and also make it less specific to XMP.
>
> But overall, I am quite excited to see this project progressing. Maybe we 
> could have some more alignment between the projects later on (How about 
> combining image metadata and custom wiki metadata about image pages in 
> queries? :-) but for GSoC you should definitely focus on your core goals and 
> solve this task as good as possible.
>
> Best regards,
>
> Markus
>
>
> On Freitag, 28. Mai 2010, bawolff wrote:
>   
>> Hi all,
>>
>> For those who don't know me, I'm one of the GSOC students this year.
>> My mentor is ^demon, and my project is to enhance support for metadata
>> in uploaded files. Similar to the recent thread on interwiki
>> transclusions, I'd thought I'd ask for comments about what I propose
>> to do.
>>
>> Currently metadata is stored in img_metadata field of the image table
>> as a serialized php array. Well this works fine for the primary use
>> case - listing the metadata in a little box on the image description
>> page, its not very flexible. Its impossible to do queries like get a
>> list of images with some specific metadata property equal to some
>> specific value, or get a list of images ordered by what software
>> edited them.
>>
>> So as part of my project I would like to move the metadata to its own
>> table. However I think the structure of the table will need to be a
>> little more complicated then just <page id>, <name>, <value> triples,
>> since ideally it would be able to store XMP metadata, which can
>> contain nested structures. XMP metadata is pretty much the most
>> complex metadata format currently popular (for metadata stored inside
>> images anyways), and can store pretty much all other types of
>> metadata. Its also the only format that can store multi-lingual
>> content, which is a definite plus as those commons folks love their
>> languages. Thus I think it would be wise to make the table store
>> information in a manner that is rather close to the XMP data model.
>>
>> So basically my proposed metadata table looks like:
>>
>> *meta_id - primary key, auto-incrementing integer
>> *meta_page - foreign key for page_id - what image is this for
>> *meta_type - type of entry - simple value or some sort of compound
>> structure. XMP supports ordered/unordered lists, associative array
>> type structures, alternate array's (things like arrays listing the
>> value of the property in different languages).
>> *meta_schema - xmp uses different namespaces to prevent name
>> collisions. exif properties have their own namespace, IPTC properties
>> have their own namespace, etc
>> *meta_name - The name of the property
>> *meta_value - the value of the property (or null for some compound
>> things, see below)
>> *meta_ref - a reference to a meta_id of a different row for nested
>> structures, or null if not applicable (or 0 perhaps)
>> *meta_qualifies - boolean to denote if this property is a qualifier
>> (in XMP there are normal properties and qualifiers)
>>
>> (see http://www.mediawiki.org/wiki/User:Bawolff/metadata_table for a
>> longer explanation of the table structure)
>>
>> Now, before everyone says eww nested structures in a db are
>> inefficient and what not, I don't think its that bad (however I'm new
>> to the whole scalability thing, so hopefully someone more
>> knowledgeable than me will confirm or deny that).
>>
>> The XMP specification specifically says that there is no artificial
>> limit on nesting depth, however in general practise its not nested
>> very deeply. Furthermore in most cases the tree structure can be
>> safely ignored. Consider:
>> *Use-case 1 (primary usecase), displaying a metadata info box on an
>> image page. Most of the time that'd be translating specific name and
>> values into html table cells. The tree structure is totally
>> unnecessary. for example the exif property DateTimeOriginal can only
>> appear once per image (also it can only appear at the root of the tree
>> structure but thats beside the point). There is no need to reconstruct
>> the tree, just look through all the props for the one you need. If the
>> tree structure is important  it can be reconstructed on the php side,
>> and would typically be only the part of the tree that is relevant, not
>> the entire nested structure.
>> *Use-case 2 (secondary usecase). Get list of images ordered by some
>> property starting at foo. or get list of images where property bar =
>> baz. In this case its a simple select. It does not matter where in the
>> tree structure the property is.
>>
>> Thus, all the nestedness of XMP is preserved (So we could re-output it
>> into xmp form if we so desired), and there is no evil joining the
>> metadata table with itself over and over again (or at all), which from
>> what i understand, self-joining to reconstruct nested structures is
>> what makes them inefficient in databases.
>>
>> I also think this schema would be future proof because it can store
>> pretty much all metadata we can think of. We can also extend it with
>> custom properties we make up that are guaranteed to not conflict with
>> anything (The X in xmp is for extensible).
>>
>> As a side-note, based on my rather informal survey of commons (aka the
>> couple people who happened to be on #wikimedia-commons at that moment)
>> another use-case people think would be cool and useful is metadata
>> intersections, and metadata-category intersections. I'm not planning
>> to do this as part of my project, as I believe that would have
>> performance issues. However doing a metadata table like this does
>> leave the possibility open for people to do such intersection things
>> on the toolserver or in a DPL-like extension.
>>
>> I'd love to get some feedback on this. Is this a reasonable approach
>> for me to take on this.
>>
>> Thanks for reading.
>>
>> --
>> -bawolff
>>
>> _______________________________________________
>> Wikitech-l mailing list
>> Wikitech-l@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>>
>>     
>
>
>   
> ------------------------------------------------------------------------
>
> _______________________________________________
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l


_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] [gsoc] splitting the img_metadata field into a new table

Reply via email to