Re: [Wikitech-l] [gsoc] splitting the img_metadata field into a new table

2010-06-01 Thread Ilmari Karonen
bawolff wrote:
 
 However, if I do implement a new table as part of this, it will
 probably use page_ids to identify the image - I don't see any reason
 to artificially restrict it to just the file namespace.

Do remember that we'll also want to store metadata for old versions of 
images.  And yes, I'm aware that our database schema isn't likely to 
lend itself to easy and elegant solutions to that.  (My advice: try to 
go for simplicity at the expense of elegance, as long as it doesn't end 
up _too_ ugly.)

-- 
Ilmari Karonen

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] [gsoc] splitting the img_metadata field into a new table

2010-05-31 Thread bawolff
I'm going to use this message to respond to several people with this
email, hopefully it doesn't become confusing.

Markus wrote:
[snip]
(1) You use mediumblob for values.

I'll be honest, I chose a type at random for that field. It needed to
be long since it should be able to store rather long strings since
some metadata formats don't have length limits on strings. (in that
version of the new table plan anyway. based on  feedback, I think I'll
try to make my plan for tables much simpler)

Each row in your table specifies...meta_qualifies

In xmp you can have special type of properties that instead of being a
property of the image, modify the meaning of another property. The
example given in the spec was if you have a creator property, you
could have a qualifer for that property named role that denotes if
that author proerty is the singer, the writer, or whatever. Its most
common use seems to be in if multiple thumbnails  of the image are
stored in xmp at different resolutions, it uses qualifiers to specify
the resolutions of the different choices (which is a kind of moot
example for us, as i don't think we want to be storing embeded
thumbnails of the image in the db). the column was meant to be boolean
flag to say if this property was a sub-property of the parent, or if
it modified the meaning of the parent.

But overall, I am quite excited to see this project progressing. Maybe we
could have some more alignment between the projects later on (How about
combining image metadata and custom wiki metadata about image pages in
queries? :-) but for GSoC you should definitely focus on your core goals and
solve this task as good as possible.

Based on the comments I recieved I might be moving towards a more
simple table layout which will probably be less aligned with
SMW_light's goals, but I'd love to see more alignment where it fits
into the goals of my project. Personally I've always thought that a
lot of the smw stuff was rather cool.


On Fri, May 28, 2010 at 3:28 PM, Neil Kandalgaonkar ne...@wikimedia.org wrote:

[snip]


 Okay, I just wrote a little novel here, but please take it as just
 opening a discussion. I think you should try for a simpler design, but
 I'm open to discussion.

 After reading the comments so far I tend to agree that perhaps my
original design was a bit more complicated than it needed to be.
Scalability is pretty much the number one concern, so the simpler the
better


 BLOBS OF SERIALIZED PHP ARE GOOD

 You should not be afraid of storing (some) data as serialized PHP,
 *especially* if it's a complex data structure. If the database doesn't
 need to query or index on a particular field, then it's a huge win NOT
 to parse it out into columns and reassemble it into PHP data structures
 on every access.

 GO FOR MEANINGFUL DATA, NOT DATA PRESERVATION

 Okay onto the next topic -- how you want to parse XMP out into a flat
 structure, with links between them. I think you were clever in how you
 tried to make the cost of storing the tree relatively minimal, but I
 just question whether it's necessary to store it at all, and whether
 this meets our needs.

[snip]
 So we shouldn't attempt to make a meta-metadata-format that has all the
 features of all possible metadata formats. Instead we should just
 standardize on one, hardcoded, metadata format that's useful for our
 purposes, and then translate other formats to that format. The simplest
 thing is just a flat series of columns. In other words, something like this:

[snip]


 And of course metadata formats differ, and not all metadata fields need
 to be queryable or indexable. It would be perfectly acceptable to parse
 out some common interesting metadata into columns, and leave all the
 other random stuff in a serialized PHP blob, much as we have today. That
 structure could be recursive or whatever floats your boat.

Hmm, I like the idea of using the serialized blobs generally, and then
exposing some special few interesting properties into another table. I
was actually thinking that perhaps page_props could be used for this.
Currently all it contains is the  hidden category listings (well and
theoretically any extension can house stuff there using
$wgPagePropLinkInvalidations, but i have yet to see an extension use
that, which is a little surprising as it seems like a wonderful way to
make really cool extensions really easily). Although it seems as if
that table is more meant for properties that change the behaviour of
the page they belong to in some way (like __HIDDENCAT__), any metadata
stored there would still be a property, so I don't think thats too
abusing its purpose too much. Really there seems no reason to create a
new table if that one will do fine.


Thanks a lot for presenting your design here in detail. If you want to
take it to a wiki I can reiterate some of this debate on your design's
talk page.

Thank you for responding, your post has given me a lot to think about.
I still have a lot to learn about databases, and 

Re: [Wikitech-l] [gsoc] splitting the img_metadata field into a new table

2010-05-31 Thread Alex Brollo
2010/5/31 bawolff bawolff...@gmail.com bawolff%2...@gmail.com

 I'm going to use this message to respond to several people with this
 email, hopefully it doesn't become confusing.


Since you're  managinf EXIF data,  consider too  the possibility to *add* to
them some relevant Commons: metadata (name of the File: page, copyright,
categories). It would be great, if possible, that downoloaded image would
contain them.

I'm far from deep into the matter, it is only an old, layman idea I get when
approaching to the magics of EXIF.

Alex
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] [gsoc] splitting the img_metadata field into a new table

2010-05-31 Thread bawolff
Hi Alex. Thats actually on my list of to do if I have time. Building a
metadata editor for files on the wiki (probably in the form of an
extension) would be in phase 2 of my project. (In my project proposal
it was on the list of things to do if I have extra time).

Cheers,
bawolff

On Mon, May 31, 2010 at 5:31 AM, Alex Brollo alex.bro...@gmail.com wrote:


 2010/5/31 bawolff bawolff...@gmail.com

 I'm going to use this message to respond to several people with this
 email, hopefully it doesn't become confusing.

 Since you're  managinf EXIF data,  consider too  the possibility to add to
 them some relevant Commons: metadata (name of the File: page, copyright,
 categories). It would be great, if possible, that downoloaded image would
 contain them.

 I'm far from deep into the matter, it is only an old, layman idea I get when
 approaching to the magics of EXIF.

 Alex



___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] [gsoc] splitting the img_metadata field into a new table

2010-05-31 Thread Niklas Laxström
On 31 May 2010 14:14, bawolff bawolff...@gmail.com wrote:
 I'm going to use this message to respond to several people with this
 email, hopefully it doesn't become confusing.

 Hmm, I like the idea of using the serialized blobs generally, and then
 exposing some special few interesting properties into another table. I
 was actually thinking that perhaps page_props could be used for this.
 Currently all it contains is the  hidden category listings (well and
 theoretically any extension can house stuff there using
 $wgPagePropLinkInvalidations, but i have yet to see an extension use
 that, which is a little surprising as it seems like a wonderful way to
 make really cool extensions really easily). Although it seems as if
 that table is more meant for properties that change the behaviour of
 the page they belong to in some way (like __HIDDENCAT__), any metadata
 stored there would still be a property, so I don't think thats too
 abusing its purpose too much. Really there seems no reason to create a
 new table if that one will do fine.

[...]

 I think the page_props table would be the best way to implement bug
 8298. Actually i was reading up on the page_props table the other day,
 and I believe that in the commit implementing that table, bug 8298 was
 given as an example of something cool the table could be used to
 implement.

I tried to use page_props once. I did end up using my own table, since
the parser thinks it owns the page_props table and when page is parsed
it happily deletes all values stored in page_props it doesn't know
about.

 -Niklas

-- 
Niklas Laxström

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] [gsoc] splitting the img_metadata field into a new table

2010-05-31 Thread Roan Kattouw
2010/5/31 Niklas Laxström niklas.laxst...@gmail.com:
 I tried to use page_props once. I did end up using my own table, since
 the parser thinks it owns the page_props table and when page is parsed
 it happily deletes all values stored in page_props it doesn't know
 about.

The parser does indeed own the page_props table. It's intended for
storing properties that can be derived at parse time and set by the
parser itself or a parser hook through
$parserOutput-setProperty($name, $value) .

Roan Kattouw (Catrope)

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] [gsoc] splitting the img_metadata field into a new table

2010-05-31 Thread bawolff
On Mon, May 31, 2010 at 7:04 AM, Roan Kattouw roan.katt...@gmail.com wrote:
 2010/5/31 Niklas Laxström niklas.laxst...@gmail.com:
 I tried to use page_props once. I did end up using my own table, since
 the parser thinks it owns the page_props table and when page is parsed
 it happily deletes all values stored in page_props it doesn't know
 about.

 The parser does indeed own the page_props table. It's intended for
 storing properties that can be derived at parse time and set by the
 parser itself or a parser hook through
 $parserOutput-setProperty($name, $value) .

 Roan Kattouw (Catrope)


Ah, I was thinking that looked like something too perfectly fitted to
the situation to be true. However I still think it may be a good
approach to generally keep the metadata as a serialized php blob, and
have another table, similar looking to page_props to store specific
metadata values of interest.

cheers,
bawolff

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] [gsoc] splitting the img_metadata field into a new table

2010-05-29 Thread Platonides
Since you are storing in the db the metadata of the images, try to make
the schema able to store metadata coming from the page, so it can be
used to implement bug 8298 or extensions like ImageFilter.


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


[Wikitech-l] [gsoc] splitting the img_metadata field into a new table

2010-05-28 Thread bawolff
Hi all,

For those who don't know me, I'm one of the GSOC students this year.
My mentor is ^demon, and my project is to enhance support for metadata
in uploaded files. Similar to the recent thread on interwiki
transclusions, I'd thought I'd ask for comments about what I propose
to do.

Currently metadata is stored in img_metadata field of the image table
as a serialized php array. Well this works fine for the primary use
case - listing the metadata in a little box on the image description
page, its not very flexible. Its impossible to do queries like get a
list of images with some specific metadata property equal to some
specific value, or get a list of images ordered by what software
edited them.

So as part of my project I would like to move the metadata to its own
table. However I think the structure of the table will need to be a
little more complicated then just page id, name, value triples,
since ideally it would be able to store XMP metadata, which can
contain nested structures. XMP metadata is pretty much the most
complex metadata format currently popular (for metadata stored inside
images anyways), and can store pretty much all other types of
metadata. Its also the only format that can store multi-lingual
content, which is a definite plus as those commons folks love their
languages. Thus I think it would be wise to make the table store
information in a manner that is rather close to the XMP data model.

So basically my proposed metadata table looks like:

*meta_id - primary key, auto-incrementing integer
*meta_page - foreign key for page_id - what image is this for
*meta_type - type of entry - simple value or some sort of compound
structure. XMP supports ordered/unordered lists, associative array
type structures, alternate array's (things like arrays listing the
value of the property in different languages).
*meta_schema - xmp uses different namespaces to prevent name
collisions. exif properties have their own namespace, IPTC properties
have their own namespace, etc
*meta_name - The name of the property
*meta_value - the value of the property (or null for some compound
things, see below)
*meta_ref - a reference to a meta_id of a different row for nested
structures, or null if not applicable (or 0 perhaps)
*meta_qualifies - boolean to denote if this property is a qualifier
(in XMP there are normal properties and qualifiers)

(see http://www.mediawiki.org/wiki/User:Bawolff/metadata_table for a
longer explanation of the table structure)

Now, before everyone says eww nested structures in a db are
inefficient and what not, I don't think its that bad (however I'm new
to the whole scalability thing, so hopefully someone more
knowledgeable than me will confirm or deny that).

The XMP specification specifically says that there is no artificial
limit on nesting depth, however in general practise its not nested
very deeply. Furthermore in most cases the tree structure can be
safely ignored. Consider:
*Use-case 1 (primary usecase), displaying a metadata info box on an
image page. Most of the time that'd be translating specific name and
values into html table cells. The tree structure is totally
unnecessary. for example the exif property DateTimeOriginal can only
appear once per image (also it can only appear at the root of the tree
structure but thats beside the point). There is no need to reconstruct
the tree, just look through all the props for the one you need. If the
tree structure is important  it can be reconstructed on the php side,
and would typically be only the part of the tree that is relevant, not
the entire nested structure.
*Use-case 2 (secondary usecase). Get list of images ordered by some
property starting at foo. or get list of images where property bar =
baz. In this case its a simple select. It does not matter where in the
tree structure the property is.

Thus, all the nestedness of XMP is preserved (So we could re-output it
into xmp form if we so desired), and there is no evil joining the
metadata table with itself over and over again (or at all), which from
what i understand, self-joining to reconstruct nested structures is
what makes them inefficient in databases.

I also think this schema would be future proof because it can store
pretty much all metadata we can think of. We can also extend it with
custom properties we make up that are guaranteed to not conflict with
anything (The X in xmp is for extensible).

As a side-note, based on my rather informal survey of commons (aka the
couple people who happened to be on #wikimedia-commons at that moment)
another use-case people think would be cool and useful is metadata
intersections, and metadata-category intersections. I'm not planning
to do this as part of my project, as I believe that would have
performance issues. However doing a metadata table like this does
leave the possibility open for people to do such intersection things
on the toolserver or in a DPL-like extension.

I'd love to get some feedback on this. Is this a reasonable 

Re: [Wikitech-l] [gsoc] splitting the img_metadata field into a new table

2010-05-28 Thread church.of.emacs.ml
Hi bawolff,

thanks for your work.
I'm not very happy about the name metadata for the table. As far as I
understand it, this is about file metadata. metadata suggests it
contains information on pages (e.g. statistics).
Please consider using a name that contains 'file', e.g. file_metadata.

Thanks,
Church of emacs



signature.asc
Description: OpenPGP digital signature
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] [gsoc] splitting the img_metadata field into a new table

2010-05-28 Thread bawolff
On Fri, May 28, 2010 at 10:12 AM, church.of.emacs.ml
church.of.emacs...@googlemail.com wrote:
 Hi bawolff,

 thanks for your work.
 I'm not very happy about the name metadata for the table. As far as I
 understand it, this is about file metadata. metadata suggests it
 contains information on pages (e.g. statistics).
 Please consider using a name that contains 'file', e.g. file_metadata.

 Thanks,
 Church of emacs



Hi.

Thanks for your response. You make a very good point. Now that you
mentioned it, I can very easily see that being confusing. I definitely
agree that either file_metadata or image_metadata would be better.
(file_metadata would be good because it contains metadata about files
that aren't image, and is consistent with the renaming of image
namespace to file, but image_metadata is more consistent with the db
table naming scheme as other table is the image table. I guess in the
end it doesn't really matter either way as long as its clear the
tables about uploaded media).

cheers,
-bawolff

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] [gsoc] splitting the img_metadata field into a new table

2010-05-28 Thread Markus Krötzsch
Hi Bawolff,

interesting project! I am currently preparing a light version of SMW that 
does something very similar, but using wiki-defined properties for adding 
metadata to normal pages (in essence, SMW is an extension to store and 
retrieve page metadata for properties defined in the wiki -- like XMP for MW 
pages; though our data model is not quite as sophisticated ;-).

The use cases for this light version are just what you describe: simple 
retrieval (select) and basic inverse searches. The idea is to thus have a 
solid foundation for editing and viewing data, so that more complex functions 
like category intersections or arbitrary metadata conjunctive queries would be 
done on external servers based on some data dump.

It would be great if the table you design could be used for such metadata as 
well. As you say, XMP already requires extensibility by design, so it might 
not be too much work to achieve this. SMW properties are usually identified by 
pages in the wiki (like categories), so page titles can be used to refer to 
them. This just requires that the meta_name field is long enough to hold MW 
page title names. Your meta_schema could be used to separate wiki properties 
from other XMP properties. SMW Light does not require nested structures, but 
they could be interesting for possible extensions (the full SMW does support 
one-level of nesting for making compound values).

Two things about your design I did not completely understand (maybe just 
because I don't know much about XMP):

(1) You use mediumblob for values. This excludes range searches for numerical 
image properties (Show all images of height 1000px or more) which do not 
seem to be overly costly if a suitable schema were used. If XMP has a typing 
scheme for property values anyway, then I guess one could find the numbers and 
simply put them in a table where the value field is a number. Is this use case 
out of scope for you, or do you think the cost of reading from two tables too 
high? One could also have an optional helper field meta_numvalue used for 
sorting/range-SELECT when it is known from the input that the values that are 
searched for are numbers.

(2) Each row in your table specifies property (name and schema), type, and the 
additional meta_qualifies. Does this mean that one XMP property can have 
values of many different types and with different flags for meta_qualifies? 
Otherwise it seems like a lot of redundant data. Also, one could put stuff 
like type and qualifies into the mediumblob value field if they are closely 
tied together (I guess, when searching for some value, you implicitly specify 
what type the data you search for has, so it is not problematic to search for 
the value + type data at once). Maybe such considerations could simplify the 
table layout, and also make it less specific to XMP.

But overall, I am quite excited to see this project progressing. Maybe we 
could have some more alignment between the projects later on (How about 
combining image metadata and custom wiki metadata about image pages in 
queries? :-) but for GSoC you should definitely focus on your core goals and 
solve this task as good as possible.

Best regards,

Markus


On Freitag, 28. Mai 2010, bawolff wrote:
 Hi all,
 
 For those who don't know me, I'm one of the GSOC students this year.
 My mentor is ^demon, and my project is to enhance support for metadata
 in uploaded files. Similar to the recent thread on interwiki
 transclusions, I'd thought I'd ask for comments about what I propose
 to do.
 
 Currently metadata is stored in img_metadata field of the image table
 as a serialized php array. Well this works fine for the primary use
 case - listing the metadata in a little box on the image description
 page, its not very flexible. Its impossible to do queries like get a
 list of images with some specific metadata property equal to some
 specific value, or get a list of images ordered by what software
 edited them.
 
 So as part of my project I would like to move the metadata to its own
 table. However I think the structure of the table will need to be a
 little more complicated then just page id, name, value triples,
 since ideally it would be able to store XMP metadata, which can
 contain nested structures. XMP metadata is pretty much the most
 complex metadata format currently popular (for metadata stored inside
 images anyways), and can store pretty much all other types of
 metadata. Its also the only format that can store multi-lingual
 content, which is a definite plus as those commons folks love their
 languages. Thus I think it would be wise to make the table store
 information in a manner that is rather close to the XMP data model.
 
 So basically my proposed metadata table looks like:
 
 *meta_id - primary key, auto-incrementing integer
 *meta_page - foreign key for page_id - what image is this for
 *meta_type - type of entry - simple value or some sort of compound
 structure. XMP supports ordered/unordered 

Re: [Wikitech-l] [gsoc] splitting the img_metadata field into a new table

2010-05-28 Thread Michael Dale
More important than file_metadata and page asset metadata working with 
the same db table backed, its important that you can query export all 
the properties in the same way.

Within SMW you already have some special properties like pagelinks, 
langlinks, category properties etc, that are not stored the same as the 
other SMW page properties ...  The SMW system should name-space all 
these file_metadata properties along with all the other structured data 
available and enable universal querying / RDF exporting all the 
structured wiki data. This way file_metadata would just be one more 
special data type with its own independent tables. ...

SMW should abstract the data store so it works with the existing 
structured tables. I know this was already done for categories correct?  
Was enabling this for all the other links and usage tables explored?

This also make sense from an architecture perspective, where 
file_metadata is tied to the file asset and SMW properties are tied to 
the asset wiki description page.  This way you know you don't have to 
think about that subset of metadata properties on page updates since 
they are tied to the file asset not the wiki page propriety driven from 
structured user input. Likewise uploading a new version of the file 
would not touch the page data tables.

--michael

Markus Krötzsch wrote:
 Hi Bawolff,

 interesting project! I am currently preparing a light version of SMW that 
 does something very similar, but using wiki-defined properties for adding 
 metadata to normal pages (in essence, SMW is an extension to store and 
 retrieve page metadata for properties defined in the wiki -- like XMP for MW 
 pages; though our data model is not quite as sophisticated ;-).

 The use cases for this light version are just what you describe: simple 
 retrieval (select) and basic inverse searches. The idea is to thus have a 
 solid foundation for editing and viewing data, so that more complex functions 
 like category intersections or arbitrary metadata conjunctive queries would 
 be 
 done on external servers based on some data dump.

 It would be great if the table you design could be used for such metadata as 
 well. As you say, XMP already requires extensibility by design, so it might 
 not be too much work to achieve this. SMW properties are usually identified 
 by 
 pages in the wiki (like categories), so page titles can be used to refer to 
 them. This just requires that the meta_name field is long enough to hold MW 
 page title names. Your meta_schema could be used to separate wiki properties 
 from other XMP properties. SMW Light does not require nested structures, but 
 they could be interesting for possible extensions (the full SMW does support 
 one-level of nesting for making compound values).

 Two things about your design I did not completely understand (maybe just 
 because I don't know much about XMP):

 (1) You use mediumblob for values. This excludes range searches for numerical 
 image properties (Show all images of height 1000px or more) which do not 
 seem to be overly costly if a suitable schema were used. If XMP has a typing 
 scheme for property values anyway, then I guess one could find the numbers 
 and 
 simply put them in a table where the value field is a number. Is this use 
 case 
 out of scope for you, or do you think the cost of reading from two tables too 
 high? One could also have an optional helper field meta_numvalue used for 
 sorting/range-SELECT when it is known from the input that the values that are 
 searched for are numbers.

 (2) Each row in your table specifies property (name and schema), type, and 
 the 
 additional meta_qualifies. Does this mean that one XMP property can have 
 values of many different types and with different flags for meta_qualifies? 
 Otherwise it seems like a lot of redundant data. Also, one could put stuff 
 like type and qualifies into the mediumblob value field if they are closely 
 tied together (I guess, when searching for some value, you implicitly specify 
 what type the data you search for has, so it is not problematic to search for 
 the value + type data at once). Maybe such considerations could simplify the 
 table layout, and also make it less specific to XMP.

 But overall, I am quite excited to see this project progressing. Maybe we 
 could have some more alignment between the projects later on (How about 
 combining image metadata and custom wiki metadata about image pages in 
 queries? :-) but for GSoC you should definitely focus on your core goals and 
 solve this task as good as possible.

 Best regards,

 Markus


 On Freitag, 28. Mai 2010, bawolff wrote:
   
 Hi all,

 For those who don't know me, I'm one of the GSOC students this year.
 My mentor is ^demon, and my project is to enhance support for metadata
 in uploaded files. Similar to the recent thread on interwiki
 transclusions, I'd thought I'd ask for comments about what I propose
 to do.

 Currently metadata is stored in 

Re: [Wikitech-l] [gsoc] splitting the img_metadata field into a new table

2010-05-28 Thread Markus Krötzsch
(This gets a little bit off the topic, but it should still be helpful for the 
current discussion. But if we want to discuss a more general data management 
architecture for MW, then it might be sensible to make a new thread ;-)

On Freitag, 28. Mai 2010, Michael Dale wrote:
 More important than file_metadata and page asset metadata working with
 the same db table backed, its important that you can query export all
 the properties in the same way.
 
 Within SMW you already have some special properties like pagelinks,
 langlinks, category properties etc, that are not stored the same as the
 other SMW page properties ...  The SMW system should name-space all
 these file_metadata properties along with all the other structured data
 available and enable universal querying / RDF exporting all the
 structured wiki data. This way file_metadata would just be one more
 special data type with its own independent tables. ...
 
 SMW should abstract the data store so it works with the existing
 structured tables. I know this was already done for categories correct?

More recent versions of SMW actually do no longer use MW's category table for 
this, mostly to improve query performance.
[In a nutshell: SMW properties can refer to non-existing pages, and the full 
version of SMW therefore has its own independent page id management (because 
we want to use numerical IDs for all pages that are used as property values, 
whether or not they exist). Using IDs everywhere improves our query 
performance and reduces SQL query size, but it creates a barrier for including 
MW table data since more joins would be needed to translate between IDs. This 
is one reason SMW Light will not support queries: it uses a much simpler DB 
layout and less code, but the resulting DB is not as suitable for querying.]

 Was enabling this for all the other links and usage tables explored?

Having a unified view on the variety of MediaWiki data (page metadata, user-
edited content data, file metadata, ...) would of course be great. But 
accomplishing this would require a more extensive effort than our little SMW 
extension. What SMW tries to provide for now is just a way of storing user-
edited data in a wiki (and also for displaying/exporting it).

Of course SMW already has a PHP abstraction for handling the property-value 
pairs that were added to some page, and this abstraction layer completely 
hides the underlying DB tables. This allows us to make more data accessible 
even if it is in other tables, and even to change the DB layout of our custom 
tables if required. You are right that such an abstraction could be extended 
to cover more of the native data of MediaWiki, so that data dumps can include 
it as well.

I think this idea is realistic, and I hope that SMW helps to accomplish this 
in some future. Yet, this is not a small endeavour given that not even most 
basic data management features are deployed on the big Wikimedia projects 
today. To get there, we first need a code review regarding security and 
performance, and so for the moment we are pressed to reduce features and to 
shrink your code base. This is why we are currently building the Light 
version that only covers data input (without link syntax extensions), storage, 
look-up, and basic export/dump. For this step, I really think that sharing a 
data table with the EXIF extension would make sense, since the data looks very 
similar and a more complex DB layout is not necessary for the initial goals. 
We can always consider using more tables if the need arises.

But I would be very happy if there were more people who want to make concrete 
progress toward the goal you describe. Meanwhile, we are planning in smaller 
steps ;-)

 
 This also make sense from an architecture perspective, where
 file_metadata is tied to the file asset and SMW properties are tied to
 the asset wiki description page.  This way you know you don't have to
 think about that subset of metadata properties on page updates since
 they are tied to the file asset not the wiki page propriety driven from
 structured user input. Likewise uploading a new version of the file
 would not touch the page data tables.

Right, it might be useful to distinguish the internal handles (and external 
URIs) of the Image page and of the image file. But having a dedicated 
meta_schema value for user-edited properties of the page might suffice to 
accomplish this on the DB level. I am fairly agnostic about the details, but I 
have a tendency to wait with developing a more sophisticated DB layout until 
we have some usage statistics from the initial deployment to guide us.

-- Markus


 
 Markus Krötzsch wrote:
  Hi Bawolff,
 
  interesting project! I am currently preparing a light version of SMW
  that does something very similar, but using wiki-defined properties for
  adding metadata to normal pages (in essence, SMW is an extension to store
  and retrieve page metadata for properties defined in the wiki -- like XMP
  for MW 

Re: [Wikitech-l] [gsoc] splitting the img_metadata field into a new table

2010-05-28 Thread Neil Kandalgaonkar
On 05/28/2010 08:03 AM, bawolff wrote:
 Hi all,
 
 For those who don't know me, I'm one of the GSOC students this year.
 My mentor is ^demon, and my project is to enhance support for metadata
 in uploaded files. Similar to the recent thread on interwiki
 transclusions, I'd thought I'd ask for comments about what I propose
 to do.

Excellent! We're glad to have you on board. (FWIW I'm working on
Multimedia Usability, so I'll be watching what you come up with closely).

 So as part of my project I would like to move the metadata to its own
 table.

Great, although perhaps we could consider other things too (see below).

 ideally it would be able to store XMP metadata, which can
 contain nested structures.

 Now, before everyone says eww nested structures in a db are
 inefficient and what not, I don't think its that bad (however I'm new
 to the whole scalability thing, so hopefully someone more
 knowledgeable than me will confirm or deny that).

Okay, I just wrote a little novel here, but please take it as just
opening a discussion. I think you should try for a simpler design, but
I'm open to discussion.

I'm familiar with how MySQL scales (particularly for large image
collections). Commons has a respectable collection of  6.6 million
media files, but just to put that into perspective, Facebook gets that
in just a few hours. If we're successful in improving Commons' usability
we'll probably get some multiple of our current intake rate. So we have
to plan for hundreds of millions of media files, at least.

POTENTIAL ISSUES WITH TREE STRUCTURES

Tree structures in MySQL can be deceptively fast especially on
single-machine tests. But it tends to be a nightmare in production
environments. Turning what could be one query into eight or nine isn't
so bad on a single machine, but consider when the database and web
server are relatively far apart or loaded and thus and have high latency.

Also, due to its crappy locking, MySQL sucks at keeping tree structures
consistent. If we were doing this on Oracle it would be a different
story -- they have some fancy features that make trees easy -- but we're
on MySQL.

The most scalable architectures use MySQL's strengths. MySQL is weak at
storing trees. It's good at querying really simple, flat, schema.

BLOBS OF SERIALIZED PHP ARE GOOD

You should not be afraid of storing (some) data as serialized PHP,
*especially* if it's a complex data structure. If the database doesn't
need to query or index on a particular field, then it's a huge win NOT
to parse it out into columns and reassemble it into PHP data structures
on every access.

GO FOR MEANINGFUL DATA, NOT DATA PRESERVATION

Okay onto the next topic -- how you want to parse XMP out into a flat
structure, with links between them. I think you were clever in how you
tried to make the cost of storing the tree relatively minimal, but I
just question whether it's necessary to store it at all, and whether
this meets our needs.

It seems to me (correct me if I'm wrong) that your structure is two
steps beyond id-key-val in abstractness: it's id-schema-key-type-val. So
the meaning of keys depends on the schema? So we might have a set of
rows like this, for two images, id 1234 and 5678:

id: 1234
schema: AcmeMetadata
key: cameraType
type: string
val: canon-digital-rebel-xt

id: 1234
schema: AcmeMetadata
key: resolution
type: string
val: 1600x1200

id: 5678
schema: SomeOtherSchema
key: recordingDevice
type: string
val: Canon Digital Rebel XT

id: 5678
schema: SomeOtherSchema
key: width
type: int
val: 1600

id:5678
schema: SomeOtherSchema
key: height
type: int
val: 1200

The point is that between schemas, we'd use different keys and values to
represent the same thing. While you've done a good job of preserving the
exact data we received, this makes for an impossibly complicated query
if we want to learn anything.

When you find yourself defining a 'type' column you should be wary,
because you're engaging in Inner-Platform Effect. MySQL has types already.

It seems to me that we have no requirement to preserve exact key and
value names in our database. What we need is *meaning* not data.

So we shouldn't attempt to make a meta-metadata-format that has all the
features of all possible metadata formats. Instead we should just
standardize on one, hardcoded, metadata format that's useful for our
purposes, and then translate other formats to that format. The simplest
thing is just a flat series of columns. In other words, something like this:

id: 1234
cameraType: canon-digital-rebel-xt
width: 1600
height: 1200

id: 5678
cameraType: canon-digital-rebel-xt
width: 1600
height: 1200

WHY EVEN HAVE A SEPARATE TABLE?

If we're doing one row per media file (the ideal!) then there is no
reason why you can't simply append these new metadata columns onto the
existing image table. This would make querying REALLY easy, and it would
simplify database management.

And of course metadata formats differ, and not all metadata fields need
to be queryable or