RE: metadata thoughts

Gary Shea 12 Mar 2003 23:09:48 -0000

Hey, thanks for the detailed reply!

On Wed, 12 Mar 2003, at 13:46 [-0800], Dave Viner ([EMAIL PROTECTED]) wrote:


> I'm not sure I understand your arguments against using the MetaData stuff I
> wrote.  You listed two reasons, doubling the number of disk writes on
> update, and public methods for changing metadata values.  You also said that
> in your application, you'll need "requires per-document metadata [that is]
> likely to change with each update/save."  So, you'll be bitten by the your
> first objection, doubling disk writes.

I don't really want this to be a "my metadata is better than your
metadata (nah nah nah :)" argument, hopefully the end result will be
that my stuff will either 1) go away, or 2) take over part of yours
where it can do so more efficiently.

The "disk write doubling" bites m1 (you and David's implementation)
because m1 metadata is stored in a different BTree record than the
data record.  My implementation (m2) stores metadata in the same BTree
record as the data, so anytime the record's data is stored, storing the
changed metadata is free.

> As for public methods for changing metadata, it is true that there are such
> methods.  The concept of the MetaData design that David Ku and I implemented
> is that there are 3 types of metadata stored.  First, there are "system"
> elements, like last modified time, last access time.  These are handled by
> Xindice.  Second, there are "attributes", which is a big Hashtable.  This is
> for the user to specify whatever key-value type metadata (s)he wants that
> might be app-specific.  Third, there is a custom XML document space.  This
> is for the user to specify whatever hierarchical metadata (s)he wants.
> Therefore, the methods available allow the user to easily add and remove
> key-value pairs from the hashtable, and the xml document section.  There is
> also code to let the "power" user change system attributes, but you'd have
> to write the code to call those methods.  The XMLRPC methods provided don't
> provide a way to change system elements.

The system attributes are the ones I am interested in.  What I want is a
maximally efficient, moderately configurable system metadata facility.
Consider data type (binary/xml) or data digests (MD5, etc), possibly
even Lamport type change-counter/timestamps for replication.  All
computed automatically by plugins registered and triggered at the
Collection level.

>From your description it sounds like I could easily move my
metadata-generation code into your system metadata.  We pay the
performance penalty but gain reduced complexity.

Another alternative is to move your system metadata into m2, so it stays
with the data it refers to, thereby eliminating the disk-access doubling
penalty, but increasing complexity.

I don't know that it would be desirable to add all of the features
offered by m1 into m2, although it could be done.  I really see m2 as at
most useful for "system" metadata.

> Your application might need some metadata capabilities that are provided by
> the metadata design and implementation that's in Xindice now.  Without
> knowing more about your requirements, it's hard to say.  But I think the
> current design is pretty flexible, and should provide you with the
> functionality you need.  If it doesn't, then we should identify if the
> missing piece is related to the design or the implementation of metadata
> storage in Xindice.  If it's the implementation, then we should add it.  If
> it's the design, then we should definitely examine what is wrong with the
> design.
> 
> dave


To give you the flavor of what I'm doing, here's the simplest possible
useful example (this is a working example):

<collection name="test">
  <filer class="apache.xindice.core.filer.BTreeFiler" />
  <writer class="org.apache.xindice.core.inlinemeta.ResourceTypeWriter"/>
</collection>

In this example the data in each record in the "test" collection will be
prefixed by three addtional bytes:

    byte 1: length of the header
    byte 2: id of the metadata reader (ResourceType => 1)
    byte 3: resource type byte (1 => xml, 2 => binary)

Yeah it's a bit wasteful, but three bytes, so what.


Here's a somewhat more complex (and also currently working) example in
which the metadata will include an MD5 digest and a resource type flag.

<collection name="test" compressed="true">
  <filer class="apache.xindice.core.filer.BTreeFiler" />
  <writer class="org.apache.xindice.core.inlinemeta.AggregatingWriter">
    <writer
      class="org.apache.xindice.core.inlinemeta.DigestWriter"
      algorithm="MD5" />
    <writer class="org.apache.xindice.core.inlinemeta.ResourceTypeWriter"/>
  </writer>
</collection>

In this example, there's an overall header consisting of the two bytes
mentioned above, and then each of the aggregated metadata bits has its
own two byte header which the AggregatingReader uses to figure out what
to do.


This could easily be expanded to include the other stuff you guys are
maintaining separately from the data record:

<collection name="test" compressed="true" enable-inline-meta="true">
  <filer class="apache.xindice.core.filer.BTreeFiler" />
  <writer class="org.apache.xindice.core.inlinemeta.AggregatingWriter">
    <writer
      class="org.apache.xindice.core.inlinemeta.LastModifiedWriter"
    <writer
      class="org.apache.xindice.core.inlinemeta.LastAccessWriter"
    <writer
      class="org.apache.xindice.core.inlinemeta.DigestWriter"
      algorithm="MD5" />
    <writer class="org.apache.xindice.core.inlinemeta.ResourceTypeWriter"/>
  </writer>
</collection>

Incidentally, the writer configuration can be changed at any time
without breaking anything.  It's only turning on the inline metadata
initially which is a bit tricky.


My only concern is whether the performance gain is worth the complexity
pain.  The code is pretty simple and totally modular and non-intrusive
(there's probably only 5 or 10 lines of inline-metadata-specific code in
Collection), but every new line of code is a new place for stuff to
break, and a challenge to the poor fool who has to figure out what's
going on....

        Gary
> 
> 
> -----Original Message-----
> From: Gary Shea [mailto:[EMAIL PROTECTED]
> Sent: Wednesday, March 12, 2003 12:07 PM
> To: [EMAIL PROTECTED]
> Subject: metadata thoughts
> 
> 
> I just finished adding binary resource support, and in the process ended
> up writing an 'inline' metadata facility, where the metadata is stored
> as a header on the data.  The metadata facility is enabled and
> configured on a per-collection basis.
> 
> I'm now re-considering metdata, mostly because I don't think I gave the
> existing metadata facility a fair chance, and want to get some group
> feedback.  There were three reasons why I didn't use Dave Viner's
> metadata facility:
> 
> 1) it doubles the number of disk writes needed when a resource
>     is inserted/updated
> 2) I _think_ the current implementation is not safe for internal use,
>     as I believe there is public code for changing arbitrary metadata
>     values (please correct me if I'm wrong...)
> 3) sheer laziness
> 
> A while back there was a metadata discussion on this list, and I've read
> that discussion.  I didn't detect any consensus about what sort of
> metadata should be supported.
> 
> It seems clear that collection-level metadata is best off in a 'system
> table', which Dave Viner's metadata system models nicely.  Per-document
> data is less clear.  Some of it will change with every save/update, some
> won't.  The resource type stuff I just did isn't likely to change all
> that often and might be a candidate for the non-inline metadata, if it
> is safe from user tampering.  On the other hand, I am currently working
> on Xindice enhancements that requires per-document metadata likely to
> change with each update/save.
> 
> I'm interested in hearing arguments pro and con.
> 
> Regards,
> 
>       Gary
> 
> 
> 
> 

Regards,

        Gary Shea
        GTS Design Consulting
        shea AT gtsdesign DOT com

RE: metadata thoughts

Reply via email to