RE: metadata thoughts

Gary Shea 13 Mar 2003 20:49:32 -0000

Hi Dave --

On Thu, 13 Mar 2003, at 10:14 [-0800], Dave Viner ([EMAIL PROTECTED]) wrote:


> Hi Gary,
>       I too don't want this discussion to be a "mine is better than yours".
> Bickering rarely leads to a good solution.  Sorry if my initial comments
> came off that way.  I really didn't intend them to have that effect.

No problem, I was just making it be clear that I don't want to go there
either...

To clarify a little more, even... right now I think we are still
clarifying understanding of 1) what we can do with the different
implementations, 2) what the gains and penalties are for the different
implementations, and 3) whether the gains justify having two.
Until we seem to share the same idea of all three, at least in terms of
the facts, I'm going to keep bringing up relevant points until I can't
think of any more :)  So, in the immortal words of Frank Zappa, "here's
some more".

> 
>       You're correct in that your implementation avoids the double disk write
> issue.  I'm not sure what craziness was coursing thru me when I missed that,
> rather obvious, point.
> 
>       This topic was discussed (at length) before:
> http://marc.theaimsgroup.com/?l=xindice-dev&m=104066672331546&w=2
> http://marc.theaimsgroup.com/?l=xindice-dev&m=103946437104874&w=2
> http://marc.theaimsgroup.com/?l=xindice-dev&m=103828918030140&w=2
> http://marc.theaimsgroup.com/?t=102873960400001&r=1&w=2

I'd read the more recent thread, will read the 8/2002 thread later today.
Thanks for the link!

> 
> The "inline" metadata approach certainly provides a lot of functionality.
> However, as noted in some of the archived messages, for some applications
> (mine included), altering the document itself is simply not an option.
> True, one could provide an API to fetch the document without the inlined
> metadata, but that requires more work than I'd want to do simply for the
> possibility of accessing metadata.

Let me see if I understand what you are implying with "altering the
document itself is simply not an option".  There are essentially three
cases of interest:

1) The original save of the document.  No problem, inline metadata goes
        into the record with the document.

2) Modification of the document.  Doesn't happen!

3) Modification of the system metadata without changing the document.
Last access time seems like the canonical example where this is an
issue, and I hadn't thought of it before.

4) Introducing metadata into a collection which doesn't have it.  Again,
the document must be read and re-written.

I am guessing that (3) is what you're talking about and it's a good
point.  Is "most recent access time" typically available in SQL db's for
instance?  Is it something your applications need/use?  It _sounds_
useful, if only for reports, and the inline model cannot support it
efficiently.  I suspect that the BTree could be modified to manage
certain kinds of metadata efficiently, but that's yet another set of
changes.  It does make me wonder if that's the way I should be
approaching this issues, though... first record block is metadata,
following blocks are data?  Interesting...


As far as an API to fetch the document without the metadata, no separate
API is required.  Fetching a document works the same way it always has.
The metadata is automatically stripped off by the reader plugins, there
is essentially no performance penalty.  I haven't yet dealt with
accessing system metadata from outside of the internals, but I
understand the desire to do so.  I suspect the API you're using now
would work fine for the inline system metadata as well, and the inline
performance advantage would still apply, assuming caching.

> 
>       On a seperate note, have you performance tested the existing MetaData
> implementation and found it to be below your requirements?  If so, are you
> at liberty to disclose your requirements and tests?  I'd love to see Xindice
> improve the performance of Metadata if it's subpar.

That's an excellent question.  I saw the doubling issue and thought
"this is crazy".  That's as far as I went.  I take your point: if ya
can't tell, does it matter?

On the other hand, for a general-purpose tool like a database, I have
the impression that the goal is to go as fast as possible, within
reason.  Xindice could have been built as file-per-document,
directory per collection, and it would have only been a little slower.
Lots of much easier things could have been done, but in fact the current
solution is fairly close to state of the art.  What drove that?  Not a
particular application, I'd wager (and maybe lose :), but probably
someone's urge to "get it right".  I am all for metadata.  For optional
metadata that is only updated by the user, independent of the document,
performance is not that critical, and your solution is optimal anyway!
But doubling access time for every document read or write truly concerns
me.  Can it really be argued that doubling disk accesses is irrelevant?

I hear your argument and would like to pursue this question further.

> 
>       Have you contacted Murray about the XNode implementation?  It was put 
> into
> the scratchpad, and looked promising but we hit some odd licensing issues
> that were never resolved.  (Or at least thats the last I remember of it.)

No I haven't.  I will try to read about it again, I've forgotten
what I read the first time!

Regards,

        Gary


> dave

RE: metadata thoughts

Reply via email to