On Mon, 2004-01-19 at 08:31, Stefano Mazzocchi wrote:

> On 19 Jan 2004, at 15:12, Michael Oliver wrote:
> 
> > On Mon, 2004-01-19 at 06:32, Stefano Mazzocchi wrote:
> >
> >> I personally wouldn't know how to make use of a query against full 
> >> text
> >> *and* properties. This is because such a query looks weird to me:
> >> full-text is the least structure possible (get me everything but I
> >> don't know where) while properties tent to be very much structured
> >> (last modified time, author, and so on).
> >>
> >> There is a decades long discussion on what is data and what is 
> >> metadata
> >> and I don't want to touch that with a stick, but I think that if you
> >> need to do full-text search on your metadata there is something wrong.
> >
> > Stefano with all due respect, there is nothing wrong with a full-text
> > search on metadata because metadata in this case can be any properties
> > of any of the resources in the repository and that meta data can be 
> > free
> > form text.
> 
> Well, this is because I try to avoid having metadata that can be free 
> form text, but as I said, this is my way and I don't want to impose it 
> on others.
> 

Well as long as we CAN have properties that are free form text, we can't
avoid them.

> > consider a search query like
> >
> > doctype="memo" and description contains "Fire Stefano" and contents
> > contains "January"
> 
> I would think that this schema is not appropriate. a description is 
> part of content, not metadata. But it's like arguing about whether 
> something should be an element or an attribute... sometimes it's just 
> subjective.
> 

No, I don't think so.  Metadata IS data about data, eh?  And a
"description" can't be anything else, you certainly don't think a binary
file stored in Slide (content) includes the "description" of the
content, which is text, is part of the content?  Slide/WebDAV properties
that can be created by and saved by the user is all about categorization
of and description of the content, almost for the express purpose of
being able to find the right content and therefore should be very much
part of the search mechanism.

> > doctype and description are properties with string values that would be
> > indexed and matched with the same index as the contents.
> 
> So, are you suggesting that we index everything? [not critical, just 
> curious]


Absolutely, if somone wants to save some piece of information they will
want to retrieve it and search for it.

> > Everybody doesn't use the Database Stores, some actually preter the XML
> > Stores so an index of the XML should be full text, yes?
> 
> This is actually a good question and I don't have a definitive answer. 
> Indexing all text() nodes might be good, but what about namespaces? 
> what about attributes? should we care?
> 
> A while ago, thinking about this, I proposed the addition of a 
> numerical namespace to the lucene mailing list but the suggesting 
> didn't catch up [I also have the impression they didn't get my point, 
> but was low priority so I dropped the subject]
> 
> I think that indexing an XHTML file is relatively easy. Indexing an FO 
> file with inlined SVG images might not be so straightforward, or lead 
> so the same quality of results without a specific indexer... but there 
> might be a general way to index XML content, but it's not so easy as it 
> seems and lucene also isn't designed for multi-dimensional content, but 
> mono-dimensional one.

I think you just made the point that indexing just content is hit or
miss, but an index that spans the content (where feasible) AND the
metadata is more likely to achieve the desired result set for any given
query.  Erik should be jumping in again any moment...;-)

> 
> But I'm wide open to ideas in this area.

I know you are and it is appreciated.

> 
> --
> Stefano.
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 

Reply via email to