On Nov 18, 2007, at 11:09 PM, Yonik Seeley wrote:
I'm also wondering how others have accomplished this. Grant
Ingersoll
noted that one of the original use cases was XPath queries so I'm
particularly interested in finding out if anyone has implemented
that,
and how.
Me too. Any clarifications on that Grant???
From what I understand from Michael Busch, you can store the path at
each token, but this doesn't seem efficient to me. I would think you
may want to come up with some more efficient encoding. I am cc'ing
Michael on this thread to see if he is able to add any light to the
subject (he may not be able to b/c of employer reasons). If he
can't, then we can brainstorm a bit more on how to do it most
efficiently.
An interesting thing here to think about is how we can come up with
more general support for XML documents and other structured docs. For
instance, a common syntax used in NLP for tokens is something like:
The|DET quick|JJ red|JJ fox|NN jumped|VB over|??? the|DET lazy|JJ
brown|JJ dogs|NN or other variations that also apply phrase
identification, semantic relationships, etc. These things, to me, all
logically fit as payloads, so it may be wise to think about coming up
with one or two generic supports for these kind of things. One could
be the default XML/XPath marked up document, but another might be this
pipe notation that is common in NLP.
See http://wiki.apache.org/lucene-java/Payload_Planning and the
related threads
-Grant