Re: What is flexible indexing in Lucene 4.0 if it's not the ability to make new postings codecs?

2012-12-21 Thread Michael McCandless
On Thu, Dec 20, 2012 at 3:54 PM, Wu, Stephen T., Ph.D. wu.step...@mayo.edu wrote: If you stuff the end of the span into the payload you'd have to create a custom variant of PhraseQuery to properly match based on the end span. How different is this from the functionality already avaialable

Re: What is flexible indexing in Lucene 4.0 if it's not the ability to make new postings codecs?

2012-12-20 Thread Wu, Stephen T., Ph.D.
If you stuff the end of the span into the payload you'd have to create a custom variant of PhraseQuery to properly match based on the end span. How different is this from the functionality already avaialable through SpanQuery? stephen

Re: What is flexible indexing in Lucene 4.0 if it's not the ability to make new postings codecs?

2012-12-18 Thread Michael McCandless
On Thu, Dec 13, 2012 at 8:32 AM, Carsten Schnober schno...@ids-mannheim.de wrote: Am 13.12.2012 12:27, schrieb Michael McCandless: For example: - part of speech of a token. - syntactic parse subtree (over a span). - semantically normalized phrase (to canonical text or ontological code).

Re: What is flexible indexing in Lucene 4.0 if it's not the ability to make new postings codecs?

2012-12-18 Thread Michael McCandless
On Thu, Dec 13, 2012 at 10:09 AM, Glen Newton glen.new...@gmail.com wrote: Unfortunately, Lucene doesn't properly index spans (it records the start position but not the end position), so that limits what kind of matching you can do at search time. If this could be fixed (i.e. indexing the

Re: What is flexible indexing in Lucene 4.0 if it's not the ability to make new postings codecs?

2012-12-18 Thread Carsten Schnober
Am 18.12.2012 12:36, schrieb Michael McCandless: On Thu, Dec 13, 2012 at 8:32 AM, Carsten Schnober schno...@ids-mannheim.de wrote: This is a relatively easy example, but how would deal with e.g. annotations that include multiple tokens (as in spans), such as chunks, or relations between

Re: What is flexible indexing in Lucene 4.0 if it's not the ability to make new postings codecs?

2012-12-13 Thread Michael McCandless
On Wed, Dec 12, 2012 at 9:08 PM, lukai lukai1...@gmail.com wrote: Do we have any plan to decouple the index process? Lucene was design for search, but according the question people ask in the thread it beyonds search functionality sometimes. Like we might want to customize our scoring

Re: What is flexible indexing in Lucene 4.0 if it's not the ability to make new postings codecs?

2012-12-13 Thread Carsten Schnober
Am 13.12.2012 12:27, schrieb Michael McCandless: For example: - part of speech of a token. - syntactic parse subtree (over a span). - semantically normalized phrase (to canonical text or ontological code). - semantic group (of a span). - coreference link. So for example

Re: What is flexible indexing in Lucene 4.0 if it's not the ability to make new postings codecs?

2012-12-13 Thread Glen Newton
Unfortunately, Lucene doesn't properly index spans (it records the start position but not the end position), so that limits what kind of matching you can do at search time. If this could be fixed (i.e. indexing the _end_ of a span) I think all the things that I want to do, and the things that can

Re: What is flexible indexing in Lucene 4.0 if it's not the ability to make new postings codecs?

2012-12-13 Thread Wu, Stephen T., Ph.D.
That would be really nice. Full standoff annotations open a lot of doors. If we had them, though, I'm not sure exactly which of Mike's methods you'd use? I thought payloads were completely token-based and could not be attached to spans regardless. And the SynonymFilter is really to mimic the

Re: What is flexible indexing in Lucene 4.0 if it's not the ability to make new postings codecs?

2012-12-13 Thread Lance Norskog
Parts-of-speech is available now, in the indexer. LUCENE-2899 adds OpenNLP to the LuceneSolr codebase. It does parts-of-speech, chunking and Named Entity Recognition. OpenNLP is an Apache project for natural-language processing. Some parts are in Solr that could be in Lucene.

Re: What is flexible indexing in Lucene 4.0 if it's not the ability to make new postings codecs?

2012-12-13 Thread Glen Newton
It is not clear this is exactly what is needed/being discussed. From the issue: We are also planning a Tokenizer/TokenFilter that can put parts of speech as either payloads (PartOfSpeechAttribute?) on a token or at the same position. This adds it to a token, not a span. 'same position' does not

Re: What is flexible indexing in Lucene 4.0 if it's not the ability to make new postings codecs?

2012-12-13 Thread Lance Norskog
I should not have added that note. The Opennlp patch gives a concrete example of adding an annotation to text. On 12/13/2012 01:54 PM, Glen Newton wrote: It is not clear this is exactly what is needed/being discussed. From the issue: We are also planning a Tokenizer/TokenFilter that can put

Re: What is flexible indexing in Lucene 4.0 if it's not the ability to make new postings codecs?

2012-12-13 Thread Glen Newton
Cool! Sounds great! :-) Any pointers to a (Lucene) example that attaches a payload to a start..end span that is more than one token? thanks, -Glen On Thu, Dec 13, 2012 at 5:03 PM, Lance Norskog goks...@gmail.com wrote: I should not have added that note. The Opennlp patch gives a concrete

Re: What is flexible indexing in Lucene 4.0 if it's not the ability to make new postings codecs?

2012-12-13 Thread SUJIT PAL
Hi Glen, I don't believe you can attach a single payload to multiple tokens. What I did for a similar requirement was to combine the tokens into a single _ delimited single token and attached the payload to it. For example: The Big Bad Wolf huffed and puffed and blew the house of the Three

Re: What is flexible indexing in Lucene 4.0 if it's not the ability to make new postings codecs?

2012-12-12 Thread Wu, Stephen T., Ph.D.
Is there any (preliminary) code checked in somewhere that I can look at, that would help me understand the practical issues that would need to be addressed? Maybe we can make this more concrete: what new attribute are you needing to record in the postings and access at search time? For

Re: What is flexible indexing in Lucene 4.0 if it's not the ability to make new postings codecs?

2012-12-12 Thread Glen Newton
+10 These are the kind of things you can do in GATE[1] using annotations[2]. A VERY useful feature. -Glen [1]http://gate.ac.uk [2]http://gate.ac.uk/wiki/jape-repository/annotations.html On Wed, Dec 12, 2012 at 3:02 PM, Wu, Stephen T., Ph.D. wu.step...@mayo.edu wrote: Is there any

Re: What is flexible indexing in Lucene 4.0 if it's not the ability to make new postings codecs?

2012-12-12 Thread lukai
Do we have any plan to decouple the index process? Lucene was design for search, but according the question people ask in the thread it beyonds search functionality sometimes. Like we might want to customize our scoring function based on payload. Sometimes i dont need to store TF/IDF information.

Re: What is flexible indexing in Lucene 4.0 if it's not the ability to make new postings codecs?

2012-11-30 Thread Johannes.Lichtenberger
On 11/28/2012 01:11 AM, Michael McCandless wrote: Flexible indexing is the ability to make your own codec, which controls the reading and writing of all index parts (postings, stored fields, term vectors, deleted docs, etc.). So for example if you want to store some postings as a bit set

Re: What is flexible indexing in Lucene 4.0 if it's not the ability to make new postings codecs?

2012-11-30 Thread Jack Krupansky
I will probably have to implement my own datastructure and parser/tokenizer/stemmer Why? I mean, I think the point of the Lucene architecture is that the codec level is completely independent of the analysis level. The end result of analysis is a value to be stored from the application

Re: What is flexible indexing in Lucene 4.0 if it's not the ability to make new postings codecs?

2012-11-30 Thread Wu, Stephen T., Ph.D.
Is there any (preliminary) code checked in somewhere that I can look at, that would help me understand the practical issues that would need to be addressed? If I understand you correctly, it's a little different from what's happening in your blog posts:

Re: What is flexible indexing in Lucene 4.0 if it's not the ability to make new postings codecs?

2012-11-30 Thread Michael McCandless
On Fri, Nov 30, 2012 at 12:25 PM, Wu, Stephen T., Ph.D. wu.step...@mayo.edu wrote: Is there any (preliminary) code checked in somewhere that I can look at, that would help me understand the practical issues that would need to be addressed? If I understand you correctly, it's a little

What is flexible indexing in Lucene 4.0 if it's not the ability to make new postings codecs?

2012-11-27 Thread Wu, Stephen T., Ph.D.
Following up on a previous question... What is flexible indexing in Lucene 4.0? We assumed it was the ability to easily make new postings formats/codecs -- but a response below says that would be tricky? stephen On 11/27/12 11:48 AM, David Causse dcau...@spotter.com wrote: Hi, We use

Re: What is flexible indexing in Lucene 4.0 if it's not the ability to make new postings codecs?

2012-11-27 Thread Michael McCandless
Flexible indexing is the ability to make your own codec, which controls the reading and writing of all index parts (postings, stored fields, term vectors, deleted docs, etc.). So for example if you want to store some postings as a bit set instead of the block format that's the default coming up