Re: Using term offsets for hit highlighting

2012-05-23 Thread Simon Willnauer
alan,

I merged the branch manually and created a new branch from it. its
here: https://svn.apache.org/repos/asf/lucene/dev/branches/LUCENE-2878
the branch compiles but lots of nocommits / todos

if you have questions please ask I will help as much as I can

simon

On Tue, May 22, 2012 at 8:38 PM, Alan Woodward
alan.woodw...@romseysoftware.co.uk wrote:
 Hey, I reckon I can have a decent go at getting the branch updated.  Is it 
 best to work this out as a patch applying to trunk?  Any patch that merges in 
 all the trunk changes to the branch is going to be absolutely massive…

 On 17 May 2012, at 13:15, Simon Willnauer wrote:

 ok man. I will try to merge up the branch. I tell you this is going to
 be messy and it might not compile but I will make it reasonable so you
 can start.

 simon

 On Thu, May 17, 2012 at 8:03 AM, Alan Woodward
 alan.woodw...@romseysoftware.co.uk wrote:
 Sorry for vanishing for so long, life unexpectedly caught up with me...  
 I'm going to have some time to look at this again next week though, if 
 you're interested in picking it up again.

 On 21 Mar 2012, at 09:02, Alan Woodward wrote:

 That would be great, thanks!  I had a go at merging it last night, but 
 there are a *lot* of changes that I haven't got my head round yet, so it 
 was getting pretty messy.

 On 21 Mar 2012, at 08:49, Simon Willnauer wrote:

 Alan, if you want I can just merge the branch up next week and we
 iterate from there?

 simon

 On Tue, Mar 20, 2012 at 12:34 PM, Erick Erickson
 erickerick...@gmail.com wrote:
 Yep, the first challenge is always getting the old patch(es) to 
 apply.

 On Tue, Mar 20, 2012 at 4:09 AM, Alan Woodward
 alan.woodw...@romseysoftware.co.uk wrote:
 Thanks for all the offers of help!  It looks as though most of the hard 
 work has already been done, which is exactly where I like to pick up 
 projects.  :-)

 Maybe the best place to start would be for me to rebase the branch 
 against trunk, and see what still fits?  I think there have been some 
 fairly major changes in the internals since July last year.

 On 19 Mar 2012, at 17:07, Mike Sokolov wrote:

 I posted a patch with a Collector somewhat similar to what you 
 described, Alan - it's attached to one of the sub-issues 
 https://issues.apache.org/jira/browse/LUCENE-3318.   It is in a fairly 
 complete alpha state, but has seen no production use of course, 
 since it relies on the remainder of the unfinished work in that 
 branch.  It works by creating a TokenStream based on match positions 
 returned from the query and passing that to the existing Highlighter.  
 Please feel free to get in touch if you decide to look into that and 
 have questions.


 -Mike

 On 03/19/2012 11:51 AM, Simon Willnauer wrote:
 On Mon, Mar 19, 2012 at 4:50 PM, Uwe Schindleru...@thetaphi.de  
 wrote:

 Have you marked that for GSOC? Would be a good idea!

 yes I did

 -
 Uwe Schindler
 H.-H.-Meier-Allee 63, D-28213 Bremen
 http://www.thetaphi.de
 eMail: u...@thetaphi.de



 -Original Message-
 From: Simon Willnauer [mailto:simon.willna...@googlemail.com]
 Sent: Monday, March 19, 2012 4:43 PM
 To: dev@lucene.apache.org
 Subject: Re: Using term offsets for hit highlighting

 Alan, you made my day!

 The branch is kind of outdated but I looked at it lately and I can 
 certainly help
 to get it up to speed. The feature in that branch is quite a big 
 one and its in a
 very early stage. Still I want to encourage you to take a look and 
 work on it. I
 promise all my help with the issues!

 let me know if you have questions!

 simon

 On Mon, Mar 19, 2012 at 3:52 PM, Alan Woodward
 alan.woodw...@romseysoftware.co.uk  wrote:

 Cool, thanks Robert.  I'll take a look at the JIRA ticket.

 On 19 Mar 2012, at 14:44, Robert Muir wrote:


 On Mon, Mar 19, 2012 at 10:38 AM, Alan Woodward
 alan.woodw...@romseysoftware.co.uk  wrote:

 Hello,

 The project I'm currently working on requires the reporting of 
 exact
 hit positions from some pretty hairy queries, not all of which 
 are
 covered by the existing highlighter modules.  I'm working round 
 this
 by translating everything into SpanQueries, and using the 
 getSpans()
 method to locate hits (I've extended the Spans interface to make
 term offsets available - see
 https://issues.apache.org/jira/browse/LUCENE-3826).  This works 
 for
 our use-case, but isn't terribly efficient, and obviously isn't 
 applicable to

 non-Span queries.

 I've seen a bit of chatter on the list about using term offsets 
 to
 provide accurate highlighting in Lucene.  I'm going to have a 
 couple
 of weeks free in April, and I thought I might have a go at
 implementing this.  Mainly I'm wondering if there's already been
 thoughts about how to do it.  My current thoughts are to somehow
 extend the Weight and Scorer interface to make term offsets
 available; to get highlights for a given set of documents, you'd
 essentially run the query again, with a filter on just the 
 documents
 you want highlighted, and have

Re: Using term offsets for hit highlighting

2012-05-23 Thread Alan Woodward
Sweet, thanks Simon.  I'll have a go at getting some failing tests passing to 
begin with.

On 23 May 2012, at 11:59, Simon Willnauer wrote:

 alan,
 
 I merged the branch manually and created a new branch from it. its
 here: https://svn.apache.org/repos/asf/lucene/dev/branches/LUCENE-2878
 the branch compiles but lots of nocommits / todos
 
 if you have questions please ask I will help as much as I can
 
 simon
 
 On Tue, May 22, 2012 at 8:38 PM, Alan Woodward
 alan.woodw...@romseysoftware.co.uk wrote:
 Hey, I reckon I can have a decent go at getting the branch updated.  Is it 
 best to work this out as a patch applying to trunk?  Any patch that merges 
 in all the trunk changes to the branch is going to be absolutely massive…
 
 On 17 May 2012, at 13:15, Simon Willnauer wrote:
 
 ok man. I will try to merge up the branch. I tell you this is going to
 be messy and it might not compile but I will make it reasonable so you
 can start.
 
 simon
 
 On Thu, May 17, 2012 at 8:03 AM, Alan Woodward
 alan.woodw...@romseysoftware.co.uk wrote:
 Sorry for vanishing for so long, life unexpectedly caught up with me...  
 I'm going to have some time to look at this again next week though, if 
 you're interested in picking it up again.
 
 On 21 Mar 2012, at 09:02, Alan Woodward wrote:
 
 That would be great, thanks!  I had a go at merging it last night, but 
 there are a *lot* of changes that I haven't got my head round yet, so it 
 was getting pretty messy.
 
 On 21 Mar 2012, at 08:49, Simon Willnauer wrote:
 
 Alan, if you want I can just merge the branch up next week and we
 iterate from there?
 
 simon
 
 On Tue, Mar 20, 2012 at 12:34 PM, Erick Erickson
 erickerick...@gmail.com wrote:
 Yep, the first challenge is always getting the old patch(es) to 
 apply.
 
 On Tue, Mar 20, 2012 at 4:09 AM, Alan Woodward
 alan.woodw...@romseysoftware.co.uk wrote:
 Thanks for all the offers of help!  It looks as though most of the 
 hard work has already been done, which is exactly where I like to pick 
 up projects.  :-)
 
 Maybe the best place to start would be for me to rebase the branch 
 against trunk, and see what still fits?  I think there have been some 
 fairly major changes in the internals since July last year.
 
 On 19 Mar 2012, at 17:07, Mike Sokolov wrote:
 
 I posted a patch with a Collector somewhat similar to what you 
 described, Alan - it's attached to one of the sub-issues 
 https://issues.apache.org/jira/browse/LUCENE-3318.   It is in a 
 fairly complete alpha state, but has seen no production use of 
 course, since it relies on the remainder of the unfinished work in 
 that branch.  It works by creating a TokenStream based on match 
 positions returned from the query and passing that to the existing 
 Highlighter.  Please feel free to get in touch if you decide to look 
 into that and have questions.
 
 
 -Mike
 
 On 03/19/2012 11:51 AM, Simon Willnauer wrote:
 On Mon, Mar 19, 2012 at 4:50 PM, Uwe Schindleru...@thetaphi.de  
 wrote:
 
 Have you marked that for GSOC? Would be a good idea!
 
 yes I did
 
 -
 Uwe Schindler
 H.-H.-Meier-Allee 63, D-28213 Bremen
 http://www.thetaphi.de
 eMail: u...@thetaphi.de
 
 
 
 -Original Message-
 From: Simon Willnauer [mailto:simon.willna...@googlemail.com]
 Sent: Monday, March 19, 2012 4:43 PM
 To: dev@lucene.apache.org
 Subject: Re: Using term offsets for hit highlighting
 
 Alan, you made my day!
 
 The branch is kind of outdated but I looked at it lately and I can 
 certainly help
 to get it up to speed. The feature in that branch is quite a big 
 one and its in a
 very early stage. Still I want to encourage you to take a look and 
 work on it. I
 promise all my help with the issues!
 
 let me know if you have questions!
 
 simon
 
 On Mon, Mar 19, 2012 at 3:52 PM, Alan Woodward
 alan.woodw...@romseysoftware.co.uk  wrote:
 
 Cool, thanks Robert.  I'll take a look at the JIRA ticket.
 
 On 19 Mar 2012, at 14:44, Robert Muir wrote:
 
 
 On Mon, Mar 19, 2012 at 10:38 AM, Alan Woodward
 alan.woodw...@romseysoftware.co.uk  wrote:
 
 Hello,
 
 The project I'm currently working on requires the reporting of 
 exact
 hit positions from some pretty hairy queries, not all of which 
 are
 covered by the existing highlighter modules.  I'm working round 
 this
 by translating everything into SpanQueries, and using the 
 getSpans()
 method to locate hits (I've extended the Spans interface to make
 term offsets available - see
 https://issues.apache.org/jira/browse/LUCENE-3826).  This works 
 for
 our use-case, but isn't terribly efficient, and obviously isn't 
 applicable to
 
 non-Span queries.
 
 I've seen a bit of chatter on the list about using term offsets 
 to
 provide accurate highlighting in Lucene.  I'm going to have a 
 couple
 of weeks free in April, and I thought I might have a go at
 implementing this.  Mainly I'm wondering if there's already been
 thoughts about how to do it.  My current thoughts are to somehow
 extend the Weight and Scorer interface

Re: Using term offsets for hit highlighting

2012-05-23 Thread Simon Willnauer
hey alan,

I added position iterator support to ConjunctionTermScorer and
committed it to the branch. All tests that don't rely on payloads are
passing in core. Previously we had to decide if we need positions up
front, the current code can pull them lazily which causes less changes
on the Scorer API. I think we should keep it that way, the only
problem is that we have currently now way to pass information to the
iterators if we need payloads or not. Same is true for offsets since
they are now in the index. I think it would be good if you could
tackle the payloads first and pass some info to the Scorer#positions()
method so we can pull the right thing.

happy coding.

simon

On Wed, May 23, 2012 at 1:23 PM, Alan Woodward
alan.woodw...@romseysoftware.co.uk wrote:
 Sweet, thanks Simon.  I'll have a go at getting some failing tests passing to 
 begin with.

 On 23 May 2012, at 11:59, Simon Willnauer wrote:

 alan,

 I merged the branch manually and created a new branch from it. its
 here: https://svn.apache.org/repos/asf/lucene/dev/branches/LUCENE-2878
 the branch compiles but lots of nocommits / todos

 if you have questions please ask I will help as much as I can

 simon

 On Tue, May 22, 2012 at 8:38 PM, Alan Woodward
 alan.woodw...@romseysoftware.co.uk wrote:
 Hey, I reckon I can have a decent go at getting the branch updated.  Is it 
 best to work this out as a patch applying to trunk?  Any patch that merges 
 in all the trunk changes to the branch is going to be absolutely massive…

 On 17 May 2012, at 13:15, Simon Willnauer wrote:

 ok man. I will try to merge up the branch. I tell you this is going to
 be messy and it might not compile but I will make it reasonable so you
 can start.

 simon

 On Thu, May 17, 2012 at 8:03 AM, Alan Woodward
 alan.woodw...@romseysoftware.co.uk wrote:
 Sorry for vanishing for so long, life unexpectedly caught up with me...  
 I'm going to have some time to look at this again next week though, if 
 you're interested in picking it up again.

 On 21 Mar 2012, at 09:02, Alan Woodward wrote:

 That would be great, thanks!  I had a go at merging it last night, but 
 there are a *lot* of changes that I haven't got my head round yet, so it 
 was getting pretty messy.

 On 21 Mar 2012, at 08:49, Simon Willnauer wrote:

 Alan, if you want I can just merge the branch up next week and we
 iterate from there?

 simon

 On Tue, Mar 20, 2012 at 12:34 PM, Erick Erickson
 erickerick...@gmail.com wrote:
 Yep, the first challenge is always getting the old patch(es) to 
 apply.

 On Tue, Mar 20, 2012 at 4:09 AM, Alan Woodward
 alan.woodw...@romseysoftware.co.uk wrote:
 Thanks for all the offers of help!  It looks as though most of the 
 hard work has already been done, which is exactly where I like to 
 pick up projects.  :-)

 Maybe the best place to start would be for me to rebase the branch 
 against trunk, and see what still fits?  I think there have been some 
 fairly major changes in the internals since July last year.

 On 19 Mar 2012, at 17:07, Mike Sokolov wrote:

 I posted a patch with a Collector somewhat similar to what you 
 described, Alan - it's attached to one of the sub-issues 
 https://issues.apache.org/jira/browse/LUCENE-3318.   It is in a 
 fairly complete alpha state, but has seen no production use of 
 course, since it relies on the remainder of the unfinished work in 
 that branch.  It works by creating a TokenStream based on match 
 positions returned from the query and passing that to the existing 
 Highlighter.  Please feel free to get in touch if you decide to look 
 into that and have questions.


 -Mike

 On 03/19/2012 11:51 AM, Simon Willnauer wrote:
 On Mon, Mar 19, 2012 at 4:50 PM, Uwe Schindleru...@thetaphi.de  
 wrote:

 Have you marked that for GSOC? Would be a good idea!

 yes I did

 -
 Uwe Schindler
 H.-H.-Meier-Allee 63, D-28213 Bremen
 http://www.thetaphi.de
 eMail: u...@thetaphi.de



 -Original Message-
 From: Simon Willnauer [mailto:simon.willna...@googlemail.com]
 Sent: Monday, March 19, 2012 4:43 PM
 To: dev@lucene.apache.org
 Subject: Re: Using term offsets for hit highlighting

 Alan, you made my day!

 The branch is kind of outdated but I looked at it lately and I 
 can certainly help
 to get it up to speed. The feature in that branch is quite a big 
 one and its in a
 very early stage. Still I want to encourage you to take a look 
 and work on it. I
 promise all my help with the issues!

 let me know if you have questions!

 simon

 On Mon, Mar 19, 2012 at 3:52 PM, Alan Woodward
 alan.woodw...@romseysoftware.co.uk  wrote:

 Cool, thanks Robert.  I'll take a look at the JIRA ticket.

 On 19 Mar 2012, at 14:44, Robert Muir wrote:


 On Mon, Mar 19, 2012 at 10:38 AM, Alan Woodward
 alan.woodw...@romseysoftware.co.uk  wrote:

 Hello,

 The project I'm currently working on requires the reporting of 
 exact
 hit positions from some pretty hairy queries, not all of which 
 are
 covered by the existing highlighter modules.  I'm

Re: Using term offsets for hit highlighting

2012-05-23 Thread Alan Woodward
OK, so the most straightforward way to do that would be to change the signature 
to positions(boolean needsPayloads, boolean needsOffsets), I guess.  This is a 
new API so it's not breaking anything.  

It'll be tomorrow morning before I have a proper go at this now (Cambridge Beer 
Festival tonight…).  Is the mailing list the best place to discuss this, or is 
JIRA/IRC better?

On 23 May 2012, at 13:43, Simon Willnauer wrote:

 hey alan,
 
 I added position iterator support to ConjunctionTermScorer and
 committed it to the branch. All tests that don't rely on payloads are
 passing in core. Previously we had to decide if we need positions up
 front, the current code can pull them lazily which causes less changes
 on the Scorer API. I think we should keep it that way, the only
 problem is that we have currently now way to pass information to the
 iterators if we need payloads or not. Same is true for offsets since
 they are now in the index. I think it would be good if you could
 tackle the payloads first and pass some info to the Scorer#positions()
 method so we can pull the right thing.
 
 happy coding.
 
 simon
 
 On Wed, May 23, 2012 at 1:23 PM, Alan Woodward
 alan.woodw...@romseysoftware.co.uk wrote:
 Sweet, thanks Simon.  I'll have a go at getting some failing tests passing 
 to begin with.
 
 On 23 May 2012, at 11:59, Simon Willnauer wrote:
 
 alan,
 
 I merged the branch manually and created a new branch from it. its
 here: https://svn.apache.org/repos/asf/lucene/dev/branches/LUCENE-2878
 the branch compiles but lots of nocommits / todos
 
 if you have questions please ask I will help as much as I can
 
 simon
 
 On Tue, May 22, 2012 at 8:38 PM, Alan Woodward
 alan.woodw...@romseysoftware.co.uk wrote:
 Hey, I reckon I can have a decent go at getting the branch updated.  Is it 
 best to work this out as a patch applying to trunk?  Any patch that merges 
 in all the trunk changes to the branch is going to be absolutely massive…
 
 On 17 May 2012, at 13:15, Simon Willnauer wrote:
 
 ok man. I will try to merge up the branch. I tell you this is going to
 be messy and it might not compile but I will make it reasonable so you
 can start.
 
 simon
 
 On Thu, May 17, 2012 at 8:03 AM, Alan Woodward
 alan.woodw...@romseysoftware.co.uk wrote:
 Sorry for vanishing for so long, life unexpectedly caught up with me...  
 I'm going to have some time to look at this again next week though, if 
 you're interested in picking it up again.
 
 On 21 Mar 2012, at 09:02, Alan Woodward wrote:
 
 That would be great, thanks!  I had a go at merging it last night, but 
 there are a *lot* of changes that I haven't got my head round yet, so 
 it was getting pretty messy.
 
 On 21 Mar 2012, at 08:49, Simon Willnauer wrote:
 
 Alan, if you want I can just merge the branch up next week and we
 iterate from there?
 
 simon
 
 On Tue, Mar 20, 2012 at 12:34 PM, Erick Erickson
 erickerick...@gmail.com wrote:
 Yep, the first challenge is always getting the old patch(es) to 
 apply.
 
 On Tue, Mar 20, 2012 at 4:09 AM, Alan Woodward
 alan.woodw...@romseysoftware.co.uk wrote:
 Thanks for all the offers of help!  It looks as though most of the 
 hard work has already been done, which is exactly where I like to 
 pick up projects.  :-)
 
 Maybe the best place to start would be for me to rebase the branch 
 against trunk, and see what still fits?  I think there have been 
 some fairly major changes in the internals since July last year.
 
 On 19 Mar 2012, at 17:07, Mike Sokolov wrote:
 
 I posted a patch with a Collector somewhat similar to what you 
 described, Alan - it's attached to one of the sub-issues 
 https://issues.apache.org/jira/browse/LUCENE-3318.   It is in a 
 fairly complete alpha state, but has seen no production use of 
 course, since it relies on the remainder of the unfinished work in 
 that branch.  It works by creating a TokenStream based on match 
 positions returned from the query and passing that to the existing 
 Highlighter.  Please feel free to get in touch if you decide to 
 look into that and have questions.
 
 
 -Mike
 
 On 03/19/2012 11:51 AM, Simon Willnauer wrote:
 On Mon, Mar 19, 2012 at 4:50 PM, Uwe Schindleru...@thetaphi.de  
 wrote:
 
 Have you marked that for GSOC? Would be a good idea!
 
 yes I did
 
 -
 Uwe Schindler
 H.-H.-Meier-Allee 63, D-28213 Bremen
 http://www.thetaphi.de
 eMail: u...@thetaphi.de
 
 
 
 -Original Message-
 From: Simon Willnauer [mailto:simon.willna...@googlemail.com]
 Sent: Monday, March 19, 2012 4:43 PM
 To: dev@lucene.apache.org
 Subject: Re: Using term offsets for hit highlighting
 
 Alan, you made my day!
 
 The branch is kind of outdated but I looked at it lately and I 
 can certainly help
 to get it up to speed. The feature in that branch is quite a big 
 one and its in a
 very early stage. Still I want to encourage you to take a look 
 and work on it. I
 promise all my help with the issues!
 
 let me know if you have questions!
 
 simon
 
 On Mon, Mar 19

Re: Using term offsets for hit highlighting

2012-05-23 Thread Simon Willnauer
 in 
 that branch.  It works by creating a TokenStream based on match 
 positions returned from the query and passing that to the existing 
 Highlighter.  Please feel free to get in touch if you decide to 
 look into that and have questions.


 -Mike

 On 03/19/2012 11:51 AM, Simon Willnauer wrote:
 On Mon, Mar 19, 2012 at 4:50 PM, Uwe Schindleru...@thetaphi.de  
 wrote:

 Have you marked that for GSOC? Would be a good idea!

 yes I did

 -
 Uwe Schindler
 H.-H.-Meier-Allee 63, D-28213 Bremen
 http://www.thetaphi.de
 eMail: u...@thetaphi.de



 -Original Message-
 From: Simon Willnauer [mailto:simon.willna...@googlemail.com]
 Sent: Monday, March 19, 2012 4:43 PM
 To: dev@lucene.apache.org
 Subject: Re: Using term offsets for hit highlighting

 Alan, you made my day!

 The branch is kind of outdated but I looked at it lately and I 
 can certainly help
 to get it up to speed. The feature in that branch is quite a 
 big one and its in a
 very early stage. Still I want to encourage you to take a look 
 and work on it. I
 promise all my help with the issues!

 let me know if you have questions!

 simon

 On Mon, Mar 19, 2012 at 3:52 PM, Alan Woodward
 alan.woodw...@romseysoftware.co.uk  wrote:

 Cool, thanks Robert.  I'll take a look at the JIRA ticket.

 On 19 Mar 2012, at 14:44, Robert Muir wrote:


 On Mon, Mar 19, 2012 at 10:38 AM, Alan Woodward
 alan.woodw...@romseysoftware.co.uk  wrote:

 Hello,

 The project I'm currently working on requires the reporting 
 of exact
 hit positions from some pretty hairy queries, not all of 
 which are
 covered by the existing highlighter modules.  I'm working 
 round this
 by translating everything into SpanQueries, and using the 
 getSpans()
 method to locate hits (I've extended the Spans interface to 
 make
 term offsets available - see
 https://issues.apache.org/jira/browse/LUCENE-3826).  This 
 works for
 our use-case, but isn't terribly efficient, and obviously 
 isn't applicable to

 non-Span queries.

 I've seen a bit of chatter on the list about using term 
 offsets to
 provide accurate highlighting in Lucene.  I'm going to have 
 a couple
 of weeks free in April, and I thought I might have a go at
 implementing this.  Mainly I'm wondering if there's already 
 been
 thoughts about how to do it.  My current thoughts are to 
 somehow
 extend the Weight and Scorer interface to make term offsets
 available; to get highlights for a given set of documents, 
 you'd
 essentially run the query again, with a filter on just the 
 documents
 you want highlighted, and have a custom collector that gets 
 the term

 offsets in place of the scores.


 Hi Alan, Simon started some initial work on
 https://issues.apache.org/jira/browse/LUCENE-2878

 Some work and prototypes were done in a branch, but it might 
 be
 lagging behind trunk a bit.

 Additionally at the time it was first done, I think we didn't 
 yet
 support offsets in the postings lists.
 We've since added this and several codecs support it.

 --
 lucidimagination.com

 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For
 additional commands, e-mail: dev-h...@lucene.apache.org



 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For
 additional commands, e-mail: dev-h...@lucene.apache.org


 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For 
 additional
 commands, e-mail: dev-h...@lucene.apache.org


 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org



 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org



 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org


 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org


 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org



 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org



 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org

Re: Using term offsets for hit highlighting

2012-05-22 Thread Alan Woodward
Hey, I reckon I can have a decent go at getting the branch updated.  Is it best 
to work this out as a patch applying to trunk?  Any patch that merges in all 
the trunk changes to the branch is going to be absolutely massive…

On 17 May 2012, at 13:15, Simon Willnauer wrote:

 ok man. I will try to merge up the branch. I tell you this is going to
 be messy and it might not compile but I will make it reasonable so you
 can start.
 
 simon
 
 On Thu, May 17, 2012 at 8:03 AM, Alan Woodward
 alan.woodw...@romseysoftware.co.uk wrote:
 Sorry for vanishing for so long, life unexpectedly caught up with me...  I'm 
 going to have some time to look at this again next week though, if you're 
 interested in picking it up again.
 
 On 21 Mar 2012, at 09:02, Alan Woodward wrote:
 
 That would be great, thanks!  I had a go at merging it last night, but 
 there are a *lot* of changes that I haven't got my head round yet, so it 
 was getting pretty messy.
 
 On 21 Mar 2012, at 08:49, Simon Willnauer wrote:
 
 Alan, if you want I can just merge the branch up next week and we
 iterate from there?
 
 simon
 
 On Tue, Mar 20, 2012 at 12:34 PM, Erick Erickson
 erickerick...@gmail.com wrote:
 Yep, the first challenge is always getting the old patch(es) to apply.
 
 On Tue, Mar 20, 2012 at 4:09 AM, Alan Woodward
 alan.woodw...@romseysoftware.co.uk wrote:
 Thanks for all the offers of help!  It looks as though most of the hard 
 work has already been done, which is exactly where I like to pick up 
 projects.  :-)
 
 Maybe the best place to start would be for me to rebase the branch 
 against trunk, and see what still fits?  I think there have been some 
 fairly major changes in the internals since July last year.
 
 On 19 Mar 2012, at 17:07, Mike Sokolov wrote:
 
 I posted a patch with a Collector somewhat similar to what you 
 described, Alan - it's attached to one of the sub-issues 
 https://issues.apache.org/jira/browse/LUCENE-3318.   It is in a fairly 
 complete alpha state, but has seen no production use of course, since 
 it relies on the remainder of the unfinished work in that branch.  It 
 works by creating a TokenStream based on match positions returned from 
 the query and passing that to the existing Highlighter.  Please feel 
 free to get in touch if you decide to look into that and have questions.
 
 
 -Mike
 
 On 03/19/2012 11:51 AM, Simon Willnauer wrote:
 On Mon, Mar 19, 2012 at 4:50 PM, Uwe Schindleru...@thetaphi.de  
 wrote:
 
 Have you marked that for GSOC? Would be a good idea!
 
 yes I did
 
 -
 Uwe Schindler
 H.-H.-Meier-Allee 63, D-28213 Bremen
 http://www.thetaphi.de
 eMail: u...@thetaphi.de
 
 
 
 -Original Message-
 From: Simon Willnauer [mailto:simon.willna...@googlemail.com]
 Sent: Monday, March 19, 2012 4:43 PM
 To: dev@lucene.apache.org
 Subject: Re: Using term offsets for hit highlighting
 
 Alan, you made my day!
 
 The branch is kind of outdated but I looked at it lately and I can 
 certainly help
 to get it up to speed. The feature in that branch is quite a big one 
 and its in a
 very early stage. Still I want to encourage you to take a look and 
 work on it. I
 promise all my help with the issues!
 
 let me know if you have questions!
 
 simon
 
 On Mon, Mar 19, 2012 at 3:52 PM, Alan Woodward
 alan.woodw...@romseysoftware.co.uk  wrote:
 
 Cool, thanks Robert.  I'll take a look at the JIRA ticket.
 
 On 19 Mar 2012, at 14:44, Robert Muir wrote:
 
 
 On Mon, Mar 19, 2012 at 10:38 AM, Alan Woodward
 alan.woodw...@romseysoftware.co.uk  wrote:
 
 Hello,
 
 The project I'm currently working on requires the reporting of 
 exact
 hit positions from some pretty hairy queries, not all of which are
 covered by the existing highlighter modules.  I'm working round 
 this
 by translating everything into SpanQueries, and using the 
 getSpans()
 method to locate hits (I've extended the Spans interface to make
 term offsets available - see
 https://issues.apache.org/jira/browse/LUCENE-3826).  This works 
 for
 our use-case, but isn't terribly efficient, and obviously isn't 
 applicable to
 
 non-Span queries.
 
 I've seen a bit of chatter on the list about using term offsets to
 provide accurate highlighting in Lucene.  I'm going to have a 
 couple
 of weeks free in April, and I thought I might have a go at
 implementing this.  Mainly I'm wondering if there's already been
 thoughts about how to do it.  My current thoughts are to somehow
 extend the Weight and Scorer interface to make term offsets
 available; to get highlights for a given set of documents, you'd
 essentially run the query again, with a filter on just the 
 documents
 you want highlighted, and have a custom collector that gets the 
 term
 
 offsets in place of the scores.
 
 
 Hi Alan, Simon started some initial work on
 https://issues.apache.org/jira/browse/LUCENE-2878
 
 Some work and prototypes were done in a branch, but it might be
 lagging behind trunk a bit.
 
 Additionally at the time it was first done, I think we didn't

Re: Using term offsets for hit highlighting

2012-05-17 Thread Alan Woodward
Sorry for vanishing for so long, life unexpectedly caught up with me...  I'm 
going to have some time to look at this again next week though, if you're 
interested in picking it up again.

On 21 Mar 2012, at 09:02, Alan Woodward wrote:

 That would be great, thanks!  I had a go at merging it last night, but there 
 are a *lot* of changes that I haven't got my head round yet, so it was 
 getting pretty messy.
 
 On 21 Mar 2012, at 08:49, Simon Willnauer wrote:
 
 Alan, if you want I can just merge the branch up next week and we
 iterate from there?
 
 simon
 
 On Tue, Mar 20, 2012 at 12:34 PM, Erick Erickson
 erickerick...@gmail.com wrote:
 Yep, the first challenge is always getting the old patch(es) to apply.
 
 On Tue, Mar 20, 2012 at 4:09 AM, Alan Woodward
 alan.woodw...@romseysoftware.co.uk wrote:
 Thanks for all the offers of help!  It looks as though most of the hard 
 work has already been done, which is exactly where I like to pick up 
 projects.  :-)
 
 Maybe the best place to start would be for me to rebase the branch against 
 trunk, and see what still fits?  I think there have been some fairly major 
 changes in the internals since July last year.
 
 On 19 Mar 2012, at 17:07, Mike Sokolov wrote:
 
 I posted a patch with a Collector somewhat similar to what you described, 
 Alan - it's attached to one of the sub-issues 
 https://issues.apache.org/jira/browse/LUCENE-3318.   It is in a fairly 
 complete alpha state, but has seen no production use of course, since 
 it relies on the remainder of the unfinished work in that branch.  It 
 works by creating a TokenStream based on match positions returned from 
 the query and passing that to the existing Highlighter.  Please feel free 
 to get in touch if you decide to look into that and have questions.
 
 
 -Mike
 
 On 03/19/2012 11:51 AM, Simon Willnauer wrote:
 On Mon, Mar 19, 2012 at 4:50 PM, Uwe Schindleru...@thetaphi.de  wrote:
 
 Have you marked that for GSOC? Would be a good idea!
 
 yes I did
 
 -
 Uwe Schindler
 H.-H.-Meier-Allee 63, D-28213 Bremen
 http://www.thetaphi.de
 eMail: u...@thetaphi.de
 
 
 
 -Original Message-
 From: Simon Willnauer [mailto:simon.willna...@googlemail.com]
 Sent: Monday, March 19, 2012 4:43 PM
 To: dev@lucene.apache.org
 Subject: Re: Using term offsets for hit highlighting
 
 Alan, you made my day!
 
 The branch is kind of outdated but I looked at it lately and I can 
 certainly help
 to get it up to speed. The feature in that branch is quite a big one 
 and its in a
 very early stage. Still I want to encourage you to take a look and 
 work on it. I
 promise all my help with the issues!
 
 let me know if you have questions!
 
 simon
 
 On Mon, Mar 19, 2012 at 3:52 PM, Alan Woodward
 alan.woodw...@romseysoftware.co.uk  wrote:
 
 Cool, thanks Robert.  I'll take a look at the JIRA ticket.
 
 On 19 Mar 2012, at 14:44, Robert Muir wrote:
 
 
 On Mon, Mar 19, 2012 at 10:38 AM, Alan Woodward
 alan.woodw...@romseysoftware.co.uk  wrote:
 
 Hello,
 
 The project I'm currently working on requires the reporting of exact
 hit positions from some pretty hairy queries, not all of which are
 covered by the existing highlighter modules.  I'm working round this
 by translating everything into SpanQueries, and using the getSpans()
 method to locate hits (I've extended the Spans interface to make
 term offsets available - see
 https://issues.apache.org/jira/browse/LUCENE-3826).  This works for
 our use-case, but isn't terribly efficient, and obviously isn't 
 applicable to
 
 non-Span queries.
 
 I've seen a bit of chatter on the list about using term offsets to
 provide accurate highlighting in Lucene.  I'm going to have a couple
 of weeks free in April, and I thought I might have a go at
 implementing this.  Mainly I'm wondering if there's already been
 thoughts about how to do it.  My current thoughts are to somehow
 extend the Weight and Scorer interface to make term offsets
 available; to get highlights for a given set of documents, you'd
 essentially run the query again, with a filter on just the documents
 you want highlighted, and have a custom collector that gets the term
 
 offsets in place of the scores.
 
 
 Hi Alan, Simon started some initial work on
 https://issues.apache.org/jira/browse/LUCENE-2878
 
 Some work and prototypes were done in a branch, but it might be
 lagging behind trunk a bit.
 
 Additionally at the time it was first done, I think we didn't yet
 support offsets in the postings lists.
 We've since added this and several codecs support it.
 
 --
 lucidimagination.com
 
 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For
 additional commands, e-mail: dev-h...@lucene.apache.org
 
 
 
 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For
 additional commands, e-mail: dev-h...@lucene.apache.org

Re: Using term offsets for hit highlighting

2012-05-17 Thread Simon Willnauer
ok man. I will try to merge up the branch. I tell you this is going to
be messy and it might not compile but I will make it reasonable so you
can start.

simon

On Thu, May 17, 2012 at 8:03 AM, Alan Woodward
alan.woodw...@romseysoftware.co.uk wrote:
 Sorry for vanishing for so long, life unexpectedly caught up with me...  I'm 
 going to have some time to look at this again next week though, if you're 
 interested in picking it up again.

 On 21 Mar 2012, at 09:02, Alan Woodward wrote:

 That would be great, thanks!  I had a go at merging it last night, but there 
 are a *lot* of changes that I haven't got my head round yet, so it was 
 getting pretty messy.

 On 21 Mar 2012, at 08:49, Simon Willnauer wrote:

 Alan, if you want I can just merge the branch up next week and we
 iterate from there?

 simon

 On Tue, Mar 20, 2012 at 12:34 PM, Erick Erickson
 erickerick...@gmail.com wrote:
 Yep, the first challenge is always getting the old patch(es) to apply.

 On Tue, Mar 20, 2012 at 4:09 AM, Alan Woodward
 alan.woodw...@romseysoftware.co.uk wrote:
 Thanks for all the offers of help!  It looks as though most of the hard 
 work has already been done, which is exactly where I like to pick up 
 projects.  :-)

 Maybe the best place to start would be for me to rebase the branch 
 against trunk, and see what still fits?  I think there have been some 
 fairly major changes in the internals since July last year.

 On 19 Mar 2012, at 17:07, Mike Sokolov wrote:

 I posted a patch with a Collector somewhat similar to what you 
 described, Alan - it's attached to one of the sub-issues 
 https://issues.apache.org/jira/browse/LUCENE-3318.   It is in a fairly 
 complete alpha state, but has seen no production use of course, since 
 it relies on the remainder of the unfinished work in that branch.  It 
 works by creating a TokenStream based on match positions returned from 
 the query and passing that to the existing Highlighter.  Please feel 
 free to get in touch if you decide to look into that and have questions.


 -Mike

 On 03/19/2012 11:51 AM, Simon Willnauer wrote:
 On Mon, Mar 19, 2012 at 4:50 PM, Uwe Schindleru...@thetaphi.de  wrote:

 Have you marked that for GSOC? Would be a good idea!

 yes I did

 -
 Uwe Schindler
 H.-H.-Meier-Allee 63, D-28213 Bremen
 http://www.thetaphi.de
 eMail: u...@thetaphi.de



 -Original Message-
 From: Simon Willnauer [mailto:simon.willna...@googlemail.com]
 Sent: Monday, March 19, 2012 4:43 PM
 To: dev@lucene.apache.org
 Subject: Re: Using term offsets for hit highlighting

 Alan, you made my day!

 The branch is kind of outdated but I looked at it lately and I can 
 certainly help
 to get it up to speed. The feature in that branch is quite a big one 
 and its in a
 very early stage. Still I want to encourage you to take a look and 
 work on it. I
 promise all my help with the issues!

 let me know if you have questions!

 simon

 On Mon, Mar 19, 2012 at 3:52 PM, Alan Woodward
 alan.woodw...@romseysoftware.co.uk  wrote:

 Cool, thanks Robert.  I'll take a look at the JIRA ticket.

 On 19 Mar 2012, at 14:44, Robert Muir wrote:


 On Mon, Mar 19, 2012 at 10:38 AM, Alan Woodward
 alan.woodw...@romseysoftware.co.uk  wrote:

 Hello,

 The project I'm currently working on requires the reporting of 
 exact
 hit positions from some pretty hairy queries, not all of which are
 covered by the existing highlighter modules.  I'm working round 
 this
 by translating everything into SpanQueries, and using the 
 getSpans()
 method to locate hits (I've extended the Spans interface to make
 term offsets available - see
 https://issues.apache.org/jira/browse/LUCENE-3826).  This works for
 our use-case, but isn't terribly efficient, and obviously isn't 
 applicable to

 non-Span queries.

 I've seen a bit of chatter on the list about using term offsets to
 provide accurate highlighting in Lucene.  I'm going to have a 
 couple
 of weeks free in April, and I thought I might have a go at
 implementing this.  Mainly I'm wondering if there's already been
 thoughts about how to do it.  My current thoughts are to somehow
 extend the Weight and Scorer interface to make term offsets
 available; to get highlights for a given set of documents, you'd
 essentially run the query again, with a filter on just the 
 documents
 you want highlighted, and have a custom collector that gets the 
 term

 offsets in place of the scores.


 Hi Alan, Simon started some initial work on
 https://issues.apache.org/jira/browse/LUCENE-2878

 Some work and prototypes were done in a branch, but it might be
 lagging behind trunk a bit.

 Additionally at the time it was first done, I think we didn't yet
 support offsets in the postings lists.
 We've since added this and several codecs support it.

 --
 lucidimagination.com

 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For
 additional commands, e-mail: dev-h...@lucene.apache.org

Re: Using term offsets for hit highlighting

2012-03-21 Thread Simon Willnauer
Alan, if you want I can just merge the branch up next week and we
iterate from there?

simon

On Tue, Mar 20, 2012 at 12:34 PM, Erick Erickson
erickerick...@gmail.com wrote:
 Yep, the first challenge is always getting the old patch(es) to apply.

 On Tue, Mar 20, 2012 at 4:09 AM, Alan Woodward
 alan.woodw...@romseysoftware.co.uk wrote:
 Thanks for all the offers of help!  It looks as though most of the hard work 
 has already been done, which is exactly where I like to pick up projects.  
 :-)

 Maybe the best place to start would be for me to rebase the branch against 
 trunk, and see what still fits?  I think there have been some fairly major 
 changes in the internals since July last year.

 On 19 Mar 2012, at 17:07, Mike Sokolov wrote:

 I posted a patch with a Collector somewhat similar to what you described, 
 Alan - it's attached to one of the sub-issues 
 https://issues.apache.org/jira/browse/LUCENE-3318.   It is in a fairly 
 complete alpha state, but has seen no production use of course, since it 
 relies on the remainder of the unfinished work in that branch.  It works by 
 creating a TokenStream based on match positions returned from the query and 
 passing that to the existing Highlighter.  Please feel free to get in touch 
 if you decide to look into that and have questions.


 -Mike

 On 03/19/2012 11:51 AM, Simon Willnauer wrote:
 On Mon, Mar 19, 2012 at 4:50 PM, Uwe Schindleru...@thetaphi.de  wrote:

 Have you marked that for GSOC? Would be a good idea!

  yes I did

 -
 Uwe Schindler
 H.-H.-Meier-Allee 63, D-28213 Bremen
 http://www.thetaphi.de
 eMail: u...@thetaphi.de



 -Original Message-
 From: Simon Willnauer [mailto:simon.willna...@googlemail.com]
 Sent: Monday, March 19, 2012 4:43 PM
 To: dev@lucene.apache.org
 Subject: Re: Using term offsets for hit highlighting

 Alan, you made my day!

 The branch is kind of outdated but I looked at it lately and I can 
 certainly help
 to get it up to speed. The feature in that branch is quite a big one and 
 its in a
 very early stage. Still I want to encourage you to take a look and work 
 on it. I
 promise all my help with the issues!

 let me know if you have questions!

 simon

 On Mon, Mar 19, 2012 at 3:52 PM, Alan Woodward
 alan.woodw...@romseysoftware.co.uk  wrote:

 Cool, thanks Robert.  I'll take a look at the JIRA ticket.

 On 19 Mar 2012, at 14:44, Robert Muir wrote:


 On Mon, Mar 19, 2012 at 10:38 AM, Alan Woodward
 alan.woodw...@romseysoftware.co.uk  wrote:

 Hello,

 The project I'm currently working on requires the reporting of exact
 hit positions from some pretty hairy queries, not all of which are
 covered by the existing highlighter modules.  I'm working round this
 by translating everything into SpanQueries, and using the getSpans()
 method to locate hits (I've extended the Spans interface to make
 term offsets available - see
 https://issues.apache.org/jira/browse/LUCENE-3826).  This works for
 our use-case, but isn't terribly efficient, and obviously isn't 
 applicable to

 non-Span queries.

 I've seen a bit of chatter on the list about using term offsets to
 provide accurate highlighting in Lucene.  I'm going to have a couple
 of weeks free in April, and I thought I might have a go at
 implementing this.  Mainly I'm wondering if there's already been
 thoughts about how to do it.  My current thoughts are to somehow
 extend the Weight and Scorer interface to make term offsets
 available; to get highlights for a given set of documents, you'd
 essentially run the query again, with a filter on just the documents
 you want highlighted, and have a custom collector that gets the term

 offsets in place of the scores.


 Hi Alan, Simon started some initial work on
 https://issues.apache.org/jira/browse/LUCENE-2878

 Some work and prototypes were done in a branch, but it might be
 lagging behind trunk a bit.

 Additionally at the time it was first done, I think we didn't yet
 support offsets in the postings lists.
 We've since added this and several codecs support it.

 --
 lucidimagination.com

 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For
 additional commands, e-mail: dev-h...@lucene.apache.org



 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For
 additional commands, e-mail: dev-h...@lucene.apache.org


 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional
 commands, e-mail: dev-h...@lucene.apache.org


 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org



 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e

Re: Using term offsets for hit highlighting

2012-03-21 Thread Alan Woodward
That would be great, thanks!  I had a go at merging it last night, but there 
are a *lot* of changes that I haven't got my head round yet, so it was getting 
pretty messy.

On 21 Mar 2012, at 08:49, Simon Willnauer wrote:

 Alan, if you want I can just merge the branch up next week and we
 iterate from there?
 
 simon
 
 On Tue, Mar 20, 2012 at 12:34 PM, Erick Erickson
 erickerick...@gmail.com wrote:
 Yep, the first challenge is always getting the old patch(es) to apply.
 
 On Tue, Mar 20, 2012 at 4:09 AM, Alan Woodward
 alan.woodw...@romseysoftware.co.uk wrote:
 Thanks for all the offers of help!  It looks as though most of the hard 
 work has already been done, which is exactly where I like to pick up 
 projects.  :-)
 
 Maybe the best place to start would be for me to rebase the branch against 
 trunk, and see what still fits?  I think there have been some fairly major 
 changes in the internals since July last year.
 
 On 19 Mar 2012, at 17:07, Mike Sokolov wrote:
 
 I posted a patch with a Collector somewhat similar to what you described, 
 Alan - it's attached to one of the sub-issues 
 https://issues.apache.org/jira/browse/LUCENE-3318.   It is in a fairly 
 complete alpha state, but has seen no production use of course, since it 
 relies on the remainder of the unfinished work in that branch.  It works 
 by creating a TokenStream based on match positions returned from the query 
 and passing that to the existing Highlighter.  Please feel free to get in 
 touch if you decide to look into that and have questions.
 
 
 -Mike
 
 On 03/19/2012 11:51 AM, Simon Willnauer wrote:
 On Mon, Mar 19, 2012 at 4:50 PM, Uwe Schindleru...@thetaphi.de  wrote:
 
 Have you marked that for GSOC? Would be a good idea!
 
  yes I did
 
 -
 Uwe Schindler
 H.-H.-Meier-Allee 63, D-28213 Bremen
 http://www.thetaphi.de
 eMail: u...@thetaphi.de
 
 
 
 -Original Message-
 From: Simon Willnauer [mailto:simon.willna...@googlemail.com]
 Sent: Monday, March 19, 2012 4:43 PM
 To: dev@lucene.apache.org
 Subject: Re: Using term offsets for hit highlighting
 
 Alan, you made my day!
 
 The branch is kind of outdated but I looked at it lately and I can 
 certainly help
 to get it up to speed. The feature in that branch is quite a big one 
 and its in a
 very early stage. Still I want to encourage you to take a look and work 
 on it. I
 promise all my help with the issues!
 
 let me know if you have questions!
 
 simon
 
 On Mon, Mar 19, 2012 at 3:52 PM, Alan Woodward
 alan.woodw...@romseysoftware.co.uk  wrote:
 
 Cool, thanks Robert.  I'll take a look at the JIRA ticket.
 
 On 19 Mar 2012, at 14:44, Robert Muir wrote:
 
 
 On Mon, Mar 19, 2012 at 10:38 AM, Alan Woodward
 alan.woodw...@romseysoftware.co.uk  wrote:
 
 Hello,
 
 The project I'm currently working on requires the reporting of exact
 hit positions from some pretty hairy queries, not all of which are
 covered by the existing highlighter modules.  I'm working round this
 by translating everything into SpanQueries, and using the getSpans()
 method to locate hits (I've extended the Spans interface to make
 term offsets available - see
 https://issues.apache.org/jira/browse/LUCENE-3826).  This works for
 our use-case, but isn't terribly efficient, and obviously isn't 
 applicable to
 
 non-Span queries.
 
 I've seen a bit of chatter on the list about using term offsets to
 provide accurate highlighting in Lucene.  I'm going to have a couple
 of weeks free in April, and I thought I might have a go at
 implementing this.  Mainly I'm wondering if there's already been
 thoughts about how to do it.  My current thoughts are to somehow
 extend the Weight and Scorer interface to make term offsets
 available; to get highlights for a given set of documents, you'd
 essentially run the query again, with a filter on just the documents
 you want highlighted, and have a custom collector that gets the term
 
 offsets in place of the scores.
 
 
 Hi Alan, Simon started some initial work on
 https://issues.apache.org/jira/browse/LUCENE-2878
 
 Some work and prototypes were done in a branch, but it might be
 lagging behind trunk a bit.
 
 Additionally at the time it was first done, I think we didn't yet
 support offsets in the postings lists.
 We've since added this and several codecs support it.
 
 --
 lucidimagination.com
 
 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For
 additional commands, e-mail: dev-h...@lucene.apache.org
 
 
 
 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For
 additional commands, e-mail: dev-h...@lucene.apache.org
 
 
 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional
 commands, e-mail: dev-h...@lucene.apache.org

Re: Using term offsets for hit highlighting

2012-03-20 Thread Alan Woodward
Thanks for all the offers of help!  It looks as though most of the hard work 
has already been done, which is exactly where I like to pick up projects.  :-)

Maybe the best place to start would be for me to rebase the branch against 
trunk, and see what still fits?  I think there have been some fairly major 
changes in the internals since July last year.

On 19 Mar 2012, at 17:07, Mike Sokolov wrote:

 I posted a patch with a Collector somewhat similar to what you described, 
 Alan - it's attached to one of the sub-issues 
 https://issues.apache.org/jira/browse/LUCENE-3318.   It is in a fairly 
 complete alpha state, but has seen no production use of course, since it 
 relies on the remainder of the unfinished work in that branch.  It works by 
 creating a TokenStream based on match positions returned from the query and 
 passing that to the existing Highlighter.  Please feel free to get in touch 
 if you decide to look into that and have questions.
 
 
 -Mike
 
 On 03/19/2012 11:51 AM, Simon Willnauer wrote:
 On Mon, Mar 19, 2012 at 4:50 PM, Uwe Schindleru...@thetaphi.de  wrote:
   
 Have you marked that for GSOC? Would be a good idea!
 
  yes I did
   
 -
 Uwe Schindler
 H.-H.-Meier-Allee 63, D-28213 Bremen
 http://www.thetaphi.de
 eMail: u...@thetaphi.de
 
 
 
 -Original Message-
 From: Simon Willnauer [mailto:simon.willna...@googlemail.com]
 Sent: Monday, March 19, 2012 4:43 PM
 To: dev@lucene.apache.org
 Subject: Re: Using term offsets for hit highlighting
 
 Alan, you made my day!
 
 The branch is kind of outdated but I looked at it lately and I can 
 certainly help
 to get it up to speed. The feature in that branch is quite a big one and 
 its in a
 very early stage. Still I want to encourage you to take a look and work on 
 it. I
 promise all my help with the issues!
 
 let me know if you have questions!
 
 simon
 
 On Mon, Mar 19, 2012 at 3:52 PM, Alan Woodward
 alan.woodw...@romseysoftware.co.uk  wrote:
   
 Cool, thanks Robert.  I'll take a look at the JIRA ticket.
 
 On 19 Mar 2012, at 14:44, Robert Muir wrote:
 
 
 On Mon, Mar 19, 2012 at 10:38 AM, Alan Woodward
 alan.woodw...@romseysoftware.co.uk  wrote:
   
 Hello,
 
 The project I'm currently working on requires the reporting of exact
 hit positions from some pretty hairy queries, not all of which are
 covered by the existing highlighter modules.  I'm working round this
 by translating everything into SpanQueries, and using the getSpans()
 method to locate hits (I've extended the Spans interface to make
 term offsets available - see
 https://issues.apache.org/jira/browse/LUCENE-3826).  This works for
 our use-case, but isn't terribly efficient, and obviously isn't 
 applicable to
 
 non-Span queries.
   
 I've seen a bit of chatter on the list about using term offsets to
 provide accurate highlighting in Lucene.  I'm going to have a couple
 of weeks free in April, and I thought I might have a go at
 implementing this.  Mainly I'm wondering if there's already been
 thoughts about how to do it.  My current thoughts are to somehow
 extend the Weight and Scorer interface to make term offsets
 available; to get highlights for a given set of documents, you'd
 essentially run the query again, with a filter on just the documents
 you want highlighted, and have a custom collector that gets the term
 
 offsets in place of the scores.
   
 
 Hi Alan, Simon started some initial work on
 https://issues.apache.org/jira/browse/LUCENE-2878
 
 Some work and prototypes were done in a branch, but it might be
 lagging behind trunk a bit.
 
 Additionally at the time it was first done, I think we didn't yet
 support offsets in the postings lists.
 We've since added this and several codecs support it.
 
 --
 lucidimagination.com
 
 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For
 additional commands, e-mail: dev-h...@lucene.apache.org
 
   
 
 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For
 additional commands, e-mail: dev-h...@lucene.apache.org
 
 
 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional
 commands, e-mail: dev-h...@lucene.apache.org
   
 
 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org
 
   
 
 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org
 


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail

Re: Using term offsets for hit highlighting

2012-03-20 Thread Erick Erickson
Yep, the first challenge is always getting the old patch(es) to apply.

On Tue, Mar 20, 2012 at 4:09 AM, Alan Woodward
alan.woodw...@romseysoftware.co.uk wrote:
 Thanks for all the offers of help!  It looks as though most of the hard work 
 has already been done, which is exactly where I like to pick up projects.  :-)

 Maybe the best place to start would be for me to rebase the branch against 
 trunk, and see what still fits?  I think there have been some fairly major 
 changes in the internals since July last year.

 On 19 Mar 2012, at 17:07, Mike Sokolov wrote:

 I posted a patch with a Collector somewhat similar to what you described, 
 Alan - it's attached to one of the sub-issues 
 https://issues.apache.org/jira/browse/LUCENE-3318.   It is in a fairly 
 complete alpha state, but has seen no production use of course, since it 
 relies on the remainder of the unfinished work in that branch.  It works by 
 creating a TokenStream based on match positions returned from the query and 
 passing that to the existing Highlighter.  Please feel free to get in touch 
 if you decide to look into that and have questions.


 -Mike

 On 03/19/2012 11:51 AM, Simon Willnauer wrote:
 On Mon, Mar 19, 2012 at 4:50 PM, Uwe Schindleru...@thetaphi.de  wrote:

 Have you marked that for GSOC? Would be a good idea!

  yes I did

 -
 Uwe Schindler
 H.-H.-Meier-Allee 63, D-28213 Bremen
 http://www.thetaphi.de
 eMail: u...@thetaphi.de



 -Original Message-
 From: Simon Willnauer [mailto:simon.willna...@googlemail.com]
 Sent: Monday, March 19, 2012 4:43 PM
 To: dev@lucene.apache.org
 Subject: Re: Using term offsets for hit highlighting

 Alan, you made my day!

 The branch is kind of outdated but I looked at it lately and I can 
 certainly help
 to get it up to speed. The feature in that branch is quite a big one and 
 its in a
 very early stage. Still I want to encourage you to take a look and work 
 on it. I
 promise all my help with the issues!

 let me know if you have questions!

 simon

 On Mon, Mar 19, 2012 at 3:52 PM, Alan Woodward
 alan.woodw...@romseysoftware.co.uk  wrote:

 Cool, thanks Robert.  I'll take a look at the JIRA ticket.

 On 19 Mar 2012, at 14:44, Robert Muir wrote:


 On Mon, Mar 19, 2012 at 10:38 AM, Alan Woodward
 alan.woodw...@romseysoftware.co.uk  wrote:

 Hello,

 The project I'm currently working on requires the reporting of exact
 hit positions from some pretty hairy queries, not all of which are
 covered by the existing highlighter modules.  I'm working round this
 by translating everything into SpanQueries, and using the getSpans()
 method to locate hits (I've extended the Spans interface to make
 term offsets available - see
 https://issues.apache.org/jira/browse/LUCENE-3826).  This works for
 our use-case, but isn't terribly efficient, and obviously isn't 
 applicable to

 non-Span queries.

 I've seen a bit of chatter on the list about using term offsets to
 provide accurate highlighting in Lucene.  I'm going to have a couple
 of weeks free in April, and I thought I might have a go at
 implementing this.  Mainly I'm wondering if there's already been
 thoughts about how to do it.  My current thoughts are to somehow
 extend the Weight and Scorer interface to make term offsets
 available; to get highlights for a given set of documents, you'd
 essentially run the query again, with a filter on just the documents
 you want highlighted, and have a custom collector that gets the term

 offsets in place of the scores.


 Hi Alan, Simon started some initial work on
 https://issues.apache.org/jira/browse/LUCENE-2878

 Some work and prototypes were done in a branch, but it might be
 lagging behind trunk a bit.

 Additionally at the time it was first done, I think we didn't yet
 support offsets in the postings lists.
 We've since added this and several codecs support it.

 --
 lucidimagination.com

 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For
 additional commands, e-mail: dev-h...@lucene.apache.org



 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For
 additional commands, e-mail: dev-h...@lucene.apache.org


 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional
 commands, e-mail: dev-h...@lucene.apache.org


 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org



 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org



 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional

Re: Using term offsets for hit highlighting

2012-03-19 Thread Robert Muir
On Mon, Mar 19, 2012 at 10:38 AM, Alan Woodward
alan.woodw...@romseysoftware.co.uk wrote:
 Hello,

 The project I'm currently working on requires the reporting of exact hit
 positions from some pretty hairy queries, not all of which are covered by
 the existing highlighter modules.  I'm working round this by translating
 everything into SpanQueries, and using the getSpans() method to locate hits
 (I've extended the Spans interface to make term offsets available -
 see https://issues.apache.org/jira/browse/LUCENE-3826).  This works for our
 use-case, but isn't terribly efficient, and obviously isn't applicable to
 non-Span queries.

 I've seen a bit of chatter on the list about using term offsets to provide
 accurate highlighting in Lucene.  I'm going to have a couple of weeks free
 in April, and I thought I might have a go at implementing this.  Mainly I'm
 wondering if there's already been thoughts about how to do it.  My current
 thoughts are to somehow extend the Weight and Scorer interface to make term
 offsets available; to get highlights for a given set of documents, you'd
 essentially run the query again, with a filter on just the documents you
 want highlighted, and have a custom collector that gets the term offsets in
 place of the scores.


Hi Alan, Simon started some initial work on
https://issues.apache.org/jira/browse/LUCENE-2878

Some work and prototypes were done in a branch, but it might be
lagging behind trunk a bit.

Additionally at the time it was first done, I think we didn't yet
support offsets in the postings lists.
We've since added this and several codecs support it.

-- 
lucidimagination.com

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Using term offsets for hit highlighting

2012-03-19 Thread Alan Woodward
Cool, thanks Robert.  I'll take a look at the JIRA ticket.

On 19 Mar 2012, at 14:44, Robert Muir wrote:

 On Mon, Mar 19, 2012 at 10:38 AM, Alan Woodward
 alan.woodw...@romseysoftware.co.uk wrote:
 Hello,
 
 The project I'm currently working on requires the reporting of exact hit
 positions from some pretty hairy queries, not all of which are covered by
 the existing highlighter modules.  I'm working round this by translating
 everything into SpanQueries, and using the getSpans() method to locate hits
 (I've extended the Spans interface to make term offsets available -
 see https://issues.apache.org/jira/browse/LUCENE-3826).  This works for our
 use-case, but isn't terribly efficient, and obviously isn't applicable to
 non-Span queries.
 
 I've seen a bit of chatter on the list about using term offsets to provide
 accurate highlighting in Lucene.  I'm going to have a couple of weeks free
 in April, and I thought I might have a go at implementing this.  Mainly I'm
 wondering if there's already been thoughts about how to do it.  My current
 thoughts are to somehow extend the Weight and Scorer interface to make term
 offsets available; to get highlights for a given set of documents, you'd
 essentially run the query again, with a filter on just the documents you
 want highlighted, and have a custom collector that gets the term offsets in
 place of the scores.
 
 
 Hi Alan, Simon started some initial work on
 https://issues.apache.org/jira/browse/LUCENE-2878
 
 Some work and prototypes were done in a branch, but it might be
 lagging behind trunk a bit.
 
 Additionally at the time it was first done, I think we didn't yet
 support offsets in the postings lists.
 We've since added this and several codecs support it.
 
 -- 
 lucidimagination.com
 
 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org
 


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Using term offsets for hit highlighting

2012-03-19 Thread Simon Willnauer
Alan, you made my day!

The branch is kind of outdated but I looked at it lately and I can
certainly help to get it up to speed. The feature in that branch is
quite a big one and its in a very early stage. Still I want to
encourage you to take a look and work on it. I promise all my help
with the issues!

let me know if you have questions!

simon

On Mon, Mar 19, 2012 at 3:52 PM, Alan Woodward
alan.woodw...@romseysoftware.co.uk wrote:
 Cool, thanks Robert.  I'll take a look at the JIRA ticket.

 On 19 Mar 2012, at 14:44, Robert Muir wrote:

 On Mon, Mar 19, 2012 at 10:38 AM, Alan Woodward
 alan.woodw...@romseysoftware.co.uk wrote:
 Hello,

 The project I'm currently working on requires the reporting of exact hit
 positions from some pretty hairy queries, not all of which are covered by
 the existing highlighter modules.  I'm working round this by translating
 everything into SpanQueries, and using the getSpans() method to locate hits
 (I've extended the Spans interface to make term offsets available -
 see https://issues.apache.org/jira/browse/LUCENE-3826).  This works for our
 use-case, but isn't terribly efficient, and obviously isn't applicable to
 non-Span queries.

 I've seen a bit of chatter on the list about using term offsets to provide
 accurate highlighting in Lucene.  I'm going to have a couple of weeks free
 in April, and I thought I might have a go at implementing this.  Mainly I'm
 wondering if there's already been thoughts about how to do it.  My current
 thoughts are to somehow extend the Weight and Scorer interface to make term
 offsets available; to get highlights for a given set of documents, you'd
 essentially run the query again, with a filter on just the documents you
 want highlighted, and have a custom collector that gets the term offsets in
 place of the scores.


 Hi Alan, Simon started some initial work on
 https://issues.apache.org/jira/browse/LUCENE-2878

 Some work and prototypes were done in a branch, but it might be
 lagging behind trunk a bit.

 Additionally at the time it was first done, I think we didn't yet
 support offsets in the postings lists.
 We've since added this and several codecs support it.

 --
 lucidimagination.com

 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org



 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



RE: Using term offsets for hit highlighting

2012-03-19 Thread Uwe Schindler
Have you marked that for GSOC? Would be a good idea!

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de


 -Original Message-
 From: Simon Willnauer [mailto:simon.willna...@googlemail.com]
 Sent: Monday, March 19, 2012 4:43 PM
 To: dev@lucene.apache.org
 Subject: Re: Using term offsets for hit highlighting
 
 Alan, you made my day!
 
 The branch is kind of outdated but I looked at it lately and I can certainly 
 help
 to get it up to speed. The feature in that branch is quite a big one and its 
 in a
 very early stage. Still I want to encourage you to take a look and work on 
 it. I
 promise all my help with the issues!
 
 let me know if you have questions!
 
 simon
 
 On Mon, Mar 19, 2012 at 3:52 PM, Alan Woodward
 alan.woodw...@romseysoftware.co.uk wrote:
  Cool, thanks Robert.  I'll take a look at the JIRA ticket.
 
  On 19 Mar 2012, at 14:44, Robert Muir wrote:
 
  On Mon, Mar 19, 2012 at 10:38 AM, Alan Woodward
  alan.woodw...@romseysoftware.co.uk wrote:
  Hello,
 
  The project I'm currently working on requires the reporting of exact
  hit positions from some pretty hairy queries, not all of which are
  covered by the existing highlighter modules.  I'm working round this
  by translating everything into SpanQueries, and using the getSpans()
  method to locate hits (I've extended the Spans interface to make
  term offsets available - see
  https://issues.apache.org/jira/browse/LUCENE-3826).  This works for
  our use-case, but isn't terribly efficient, and obviously isn't 
  applicable to
 non-Span queries.
 
  I've seen a bit of chatter on the list about using term offsets to
  provide accurate highlighting in Lucene.  I'm going to have a couple
  of weeks free in April, and I thought I might have a go at
  implementing this.  Mainly I'm wondering if there's already been
  thoughts about how to do it.  My current thoughts are to somehow
  extend the Weight and Scorer interface to make term offsets
  available; to get highlights for a given set of documents, you'd
  essentially run the query again, with a filter on just the documents
  you want highlighted, and have a custom collector that gets the term
 offsets in place of the scores.
 
 
  Hi Alan, Simon started some initial work on
  https://issues.apache.org/jira/browse/LUCENE-2878
 
  Some work and prototypes were done in a branch, but it might be
  lagging behind trunk a bit.
 
  Additionally at the time it was first done, I think we didn't yet
  support offsets in the postings lists.
  We've since added this and several codecs support it.
 
  --
  lucidimagination.com
 
  -
  To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For
  additional commands, e-mail: dev-h...@lucene.apache.org
 
 
 
  -
  To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For
  additional commands, e-mail: dev-h...@lucene.apache.org
 
 
 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional
 commands, e-mail: dev-h...@lucene.apache.org


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Using term offsets for hit highlighting

2012-03-19 Thread Simon Willnauer
On Mon, Mar 19, 2012 at 4:50 PM, Uwe Schindler u...@thetaphi.de wrote:
 Have you marked that for GSOC? Would be a good idea!
 yes I did

 -
 Uwe Schindler
 H.-H.-Meier-Allee 63, D-28213 Bremen
 http://www.thetaphi.de
 eMail: u...@thetaphi.de


 -Original Message-
 From: Simon Willnauer [mailto:simon.willna...@googlemail.com]
 Sent: Monday, March 19, 2012 4:43 PM
 To: dev@lucene.apache.org
 Subject: Re: Using term offsets for hit highlighting

 Alan, you made my day!

 The branch is kind of outdated but I looked at it lately and I can certainly 
 help
 to get it up to speed. The feature in that branch is quite a big one and its 
 in a
 very early stage. Still I want to encourage you to take a look and work on 
 it. I
 promise all my help with the issues!

 let me know if you have questions!

 simon

 On Mon, Mar 19, 2012 at 3:52 PM, Alan Woodward
 alan.woodw...@romseysoftware.co.uk wrote:
  Cool, thanks Robert.  I'll take a look at the JIRA ticket.
 
  On 19 Mar 2012, at 14:44, Robert Muir wrote:
 
  On Mon, Mar 19, 2012 at 10:38 AM, Alan Woodward
  alan.woodw...@romseysoftware.co.uk wrote:
  Hello,
 
  The project I'm currently working on requires the reporting of exact
  hit positions from some pretty hairy queries, not all of which are
  covered by the existing highlighter modules.  I'm working round this
  by translating everything into SpanQueries, and using the getSpans()
  method to locate hits (I've extended the Spans interface to make
  term offsets available - see
  https://issues.apache.org/jira/browse/LUCENE-3826).  This works for
  our use-case, but isn't terribly efficient, and obviously isn't 
  applicable to
 non-Span queries.
 
  I've seen a bit of chatter on the list about using term offsets to
  provide accurate highlighting in Lucene.  I'm going to have a couple
  of weeks free in April, and I thought I might have a go at
  implementing this.  Mainly I'm wondering if there's already been
  thoughts about how to do it.  My current thoughts are to somehow
  extend the Weight and Scorer interface to make term offsets
  available; to get highlights for a given set of documents, you'd
  essentially run the query again, with a filter on just the documents
  you want highlighted, and have a custom collector that gets the term
 offsets in place of the scores.
 
 
  Hi Alan, Simon started some initial work on
  https://issues.apache.org/jira/browse/LUCENE-2878
 
  Some work and prototypes were done in a branch, but it might be
  lagging behind trunk a bit.
 
  Additionally at the time it was first done, I think we didn't yet
  support offsets in the postings lists.
  We've since added this and several codecs support it.
 
  --
  lucidimagination.com
 
  -
  To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For
  additional commands, e-mail: dev-h...@lucene.apache.org
 
 
 
  -
  To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For
  additional commands, e-mail: dev-h...@lucene.apache.org
 

 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional
 commands, e-mail: dev-h...@lucene.apache.org


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Using term offsets for hit highlighting

2012-03-19 Thread Mike Sokolov
I posted a patch with a Collector somewhat similar to what you 
described, Alan - it's attached to one of the sub-issues 
https://issues.apache.org/jira/browse/LUCENE-3318.   It is in a fairly 
complete alpha state, but has seen no production use of course, since 
it relies on the remainder of the unfinished work in that branch.  It 
works by creating a TokenStream based on match positions returned from 
the query and passing that to the existing Highlighter.  Please feel 
free to get in touch if you decide to look into that and have questions.



-Mike

On 03/19/2012 11:51 AM, Simon Willnauer wrote:

On Mon, Mar 19, 2012 at 4:50 PM, Uwe Schindleru...@thetaphi.de  wrote:
   

Have you marked that for GSOC? Would be a good idea!
 

  yes I did
   

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de


 

-Original Message-
From: Simon Willnauer [mailto:simon.willna...@googlemail.com]
Sent: Monday, March 19, 2012 4:43 PM
To: dev@lucene.apache.org
Subject: Re: Using term offsets for hit highlighting

Alan, you made my day!

The branch is kind of outdated but I looked at it lately and I can certainly 
help
to get it up to speed. The feature in that branch is quite a big one and its in 
a
very early stage. Still I want to encourage you to take a look and work on it. I
promise all my help with the issues!

let me know if you have questions!

simon

On Mon, Mar 19, 2012 at 3:52 PM, Alan Woodward
alan.woodw...@romseysoftware.co.uk  wrote:
   

Cool, thanks Robert.  I'll take a look at the JIRA ticket.

On 19 Mar 2012, at 14:44, Robert Muir wrote:

 

On Mon, Mar 19, 2012 at 10:38 AM, Alan Woodward
alan.woodw...@romseysoftware.co.uk  wrote:
   

Hello,

The project I'm currently working on requires the reporting of exact
hit positions from some pretty hairy queries, not all of which are
covered by the existing highlighter modules.  I'm working round this
by translating everything into SpanQueries, and using the getSpans()
method to locate hits (I've extended the Spans interface to make
term offsets available - see
https://issues.apache.org/jira/browse/LUCENE-3826).  This works for
our use-case, but isn't terribly efficient, and obviously isn't applicable to
 

non-Span queries.
   

I've seen a bit of chatter on the list about using term offsets to
provide accurate highlighting in Lucene.  I'm going to have a couple
of weeks free in April, and I thought I might have a go at
implementing this.  Mainly I'm wondering if there's already been
thoughts about how to do it.  My current thoughts are to somehow
extend the Weight and Scorer interface to make term offsets
available; to get highlights for a given set of documents, you'd
essentially run the query again, with a filter on just the documents
you want highlighted, and have a custom collector that gets the term
 

offsets in place of the scores.
   
 

Hi Alan, Simon started some initial work on
https://issues.apache.org/jira/browse/LUCENE-2878

Some work and prototypes were done in a branch, but it might be
lagging behind trunk a bit.

Additionally at the time it was first done, I think we didn't yet
support offsets in the postings lists.
We've since added this and several codecs support it.

--
lucidimagination.com

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For
additional commands, e-mail: dev-h...@lucene.apache.org

   


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For
additional commands, e-mail: dev-h...@lucene.apache.org

 

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional
commands, e-mail: dev-h...@lucene.apache.org
   
 

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

   


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org