Re: Using term offsets for hit highlighting
alan, I merged the branch manually and created a new branch from it. its here: https://svn.apache.org/repos/asf/lucene/dev/branches/LUCENE-2878 the branch compiles but lots of nocommits / todos if you have questions please ask I will help as much as I can simon On Tue, May 22, 2012 at 8:38 PM, Alan Woodward alan.woodw...@romseysoftware.co.uk wrote: Hey, I reckon I can have a decent go at getting the branch updated. Is it best to work this out as a patch applying to trunk? Any patch that merges in all the trunk changes to the branch is going to be absolutely massive… On 17 May 2012, at 13:15, Simon Willnauer wrote: ok man. I will try to merge up the branch. I tell you this is going to be messy and it might not compile but I will make it reasonable so you can start. simon On Thu, May 17, 2012 at 8:03 AM, Alan Woodward alan.woodw...@romseysoftware.co.uk wrote: Sorry for vanishing for so long, life unexpectedly caught up with me... I'm going to have some time to look at this again next week though, if you're interested in picking it up again. On 21 Mar 2012, at 09:02, Alan Woodward wrote: That would be great, thanks! I had a go at merging it last night, but there are a *lot* of changes that I haven't got my head round yet, so it was getting pretty messy. On 21 Mar 2012, at 08:49, Simon Willnauer wrote: Alan, if you want I can just merge the branch up next week and we iterate from there? simon On Tue, Mar 20, 2012 at 12:34 PM, Erick Erickson erickerick...@gmail.com wrote: Yep, the first challenge is always getting the old patch(es) to apply. On Tue, Mar 20, 2012 at 4:09 AM, Alan Woodward alan.woodw...@romseysoftware.co.uk wrote: Thanks for all the offers of help! It looks as though most of the hard work has already been done, which is exactly where I like to pick up projects. :-) Maybe the best place to start would be for me to rebase the branch against trunk, and see what still fits? I think there have been some fairly major changes in the internals since July last year. On 19 Mar 2012, at 17:07, Mike Sokolov wrote: I posted a patch with a Collector somewhat similar to what you described, Alan - it's attached to one of the sub-issues https://issues.apache.org/jira/browse/LUCENE-3318. It is in a fairly complete alpha state, but has seen no production use of course, since it relies on the remainder of the unfinished work in that branch. It works by creating a TokenStream based on match positions returned from the query and passing that to the existing Highlighter. Please feel free to get in touch if you decide to look into that and have questions. -Mike On 03/19/2012 11:51 AM, Simon Willnauer wrote: On Mon, Mar 19, 2012 at 4:50 PM, Uwe Schindleru...@thetaphi.de wrote: Have you marked that for GSOC? Would be a good idea! yes I did - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: Simon Willnauer [mailto:simon.willna...@googlemail.com] Sent: Monday, March 19, 2012 4:43 PM To: dev@lucene.apache.org Subject: Re: Using term offsets for hit highlighting Alan, you made my day! The branch is kind of outdated but I looked at it lately and I can certainly help to get it up to speed. The feature in that branch is quite a big one and its in a very early stage. Still I want to encourage you to take a look and work on it. I promise all my help with the issues! let me know if you have questions! simon On Mon, Mar 19, 2012 at 3:52 PM, Alan Woodward alan.woodw...@romseysoftware.co.uk wrote: Cool, thanks Robert. I'll take a look at the JIRA ticket. On 19 Mar 2012, at 14:44, Robert Muir wrote: On Mon, Mar 19, 2012 at 10:38 AM, Alan Woodward alan.woodw...@romseysoftware.co.uk wrote: Hello, The project I'm currently working on requires the reporting of exact hit positions from some pretty hairy queries, not all of which are covered by the existing highlighter modules. I'm working round this by translating everything into SpanQueries, and using the getSpans() method to locate hits (I've extended the Spans interface to make term offsets available - see https://issues.apache.org/jira/browse/LUCENE-3826). This works for our use-case, but isn't terribly efficient, and obviously isn't applicable to non-Span queries. I've seen a bit of chatter on the list about using term offsets to provide accurate highlighting in Lucene. I'm going to have a couple of weeks free in April, and I thought I might have a go at implementing this. Mainly I'm wondering if there's already been thoughts about how to do it. My current thoughts are to somehow extend the Weight and Scorer interface to make term offsets available; to get highlights for a given set of documents, you'd essentially run the query again, with a filter on just the documents you want highlighted, and have
Re: Using term offsets for hit highlighting
Sweet, thanks Simon. I'll have a go at getting some failing tests passing to begin with. On 23 May 2012, at 11:59, Simon Willnauer wrote: alan, I merged the branch manually and created a new branch from it. its here: https://svn.apache.org/repos/asf/lucene/dev/branches/LUCENE-2878 the branch compiles but lots of nocommits / todos if you have questions please ask I will help as much as I can simon On Tue, May 22, 2012 at 8:38 PM, Alan Woodward alan.woodw...@romseysoftware.co.uk wrote: Hey, I reckon I can have a decent go at getting the branch updated. Is it best to work this out as a patch applying to trunk? Any patch that merges in all the trunk changes to the branch is going to be absolutely massive… On 17 May 2012, at 13:15, Simon Willnauer wrote: ok man. I will try to merge up the branch. I tell you this is going to be messy and it might not compile but I will make it reasonable so you can start. simon On Thu, May 17, 2012 at 8:03 AM, Alan Woodward alan.woodw...@romseysoftware.co.uk wrote: Sorry for vanishing for so long, life unexpectedly caught up with me... I'm going to have some time to look at this again next week though, if you're interested in picking it up again. On 21 Mar 2012, at 09:02, Alan Woodward wrote: That would be great, thanks! I had a go at merging it last night, but there are a *lot* of changes that I haven't got my head round yet, so it was getting pretty messy. On 21 Mar 2012, at 08:49, Simon Willnauer wrote: Alan, if you want I can just merge the branch up next week and we iterate from there? simon On Tue, Mar 20, 2012 at 12:34 PM, Erick Erickson erickerick...@gmail.com wrote: Yep, the first challenge is always getting the old patch(es) to apply. On Tue, Mar 20, 2012 at 4:09 AM, Alan Woodward alan.woodw...@romseysoftware.co.uk wrote: Thanks for all the offers of help! It looks as though most of the hard work has already been done, which is exactly where I like to pick up projects. :-) Maybe the best place to start would be for me to rebase the branch against trunk, and see what still fits? I think there have been some fairly major changes in the internals since July last year. On 19 Mar 2012, at 17:07, Mike Sokolov wrote: I posted a patch with a Collector somewhat similar to what you described, Alan - it's attached to one of the sub-issues https://issues.apache.org/jira/browse/LUCENE-3318. It is in a fairly complete alpha state, but has seen no production use of course, since it relies on the remainder of the unfinished work in that branch. It works by creating a TokenStream based on match positions returned from the query and passing that to the existing Highlighter. Please feel free to get in touch if you decide to look into that and have questions. -Mike On 03/19/2012 11:51 AM, Simon Willnauer wrote: On Mon, Mar 19, 2012 at 4:50 PM, Uwe Schindleru...@thetaphi.de wrote: Have you marked that for GSOC? Would be a good idea! yes I did - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: Simon Willnauer [mailto:simon.willna...@googlemail.com] Sent: Monday, March 19, 2012 4:43 PM To: dev@lucene.apache.org Subject: Re: Using term offsets for hit highlighting Alan, you made my day! The branch is kind of outdated but I looked at it lately and I can certainly help to get it up to speed. The feature in that branch is quite a big one and its in a very early stage. Still I want to encourage you to take a look and work on it. I promise all my help with the issues! let me know if you have questions! simon On Mon, Mar 19, 2012 at 3:52 PM, Alan Woodward alan.woodw...@romseysoftware.co.uk wrote: Cool, thanks Robert. I'll take a look at the JIRA ticket. On 19 Mar 2012, at 14:44, Robert Muir wrote: On Mon, Mar 19, 2012 at 10:38 AM, Alan Woodward alan.woodw...@romseysoftware.co.uk wrote: Hello, The project I'm currently working on requires the reporting of exact hit positions from some pretty hairy queries, not all of which are covered by the existing highlighter modules. I'm working round this by translating everything into SpanQueries, and using the getSpans() method to locate hits (I've extended the Spans interface to make term offsets available - see https://issues.apache.org/jira/browse/LUCENE-3826). This works for our use-case, but isn't terribly efficient, and obviously isn't applicable to non-Span queries. I've seen a bit of chatter on the list about using term offsets to provide accurate highlighting in Lucene. I'm going to have a couple of weeks free in April, and I thought I might have a go at implementing this. Mainly I'm wondering if there's already been thoughts about how to do it. My current thoughts are to somehow extend the Weight and Scorer interface
Re: Using term offsets for hit highlighting
hey alan, I added position iterator support to ConjunctionTermScorer and committed it to the branch. All tests that don't rely on payloads are passing in core. Previously we had to decide if we need positions up front, the current code can pull them lazily which causes less changes on the Scorer API. I think we should keep it that way, the only problem is that we have currently now way to pass information to the iterators if we need payloads or not. Same is true for offsets since they are now in the index. I think it would be good if you could tackle the payloads first and pass some info to the Scorer#positions() method so we can pull the right thing. happy coding. simon On Wed, May 23, 2012 at 1:23 PM, Alan Woodward alan.woodw...@romseysoftware.co.uk wrote: Sweet, thanks Simon. I'll have a go at getting some failing tests passing to begin with. On 23 May 2012, at 11:59, Simon Willnauer wrote: alan, I merged the branch manually and created a new branch from it. its here: https://svn.apache.org/repos/asf/lucene/dev/branches/LUCENE-2878 the branch compiles but lots of nocommits / todos if you have questions please ask I will help as much as I can simon On Tue, May 22, 2012 at 8:38 PM, Alan Woodward alan.woodw...@romseysoftware.co.uk wrote: Hey, I reckon I can have a decent go at getting the branch updated. Is it best to work this out as a patch applying to trunk? Any patch that merges in all the trunk changes to the branch is going to be absolutely massive… On 17 May 2012, at 13:15, Simon Willnauer wrote: ok man. I will try to merge up the branch. I tell you this is going to be messy and it might not compile but I will make it reasonable so you can start. simon On Thu, May 17, 2012 at 8:03 AM, Alan Woodward alan.woodw...@romseysoftware.co.uk wrote: Sorry for vanishing for so long, life unexpectedly caught up with me... I'm going to have some time to look at this again next week though, if you're interested in picking it up again. On 21 Mar 2012, at 09:02, Alan Woodward wrote: That would be great, thanks! I had a go at merging it last night, but there are a *lot* of changes that I haven't got my head round yet, so it was getting pretty messy. On 21 Mar 2012, at 08:49, Simon Willnauer wrote: Alan, if you want I can just merge the branch up next week and we iterate from there? simon On Tue, Mar 20, 2012 at 12:34 PM, Erick Erickson erickerick...@gmail.com wrote: Yep, the first challenge is always getting the old patch(es) to apply. On Tue, Mar 20, 2012 at 4:09 AM, Alan Woodward alan.woodw...@romseysoftware.co.uk wrote: Thanks for all the offers of help! It looks as though most of the hard work has already been done, which is exactly where I like to pick up projects. :-) Maybe the best place to start would be for me to rebase the branch against trunk, and see what still fits? I think there have been some fairly major changes in the internals since July last year. On 19 Mar 2012, at 17:07, Mike Sokolov wrote: I posted a patch with a Collector somewhat similar to what you described, Alan - it's attached to one of the sub-issues https://issues.apache.org/jira/browse/LUCENE-3318. It is in a fairly complete alpha state, but has seen no production use of course, since it relies on the remainder of the unfinished work in that branch. It works by creating a TokenStream based on match positions returned from the query and passing that to the existing Highlighter. Please feel free to get in touch if you decide to look into that and have questions. -Mike On 03/19/2012 11:51 AM, Simon Willnauer wrote: On Mon, Mar 19, 2012 at 4:50 PM, Uwe Schindleru...@thetaphi.de wrote: Have you marked that for GSOC? Would be a good idea! yes I did - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: Simon Willnauer [mailto:simon.willna...@googlemail.com] Sent: Monday, March 19, 2012 4:43 PM To: dev@lucene.apache.org Subject: Re: Using term offsets for hit highlighting Alan, you made my day! The branch is kind of outdated but I looked at it lately and I can certainly help to get it up to speed. The feature in that branch is quite a big one and its in a very early stage. Still I want to encourage you to take a look and work on it. I promise all my help with the issues! let me know if you have questions! simon On Mon, Mar 19, 2012 at 3:52 PM, Alan Woodward alan.woodw...@romseysoftware.co.uk wrote: Cool, thanks Robert. I'll take a look at the JIRA ticket. On 19 Mar 2012, at 14:44, Robert Muir wrote: On Mon, Mar 19, 2012 at 10:38 AM, Alan Woodward alan.woodw...@romseysoftware.co.uk wrote: Hello, The project I'm currently working on requires the reporting of exact hit positions from some pretty hairy queries, not all of which are covered by the existing highlighter modules. I'm
Re: Using term offsets for hit highlighting
OK, so the most straightforward way to do that would be to change the signature to positions(boolean needsPayloads, boolean needsOffsets), I guess. This is a new API so it's not breaking anything. It'll be tomorrow morning before I have a proper go at this now (Cambridge Beer Festival tonight…). Is the mailing list the best place to discuss this, or is JIRA/IRC better? On 23 May 2012, at 13:43, Simon Willnauer wrote: hey alan, I added position iterator support to ConjunctionTermScorer and committed it to the branch. All tests that don't rely on payloads are passing in core. Previously we had to decide if we need positions up front, the current code can pull them lazily which causes less changes on the Scorer API. I think we should keep it that way, the only problem is that we have currently now way to pass information to the iterators if we need payloads or not. Same is true for offsets since they are now in the index. I think it would be good if you could tackle the payloads first and pass some info to the Scorer#positions() method so we can pull the right thing. happy coding. simon On Wed, May 23, 2012 at 1:23 PM, Alan Woodward alan.woodw...@romseysoftware.co.uk wrote: Sweet, thanks Simon. I'll have a go at getting some failing tests passing to begin with. On 23 May 2012, at 11:59, Simon Willnauer wrote: alan, I merged the branch manually and created a new branch from it. its here: https://svn.apache.org/repos/asf/lucene/dev/branches/LUCENE-2878 the branch compiles but lots of nocommits / todos if you have questions please ask I will help as much as I can simon On Tue, May 22, 2012 at 8:38 PM, Alan Woodward alan.woodw...@romseysoftware.co.uk wrote: Hey, I reckon I can have a decent go at getting the branch updated. Is it best to work this out as a patch applying to trunk? Any patch that merges in all the trunk changes to the branch is going to be absolutely massive… On 17 May 2012, at 13:15, Simon Willnauer wrote: ok man. I will try to merge up the branch. I tell you this is going to be messy and it might not compile but I will make it reasonable so you can start. simon On Thu, May 17, 2012 at 8:03 AM, Alan Woodward alan.woodw...@romseysoftware.co.uk wrote: Sorry for vanishing for so long, life unexpectedly caught up with me... I'm going to have some time to look at this again next week though, if you're interested in picking it up again. On 21 Mar 2012, at 09:02, Alan Woodward wrote: That would be great, thanks! I had a go at merging it last night, but there are a *lot* of changes that I haven't got my head round yet, so it was getting pretty messy. On 21 Mar 2012, at 08:49, Simon Willnauer wrote: Alan, if you want I can just merge the branch up next week and we iterate from there? simon On Tue, Mar 20, 2012 at 12:34 PM, Erick Erickson erickerick...@gmail.com wrote: Yep, the first challenge is always getting the old patch(es) to apply. On Tue, Mar 20, 2012 at 4:09 AM, Alan Woodward alan.woodw...@romseysoftware.co.uk wrote: Thanks for all the offers of help! It looks as though most of the hard work has already been done, which is exactly where I like to pick up projects. :-) Maybe the best place to start would be for me to rebase the branch against trunk, and see what still fits? I think there have been some fairly major changes in the internals since July last year. On 19 Mar 2012, at 17:07, Mike Sokolov wrote: I posted a patch with a Collector somewhat similar to what you described, Alan - it's attached to one of the sub-issues https://issues.apache.org/jira/browse/LUCENE-3318. It is in a fairly complete alpha state, but has seen no production use of course, since it relies on the remainder of the unfinished work in that branch. It works by creating a TokenStream based on match positions returned from the query and passing that to the existing Highlighter. Please feel free to get in touch if you decide to look into that and have questions. -Mike On 03/19/2012 11:51 AM, Simon Willnauer wrote: On Mon, Mar 19, 2012 at 4:50 PM, Uwe Schindleru...@thetaphi.de wrote: Have you marked that for GSOC? Would be a good idea! yes I did - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: Simon Willnauer [mailto:simon.willna...@googlemail.com] Sent: Monday, March 19, 2012 4:43 PM To: dev@lucene.apache.org Subject: Re: Using term offsets for hit highlighting Alan, you made my day! The branch is kind of outdated but I looked at it lately and I can certainly help to get it up to speed. The feature in that branch is quite a big one and its in a very early stage. Still I want to encourage you to take a look and work on it. I promise all my help with the issues! let me know if you have questions! simon On Mon, Mar 19
Re: Using term offsets for hit highlighting
in that branch. It works by creating a TokenStream based on match positions returned from the query and passing that to the existing Highlighter. Please feel free to get in touch if you decide to look into that and have questions. -Mike On 03/19/2012 11:51 AM, Simon Willnauer wrote: On Mon, Mar 19, 2012 at 4:50 PM, Uwe Schindleru...@thetaphi.de wrote: Have you marked that for GSOC? Would be a good idea! yes I did - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: Simon Willnauer [mailto:simon.willna...@googlemail.com] Sent: Monday, March 19, 2012 4:43 PM To: dev@lucene.apache.org Subject: Re: Using term offsets for hit highlighting Alan, you made my day! The branch is kind of outdated but I looked at it lately and I can certainly help to get it up to speed. The feature in that branch is quite a big one and its in a very early stage. Still I want to encourage you to take a look and work on it. I promise all my help with the issues! let me know if you have questions! simon On Mon, Mar 19, 2012 at 3:52 PM, Alan Woodward alan.woodw...@romseysoftware.co.uk wrote: Cool, thanks Robert. I'll take a look at the JIRA ticket. On 19 Mar 2012, at 14:44, Robert Muir wrote: On Mon, Mar 19, 2012 at 10:38 AM, Alan Woodward alan.woodw...@romseysoftware.co.uk wrote: Hello, The project I'm currently working on requires the reporting of exact hit positions from some pretty hairy queries, not all of which are covered by the existing highlighter modules. I'm working round this by translating everything into SpanQueries, and using the getSpans() method to locate hits (I've extended the Spans interface to make term offsets available - see https://issues.apache.org/jira/browse/LUCENE-3826). This works for our use-case, but isn't terribly efficient, and obviously isn't applicable to non-Span queries. I've seen a bit of chatter on the list about using term offsets to provide accurate highlighting in Lucene. I'm going to have a couple of weeks free in April, and I thought I might have a go at implementing this. Mainly I'm wondering if there's already been thoughts about how to do it. My current thoughts are to somehow extend the Weight and Scorer interface to make term offsets available; to get highlights for a given set of documents, you'd essentially run the query again, with a filter on just the documents you want highlighted, and have a custom collector that gets the term offsets in place of the scores. Hi Alan, Simon started some initial work on https://issues.apache.org/jira/browse/LUCENE-2878 Some work and prototypes were done in a branch, but it might be lagging behind trunk a bit. Additionally at the time it was first done, I think we didn't yet support offsets in the postings lists. We've since added this and several codecs support it. -- lucidimagination.com - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Using term offsets for hit highlighting
Hey, I reckon I can have a decent go at getting the branch updated. Is it best to work this out as a patch applying to trunk? Any patch that merges in all the trunk changes to the branch is going to be absolutely massive… On 17 May 2012, at 13:15, Simon Willnauer wrote: ok man. I will try to merge up the branch. I tell you this is going to be messy and it might not compile but I will make it reasonable so you can start. simon On Thu, May 17, 2012 at 8:03 AM, Alan Woodward alan.woodw...@romseysoftware.co.uk wrote: Sorry for vanishing for so long, life unexpectedly caught up with me... I'm going to have some time to look at this again next week though, if you're interested in picking it up again. On 21 Mar 2012, at 09:02, Alan Woodward wrote: That would be great, thanks! I had a go at merging it last night, but there are a *lot* of changes that I haven't got my head round yet, so it was getting pretty messy. On 21 Mar 2012, at 08:49, Simon Willnauer wrote: Alan, if you want I can just merge the branch up next week and we iterate from there? simon On Tue, Mar 20, 2012 at 12:34 PM, Erick Erickson erickerick...@gmail.com wrote: Yep, the first challenge is always getting the old patch(es) to apply. On Tue, Mar 20, 2012 at 4:09 AM, Alan Woodward alan.woodw...@romseysoftware.co.uk wrote: Thanks for all the offers of help! It looks as though most of the hard work has already been done, which is exactly where I like to pick up projects. :-) Maybe the best place to start would be for me to rebase the branch against trunk, and see what still fits? I think there have been some fairly major changes in the internals since July last year. On 19 Mar 2012, at 17:07, Mike Sokolov wrote: I posted a patch with a Collector somewhat similar to what you described, Alan - it's attached to one of the sub-issues https://issues.apache.org/jira/browse/LUCENE-3318. It is in a fairly complete alpha state, but has seen no production use of course, since it relies on the remainder of the unfinished work in that branch. It works by creating a TokenStream based on match positions returned from the query and passing that to the existing Highlighter. Please feel free to get in touch if you decide to look into that and have questions. -Mike On 03/19/2012 11:51 AM, Simon Willnauer wrote: On Mon, Mar 19, 2012 at 4:50 PM, Uwe Schindleru...@thetaphi.de wrote: Have you marked that for GSOC? Would be a good idea! yes I did - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: Simon Willnauer [mailto:simon.willna...@googlemail.com] Sent: Monday, March 19, 2012 4:43 PM To: dev@lucene.apache.org Subject: Re: Using term offsets for hit highlighting Alan, you made my day! The branch is kind of outdated but I looked at it lately and I can certainly help to get it up to speed. The feature in that branch is quite a big one and its in a very early stage. Still I want to encourage you to take a look and work on it. I promise all my help with the issues! let me know if you have questions! simon On Mon, Mar 19, 2012 at 3:52 PM, Alan Woodward alan.woodw...@romseysoftware.co.uk wrote: Cool, thanks Robert. I'll take a look at the JIRA ticket. On 19 Mar 2012, at 14:44, Robert Muir wrote: On Mon, Mar 19, 2012 at 10:38 AM, Alan Woodward alan.woodw...@romseysoftware.co.uk wrote: Hello, The project I'm currently working on requires the reporting of exact hit positions from some pretty hairy queries, not all of which are covered by the existing highlighter modules. I'm working round this by translating everything into SpanQueries, and using the getSpans() method to locate hits (I've extended the Spans interface to make term offsets available - see https://issues.apache.org/jira/browse/LUCENE-3826). This works for our use-case, but isn't terribly efficient, and obviously isn't applicable to non-Span queries. I've seen a bit of chatter on the list about using term offsets to provide accurate highlighting in Lucene. I'm going to have a couple of weeks free in April, and I thought I might have a go at implementing this. Mainly I'm wondering if there's already been thoughts about how to do it. My current thoughts are to somehow extend the Weight and Scorer interface to make term offsets available; to get highlights for a given set of documents, you'd essentially run the query again, with a filter on just the documents you want highlighted, and have a custom collector that gets the term offsets in place of the scores. Hi Alan, Simon started some initial work on https://issues.apache.org/jira/browse/LUCENE-2878 Some work and prototypes were done in a branch, but it might be lagging behind trunk a bit. Additionally at the time it was first done, I think we didn't
Re: Using term offsets for hit highlighting
Sorry for vanishing for so long, life unexpectedly caught up with me... I'm going to have some time to look at this again next week though, if you're interested in picking it up again. On 21 Mar 2012, at 09:02, Alan Woodward wrote: That would be great, thanks! I had a go at merging it last night, but there are a *lot* of changes that I haven't got my head round yet, so it was getting pretty messy. On 21 Mar 2012, at 08:49, Simon Willnauer wrote: Alan, if you want I can just merge the branch up next week and we iterate from there? simon On Tue, Mar 20, 2012 at 12:34 PM, Erick Erickson erickerick...@gmail.com wrote: Yep, the first challenge is always getting the old patch(es) to apply. On Tue, Mar 20, 2012 at 4:09 AM, Alan Woodward alan.woodw...@romseysoftware.co.uk wrote: Thanks for all the offers of help! It looks as though most of the hard work has already been done, which is exactly where I like to pick up projects. :-) Maybe the best place to start would be for me to rebase the branch against trunk, and see what still fits? I think there have been some fairly major changes in the internals since July last year. On 19 Mar 2012, at 17:07, Mike Sokolov wrote: I posted a patch with a Collector somewhat similar to what you described, Alan - it's attached to one of the sub-issues https://issues.apache.org/jira/browse/LUCENE-3318. It is in a fairly complete alpha state, but has seen no production use of course, since it relies on the remainder of the unfinished work in that branch. It works by creating a TokenStream based on match positions returned from the query and passing that to the existing Highlighter. Please feel free to get in touch if you decide to look into that and have questions. -Mike On 03/19/2012 11:51 AM, Simon Willnauer wrote: On Mon, Mar 19, 2012 at 4:50 PM, Uwe Schindleru...@thetaphi.de wrote: Have you marked that for GSOC? Would be a good idea! yes I did - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: Simon Willnauer [mailto:simon.willna...@googlemail.com] Sent: Monday, March 19, 2012 4:43 PM To: dev@lucene.apache.org Subject: Re: Using term offsets for hit highlighting Alan, you made my day! The branch is kind of outdated but I looked at it lately and I can certainly help to get it up to speed. The feature in that branch is quite a big one and its in a very early stage. Still I want to encourage you to take a look and work on it. I promise all my help with the issues! let me know if you have questions! simon On Mon, Mar 19, 2012 at 3:52 PM, Alan Woodward alan.woodw...@romseysoftware.co.uk wrote: Cool, thanks Robert. I'll take a look at the JIRA ticket. On 19 Mar 2012, at 14:44, Robert Muir wrote: On Mon, Mar 19, 2012 at 10:38 AM, Alan Woodward alan.woodw...@romseysoftware.co.uk wrote: Hello, The project I'm currently working on requires the reporting of exact hit positions from some pretty hairy queries, not all of which are covered by the existing highlighter modules. I'm working round this by translating everything into SpanQueries, and using the getSpans() method to locate hits (I've extended the Spans interface to make term offsets available - see https://issues.apache.org/jira/browse/LUCENE-3826). This works for our use-case, but isn't terribly efficient, and obviously isn't applicable to non-Span queries. I've seen a bit of chatter on the list about using term offsets to provide accurate highlighting in Lucene. I'm going to have a couple of weeks free in April, and I thought I might have a go at implementing this. Mainly I'm wondering if there's already been thoughts about how to do it. My current thoughts are to somehow extend the Weight and Scorer interface to make term offsets available; to get highlights for a given set of documents, you'd essentially run the query again, with a filter on just the documents you want highlighted, and have a custom collector that gets the term offsets in place of the scores. Hi Alan, Simon started some initial work on https://issues.apache.org/jira/browse/LUCENE-2878 Some work and prototypes were done in a branch, but it might be lagging behind trunk a bit. Additionally at the time it was first done, I think we didn't yet support offsets in the postings lists. We've since added this and several codecs support it. -- lucidimagination.com - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Using term offsets for hit highlighting
ok man. I will try to merge up the branch. I tell you this is going to be messy and it might not compile but I will make it reasonable so you can start. simon On Thu, May 17, 2012 at 8:03 AM, Alan Woodward alan.woodw...@romseysoftware.co.uk wrote: Sorry for vanishing for so long, life unexpectedly caught up with me... I'm going to have some time to look at this again next week though, if you're interested in picking it up again. On 21 Mar 2012, at 09:02, Alan Woodward wrote: That would be great, thanks! I had a go at merging it last night, but there are a *lot* of changes that I haven't got my head round yet, so it was getting pretty messy. On 21 Mar 2012, at 08:49, Simon Willnauer wrote: Alan, if you want I can just merge the branch up next week and we iterate from there? simon On Tue, Mar 20, 2012 at 12:34 PM, Erick Erickson erickerick...@gmail.com wrote: Yep, the first challenge is always getting the old patch(es) to apply. On Tue, Mar 20, 2012 at 4:09 AM, Alan Woodward alan.woodw...@romseysoftware.co.uk wrote: Thanks for all the offers of help! It looks as though most of the hard work has already been done, which is exactly where I like to pick up projects. :-) Maybe the best place to start would be for me to rebase the branch against trunk, and see what still fits? I think there have been some fairly major changes in the internals since July last year. On 19 Mar 2012, at 17:07, Mike Sokolov wrote: I posted a patch with a Collector somewhat similar to what you described, Alan - it's attached to one of the sub-issues https://issues.apache.org/jira/browse/LUCENE-3318. It is in a fairly complete alpha state, but has seen no production use of course, since it relies on the remainder of the unfinished work in that branch. It works by creating a TokenStream based on match positions returned from the query and passing that to the existing Highlighter. Please feel free to get in touch if you decide to look into that and have questions. -Mike On 03/19/2012 11:51 AM, Simon Willnauer wrote: On Mon, Mar 19, 2012 at 4:50 PM, Uwe Schindleru...@thetaphi.de wrote: Have you marked that for GSOC? Would be a good idea! yes I did - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: Simon Willnauer [mailto:simon.willna...@googlemail.com] Sent: Monday, March 19, 2012 4:43 PM To: dev@lucene.apache.org Subject: Re: Using term offsets for hit highlighting Alan, you made my day! The branch is kind of outdated but I looked at it lately and I can certainly help to get it up to speed. The feature in that branch is quite a big one and its in a very early stage. Still I want to encourage you to take a look and work on it. I promise all my help with the issues! let me know if you have questions! simon On Mon, Mar 19, 2012 at 3:52 PM, Alan Woodward alan.woodw...@romseysoftware.co.uk wrote: Cool, thanks Robert. I'll take a look at the JIRA ticket. On 19 Mar 2012, at 14:44, Robert Muir wrote: On Mon, Mar 19, 2012 at 10:38 AM, Alan Woodward alan.woodw...@romseysoftware.co.uk wrote: Hello, The project I'm currently working on requires the reporting of exact hit positions from some pretty hairy queries, not all of which are covered by the existing highlighter modules. I'm working round this by translating everything into SpanQueries, and using the getSpans() method to locate hits (I've extended the Spans interface to make term offsets available - see https://issues.apache.org/jira/browse/LUCENE-3826). This works for our use-case, but isn't terribly efficient, and obviously isn't applicable to non-Span queries. I've seen a bit of chatter on the list about using term offsets to provide accurate highlighting in Lucene. I'm going to have a couple of weeks free in April, and I thought I might have a go at implementing this. Mainly I'm wondering if there's already been thoughts about how to do it. My current thoughts are to somehow extend the Weight and Scorer interface to make term offsets available; to get highlights for a given set of documents, you'd essentially run the query again, with a filter on just the documents you want highlighted, and have a custom collector that gets the term offsets in place of the scores. Hi Alan, Simon started some initial work on https://issues.apache.org/jira/browse/LUCENE-2878 Some work and prototypes were done in a branch, but it might be lagging behind trunk a bit. Additionally at the time it was first done, I think we didn't yet support offsets in the postings lists. We've since added this and several codecs support it. -- lucidimagination.com - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Using term offsets for hit highlighting
Alan, if you want I can just merge the branch up next week and we iterate from there? simon On Tue, Mar 20, 2012 at 12:34 PM, Erick Erickson erickerick...@gmail.com wrote: Yep, the first challenge is always getting the old patch(es) to apply. On Tue, Mar 20, 2012 at 4:09 AM, Alan Woodward alan.woodw...@romseysoftware.co.uk wrote: Thanks for all the offers of help! It looks as though most of the hard work has already been done, which is exactly where I like to pick up projects. :-) Maybe the best place to start would be for me to rebase the branch against trunk, and see what still fits? I think there have been some fairly major changes in the internals since July last year. On 19 Mar 2012, at 17:07, Mike Sokolov wrote: I posted a patch with a Collector somewhat similar to what you described, Alan - it's attached to one of the sub-issues https://issues.apache.org/jira/browse/LUCENE-3318. It is in a fairly complete alpha state, but has seen no production use of course, since it relies on the remainder of the unfinished work in that branch. It works by creating a TokenStream based on match positions returned from the query and passing that to the existing Highlighter. Please feel free to get in touch if you decide to look into that and have questions. -Mike On 03/19/2012 11:51 AM, Simon Willnauer wrote: On Mon, Mar 19, 2012 at 4:50 PM, Uwe Schindleru...@thetaphi.de wrote: Have you marked that for GSOC? Would be a good idea! yes I did - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: Simon Willnauer [mailto:simon.willna...@googlemail.com] Sent: Monday, March 19, 2012 4:43 PM To: dev@lucene.apache.org Subject: Re: Using term offsets for hit highlighting Alan, you made my day! The branch is kind of outdated but I looked at it lately and I can certainly help to get it up to speed. The feature in that branch is quite a big one and its in a very early stage. Still I want to encourage you to take a look and work on it. I promise all my help with the issues! let me know if you have questions! simon On Mon, Mar 19, 2012 at 3:52 PM, Alan Woodward alan.woodw...@romseysoftware.co.uk wrote: Cool, thanks Robert. I'll take a look at the JIRA ticket. On 19 Mar 2012, at 14:44, Robert Muir wrote: On Mon, Mar 19, 2012 at 10:38 AM, Alan Woodward alan.woodw...@romseysoftware.co.uk wrote: Hello, The project I'm currently working on requires the reporting of exact hit positions from some pretty hairy queries, not all of which are covered by the existing highlighter modules. I'm working round this by translating everything into SpanQueries, and using the getSpans() method to locate hits (I've extended the Spans interface to make term offsets available - see https://issues.apache.org/jira/browse/LUCENE-3826). This works for our use-case, but isn't terribly efficient, and obviously isn't applicable to non-Span queries. I've seen a bit of chatter on the list about using term offsets to provide accurate highlighting in Lucene. I'm going to have a couple of weeks free in April, and I thought I might have a go at implementing this. Mainly I'm wondering if there's already been thoughts about how to do it. My current thoughts are to somehow extend the Weight and Scorer interface to make term offsets available; to get highlights for a given set of documents, you'd essentially run the query again, with a filter on just the documents you want highlighted, and have a custom collector that gets the term offsets in place of the scores. Hi Alan, Simon started some initial work on https://issues.apache.org/jira/browse/LUCENE-2878 Some work and prototypes were done in a branch, but it might be lagging behind trunk a bit. Additionally at the time it was first done, I think we didn't yet support offsets in the postings lists. We've since added this and several codecs support it. -- lucidimagination.com - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e
Re: Using term offsets for hit highlighting
That would be great, thanks! I had a go at merging it last night, but there are a *lot* of changes that I haven't got my head round yet, so it was getting pretty messy. On 21 Mar 2012, at 08:49, Simon Willnauer wrote: Alan, if you want I can just merge the branch up next week and we iterate from there? simon On Tue, Mar 20, 2012 at 12:34 PM, Erick Erickson erickerick...@gmail.com wrote: Yep, the first challenge is always getting the old patch(es) to apply. On Tue, Mar 20, 2012 at 4:09 AM, Alan Woodward alan.woodw...@romseysoftware.co.uk wrote: Thanks for all the offers of help! It looks as though most of the hard work has already been done, which is exactly where I like to pick up projects. :-) Maybe the best place to start would be for me to rebase the branch against trunk, and see what still fits? I think there have been some fairly major changes in the internals since July last year. On 19 Mar 2012, at 17:07, Mike Sokolov wrote: I posted a patch with a Collector somewhat similar to what you described, Alan - it's attached to one of the sub-issues https://issues.apache.org/jira/browse/LUCENE-3318. It is in a fairly complete alpha state, but has seen no production use of course, since it relies on the remainder of the unfinished work in that branch. It works by creating a TokenStream based on match positions returned from the query and passing that to the existing Highlighter. Please feel free to get in touch if you decide to look into that and have questions. -Mike On 03/19/2012 11:51 AM, Simon Willnauer wrote: On Mon, Mar 19, 2012 at 4:50 PM, Uwe Schindleru...@thetaphi.de wrote: Have you marked that for GSOC? Would be a good idea! yes I did - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: Simon Willnauer [mailto:simon.willna...@googlemail.com] Sent: Monday, March 19, 2012 4:43 PM To: dev@lucene.apache.org Subject: Re: Using term offsets for hit highlighting Alan, you made my day! The branch is kind of outdated but I looked at it lately and I can certainly help to get it up to speed. The feature in that branch is quite a big one and its in a very early stage. Still I want to encourage you to take a look and work on it. I promise all my help with the issues! let me know if you have questions! simon On Mon, Mar 19, 2012 at 3:52 PM, Alan Woodward alan.woodw...@romseysoftware.co.uk wrote: Cool, thanks Robert. I'll take a look at the JIRA ticket. On 19 Mar 2012, at 14:44, Robert Muir wrote: On Mon, Mar 19, 2012 at 10:38 AM, Alan Woodward alan.woodw...@romseysoftware.co.uk wrote: Hello, The project I'm currently working on requires the reporting of exact hit positions from some pretty hairy queries, not all of which are covered by the existing highlighter modules. I'm working round this by translating everything into SpanQueries, and using the getSpans() method to locate hits (I've extended the Spans interface to make term offsets available - see https://issues.apache.org/jira/browse/LUCENE-3826). This works for our use-case, but isn't terribly efficient, and obviously isn't applicable to non-Span queries. I've seen a bit of chatter on the list about using term offsets to provide accurate highlighting in Lucene. I'm going to have a couple of weeks free in April, and I thought I might have a go at implementing this. Mainly I'm wondering if there's already been thoughts about how to do it. My current thoughts are to somehow extend the Weight and Scorer interface to make term offsets available; to get highlights for a given set of documents, you'd essentially run the query again, with a filter on just the documents you want highlighted, and have a custom collector that gets the term offsets in place of the scores. Hi Alan, Simon started some initial work on https://issues.apache.org/jira/browse/LUCENE-2878 Some work and prototypes were done in a branch, but it might be lagging behind trunk a bit. Additionally at the time it was first done, I think we didn't yet support offsets in the postings lists. We've since added this and several codecs support it. -- lucidimagination.com - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Using term offsets for hit highlighting
Thanks for all the offers of help! It looks as though most of the hard work has already been done, which is exactly where I like to pick up projects. :-) Maybe the best place to start would be for me to rebase the branch against trunk, and see what still fits? I think there have been some fairly major changes in the internals since July last year. On 19 Mar 2012, at 17:07, Mike Sokolov wrote: I posted a patch with a Collector somewhat similar to what you described, Alan - it's attached to one of the sub-issues https://issues.apache.org/jira/browse/LUCENE-3318. It is in a fairly complete alpha state, but has seen no production use of course, since it relies on the remainder of the unfinished work in that branch. It works by creating a TokenStream based on match positions returned from the query and passing that to the existing Highlighter. Please feel free to get in touch if you decide to look into that and have questions. -Mike On 03/19/2012 11:51 AM, Simon Willnauer wrote: On Mon, Mar 19, 2012 at 4:50 PM, Uwe Schindleru...@thetaphi.de wrote: Have you marked that for GSOC? Would be a good idea! yes I did - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: Simon Willnauer [mailto:simon.willna...@googlemail.com] Sent: Monday, March 19, 2012 4:43 PM To: dev@lucene.apache.org Subject: Re: Using term offsets for hit highlighting Alan, you made my day! The branch is kind of outdated but I looked at it lately and I can certainly help to get it up to speed. The feature in that branch is quite a big one and its in a very early stage. Still I want to encourage you to take a look and work on it. I promise all my help with the issues! let me know if you have questions! simon On Mon, Mar 19, 2012 at 3:52 PM, Alan Woodward alan.woodw...@romseysoftware.co.uk wrote: Cool, thanks Robert. I'll take a look at the JIRA ticket. On 19 Mar 2012, at 14:44, Robert Muir wrote: On Mon, Mar 19, 2012 at 10:38 AM, Alan Woodward alan.woodw...@romseysoftware.co.uk wrote: Hello, The project I'm currently working on requires the reporting of exact hit positions from some pretty hairy queries, not all of which are covered by the existing highlighter modules. I'm working round this by translating everything into SpanQueries, and using the getSpans() method to locate hits (I've extended the Spans interface to make term offsets available - see https://issues.apache.org/jira/browse/LUCENE-3826). This works for our use-case, but isn't terribly efficient, and obviously isn't applicable to non-Span queries. I've seen a bit of chatter on the list about using term offsets to provide accurate highlighting in Lucene. I'm going to have a couple of weeks free in April, and I thought I might have a go at implementing this. Mainly I'm wondering if there's already been thoughts about how to do it. My current thoughts are to somehow extend the Weight and Scorer interface to make term offsets available; to get highlights for a given set of documents, you'd essentially run the query again, with a filter on just the documents you want highlighted, and have a custom collector that gets the term offsets in place of the scores. Hi Alan, Simon started some initial work on https://issues.apache.org/jira/browse/LUCENE-2878 Some work and prototypes were done in a branch, but it might be lagging behind trunk a bit. Additionally at the time it was first done, I think we didn't yet support offsets in the postings lists. We've since added this and several codecs support it. -- lucidimagination.com - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail
Re: Using term offsets for hit highlighting
Yep, the first challenge is always getting the old patch(es) to apply. On Tue, Mar 20, 2012 at 4:09 AM, Alan Woodward alan.woodw...@romseysoftware.co.uk wrote: Thanks for all the offers of help! It looks as though most of the hard work has already been done, which is exactly where I like to pick up projects. :-) Maybe the best place to start would be for me to rebase the branch against trunk, and see what still fits? I think there have been some fairly major changes in the internals since July last year. On 19 Mar 2012, at 17:07, Mike Sokolov wrote: I posted a patch with a Collector somewhat similar to what you described, Alan - it's attached to one of the sub-issues https://issues.apache.org/jira/browse/LUCENE-3318. It is in a fairly complete alpha state, but has seen no production use of course, since it relies on the remainder of the unfinished work in that branch. It works by creating a TokenStream based on match positions returned from the query and passing that to the existing Highlighter. Please feel free to get in touch if you decide to look into that and have questions. -Mike On 03/19/2012 11:51 AM, Simon Willnauer wrote: On Mon, Mar 19, 2012 at 4:50 PM, Uwe Schindleru...@thetaphi.de wrote: Have you marked that for GSOC? Would be a good idea! yes I did - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: Simon Willnauer [mailto:simon.willna...@googlemail.com] Sent: Monday, March 19, 2012 4:43 PM To: dev@lucene.apache.org Subject: Re: Using term offsets for hit highlighting Alan, you made my day! The branch is kind of outdated but I looked at it lately and I can certainly help to get it up to speed. The feature in that branch is quite a big one and its in a very early stage. Still I want to encourage you to take a look and work on it. I promise all my help with the issues! let me know if you have questions! simon On Mon, Mar 19, 2012 at 3:52 PM, Alan Woodward alan.woodw...@romseysoftware.co.uk wrote: Cool, thanks Robert. I'll take a look at the JIRA ticket. On 19 Mar 2012, at 14:44, Robert Muir wrote: On Mon, Mar 19, 2012 at 10:38 AM, Alan Woodward alan.woodw...@romseysoftware.co.uk wrote: Hello, The project I'm currently working on requires the reporting of exact hit positions from some pretty hairy queries, not all of which are covered by the existing highlighter modules. I'm working round this by translating everything into SpanQueries, and using the getSpans() method to locate hits (I've extended the Spans interface to make term offsets available - see https://issues.apache.org/jira/browse/LUCENE-3826). This works for our use-case, but isn't terribly efficient, and obviously isn't applicable to non-Span queries. I've seen a bit of chatter on the list about using term offsets to provide accurate highlighting in Lucene. I'm going to have a couple of weeks free in April, and I thought I might have a go at implementing this. Mainly I'm wondering if there's already been thoughts about how to do it. My current thoughts are to somehow extend the Weight and Scorer interface to make term offsets available; to get highlights for a given set of documents, you'd essentially run the query again, with a filter on just the documents you want highlighted, and have a custom collector that gets the term offsets in place of the scores. Hi Alan, Simon started some initial work on https://issues.apache.org/jira/browse/LUCENE-2878 Some work and prototypes were done in a branch, but it might be lagging behind trunk a bit. Additionally at the time it was first done, I think we didn't yet support offsets in the postings lists. We've since added this and several codecs support it. -- lucidimagination.com - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional
Re: Using term offsets for hit highlighting
On Mon, Mar 19, 2012 at 10:38 AM, Alan Woodward alan.woodw...@romseysoftware.co.uk wrote: Hello, The project I'm currently working on requires the reporting of exact hit positions from some pretty hairy queries, not all of which are covered by the existing highlighter modules. I'm working round this by translating everything into SpanQueries, and using the getSpans() method to locate hits (I've extended the Spans interface to make term offsets available - see https://issues.apache.org/jira/browse/LUCENE-3826). This works for our use-case, but isn't terribly efficient, and obviously isn't applicable to non-Span queries. I've seen a bit of chatter on the list about using term offsets to provide accurate highlighting in Lucene. I'm going to have a couple of weeks free in April, and I thought I might have a go at implementing this. Mainly I'm wondering if there's already been thoughts about how to do it. My current thoughts are to somehow extend the Weight and Scorer interface to make term offsets available; to get highlights for a given set of documents, you'd essentially run the query again, with a filter on just the documents you want highlighted, and have a custom collector that gets the term offsets in place of the scores. Hi Alan, Simon started some initial work on https://issues.apache.org/jira/browse/LUCENE-2878 Some work and prototypes were done in a branch, but it might be lagging behind trunk a bit. Additionally at the time it was first done, I think we didn't yet support offsets in the postings lists. We've since added this and several codecs support it. -- lucidimagination.com - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Using term offsets for hit highlighting
Cool, thanks Robert. I'll take a look at the JIRA ticket. On 19 Mar 2012, at 14:44, Robert Muir wrote: On Mon, Mar 19, 2012 at 10:38 AM, Alan Woodward alan.woodw...@romseysoftware.co.uk wrote: Hello, The project I'm currently working on requires the reporting of exact hit positions from some pretty hairy queries, not all of which are covered by the existing highlighter modules. I'm working round this by translating everything into SpanQueries, and using the getSpans() method to locate hits (I've extended the Spans interface to make term offsets available - see https://issues.apache.org/jira/browse/LUCENE-3826). This works for our use-case, but isn't terribly efficient, and obviously isn't applicable to non-Span queries. I've seen a bit of chatter on the list about using term offsets to provide accurate highlighting in Lucene. I'm going to have a couple of weeks free in April, and I thought I might have a go at implementing this. Mainly I'm wondering if there's already been thoughts about how to do it. My current thoughts are to somehow extend the Weight and Scorer interface to make term offsets available; to get highlights for a given set of documents, you'd essentially run the query again, with a filter on just the documents you want highlighted, and have a custom collector that gets the term offsets in place of the scores. Hi Alan, Simon started some initial work on https://issues.apache.org/jira/browse/LUCENE-2878 Some work and prototypes were done in a branch, but it might be lagging behind trunk a bit. Additionally at the time it was first done, I think we didn't yet support offsets in the postings lists. We've since added this and several codecs support it. -- lucidimagination.com - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Using term offsets for hit highlighting
Alan, you made my day! The branch is kind of outdated but I looked at it lately and I can certainly help to get it up to speed. The feature in that branch is quite a big one and its in a very early stage. Still I want to encourage you to take a look and work on it. I promise all my help with the issues! let me know if you have questions! simon On Mon, Mar 19, 2012 at 3:52 PM, Alan Woodward alan.woodw...@romseysoftware.co.uk wrote: Cool, thanks Robert. I'll take a look at the JIRA ticket. On 19 Mar 2012, at 14:44, Robert Muir wrote: On Mon, Mar 19, 2012 at 10:38 AM, Alan Woodward alan.woodw...@romseysoftware.co.uk wrote: Hello, The project I'm currently working on requires the reporting of exact hit positions from some pretty hairy queries, not all of which are covered by the existing highlighter modules. I'm working round this by translating everything into SpanQueries, and using the getSpans() method to locate hits (I've extended the Spans interface to make term offsets available - see https://issues.apache.org/jira/browse/LUCENE-3826). This works for our use-case, but isn't terribly efficient, and obviously isn't applicable to non-Span queries. I've seen a bit of chatter on the list about using term offsets to provide accurate highlighting in Lucene. I'm going to have a couple of weeks free in April, and I thought I might have a go at implementing this. Mainly I'm wondering if there's already been thoughts about how to do it. My current thoughts are to somehow extend the Weight and Scorer interface to make term offsets available; to get highlights for a given set of documents, you'd essentially run the query again, with a filter on just the documents you want highlighted, and have a custom collector that gets the term offsets in place of the scores. Hi Alan, Simon started some initial work on https://issues.apache.org/jira/browse/LUCENE-2878 Some work and prototypes were done in a branch, but it might be lagging behind trunk a bit. Additionally at the time it was first done, I think we didn't yet support offsets in the postings lists. We've since added this and several codecs support it. -- lucidimagination.com - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
RE: Using term offsets for hit highlighting
Have you marked that for GSOC? Would be a good idea! - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: Simon Willnauer [mailto:simon.willna...@googlemail.com] Sent: Monday, March 19, 2012 4:43 PM To: dev@lucene.apache.org Subject: Re: Using term offsets for hit highlighting Alan, you made my day! The branch is kind of outdated but I looked at it lately and I can certainly help to get it up to speed. The feature in that branch is quite a big one and its in a very early stage. Still I want to encourage you to take a look and work on it. I promise all my help with the issues! let me know if you have questions! simon On Mon, Mar 19, 2012 at 3:52 PM, Alan Woodward alan.woodw...@romseysoftware.co.uk wrote: Cool, thanks Robert. I'll take a look at the JIRA ticket. On 19 Mar 2012, at 14:44, Robert Muir wrote: On Mon, Mar 19, 2012 at 10:38 AM, Alan Woodward alan.woodw...@romseysoftware.co.uk wrote: Hello, The project I'm currently working on requires the reporting of exact hit positions from some pretty hairy queries, not all of which are covered by the existing highlighter modules. I'm working round this by translating everything into SpanQueries, and using the getSpans() method to locate hits (I've extended the Spans interface to make term offsets available - see https://issues.apache.org/jira/browse/LUCENE-3826). This works for our use-case, but isn't terribly efficient, and obviously isn't applicable to non-Span queries. I've seen a bit of chatter on the list about using term offsets to provide accurate highlighting in Lucene. I'm going to have a couple of weeks free in April, and I thought I might have a go at implementing this. Mainly I'm wondering if there's already been thoughts about how to do it. My current thoughts are to somehow extend the Weight and Scorer interface to make term offsets available; to get highlights for a given set of documents, you'd essentially run the query again, with a filter on just the documents you want highlighted, and have a custom collector that gets the term offsets in place of the scores. Hi Alan, Simon started some initial work on https://issues.apache.org/jira/browse/LUCENE-2878 Some work and prototypes were done in a branch, but it might be lagging behind trunk a bit. Additionally at the time it was first done, I think we didn't yet support offsets in the postings lists. We've since added this and several codecs support it. -- lucidimagination.com - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Using term offsets for hit highlighting
On Mon, Mar 19, 2012 at 4:50 PM, Uwe Schindler u...@thetaphi.de wrote: Have you marked that for GSOC? Would be a good idea! yes I did - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: Simon Willnauer [mailto:simon.willna...@googlemail.com] Sent: Monday, March 19, 2012 4:43 PM To: dev@lucene.apache.org Subject: Re: Using term offsets for hit highlighting Alan, you made my day! The branch is kind of outdated but I looked at it lately and I can certainly help to get it up to speed. The feature in that branch is quite a big one and its in a very early stage. Still I want to encourage you to take a look and work on it. I promise all my help with the issues! let me know if you have questions! simon On Mon, Mar 19, 2012 at 3:52 PM, Alan Woodward alan.woodw...@romseysoftware.co.uk wrote: Cool, thanks Robert. I'll take a look at the JIRA ticket. On 19 Mar 2012, at 14:44, Robert Muir wrote: On Mon, Mar 19, 2012 at 10:38 AM, Alan Woodward alan.woodw...@romseysoftware.co.uk wrote: Hello, The project I'm currently working on requires the reporting of exact hit positions from some pretty hairy queries, not all of which are covered by the existing highlighter modules. I'm working round this by translating everything into SpanQueries, and using the getSpans() method to locate hits (I've extended the Spans interface to make term offsets available - see https://issues.apache.org/jira/browse/LUCENE-3826). This works for our use-case, but isn't terribly efficient, and obviously isn't applicable to non-Span queries. I've seen a bit of chatter on the list about using term offsets to provide accurate highlighting in Lucene. I'm going to have a couple of weeks free in April, and I thought I might have a go at implementing this. Mainly I'm wondering if there's already been thoughts about how to do it. My current thoughts are to somehow extend the Weight and Scorer interface to make term offsets available; to get highlights for a given set of documents, you'd essentially run the query again, with a filter on just the documents you want highlighted, and have a custom collector that gets the term offsets in place of the scores. Hi Alan, Simon started some initial work on https://issues.apache.org/jira/browse/LUCENE-2878 Some work and prototypes were done in a branch, but it might be lagging behind trunk a bit. Additionally at the time it was first done, I think we didn't yet support offsets in the postings lists. We've since added this and several codecs support it. -- lucidimagination.com - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Using term offsets for hit highlighting
I posted a patch with a Collector somewhat similar to what you described, Alan - it's attached to one of the sub-issues https://issues.apache.org/jira/browse/LUCENE-3318. It is in a fairly complete alpha state, but has seen no production use of course, since it relies on the remainder of the unfinished work in that branch. It works by creating a TokenStream based on match positions returned from the query and passing that to the existing Highlighter. Please feel free to get in touch if you decide to look into that and have questions. -Mike On 03/19/2012 11:51 AM, Simon Willnauer wrote: On Mon, Mar 19, 2012 at 4:50 PM, Uwe Schindleru...@thetaphi.de wrote: Have you marked that for GSOC? Would be a good idea! yes I did - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: Simon Willnauer [mailto:simon.willna...@googlemail.com] Sent: Monday, March 19, 2012 4:43 PM To: dev@lucene.apache.org Subject: Re: Using term offsets for hit highlighting Alan, you made my day! The branch is kind of outdated but I looked at it lately and I can certainly help to get it up to speed. The feature in that branch is quite a big one and its in a very early stage. Still I want to encourage you to take a look and work on it. I promise all my help with the issues! let me know if you have questions! simon On Mon, Mar 19, 2012 at 3:52 PM, Alan Woodward alan.woodw...@romseysoftware.co.uk wrote: Cool, thanks Robert. I'll take a look at the JIRA ticket. On 19 Mar 2012, at 14:44, Robert Muir wrote: On Mon, Mar 19, 2012 at 10:38 AM, Alan Woodward alan.woodw...@romseysoftware.co.uk wrote: Hello, The project I'm currently working on requires the reporting of exact hit positions from some pretty hairy queries, not all of which are covered by the existing highlighter modules. I'm working round this by translating everything into SpanQueries, and using the getSpans() method to locate hits (I've extended the Spans interface to make term offsets available - see https://issues.apache.org/jira/browse/LUCENE-3826). This works for our use-case, but isn't terribly efficient, and obviously isn't applicable to non-Span queries. I've seen a bit of chatter on the list about using term offsets to provide accurate highlighting in Lucene. I'm going to have a couple of weeks free in April, and I thought I might have a go at implementing this. Mainly I'm wondering if there's already been thoughts about how to do it. My current thoughts are to somehow extend the Weight and Scorer interface to make term offsets available; to get highlights for a given set of documents, you'd essentially run the query again, with a filter on just the documents you want highlighted, and have a custom collector that gets the term offsets in place of the scores. Hi Alan, Simon started some initial work on https://issues.apache.org/jira/browse/LUCENE-2878 Some work and prototypes were done in a branch, but it might be lagging behind trunk a bit. Additionally at the time it was first done, I think we didn't yet support offsets in the postings lists. We've since added this and several codecs support it. -- lucidimagination.com - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org