[jira] [Comment Edited] (LUCENE-8651) Tokenizer implementations can't be reset
[ https://issues.apache.org/jira/browse/LUCENE-8651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16746759#comment-16746759 ] Daniel Meehl edited comment on LUCENE-8651 at 1/19/19 12:58 AM: As a little more of an explanation, all I did here was to replace the KeywordTokenStream (from the 1st patch) to a KeywordTokenizer. This causes the test to fail with an IllegalStateException because the KeywordTokenizer has it's close() and then reset() methods called which swaps out the previously set reader for the Tokenizer.ILLEGAL_STATE_READER. was (Author: dmeehl): As a little more of an explanation, all I did here was to replace the KeywordTokenStream (from the 1st patch) to a KeywordTokenizer. This causes the test to fail with an IllegalStateException because the KeywordTokenizer has it's end and then reset methods called which swaps out the previously set reader for the Tokenizer.ILLEGAL_STATE_READER. > Tokenizer implementations can't be reset > > > Key: LUCENE-8651 > URL: https://issues.apache.org/jira/browse/LUCENE-8651 > Project: Lucene - Core > Issue Type: Bug > Components: modules/analysis >Reporter: Daniel Meehl >Priority: Major > Attachments: LUCENE-8650-2.patch, LUCENE-8651.patch > > > The fine print here is that they can't be reset without calling setReader() > every time before reset() is called. The reason for this is that Tokenizer > violates the contract put forth by TokenStream.reset() which is the following: > "Resets this stream to a clean state. Stateful implementations must implement > this method so that they can be reused, just as if they had been created > fresh." > Tokenizer implementation's reset function can't reset in that manner because > their Tokenizer.close() removes the reference to the underlying Reader > because of LUCENE-2387. The catch-22 here is that we don't want to > unnecessarily keep around a Reader (memory leak) but we would like to be able > to reset() if necessary. > The patches include an integration test that attempts to use a > ConcatenatingTokenStream to join an input TokenStream with a KeywordTokenizer > TokenStream. This test fails with an IllegalStateException thrown by > Tokenizer.ILLEGAL_STATE_READER. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-8650) ConcatenatingTokenStream does not end() nor reset() properly
[ https://issues.apache.org/jira/browse/LUCENE-8650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16746705#comment-16746705 ] Daniel Meehl edited comment on LUCENE-8650 at 1/19/19 1:00 AM: --- [~romseygeek] Yes I will. The core issue is that Tokenizer implementations end up clearing their Reader when they close() and thus can never reset() without setting a new Reader. was (Author: dmeehl): [~romseygeek] Yes I will. The core issue is that Tokenizer implementations end up clearing their Reader when they end() and thus can never reset() without setting a new Reader. > ConcatenatingTokenStream does not end() nor reset() properly > > > Key: LUCENE-8650 > URL: https://issues.apache.org/jira/browse/LUCENE-8650 > Project: Lucene - Core > Issue Type: Bug > Components: modules/analysis >Reporter: Daniel Meehl >Assignee: Alan Woodward >Priority: Major > Attachments: LUCENE-8650-1.patch, LUCENE-8650-2.patch, > LUCENE-8650-3.patch > > > All (I think) TokenStream implementations set a "final offset" after calling > super.end() in their end() methods. ConcatenatingTokenStream fails to do > this. Because of this, it's final offset is not readable and > DefaultIndexingChain in turn fails to set the lastStartOffset properly. This > results in problems with indexing which can include unsearchable content or > IllegalStateExceptions. > ConcatenatingTokenStream also fails to reset() properly. Specifically, it > does not set its currentSource and offsetIncrement back to 0. Because of > this, copyField directives (in the schema) do not work and content becomes > unsearchable. > I've created a few patches that illustrate the problem and then provide a fix. > The first patch enhances the TestConcatenatingTokensStream to check for > finalOffset, which as you can see ends up being 0. > I created the next patch separately because it includes extra classes used > for the testing that Lucene may or may not want to merge in. This patch adds > an integration test that loads some content into the 'text' field. The schema > then copies it to 'content' using a copyField directive. The test searches in > the content field for the loaded text and fails to find it even though the > field does contain the content. Flip the debug flag to see a nicer printout > of the response and what's in the index. Notice that the added class I > alluded to is KeywordTokenStream .This class had to be added because of > another (ultimately unrelated) problem: ConcatenatingTokenStream cannot > concatenate Tokenziers. This is because Tokenizer violates the contract put > forth by TokenStream.reset(). This separate problem warrants its own ticket, > though. However, ultimately KeywordTokenStream may be useful to others and > could be considered for adding to the repo. > The third patch finally fixes ConcatenatingTokenStream by storing and setting > a finalOffset as the last task in the end() method, and resetting > currentSource, offsetIncrement and finalOffset when reset() is called. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-8651) Tokenizer implementations can't be reset
[ https://issues.apache.org/jira/browse/LUCENE-8651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Meehl updated LUCENE-8651: - Description: The fine print here is that they can't be reset without calling setReader() every time before reset() is called. The reason for this is that Tokenizer violates the contract put forth by TokenStream.reset() which is the following: "Resets this stream to a clean state. Stateful implementations must implement this method so that they can be reused, just as if they had been created fresh." Tokenizer implementation's reset function can't reset in that manner because their Tokenizer.close() removes the reference to the underlying Reader because of LUCENE-2387. The catch-22 here is that we don't want to unnecessarily keep around a Reader (memory leak) but we would like to be able to reset() if necessary. The patches include an integration test that attempts to use a ConcatenatingTokenStream to join an input TokenStream with a KeywordTokenizer TokenStream. This test fails with an IllegalStateException thrown by Tokenizer.ILLEGAL_STATE_READER. was: The fine print here is that they can't be reset without calling setReader() every time before reset() is called. The reason for this is that Tokenizer violates the contract put forth by TokenStream.reset() which is the following: "Resets this stream to a clean state. Stateful implementations must implement this method so that they can be reused, just as if they had been created fresh." Tokenizer implementation's reset function can't reset in that manner because their Tokenizer.end() removes the reference to the underlying Reader because of LUCENE-2387. The catch-22 here is that we don't want to unnecessarily keep around a Reader (memory leak) but we would like to be able to reset() if necessary. The patches include an integration test that attempts to use a ConcatenatingTokenStream to join an input TokenStream with a KeywordTokenizer TokenStream. This test fails with an IllegalStateException thrown by Tokenizer.ILLEGAL_STATE_READER. > Tokenizer implementations can't be reset > > > Key: LUCENE-8651 > URL: https://issues.apache.org/jira/browse/LUCENE-8651 > Project: Lucene - Core > Issue Type: Bug > Components: modules/analysis >Reporter: Daniel Meehl >Priority: Major > Attachments: LUCENE-8650-2.patch, LUCENE-8651.patch > > > The fine print here is that they can't be reset without calling setReader() > every time before reset() is called. The reason for this is that Tokenizer > violates the contract put forth by TokenStream.reset() which is the following: > "Resets this stream to a clean state. Stateful implementations must implement > this method so that they can be reused, just as if they had been created > fresh." > Tokenizer implementation's reset function can't reset in that manner because > their Tokenizer.close() removes the reference to the underlying Reader > because of LUCENE-2387. The catch-22 here is that we don't want to > unnecessarily keep around a Reader (memory leak) but we would like to be able > to reset() if necessary. > The patches include an integration test that attempts to use a > ConcatenatingTokenStream to join an input TokenStream with a KeywordTokenizer > TokenStream. This test fails with an IllegalStateException thrown by > Tokenizer.ILLEGAL_STATE_READER. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8650) ConcatenatingTokenStream does not end() nor reset() properly
[ https://issues.apache.org/jira/browse/LUCENE-8650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16746828#comment-16746828 ] Daniel Meehl commented on LUCENE-8650: -- I do have one more question related to the fix. I noticed that most of the other TokenStream implementations end up setting their final offset like the code below. Why do they do this? Why set the start and end to the same value? In my patch, I set the start to 0 and the end to finalOffset, because this seems like the correct thing to do. {{offsetAtt.setOffset(finalOffset, finalOffset);}} > ConcatenatingTokenStream does not end() nor reset() properly > > > Key: LUCENE-8650 > URL: https://issues.apache.org/jira/browse/LUCENE-8650 > Project: Lucene - Core > Issue Type: Bug > Components: modules/analysis >Reporter: Daniel Meehl >Assignee: Alan Woodward >Priority: Major > Attachments: LUCENE-8650-1.patch, LUCENE-8650-2.patch, > LUCENE-8650-3.patch > > > All (I think) TokenStream implementations set a "final offset" after calling > super.end() in their end() methods. ConcatenatingTokenStream fails to do > this. Because of this, it's final offset is not readable and > DefaultIndexingChain in turn fails to set the lastStartOffset properly. This > results in problems with indexing which can include unsearchable content or > IllegalStateExceptions. > ConcatenatingTokenStream also fails to reset() properly. Specifically, it > does not set its currentSource and offsetIncrement back to 0. Because of > this, copyField directives (in the schema) do not work and content becomes > unsearchable. > I've created a few patches that illustrate the problem and then provide a fix. > The first patch enhances the TestConcatenatingTokensStream to check for > finalOffset, which as you can see ends up being 0. > I created the next patch separately because it includes extra classes used > for the testing that Lucene may or may not want to merge in. This patch adds > an integration test that loads some content into the 'text' field. The schema > then copies it to 'content' using a copyField directive. The test searches in > the content field for the loaded text and fails to find it even though the > field does contain the content. Flip the debug flag to see a nicer printout > of the response and what's in the index. Notice that the added class I > alluded to is KeywordTokenStream .This class had to be added because of > another (ultimately unrelated) problem: ConcatenatingTokenStream cannot > concatenate Tokenziers. This is because Tokenizer violates the contract put > forth by TokenStream.reset(). This separate problem warrants its own ticket, > though. However, ultimately KeywordTokenStream may be useful to others and > could be considered for adding to the repo. > The third patch finally fixes ConcatenatingTokenStream by storing and setting > a finalOffset as the last task in the end() method, and resetting > currentSource, offsetIncrement and finalOffset when reset() is called. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-8650) ConcatenatingTokenStream does not end() nor reset() properly
[ https://issues.apache.org/jira/browse/LUCENE-8650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Meehl updated LUCENE-8650: - Description: All (I think) TokenStream implementations set a "final offset" after calling super.end() in their end() methods. ConcatenatingTokenStream fails to do this. Because of this, it's final offset is not readable and DefaultIndexingChain in turn fails to set the lastStartOffset properly. This results in problems with indexing which can include unsearchable content or IllegalStateExceptions. ConcatenatingTokenStream also fails to reset() properly. Specifically, it does not set its currentSource and offsetIncrement back to 0. Because of this, copyField directives (in the schema) do not work and content becomes unsearchable. I've created a few patches that illustrate the problem and then provide a fix. The first patch enhances the TestConcatenatingTokensStream to check for finalOffset, which as you can see ends up being 0. I created the next patch separately because it includes extra classes used for the testing that Lucene may or may not want to merge in. This patch adds an integration test that loads some content into the 'text' field. The schema then copies it to 'content' using a copyField directive. The test searches in the content field for the loaded text and fails to find it even though the field does contain the content. Flip the debug flag to see a nicer printout of the response and what's in the index. Notice that the added class I alluded to is KeywordTokenStream .This class had to be added because of another (ultimately unrelated) problem: ConcatenatingTokenStream cannot concatenate Tokenziers. This is because Tokenizer violates the contract put forth by TokenStream.reset(). This separate problem warrants its own ticket, though. However, ultimately KeywordTokenStream may be useful to others and could be considered for adding to the repo. The third patch finally fixes ConcatenatingTokenStream by storing and setting a finalOffset as the last task in the end() method, and resetting currentSource, offsetIncrement and finalOffset when reset() is called. was: All (I think) TokenStream implementations set a "final offset" after calling super.end() in their end() methods. ConcatenatingTokenStream fails to do this. Because of this, it's final offset is not readable and DefaultIndexingChain in turn fails to set the lastStartOffset properly. This results in problems with indexing which can include unsearchable content or IllegalStateExceptions. ConcatenatingTokenStream also fails to reset() properly. Specifically, it does not set its currentSource and offsetIncrement back to 0. Because of this, copyField directives (in the schema) do not work and content becomes unsearchable. I've created a few patches that illustrate the problem and then provide a fix. The first patch enhances the TestConcatenatingTokensStream to check for finalOffset, which as you can see ends up being 0. I created the next patch separately because it includes extra classes used for the testing that Lucene may or may not want to merge in. This patch adds an integration test that loads some content into the 'text' field. The schema then copies it to 'content' using a copyField directive. The test searches in the content field for the loaded text and fails to find it even though the field does contain the content. Flip the debug flag to see a nicer printout of the response and what's in the index. Notice that the added class I alluded to is KeywordTokenStream .This class had to be added because of another (ultimately unrelated) problem: ConcatenatingTokenStream cannot concatenate Tokenziers. This is because Tokenizer violates the contract put forth by TokenStream.reset(). This separate problem warrants its own ticket, though. However, ultimately KeywordTokenStream may be useful to others and could be considered for adding to the repo. The third patch finally fixes ConcatenatingTokenStream by storing and setting a finalOffset as the last task in the end() method, and resetting currentSource, offsetIncrement and finalOffset when reset() is called. > ConcatenatingTokenStream does not end() nor reset() properly > > > Key: LUCENE-8650 > URL: https://issues.apache.org/jira/browse/LUCENE-8650 > Project: Lucene - Core > Issue Type: Bug > Components: modules/analysis >Reporter: Daniel Meehl >Assignee: Alan Woodward >Priority: Major > Attachments: LUCENE-8650-1.patch, LUCENE-8650-2.patch, > LUCENE-8650-3.patch > > > All (I think) TokenStream implementations set a "final offset" after calling > super.end() in their end() methods. ConcatenatingTokenStream fails to do > this. Because of this, it's final offset is not
[jira] [Updated] (LUCENE-8651) Tokenizer implementations can't be reset
[ https://issues.apache.org/jira/browse/LUCENE-8651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Meehl updated LUCENE-8651: - Lucene Fields: New,Patch Available (was: New) > Tokenizer implementations can't be reset > > > Key: LUCENE-8651 > URL: https://issues.apache.org/jira/browse/LUCENE-8651 > Project: Lucene - Core > Issue Type: Bug > Components: modules/analysis >Reporter: Daniel Meehl >Priority: Major > Attachments: LUCENE-8650-2.patch, LUCENE-8651.patch > > > The fine print here is that they can't be reset without calling setReader() > every time before reset() is called. The reason for this is that Tokenizer > violates the contract put forth by TokenStream.reset() which is the following: > "Resets this stream to a clean state. Stateful implementations must implement > this method so that they can be reused, just as if they had been created > fresh." > Tokenizer implementation's reset function can't reset in that manner because > their Tokenizer.end() removes the reference to the underlying Reader because > of LUCENE-2387. The catch-22 here is that we don't want to unnecessarily keep > around a Reader (memory leak) but we would like to be able to reset() if > necessary. > The patches include an integration test that attempts to use a > ConcatenatingTokenStream to join an input TokenStream with a KeywordTokenizer > TokenStream. This test fails with an IllegalStateException thrown by > Tokenizer.ILLEGAL_STATE_READER. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8651) Tokenizer implementations can't be reset
[ https://issues.apache.org/jira/browse/LUCENE-8651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16746759#comment-16746759 ] Daniel Meehl commented on LUCENE-8651: -- As a little more of an explanation, all I did here was to replace the KeywordTokenStream (from the 1st patch) to a KeywordTokenizer. This causes the test to fail with an IllegalStateException because the KeywordTokenizer has it's end and then reset methods called which swaps out the previously set reader for the Tokenizer.ILLEGAL_STATE_READER. > Tokenizer implementations can't be reset > > > Key: LUCENE-8651 > URL: https://issues.apache.org/jira/browse/LUCENE-8651 > Project: Lucene - Core > Issue Type: Bug > Components: modules/analysis >Reporter: Daniel Meehl >Priority: Major > Attachments: LUCENE-8650-2.patch, LUCENE-8651.patch > > > The fine print here is that they can't be reset without calling setReader() > every time before reset() is called. The reason for this is that Tokenizer > violates the contract put forth by TokenStream.reset() which is the following: > "Resets this stream to a clean state. Stateful implementations must implement > this method so that they can be reused, just as if they had been created > fresh." > Tokenizer implementation's reset function can't reset in that manner because > their Tokenizer.end() removes the reference to the underlying Reader because > of LUCENE-2387. The catch-22 here is that we don't want to unnecessarily keep > around a Reader (memory leak) but we would like to be able to reset() if > necessary. > The patches include an integration test that attempts to use a > ConcatenatingTokenStream to join an input TokenStream with a KeywordTokenizer > TokenStream. This test fails with an IllegalStateException thrown by > Tokenizer.ILLEGAL_STATE_READER. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-8651) Tokenizer implementations can't be reset
[ https://issues.apache.org/jira/browse/LUCENE-8651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Meehl updated LUCENE-8651: - Attachment: LUCENE-8650-2.patch > Tokenizer implementations can't be reset > > > Key: LUCENE-8651 > URL: https://issues.apache.org/jira/browse/LUCENE-8651 > Project: Lucene - Core > Issue Type: Bug >Reporter: Daniel Meehl >Priority: Major > Attachments: LUCENE-8650-2.patch > > > The fine print here is that they can't be reset without calling setReader() > every time before reset() is called. The reason for this is that Tokenizer > violates the contract put forth by TokenStream.reset() which is the following: > "Resets this stream to a clean state. Stateful implementations must implement > this method so that they can be reused, just as if they had been created > fresh." > Tokenizer implementation's reset function can't reset in that manner because > their Tokenizer.end() removes the reference to the underlying Reader because > of LUCENE-2387. The catch-22 here is that we don't want to unnecessarily keep > around a Reader (memory leak) but we would like to be able to reset() if > necessary. > The patches include an integration test that attempts to use a > ConcatenatingTokenStream to join an input TokenStream with a KeywordTokenizer > TokenStream. This test fails with an IllegalStateException thrown by > Tokenizer.ILLEGAL_STATE_READER. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-8651) Tokenizer implementations can't be reset
[ https://issues.apache.org/jira/browse/LUCENE-8651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Meehl updated LUCENE-8651: - Component/s: modules/analysis > Tokenizer implementations can't be reset > > > Key: LUCENE-8651 > URL: https://issues.apache.org/jira/browse/LUCENE-8651 > Project: Lucene - Core > Issue Type: Bug > Components: modules/analysis >Reporter: Daniel Meehl >Priority: Major > Attachments: LUCENE-8650-2.patch, LUCENE-8651.patch > > > The fine print here is that they can't be reset without calling setReader() > every time before reset() is called. The reason for this is that Tokenizer > violates the contract put forth by TokenStream.reset() which is the following: > "Resets this stream to a clean state. Stateful implementations must implement > this method so that they can be reused, just as if they had been created > fresh." > Tokenizer implementation's reset function can't reset in that manner because > their Tokenizer.end() removes the reference to the underlying Reader because > of LUCENE-2387. The catch-22 here is that we don't want to unnecessarily keep > around a Reader (memory leak) but we would like to be able to reset() if > necessary. > The patches include an integration test that attempts to use a > ConcatenatingTokenStream to join an input TokenStream with a KeywordTokenizer > TokenStream. This test fails with an IllegalStateException thrown by > Tokenizer.ILLEGAL_STATE_READER. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-8651) Tokenizer implementations can't be reset
[ https://issues.apache.org/jira/browse/LUCENE-8651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16746748#comment-16746748 ] Daniel Meehl edited comment on LUCENE-8651 at 1/18/19 10:57 PM: Since this was related to LUCENE-8650, I piggy-backed on the 2nd patch in that ticket to make things easier. I hope that's not a problem. This means that to run this test, you should apply both patches: 8650 first then 8651. was (Author: dmeehl): Since this was related to LUCENE-8650, I piggybacked on the 2nd patch in that ticket to make things easier. I hope that's not a problem. > Tokenizer implementations can't be reset > > > Key: LUCENE-8651 > URL: https://issues.apache.org/jira/browse/LUCENE-8651 > Project: Lucene - Core > Issue Type: Bug >Reporter: Daniel Meehl >Priority: Major > Attachments: LUCENE-8650-2.patch, LUCENE-8651.patch > > > The fine print here is that they can't be reset without calling setReader() > every time before reset() is called. The reason for this is that Tokenizer > violates the contract put forth by TokenStream.reset() which is the following: > "Resets this stream to a clean state. Stateful implementations must implement > this method so that they can be reused, just as if they had been created > fresh." > Tokenizer implementation's reset function can't reset in that manner because > their Tokenizer.end() removes the reference to the underlying Reader because > of LUCENE-2387. The catch-22 here is that we don't want to unnecessarily keep > around a Reader (memory leak) but we would like to be able to reset() if > necessary. > The patches include an integration test that attempts to use a > ConcatenatingTokenStream to join an input TokenStream with a KeywordTokenizer > TokenStream. This test fails with an IllegalStateException thrown by > Tokenizer.ILLEGAL_STATE_READER. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8650) ConcatenatingTokenStream does not end() nor reset() properly
[ https://issues.apache.org/jira/browse/LUCENE-8650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16746751#comment-16746751 ] Daniel Meehl commented on LUCENE-8650: -- [~romseygeek] Filed that ticket here: LUCENE-8651 > ConcatenatingTokenStream does not end() nor reset() properly > > > Key: LUCENE-8650 > URL: https://issues.apache.org/jira/browse/LUCENE-8650 > Project: Lucene - Core > Issue Type: Bug > Components: modules/analysis >Reporter: Daniel Meehl >Assignee: Alan Woodward >Priority: Major > Attachments: LUCENE-8650-1.patch, LUCENE-8650-2.patch, > LUCENE-8650-3.patch > > > All (I think) TokenStream implementations set a "final offset" after calling > super.end() in their end() methods. ConcatenatingTokenStream fails to do > this. Because of this, it's final offset is not readable and > DefaultIndexingChain in turn fails to set the lastStartOffset properly. This > results in problems with indexing which can include unsearchable content or > IllegalStateExceptions. > > ConcatenatingTokenStream also fails to reset() properly. Specifically, it > does not set its currentSource and offsetIncrement back to 0. Because of > this, copyField directives (in the schema) do not work and content becomes > unsearchable. > I've created a few patches that illustrate the problem and then provide a fix. > The first patch enhances the TestConcatenatingTokensStream to check for > finalOffset, which as you can see ends up being 0. > I created the next patch separately because it includes extra classes used > for the testing that Lucene may or may not want to merge in. This patch adds > an integration test that loads some content into the 'text' field. The schema > then copies it to 'content' using a copyField directive. The test searches in > the content field for the loaded text and fails to find it even though the > field does contain the content. Flip the debug flag to see a nicer printout > of the response and what's in the index. Notice that the added class I > alluded to is KeywordTokenStream .This class had to be added because of > another (ultimately unrelated) problem: ConcatenatingTokenStream cannot > concatenate Tokenziers. This is because Tokenizer violates the contract put > forth by TokenStream.reset(). This separate problem warrants its own ticket, > though. However, ultimately KeywordTokenStream may be useful to others and > could be considered for adding to the repo. > The third patch finally fixes ConcatenatingTokenStream by storing and setting > a finalOffset as the last task in the end() method, and resetting > currentSource, offsetIncrement and finalOffset when reset() is called. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-8650) ConcatenatingTokenStream does not end() nor reset() properly
[ https://issues.apache.org/jira/browse/LUCENE-8650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16746751#comment-16746751 ] Daniel Meehl edited comment on LUCENE-8650 at 1/18/19 10:52 PM: [~romseygeek], I filed that ticket here: LUCENE-8651 was (Author: dmeehl): [~romseygeek] Filed that ticket here: LUCENE-8651 > ConcatenatingTokenStream does not end() nor reset() properly > > > Key: LUCENE-8650 > URL: https://issues.apache.org/jira/browse/LUCENE-8650 > Project: Lucene - Core > Issue Type: Bug > Components: modules/analysis >Reporter: Daniel Meehl >Assignee: Alan Woodward >Priority: Major > Attachments: LUCENE-8650-1.patch, LUCENE-8650-2.patch, > LUCENE-8650-3.patch > > > All (I think) TokenStream implementations set a "final offset" after calling > super.end() in their end() methods. ConcatenatingTokenStream fails to do > this. Because of this, it's final offset is not readable and > DefaultIndexingChain in turn fails to set the lastStartOffset properly. This > results in problems with indexing which can include unsearchable content or > IllegalStateExceptions. > > ConcatenatingTokenStream also fails to reset() properly. Specifically, it > does not set its currentSource and offsetIncrement back to 0. Because of > this, copyField directives (in the schema) do not work and content becomes > unsearchable. > I've created a few patches that illustrate the problem and then provide a fix. > The first patch enhances the TestConcatenatingTokensStream to check for > finalOffset, which as you can see ends up being 0. > I created the next patch separately because it includes extra classes used > for the testing that Lucene may or may not want to merge in. This patch adds > an integration test that loads some content into the 'text' field. The schema > then copies it to 'content' using a copyField directive. The test searches in > the content field for the loaded text and fails to find it even though the > field does contain the content. Flip the debug flag to see a nicer printout > of the response and what's in the index. Notice that the added class I > alluded to is KeywordTokenStream .This class had to be added because of > another (ultimately unrelated) problem: ConcatenatingTokenStream cannot > concatenate Tokenziers. This is because Tokenizer violates the contract put > forth by TokenStream.reset(). This separate problem warrants its own ticket, > though. However, ultimately KeywordTokenStream may be useful to others and > could be considered for adding to the repo. > The third patch finally fixes ConcatenatingTokenStream by storing and setting > a finalOffset as the last task in the end() method, and resetting > currentSource, offsetIncrement and finalOffset when reset() is called. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8651) Tokenizer implementations can't be reset
[ https://issues.apache.org/jira/browse/LUCENE-8651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16746748#comment-16746748 ] Daniel Meehl commented on LUCENE-8651: -- Since this was related to LUCENE-8650, I piggybacked on the 2nd patch in that ticket to make things easier. I hope that's not a problem. > Tokenizer implementations can't be reset > > > Key: LUCENE-8651 > URL: https://issues.apache.org/jira/browse/LUCENE-8651 > Project: Lucene - Core > Issue Type: Bug >Reporter: Daniel Meehl >Priority: Major > Attachments: LUCENE-8650-2.patch, LUCENE-8651.patch > > > The fine print here is that they can't be reset without calling setReader() > every time before reset() is called. The reason for this is that Tokenizer > violates the contract put forth by TokenStream.reset() which is the following: > "Resets this stream to a clean state. Stateful implementations must implement > this method so that they can be reused, just as if they had been created > fresh." > Tokenizer implementation's reset function can't reset in that manner because > their Tokenizer.end() removes the reference to the underlying Reader because > of LUCENE-2387. The catch-22 here is that we don't want to unnecessarily keep > around a Reader (memory leak) but we would like to be able to reset() if > necessary. > The patches include an integration test that attempts to use a > ConcatenatingTokenStream to join an input TokenStream with a KeywordTokenizer > TokenStream. This test fails with an IllegalStateException thrown by > Tokenizer.ILLEGAL_STATE_READER. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-8651) Tokenizer implementations can't be reset
[ https://issues.apache.org/jira/browse/LUCENE-8651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Meehl updated LUCENE-8651: - Attachment: LUCENE-8651.patch > Tokenizer implementations can't be reset > > > Key: LUCENE-8651 > URL: https://issues.apache.org/jira/browse/LUCENE-8651 > Project: Lucene - Core > Issue Type: Bug >Reporter: Daniel Meehl >Priority: Major > Attachments: LUCENE-8650-2.patch, LUCENE-8651.patch > > > The fine print here is that they can't be reset without calling setReader() > every time before reset() is called. The reason for this is that Tokenizer > violates the contract put forth by TokenStream.reset() which is the following: > "Resets this stream to a clean state. Stateful implementations must implement > this method so that they can be reused, just as if they had been created > fresh." > Tokenizer implementation's reset function can't reset in that manner because > their Tokenizer.end() removes the reference to the underlying Reader because > of LUCENE-2387. The catch-22 here is that we don't want to unnecessarily keep > around a Reader (memory leak) but we would like to be able to reset() if > necessary. > The patches include an integration test that attempts to use a > ConcatenatingTokenStream to join an input TokenStream with a KeywordTokenizer > TokenStream. This test fails with an IllegalStateException thrown by > Tokenizer.ILLEGAL_STATE_READER. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-8651) Tokenizer implementations can't be reset
Daniel Meehl created LUCENE-8651: Summary: Tokenizer implementations can't be reset Key: LUCENE-8651 URL: https://issues.apache.org/jira/browse/LUCENE-8651 Project: Lucene - Core Issue Type: Bug Reporter: Daniel Meehl The fine print here is that they can't be reset without calling setReader() every time before reset() is called. The reason for this is that Tokenizer violates the contract put forth by TokenStream.reset() which is the following: "Resets this stream to a clean state. Stateful implementations must implement this method so that they can be reused, just as if they had been created fresh." Tokenizer implementation's reset function can't reset in that manner because their Tokenizer.end() removes the reference to the underlying Reader because of LUCENE-2387. The catch-22 here is that we don't want to unnecessarily keep around a Reader (memory leak) but we would like to be able to reset() if necessary. The patches include an integration test that attempts to use a ConcatenatingTokenStream to join an input TokenStream with a KeywordTokenizer TokenStream. This test fails with an IllegalStateException thrown by Tokenizer.ILLEGAL_STATE_READER. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8650) ConcatenatingTokenStream does not end() nor reset() properly
[ https://issues.apache.org/jira/browse/LUCENE-8650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16746705#comment-16746705 ] Daniel Meehl commented on LUCENE-8650: -- [~romseygeek] Yes I will. The core issue is that Tokenizer implementations end up clearing their Reader when they end() and thus can never reset() without setting a new Reader. > ConcatenatingTokenStream does not end() nor reset() properly > > > Key: LUCENE-8650 > URL: https://issues.apache.org/jira/browse/LUCENE-8650 > Project: Lucene - Core > Issue Type: Bug > Components: modules/analysis >Reporter: Daniel Meehl >Assignee: Alan Woodward >Priority: Major > Attachments: LUCENE-8650-1.patch, LUCENE-8650-2.patch, > LUCENE-8650-3.patch > > > All (I think) TokenStream implementations set a "final offset" after calling > super.end() in their end() methods. ConcatenatingTokenStream fails to do > this. Because of this, it's final offset is not readable and > DefaultIndexingChain in turn fails to set the lastStartOffset properly. This > results in problems with indexing which can include unsearchable content or > IllegalStateExceptions. > > ConcatenatingTokenStream also fails to reset() properly. Specifically, it > does not set its currentSource and offsetIncrement back to 0. Because of > this, copyField directives (in the schema) do not work and content becomes > unsearchable. > I've created a few patches that illustrate the problem and then provide a fix. > The first patch enhances the TestConcatenatingTokensStream to check for > finalOffset, which as you can see ends up being 0. > I created the next patch separately because it includes extra classes used > for the testing that Lucene may or may not want to merge in. This patch adds > an integration test that loads some content into the 'text' field. The schema > then copies it to 'content' using a copyField directive. The test searches in > the content field for the loaded text and fails to find it even though the > field does contain the content. Flip the debug flag to see a nicer printout > of the response and what's in the index. Notice that the added class I > alluded to is KeywordTokenStream .This class had to be added because of > another (ultimately unrelated) problem: ConcatenatingTokenStream cannot > concatenate Tokenziers. This is because Tokenizer violates the contract put > forth by TokenStream.reset(). This separate problem warrants its own ticket, > though. However, ultimately KeywordTokenStream may be useful to others and > could be considered for adding to the repo. > The third patch finally fixes ConcatenatingTokenStream by storing and setting > a finalOffset as the last task in the end() method, and resetting > currentSource, offsetIncrement and finalOffset when reset() is called. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8650) ConcatenatingTokenStream does not end() nor reset() properly
[ https://issues.apache.org/jira/browse/LUCENE-8650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16746700#comment-16746700 ] Daniel Meehl commented on LUCENE-8650: -- Relates to LUCENE-2387 because that's the root cause of Tokenizer implementations not reset()'ing properly. > ConcatenatingTokenStream does not end() nor reset() properly > > > Key: LUCENE-8650 > URL: https://issues.apache.org/jira/browse/LUCENE-8650 > Project: Lucene - Core > Issue Type: Bug > Components: modules/analysis >Reporter: Daniel Meehl >Assignee: Alan Woodward >Priority: Major > Attachments: LUCENE-8650-1.patch, LUCENE-8650-2.patch, > LUCENE-8650-3.patch > > > All (I think) TokenStream implementations set a "final offset" after calling > super.end() in their end() methods. ConcatenatingTokenStream fails to do > this. Because of this, it's final offset is not readable and > DefaultIndexingChain in turn fails to set the lastStartOffset properly. This > results in problems with indexing which can include unsearchable content or > IllegalStateExceptions. > > ConcatenatingTokenStream also fails to reset() properly. Specifically, it > does not set its currentSource and offsetIncrement back to 0. Because of > this, copyField directives (in the schema) do not work and content becomes > unsearchable. > I've created a few patches that illustrate the problem and then provide a fix. > The first patch enhances the TestConcatenatingTokensStream to check for > finalOffset, which as you can see ends up being 0. > I created the next patch separately because it includes extra classes used > for the testing that Lucene may or may not want to merge in. This patch adds > an integration test that loads some content into the 'text' field. The schema > then copies it to 'content' using a copyField directive. The test searches in > the content field for the loaded text and fails to find it even though the > field does contain the content. Flip the debug flag to see a nicer printout > of the response and what's in the index. Notice that the added class I > alluded to is KeywordTokenStream .This class had to be added because of > another (ultimately unrelated) problem: ConcatenatingTokenStream cannot > concatenate Tokenziers. This is because Tokenizer violates the contract put > forth by TokenStream.reset(). This separate problem warrants its own ticket, > though. However, ultimately KeywordTokenStream may be useful to others and > could be considered for adding to the repo. > The third patch finally fixes ConcatenatingTokenStream by storing and setting > a finalOffset as the last task in the end() method, and resetting > currentSource, offsetIncrement and finalOffset when reset() is called. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-8650) ConcatenatingTokenStream does not end() nor reset() properly
[ https://issues.apache.org/jira/browse/LUCENE-8650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Meehl updated LUCENE-8650: - Description: All (I think) TokenStream implementations set a "final offset" after calling super.end() in their end() methods. ConcatenatingTokenStream fails to do this. Because of this, it's final offset is not readable and DefaultIndexingChain in turn fails to set the lastStartOffset properly. This results in problems with indexing which can include unsearchable content or IllegalStateExceptions. ConcatenatingTokenStream also fails to reset() properly. Specifically, it does not set its currentSource and offsetIncrement back to 0. Because of this, copyField directives (in the schema) do not work and content becomes unsearchable. I've created a few patches that illustrate the problem and then provide a fix. The first patch enhances the TestConcatenatingTokensStream to check for finalOffset, which as you can see ends up being 0. I created the next patch separately because it includes extra classes used for the testing that Lucene may or may not want to merge in. This patch adds an integration test that loads some content into the 'text' field. The schema then copies it to 'content' using a copyField directive. The test searches in the content field for the loaded text and fails to find it even though the field does contain the content. Flip the debug flag to see a nicer printout of the response and what's in the index. Notice that the added class I alluded to is KeywordTokenStream .This class had to be added because of another (ultimately unrelated) problem: ConcatenatingTokenStream cannot concatenate Tokenziers. This is because Tokenizer violates the contract put forth by TokenStream.reset(). This separate problem warrants its own ticket, though. However, ultimately KeywordTokenStream may be useful to others and could be considered for adding to the repo. The third patch finally fixes ConcatenatingTokenStream by storing and setting a finalOffset as the last task in the end() method, and resetting currentSource, offsetIncrement and finalOffset when reset() is called. was: All (I think) TokenStream implementations set a "final offset" after calling super.end() in their end() methods. ConcatenatingTokenStream fails to do this. Because of this, it's final offset is not readable and DefaultIndexingChain in turn fails to set the lastStartOffset properly. This results in problems with indexing which can include unsearchable content or IllegalStateExceptions. ConcatenatingTokenStream also fails to reset() properly. Specifically, it does not set its currentSource and offsetIncrement back to 0. Because of this, copyField directives (in the schema) do not work and content becomes unsearchable. I've created a few patches that illustrate the problem and then provide a fix. The first patch enhances the TestConcatenatingTokensStream to check for finalOffset, which as you can see ends up being 0. I created the next patch separately because it includes extra classes used for the testing that Lucene may or may not want to merge in. This patch adds an integration test that loads some content into the 'text' field. The schema then copies it to 'content' using a copyField directive. The test searches in the content field for the loaded text and fails to find it even though the field does contain the content. Flip the debug flag to see a nicer printout of the response and what's in the index. Notice that the added class I alluded to is KeywordTokenStream .This class had to be added because of another (ultimately unrelated) problem: ConcatenatingTokenStream cannot concatenate Tokenziers. This is because Tokenizer violates the contract put forth by TokenStream.reset(). This separate problem warrants its own ticket, though. However, ultimately KeywordTokenStream may be useful to others and and could be considered for adding to the repo. The third patch finally fixes ConcatenatingTokenStream by storing and setting a finalOffset as the last task in the end() method, and resetting currentSource, offsetIncrement and finalOffset when reset() is called. > ConcatenatingTokenStream does not end() nor reset() properly > > > Key: LUCENE-8650 > URL: https://issues.apache.org/jira/browse/LUCENE-8650 > Project: Lucene - Core > Issue Type: Bug > Components: modules/analysis >Reporter: Daniel Meehl >Priority: Major > Attachments: LUCENE-8650-1.patch, LUCENE-8650-2.patch, > LUCENE-8650-3.patch > > > All (I think) TokenStream implementations set a "final offset" after calling > super.end() in their end() methods. ConcatenatingTokenStream fails to do > this. Because of this, it's final offset is not readable and >
[jira] [Updated] (LUCENE-8650) ConcatenatingTokenStream does not end() nor reset() properly
[ https://issues.apache.org/jira/browse/LUCENE-8650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Meehl updated LUCENE-8650: - Attachment: LUCENE-8650-3.patch LUCENE-8650-2.patch LUCENE-8650-1.patch > ConcatenatingTokenStream does not end() nor reset() properly > > > Key: LUCENE-8650 > URL: https://issues.apache.org/jira/browse/LUCENE-8650 > Project: Lucene - Core > Issue Type: Bug > Components: modules/analysis >Reporter: Daniel Meehl >Priority: Major > Attachments: LUCENE-8650-1.patch, LUCENE-8650-2.patch, > LUCENE-8650-3.patch > > > All (I think) TokenStream implementations set a "final offset" after calling > super.end() in their end() methods. ConcatenatingTokenStream fails to do > this. Because of this, it's final offset is not readable and > DefaultIndexingChain in turn fails to set the lastStartOffset properly. This > results in problems with indexing which can include unsearchable content or > IllegalStateExceptions. > > ConcatenatingTokenStream also fails to reset() properly. Specifically, it > does not set its currentSource and offsetIncrement back to 0. Because of > this, copyField directives (in the schema) do not work and content becomes > unsearchable. > I've created a few patches that illustrate the problem and then provide a fix. > The first patch enhances the TestConcatenatingTokensStream to check for > finalOffset, which as you can see ends up being 0. > I created the next patch separately because it includes extra classes used > for the testing that Lucene may or may not want to merge in. This patch adds > an integration test that loads some content into the 'text' field. The schema > then copies it to 'content' using a copyField directive. The test searches in > the content field for the loaded text and fails to find it even though the > field does contain the content. Flip the debug flag to see a nicer printout > of the response and what's in the index. Notice that the added class I > alluded to is KeywordTokenStream .This class had to be added because of > another (ultimately unrelated) problem: ConcatenatingTokenStream cannot > concatenate Tokenziers. This is because Tokenizer violates the contract put > forth by TokenStream.reset(). This separate problem warrants its own ticket, > though. However, ultimately KeywordTokenStream may be useful to others and > and could be considered for adding to the repo. > The third patch finally fixes ConcatenatingTokenStream by storing and setting > a finalOffset as the last task in the end() method, and resetting > currentSource, offsetIncrement and finalOffset when reset() is called. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-8650) ConcatenatingTokenStream does not end() nor reset() properly
[ https://issues.apache.org/jira/browse/LUCENE-8650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Meehl updated LUCENE-8650: - Attachment: ConcatTokenFilterFactory.java > ConcatenatingTokenStream does not end() nor reset() properly > > > Key: LUCENE-8650 > URL: https://issues.apache.org/jira/browse/LUCENE-8650 > Project: Lucene - Core > Issue Type: Bug > Components: modules/analysis >Reporter: Daniel Meehl >Priority: Major > > All (I think) TokenStream implementations set a "final offset" after calling > super.end() in their end() methods. ConcatenatingTokenStream fails to do > this. Because of this, it's final offset is not readable and > DefaultIndexingChain in turn fails to set the lastStartOffset properly. This > results in problems with indexing which can include unsearchable content or > IllegalStateExceptions. > > ConcatenatingTokenStream also fails to reset() properly. Specifically, it > does not set its currentSource and offsetIncrement back to 0. Because of > this, copyField directives (in the schema) do not work and content becomes > unsearchable. > I've created a few patches that illustrate the problem and then provide a fix. > The first patch enhances the TestConcatenatingTokensStream to check for > finalOffset, which as you can see ends up being 0. > I created the next patch separately because it includes extra classes used > for the testing that Lucene may or may not want to merge in. This patch adds > an integration test that loads some content into the 'text' field. The schema > then copies it to 'content' using a copyField directive. The test searches in > the content field for the loaded text and fails to find it even though the > field does contain the content. Flip the debug flag to see a nicer printout > of the response and what's in the index. Notice that the added class I > alluded to is KeywordTokenStream .This class had to be added because of > another (ultimately unrelated) problem: ConcatenatingTokenStream cannot > concatenate Tokenziers. This is because Tokenizer violates the contract put > forth by TokenStream.reset(). This separate problem warrants its own ticket, > though. However, ultimately KeywordTokenStream may be useful to others and > and could be considered for adding to the repo. > The third patch finally fixes ConcatenatingTokenStream by storing and setting > a finalOffset as the last task in the end() method, and resetting > currentSource, offsetIncrement and finalOffset when reset() is called. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-8650) ConcatenatingTokenStream does not end() nor reset() properly
[ https://issues.apache.org/jira/browse/LUCENE-8650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Meehl updated LUCENE-8650: - Attachment: (was: ConcatTokenFilterFactory.java) > ConcatenatingTokenStream does not end() nor reset() properly > > > Key: LUCENE-8650 > URL: https://issues.apache.org/jira/browse/LUCENE-8650 > Project: Lucene - Core > Issue Type: Bug > Components: modules/analysis >Reporter: Daniel Meehl >Priority: Major > > All (I think) TokenStream implementations set a "final offset" after calling > super.end() in their end() methods. ConcatenatingTokenStream fails to do > this. Because of this, it's final offset is not readable and > DefaultIndexingChain in turn fails to set the lastStartOffset properly. This > results in problems with indexing which can include unsearchable content or > IllegalStateExceptions. > > ConcatenatingTokenStream also fails to reset() properly. Specifically, it > does not set its currentSource and offsetIncrement back to 0. Because of > this, copyField directives (in the schema) do not work and content becomes > unsearchable. > I've created a few patches that illustrate the problem and then provide a fix. > The first patch enhances the TestConcatenatingTokensStream to check for > finalOffset, which as you can see ends up being 0. > I created the next patch separately because it includes extra classes used > for the testing that Lucene may or may not want to merge in. This patch adds > an integration test that loads some content into the 'text' field. The schema > then copies it to 'content' using a copyField directive. The test searches in > the content field for the loaded text and fails to find it even though the > field does contain the content. Flip the debug flag to see a nicer printout > of the response and what's in the index. Notice that the added class I > alluded to is KeywordTokenStream .This class had to be added because of > another (ultimately unrelated) problem: ConcatenatingTokenStream cannot > concatenate Tokenziers. This is because Tokenizer violates the contract put > forth by TokenStream.reset(). This separate problem warrants its own ticket, > though. However, ultimately KeywordTokenStream may be useful to others and > and could be considered for adding to the repo. > The third patch finally fixes ConcatenatingTokenStream by storing and setting > a finalOffset as the last task in the end() method, and resetting > currentSource, offsetIncrement and finalOffset when reset() is called. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-8650) ConcatenatingTokenStream does not end() nor reset() properly
Daniel Meehl created LUCENE-8650: Summary: ConcatenatingTokenStream does not end() nor reset() properly Key: LUCENE-8650 URL: https://issues.apache.org/jira/browse/LUCENE-8650 Project: Lucene - Core Issue Type: Bug Reporter: Daniel Meehl All (I think) TokenStream implementations set a "final offset" after calling super.end() in their end() methods. ConcatenatingTokenStream fails to do this. Because of this, it's final offset is not readable and DefaultIndexingChain in turn fails to set the lastStartOffset properly. This results in problems with indexing which can include unsearchable content or IllegalStateExceptions. ConcatenatingTokenStream also fails to reset() properly. Specifically, it does not set its currentSource and offsetIncrement back to 0. Because of this, copyField directives (in the schema) do not work and content becomes unsearchable. I've created a few patches that illustrate the problem and then provide a fix. The first patch enhances the TestConcatenatingTokensStream to check for finalOffset, which as you can see ends up being 0. I created the next patch separately because it includes extra classes used for the testing that Lucene may or may not want to merge in. This patch adds an integration test that loads some content into the 'text' field. The schema then copies it to 'content' using a copyField directive. The test searches in the content field for the loaded text and fails to find it even though the field does contain the content. Flip the debug flag to see a nicer printout of the response and what's in the index. Notice that the added class I alluded to is KeywordTokenStream .This class had to be added because of another (ultimately unrelated) problem: ConcatenatingTokenStream cannot concatenate Tokenziers. This is because Tokenizer violates the contract put forth by TokenStream.reset(). This separate problem warrants its own ticket, though. However, ultimately KeywordTokenStream may be useful to others and and could be considered for adding to the repo. The third patch finally fixes ConcatenatingTokenStream by storing and setting a finalOffset as the last task in the end() method, and resetting currentSource, offsetIncrement and finalOffset when reset() is called. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-12328) Adding graph json facet domain change
[ https://issues.apache.org/jira/browse/SOLR-12328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Meehl updated SOLR-12328: Attachment: SOLR-12328.patch > Adding graph json facet domain change > - > > Key: SOLR-12328 > URL: https://issues.apache.org/jira/browse/SOLR-12328 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: Facet Module >Affects Versions: 7.3 >Reporter: Daniel Meehl >Priority: Major > Attachments: SOLR-12328.patch > > > Json facets now support join queries via domain change. I've made a > relatively small enhancement to add graph to the mix. I'll attach a patch for > your viewing. I'm hoping this can be merged into solr proper. Please let me > know if there are any problems/changes/requirements. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (SOLR-12328) Adding graph json facet domain change
Daniel Meehl created SOLR-12328: --- Summary: Adding graph json facet domain change Key: SOLR-12328 URL: https://issues.apache.org/jira/browse/SOLR-12328 Project: Solr Issue Type: Improvement Security Level: Public (Default Security Level. Issues are Public) Components: Facet Module Affects Versions: 7.3 Reporter: Daniel Meehl Json facets now support join queries via domain change. I've made a relatively small enhancement to add graph to the mix. I'll attach a patch for your viewing. I'm hoping this can be merged into solr proper. Please let me know if there are any problems/changes/requirements. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org