[jira] [Comment Edited] (LUCENE-8651) Tokenizer implementations can't be reset

2019-01-18 Thread Daniel Meehl (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16746759#comment-16746759
 ] 

Daniel Meehl edited comment on LUCENE-8651 at 1/19/19 12:58 AM:


As a little more of an explanation, all I did here was to replace the 
KeywordTokenStream (from the 1st patch) to a KeywordTokenizer. This causes the 
test to fail with an IllegalStateException because the KeywordTokenizer has 
it's close() and then reset() methods called which swaps out the previously set 
reader for the Tokenizer.ILLEGAL_STATE_READER.


was (Author: dmeehl):
As a little more of an explanation, all I did here was to replace the 
KeywordTokenStream (from the 1st patch) to a KeywordTokenizer. This causes the 
test to fail with an IllegalStateException because the KeywordTokenizer has 
it's end and then reset methods called which swaps out the previously set 
reader for the Tokenizer.ILLEGAL_STATE_READER.

> Tokenizer implementations can't be reset
> 
>
> Key: LUCENE-8651
> URL: https://issues.apache.org/jira/browse/LUCENE-8651
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/analysis
>Reporter: Daniel Meehl
>Priority: Major
> Attachments: LUCENE-8650-2.patch, LUCENE-8651.patch
>
>
> The fine print here is that they can't be reset without calling setReader() 
> every time before reset() is called. The reason for this is that Tokenizer 
> violates the contract put forth by TokenStream.reset() which is the following:
> "Resets this stream to a clean state. Stateful implementations must implement 
> this method so that they can be reused, just as if they had been created 
> fresh."
> Tokenizer implementation's reset function can't reset in that manner because 
> their Tokenizer.close() removes the reference to the underlying Reader 
> because of LUCENE-2387. The catch-22 here is that we don't want to 
> unnecessarily keep around a Reader (memory leak) but we would like to be able 
> to reset() if necessary.
> The patches include an integration test that attempts to use a 
> ConcatenatingTokenStream to join an input TokenStream with a KeywordTokenizer 
> TokenStream. This test fails with an IllegalStateException thrown by 
> Tokenizer.ILLEGAL_STATE_READER.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-8650) ConcatenatingTokenStream does not end() nor reset() properly

2019-01-18 Thread Daniel Meehl (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16746705#comment-16746705
 ] 

Daniel Meehl edited comment on LUCENE-8650 at 1/19/19 1:00 AM:
---

[~romseygeek] Yes I will. The core issue is that Tokenizer implementations end 
up clearing their Reader when they close() and thus can never reset() without 
setting a new Reader.


was (Author: dmeehl):
[~romseygeek] Yes I will. The core issue is that Tokenizer implementations end 
up clearing their Reader when they end() and thus can never reset() without 
setting a new Reader.

> ConcatenatingTokenStream does not end() nor reset() properly
> 
>
> Key: LUCENE-8650
> URL: https://issues.apache.org/jira/browse/LUCENE-8650
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/analysis
>Reporter: Daniel Meehl
>Assignee: Alan Woodward
>Priority: Major
> Attachments: LUCENE-8650-1.patch, LUCENE-8650-2.patch, 
> LUCENE-8650-3.patch
>
>
> All (I think) TokenStream implementations set a "final offset" after calling 
> super.end() in their end() methods. ConcatenatingTokenStream fails to do 
> this. Because of this, it's final offset is not readable and 
> DefaultIndexingChain in turn fails to set the lastStartOffset properly. This 
> results in problems with indexing which can include unsearchable content or 
> IllegalStateExceptions.
> ConcatenatingTokenStream also fails to reset() properly. Specifically, it 
> does not set its currentSource and offsetIncrement back to 0. Because of 
> this, copyField directives (in the schema) do not work and content becomes 
> unsearchable.
> I've created a few patches that illustrate the problem and then provide a fix.
> The first patch enhances the TestConcatenatingTokensStream to check for 
> finalOffset, which as you can see ends up being 0.
> I created the next patch separately because it includes extra classes used 
> for the testing that Lucene may or may not want to merge in. This patch adds 
> an integration test that loads some content into the 'text' field. The schema 
> then copies it to 'content' using a copyField directive. The test searches in 
> the content field for the loaded text and fails to find it even though the 
> field does contain the content. Flip the debug flag to see a nicer printout 
> of the response and what's in the index. Notice that the added class I 
> alluded to is KeywordTokenStream .This class had to be added because of 
> another (ultimately unrelated) problem: ConcatenatingTokenStream cannot 
> concatenate Tokenziers. This is because Tokenizer violates the contract put 
> forth by TokenStream.reset(). This separate problem warrants its own ticket, 
> though. However, ultimately KeywordTokenStream may be useful to others and 
> could be considered for adding to the repo.
> The third patch finally fixes ConcatenatingTokenStream by storing and setting 
> a finalOffset as the last task in the end() method, and resetting 
> currentSource, offsetIncrement and finalOffset when reset() is called.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-8651) Tokenizer implementations can't be reset

2019-01-18 Thread Daniel Meehl (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Meehl updated LUCENE-8651:
-
Description: 
The fine print here is that they can't be reset without calling setReader() 
every time before reset() is called. The reason for this is that Tokenizer 
violates the contract put forth by TokenStream.reset() which is the following:

"Resets this stream to a clean state. Stateful implementations must implement 
this method so that they can be reused, just as if they had been created fresh."

Tokenizer implementation's reset function can't reset in that manner because 
their Tokenizer.close() removes the reference to the underlying Reader because 
of LUCENE-2387. The catch-22 here is that we don't want to unnecessarily keep 
around a Reader (memory leak) but we would like to be able to reset() if 
necessary.

The patches include an integration test that attempts to use a 
ConcatenatingTokenStream to join an input TokenStream with a KeywordTokenizer 
TokenStream. This test fails with an IllegalStateException thrown by 
Tokenizer.ILLEGAL_STATE_READER.

 

  was:
The fine print here is that they can't be reset without calling setReader() 
every time before reset() is called. The reason for this is that Tokenizer 
violates the contract put forth by TokenStream.reset() which is the following:

"Resets this stream to a clean state. Stateful implementations must implement 
this method so that they can be reused, just as if they had been created fresh."

Tokenizer implementation's reset function can't reset in that manner because 
their Tokenizer.end() removes the reference to the underlying Reader because of 
LUCENE-2387. The catch-22 here is that we don't want to unnecessarily keep 
around a Reader (memory leak) but we would like to be able to reset() if 
necessary.

The patches include an integration test that attempts to use a 
ConcatenatingTokenStream to join an input TokenStream with a KeywordTokenizer 
TokenStream. This test fails with an IllegalStateException thrown by 
Tokenizer.ILLEGAL_STATE_READER.

 


> Tokenizer implementations can't be reset
> 
>
> Key: LUCENE-8651
> URL: https://issues.apache.org/jira/browse/LUCENE-8651
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/analysis
>Reporter: Daniel Meehl
>Priority: Major
> Attachments: LUCENE-8650-2.patch, LUCENE-8651.patch
>
>
> The fine print here is that they can't be reset without calling setReader() 
> every time before reset() is called. The reason for this is that Tokenizer 
> violates the contract put forth by TokenStream.reset() which is the following:
> "Resets this stream to a clean state. Stateful implementations must implement 
> this method so that they can be reused, just as if they had been created 
> fresh."
> Tokenizer implementation's reset function can't reset in that manner because 
> their Tokenizer.close() removes the reference to the underlying Reader 
> because of LUCENE-2387. The catch-22 here is that we don't want to 
> unnecessarily keep around a Reader (memory leak) but we would like to be able 
> to reset() if necessary.
> The patches include an integration test that attempts to use a 
> ConcatenatingTokenStream to join an input TokenStream with a KeywordTokenizer 
> TokenStream. This test fails with an IllegalStateException thrown by 
> Tokenizer.ILLEGAL_STATE_READER.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8650) ConcatenatingTokenStream does not end() nor reset() properly

2019-01-18 Thread Daniel Meehl (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16746828#comment-16746828
 ] 

Daniel Meehl commented on LUCENE-8650:
--

I do have one more question related to the fix. I noticed that most of the 
other TokenStream implementations end up setting their final offset like the 
code below. Why do they do this? Why set the start and end to the same value? 
In my patch, I set the start to 0 and the end to finalOffset, because this 
seems like the correct thing to do.

{{offsetAtt.setOffset(finalOffset, finalOffset);}}

> ConcatenatingTokenStream does not end() nor reset() properly
> 
>
> Key: LUCENE-8650
> URL: https://issues.apache.org/jira/browse/LUCENE-8650
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/analysis
>Reporter: Daniel Meehl
>Assignee: Alan Woodward
>Priority: Major
> Attachments: LUCENE-8650-1.patch, LUCENE-8650-2.patch, 
> LUCENE-8650-3.patch
>
>
> All (I think) TokenStream implementations set a "final offset" after calling 
> super.end() in their end() methods. ConcatenatingTokenStream fails to do 
> this. Because of this, it's final offset is not readable and 
> DefaultIndexingChain in turn fails to set the lastStartOffset properly. This 
> results in problems with indexing which can include unsearchable content or 
> IllegalStateExceptions.
> ConcatenatingTokenStream also fails to reset() properly. Specifically, it 
> does not set its currentSource and offsetIncrement back to 0. Because of 
> this, copyField directives (in the schema) do not work and content becomes 
> unsearchable.
> I've created a few patches that illustrate the problem and then provide a fix.
> The first patch enhances the TestConcatenatingTokensStream to check for 
> finalOffset, which as you can see ends up being 0.
> I created the next patch separately because it includes extra classes used 
> for the testing that Lucene may or may not want to merge in. This patch adds 
> an integration test that loads some content into the 'text' field. The schema 
> then copies it to 'content' using a copyField directive. The test searches in 
> the content field for the loaded text and fails to find it even though the 
> field does contain the content. Flip the debug flag to see a nicer printout 
> of the response and what's in the index. Notice that the added class I 
> alluded to is KeywordTokenStream .This class had to be added because of 
> another (ultimately unrelated) problem: ConcatenatingTokenStream cannot 
> concatenate Tokenziers. This is because Tokenizer violates the contract put 
> forth by TokenStream.reset(). This separate problem warrants its own ticket, 
> though. However, ultimately KeywordTokenStream may be useful to others and 
> could be considered for adding to the repo.
> The third patch finally fixes ConcatenatingTokenStream by storing and setting 
> a finalOffset as the last task in the end() method, and resetting 
> currentSource, offsetIncrement and finalOffset when reset() is called.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-8650) ConcatenatingTokenStream does not end() nor reset() properly

2019-01-18 Thread Daniel Meehl (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Meehl updated LUCENE-8650:
-
Description: 
All (I think) TokenStream implementations set a "final offset" after calling 
super.end() in their end() methods. ConcatenatingTokenStream fails to do this. 
Because of this, it's final offset is not readable and DefaultIndexingChain in 
turn fails to set the lastStartOffset properly. This results in problems with 
indexing which can include unsearchable content or IllegalStateExceptions.

ConcatenatingTokenStream also fails to reset() properly. Specifically, it does 
not set its currentSource and offsetIncrement back to 0. Because of this, 
copyField directives (in the schema) do not work and content becomes 
unsearchable.

I've created a few patches that illustrate the problem and then provide a fix.

The first patch enhances the TestConcatenatingTokensStream to check for 
finalOffset, which as you can see ends up being 0.

I created the next patch separately because it includes extra classes used for 
the testing that Lucene may or may not want to merge in. This patch adds an 
integration test that loads some content into the 'text' field. The schema then 
copies it to 'content' using a copyField directive. The test searches in the 
content field for the loaded text and fails to find it even though the field 
does contain the content. Flip the debug flag to see a nicer printout of the 
response and what's in the index. Notice that the added class I alluded to is 
KeywordTokenStream .This class had to be added because of another (ultimately 
unrelated) problem: ConcatenatingTokenStream cannot concatenate Tokenziers. 
This is because Tokenizer violates the contract put forth by 
TokenStream.reset(). This separate problem warrants its own ticket, though. 
However, ultimately KeywordTokenStream may be useful to others and could be 
considered for adding to the repo.

The third patch finally fixes ConcatenatingTokenStream by storing and setting a 
finalOffset as the last task in the end() method, and resetting currentSource, 
offsetIncrement and finalOffset when reset() is called.

  was:
All (I think) TokenStream implementations set a "final offset" after calling 
super.end() in their end() methods. ConcatenatingTokenStream fails to do this. 
Because of this, it's final offset is not readable and DefaultIndexingChain in 
turn fails to set the lastStartOffset properly. This results in problems with 
indexing which can include unsearchable content or IllegalStateExceptions.

 

ConcatenatingTokenStream also fails to reset() properly. Specifically, it does 
not set its currentSource and offsetIncrement back to 0. Because of this, 
copyField directives (in the schema) do not work and content becomes 
unsearchable.

I've created a few patches that illustrate the problem and then provide a fix.

The first patch enhances the TestConcatenatingTokensStream to check for 
finalOffset, which as you can see ends up being 0.

I created the next patch separately because it includes extra classes used for 
the testing that Lucene may or may not want to merge in. This patch adds an 
integration test that loads some content into the 'text' field. The schema then 
copies it to 'content' using a copyField directive. The test searches in the 
content field for the loaded text and fails to find it even though the field 
does contain the content. Flip the debug flag to see a nicer printout of the 
response and what's in the index. Notice that the added class I alluded to is 
KeywordTokenStream .This class had to be added because of another (ultimately 
unrelated) problem: ConcatenatingTokenStream cannot concatenate Tokenziers. 
This is because Tokenizer violates the contract put forth by 
TokenStream.reset(). This separate problem warrants its own ticket, though. 
However, ultimately KeywordTokenStream may be useful to others and could be 
considered for adding to the repo.

The third patch finally fixes ConcatenatingTokenStream by storing and setting a 
finalOffset as the last task in the end() method, and resetting currentSource, 
offsetIncrement and finalOffset when reset() is called.


> ConcatenatingTokenStream does not end() nor reset() properly
> 
>
> Key: LUCENE-8650
> URL: https://issues.apache.org/jira/browse/LUCENE-8650
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/analysis
>Reporter: Daniel Meehl
>Assignee: Alan Woodward
>Priority: Major
> Attachments: LUCENE-8650-1.patch, LUCENE-8650-2.patch, 
> LUCENE-8650-3.patch
>
>
> All (I think) TokenStream implementations set a "final offset" after calling 
> super.end() in their end() methods. ConcatenatingTokenStream fails to do 
> this. Because of this, it's final offset is not 

[jira] [Updated] (LUCENE-8651) Tokenizer implementations can't be reset

2019-01-18 Thread Daniel Meehl (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Meehl updated LUCENE-8651:
-
Lucene Fields: New,Patch Available  (was: New)

> Tokenizer implementations can't be reset
> 
>
> Key: LUCENE-8651
> URL: https://issues.apache.org/jira/browse/LUCENE-8651
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/analysis
>Reporter: Daniel Meehl
>Priority: Major
> Attachments: LUCENE-8650-2.patch, LUCENE-8651.patch
>
>
> The fine print here is that they can't be reset without calling setReader() 
> every time before reset() is called. The reason for this is that Tokenizer 
> violates the contract put forth by TokenStream.reset() which is the following:
> "Resets this stream to a clean state. Stateful implementations must implement 
> this method so that they can be reused, just as if they had been created 
> fresh."
> Tokenizer implementation's reset function can't reset in that manner because 
> their Tokenizer.end() removes the reference to the underlying Reader because 
> of LUCENE-2387. The catch-22 here is that we don't want to unnecessarily keep 
> around a Reader (memory leak) but we would like to be able to reset() if 
> necessary.
> The patches include an integration test that attempts to use a 
> ConcatenatingTokenStream to join an input TokenStream with a KeywordTokenizer 
> TokenStream. This test fails with an IllegalStateException thrown by 
> Tokenizer.ILLEGAL_STATE_READER.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8651) Tokenizer implementations can't be reset

2019-01-18 Thread Daniel Meehl (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16746759#comment-16746759
 ] 

Daniel Meehl commented on LUCENE-8651:
--

As a little more of an explanation, all I did here was to replace the 
KeywordTokenStream (from the 1st patch) to a KeywordTokenizer. This causes the 
test to fail with an IllegalStateException because the KeywordTokenizer has 
it's end and then reset methods called which swaps out the previously set 
reader for the Tokenizer.ILLEGAL_STATE_READER.

> Tokenizer implementations can't be reset
> 
>
> Key: LUCENE-8651
> URL: https://issues.apache.org/jira/browse/LUCENE-8651
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/analysis
>Reporter: Daniel Meehl
>Priority: Major
> Attachments: LUCENE-8650-2.patch, LUCENE-8651.patch
>
>
> The fine print here is that they can't be reset without calling setReader() 
> every time before reset() is called. The reason for this is that Tokenizer 
> violates the contract put forth by TokenStream.reset() which is the following:
> "Resets this stream to a clean state. Stateful implementations must implement 
> this method so that they can be reused, just as if they had been created 
> fresh."
> Tokenizer implementation's reset function can't reset in that manner because 
> their Tokenizer.end() removes the reference to the underlying Reader because 
> of LUCENE-2387. The catch-22 here is that we don't want to unnecessarily keep 
> around a Reader (memory leak) but we would like to be able to reset() if 
> necessary.
> The patches include an integration test that attempts to use a 
> ConcatenatingTokenStream to join an input TokenStream with a KeywordTokenizer 
> TokenStream. This test fails with an IllegalStateException thrown by 
> Tokenizer.ILLEGAL_STATE_READER.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-8651) Tokenizer implementations can't be reset

2019-01-18 Thread Daniel Meehl (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Meehl updated LUCENE-8651:
-
Attachment: LUCENE-8650-2.patch

> Tokenizer implementations can't be reset
> 
>
> Key: LUCENE-8651
> URL: https://issues.apache.org/jira/browse/LUCENE-8651
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Daniel Meehl
>Priority: Major
> Attachments: LUCENE-8650-2.patch
>
>
> The fine print here is that they can't be reset without calling setReader() 
> every time before reset() is called. The reason for this is that Tokenizer 
> violates the contract put forth by TokenStream.reset() which is the following:
> "Resets this stream to a clean state. Stateful implementations must implement 
> this method so that they can be reused, just as if they had been created 
> fresh."
> Tokenizer implementation's reset function can't reset in that manner because 
> their Tokenizer.end() removes the reference to the underlying Reader because 
> of LUCENE-2387. The catch-22 here is that we don't want to unnecessarily keep 
> around a Reader (memory leak) but we would like to be able to reset() if 
> necessary.
> The patches include an integration test that attempts to use a 
> ConcatenatingTokenStream to join an input TokenStream with a KeywordTokenizer 
> TokenStream. This test fails with an IllegalStateException thrown by 
> Tokenizer.ILLEGAL_STATE_READER.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-8651) Tokenizer implementations can't be reset

2019-01-18 Thread Daniel Meehl (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Meehl updated LUCENE-8651:
-
Component/s: modules/analysis

> Tokenizer implementations can't be reset
> 
>
> Key: LUCENE-8651
> URL: https://issues.apache.org/jira/browse/LUCENE-8651
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/analysis
>Reporter: Daniel Meehl
>Priority: Major
> Attachments: LUCENE-8650-2.patch, LUCENE-8651.patch
>
>
> The fine print here is that they can't be reset without calling setReader() 
> every time before reset() is called. The reason for this is that Tokenizer 
> violates the contract put forth by TokenStream.reset() which is the following:
> "Resets this stream to a clean state. Stateful implementations must implement 
> this method so that they can be reused, just as if they had been created 
> fresh."
> Tokenizer implementation's reset function can't reset in that manner because 
> their Tokenizer.end() removes the reference to the underlying Reader because 
> of LUCENE-2387. The catch-22 here is that we don't want to unnecessarily keep 
> around a Reader (memory leak) but we would like to be able to reset() if 
> necessary.
> The patches include an integration test that attempts to use a 
> ConcatenatingTokenStream to join an input TokenStream with a KeywordTokenizer 
> TokenStream. This test fails with an IllegalStateException thrown by 
> Tokenizer.ILLEGAL_STATE_READER.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-8651) Tokenizer implementations can't be reset

2019-01-18 Thread Daniel Meehl (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16746748#comment-16746748
 ] 

Daniel Meehl edited comment on LUCENE-8651 at 1/18/19 10:57 PM:


Since this was related to LUCENE-8650, I piggy-backed on the 2nd patch in that 
ticket to make things easier. I hope that's not a problem. This means that to 
run this test, you should apply both patches: 8650 first then 8651.


was (Author: dmeehl):
Since this was related to LUCENE-8650, I piggybacked on the 2nd patch in that 
ticket to make things easier. I hope that's not a problem.

> Tokenizer implementations can't be reset
> 
>
> Key: LUCENE-8651
> URL: https://issues.apache.org/jira/browse/LUCENE-8651
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Daniel Meehl
>Priority: Major
> Attachments: LUCENE-8650-2.patch, LUCENE-8651.patch
>
>
> The fine print here is that they can't be reset without calling setReader() 
> every time before reset() is called. The reason for this is that Tokenizer 
> violates the contract put forth by TokenStream.reset() which is the following:
> "Resets this stream to a clean state. Stateful implementations must implement 
> this method so that they can be reused, just as if they had been created 
> fresh."
> Tokenizer implementation's reset function can't reset in that manner because 
> their Tokenizer.end() removes the reference to the underlying Reader because 
> of LUCENE-2387. The catch-22 here is that we don't want to unnecessarily keep 
> around a Reader (memory leak) but we would like to be able to reset() if 
> necessary.
> The patches include an integration test that attempts to use a 
> ConcatenatingTokenStream to join an input TokenStream with a KeywordTokenizer 
> TokenStream. This test fails with an IllegalStateException thrown by 
> Tokenizer.ILLEGAL_STATE_READER.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8650) ConcatenatingTokenStream does not end() nor reset() properly

2019-01-18 Thread Daniel Meehl (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16746751#comment-16746751
 ] 

Daniel Meehl commented on LUCENE-8650:
--

[~romseygeek] Filed that ticket here: LUCENE-8651

> ConcatenatingTokenStream does not end() nor reset() properly
> 
>
> Key: LUCENE-8650
> URL: https://issues.apache.org/jira/browse/LUCENE-8650
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/analysis
>Reporter: Daniel Meehl
>Assignee: Alan Woodward
>Priority: Major
> Attachments: LUCENE-8650-1.patch, LUCENE-8650-2.patch, 
> LUCENE-8650-3.patch
>
>
> All (I think) TokenStream implementations set a "final offset" after calling 
> super.end() in their end() methods. ConcatenatingTokenStream fails to do 
> this. Because of this, it's final offset is not readable and 
> DefaultIndexingChain in turn fails to set the lastStartOffset properly. This 
> results in problems with indexing which can include unsearchable content or 
> IllegalStateExceptions.
>  
> ConcatenatingTokenStream also fails to reset() properly. Specifically, it 
> does not set its currentSource and offsetIncrement back to 0. Because of 
> this, copyField directives (in the schema) do not work and content becomes 
> unsearchable.
> I've created a few patches that illustrate the problem and then provide a fix.
> The first patch enhances the TestConcatenatingTokensStream to check for 
> finalOffset, which as you can see ends up being 0.
> I created the next patch separately because it includes extra classes used 
> for the testing that Lucene may or may not want to merge in. This patch adds 
> an integration test that loads some content into the 'text' field. The schema 
> then copies it to 'content' using a copyField directive. The test searches in 
> the content field for the loaded text and fails to find it even though the 
> field does contain the content. Flip the debug flag to see a nicer printout 
> of the response and what's in the index. Notice that the added class I 
> alluded to is KeywordTokenStream .This class had to be added because of 
> another (ultimately unrelated) problem: ConcatenatingTokenStream cannot 
> concatenate Tokenziers. This is because Tokenizer violates the contract put 
> forth by TokenStream.reset(). This separate problem warrants its own ticket, 
> though. However, ultimately KeywordTokenStream may be useful to others and 
> could be considered for adding to the repo.
> The third patch finally fixes ConcatenatingTokenStream by storing and setting 
> a finalOffset as the last task in the end() method, and resetting 
> currentSource, offsetIncrement and finalOffset when reset() is called.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-8650) ConcatenatingTokenStream does not end() nor reset() properly

2019-01-18 Thread Daniel Meehl (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16746751#comment-16746751
 ] 

Daniel Meehl edited comment on LUCENE-8650 at 1/18/19 10:52 PM:


[~romseygeek], I filed that ticket here: LUCENE-8651


was (Author: dmeehl):
[~romseygeek] Filed that ticket here: LUCENE-8651

> ConcatenatingTokenStream does not end() nor reset() properly
> 
>
> Key: LUCENE-8650
> URL: https://issues.apache.org/jira/browse/LUCENE-8650
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/analysis
>Reporter: Daniel Meehl
>Assignee: Alan Woodward
>Priority: Major
> Attachments: LUCENE-8650-1.patch, LUCENE-8650-2.patch, 
> LUCENE-8650-3.patch
>
>
> All (I think) TokenStream implementations set a "final offset" after calling 
> super.end() in their end() methods. ConcatenatingTokenStream fails to do 
> this. Because of this, it's final offset is not readable and 
> DefaultIndexingChain in turn fails to set the lastStartOffset properly. This 
> results in problems with indexing which can include unsearchable content or 
> IllegalStateExceptions.
>  
> ConcatenatingTokenStream also fails to reset() properly. Specifically, it 
> does not set its currentSource and offsetIncrement back to 0. Because of 
> this, copyField directives (in the schema) do not work and content becomes 
> unsearchable.
> I've created a few patches that illustrate the problem and then provide a fix.
> The first patch enhances the TestConcatenatingTokensStream to check for 
> finalOffset, which as you can see ends up being 0.
> I created the next patch separately because it includes extra classes used 
> for the testing that Lucene may or may not want to merge in. This patch adds 
> an integration test that loads some content into the 'text' field. The schema 
> then copies it to 'content' using a copyField directive. The test searches in 
> the content field for the loaded text and fails to find it even though the 
> field does contain the content. Flip the debug flag to see a nicer printout 
> of the response and what's in the index. Notice that the added class I 
> alluded to is KeywordTokenStream .This class had to be added because of 
> another (ultimately unrelated) problem: ConcatenatingTokenStream cannot 
> concatenate Tokenziers. This is because Tokenizer violates the contract put 
> forth by TokenStream.reset(). This separate problem warrants its own ticket, 
> though. However, ultimately KeywordTokenStream may be useful to others and 
> could be considered for adding to the repo.
> The third patch finally fixes ConcatenatingTokenStream by storing and setting 
> a finalOffset as the last task in the end() method, and resetting 
> currentSource, offsetIncrement and finalOffset when reset() is called.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8651) Tokenizer implementations can't be reset

2019-01-18 Thread Daniel Meehl (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16746748#comment-16746748
 ] 

Daniel Meehl commented on LUCENE-8651:
--

Since this was related to LUCENE-8650, I piggybacked on the 2nd patch in that 
ticket to make things easier. I hope that's not a problem.

> Tokenizer implementations can't be reset
> 
>
> Key: LUCENE-8651
> URL: https://issues.apache.org/jira/browse/LUCENE-8651
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Daniel Meehl
>Priority: Major
> Attachments: LUCENE-8650-2.patch, LUCENE-8651.patch
>
>
> The fine print here is that they can't be reset without calling setReader() 
> every time before reset() is called. The reason for this is that Tokenizer 
> violates the contract put forth by TokenStream.reset() which is the following:
> "Resets this stream to a clean state. Stateful implementations must implement 
> this method so that they can be reused, just as if they had been created 
> fresh."
> Tokenizer implementation's reset function can't reset in that manner because 
> their Tokenizer.end() removes the reference to the underlying Reader because 
> of LUCENE-2387. The catch-22 here is that we don't want to unnecessarily keep 
> around a Reader (memory leak) but we would like to be able to reset() if 
> necessary.
> The patches include an integration test that attempts to use a 
> ConcatenatingTokenStream to join an input TokenStream with a KeywordTokenizer 
> TokenStream. This test fails with an IllegalStateException thrown by 
> Tokenizer.ILLEGAL_STATE_READER.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-8651) Tokenizer implementations can't be reset

2019-01-18 Thread Daniel Meehl (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Meehl updated LUCENE-8651:
-
Attachment: LUCENE-8651.patch

> Tokenizer implementations can't be reset
> 
>
> Key: LUCENE-8651
> URL: https://issues.apache.org/jira/browse/LUCENE-8651
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Daniel Meehl
>Priority: Major
> Attachments: LUCENE-8650-2.patch, LUCENE-8651.patch
>
>
> The fine print here is that they can't be reset without calling setReader() 
> every time before reset() is called. The reason for this is that Tokenizer 
> violates the contract put forth by TokenStream.reset() which is the following:
> "Resets this stream to a clean state. Stateful implementations must implement 
> this method so that they can be reused, just as if they had been created 
> fresh."
> Tokenizer implementation's reset function can't reset in that manner because 
> their Tokenizer.end() removes the reference to the underlying Reader because 
> of LUCENE-2387. The catch-22 here is that we don't want to unnecessarily keep 
> around a Reader (memory leak) but we would like to be able to reset() if 
> necessary.
> The patches include an integration test that attempts to use a 
> ConcatenatingTokenStream to join an input TokenStream with a KeywordTokenizer 
> TokenStream. This test fails with an IllegalStateException thrown by 
> Tokenizer.ILLEGAL_STATE_READER.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-8651) Tokenizer implementations can't be reset

2019-01-18 Thread Daniel Meehl (JIRA)
Daniel Meehl created LUCENE-8651:


 Summary: Tokenizer implementations can't be reset
 Key: LUCENE-8651
 URL: https://issues.apache.org/jira/browse/LUCENE-8651
 Project: Lucene - Core
  Issue Type: Bug
Reporter: Daniel Meehl


The fine print here is that they can't be reset without calling setReader() 
every time before reset() is called. The reason for this is that Tokenizer 
violates the contract put forth by TokenStream.reset() which is the following:

"Resets this stream to a clean state. Stateful implementations must implement 
this method so that they can be reused, just as if they had been created fresh."

Tokenizer implementation's reset function can't reset in that manner because 
their Tokenizer.end() removes the reference to the underlying Reader because of 
LUCENE-2387. The catch-22 here is that we don't want to unnecessarily keep 
around a Reader (memory leak) but we would like to be able to reset() if 
necessary.

The patches include an integration test that attempts to use a 
ConcatenatingTokenStream to join an input TokenStream with a KeywordTokenizer 
TokenStream. This test fails with an IllegalStateException thrown by 
Tokenizer.ILLEGAL_STATE_READER.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8650) ConcatenatingTokenStream does not end() nor reset() properly

2019-01-18 Thread Daniel Meehl (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16746705#comment-16746705
 ] 

Daniel Meehl commented on LUCENE-8650:
--

[~romseygeek] Yes I will. The core issue is that Tokenizer implementations end 
up clearing their Reader when they end() and thus can never reset() without 
setting a new Reader.

> ConcatenatingTokenStream does not end() nor reset() properly
> 
>
> Key: LUCENE-8650
> URL: https://issues.apache.org/jira/browse/LUCENE-8650
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/analysis
>Reporter: Daniel Meehl
>Assignee: Alan Woodward
>Priority: Major
> Attachments: LUCENE-8650-1.patch, LUCENE-8650-2.patch, 
> LUCENE-8650-3.patch
>
>
> All (I think) TokenStream implementations set a "final offset" after calling 
> super.end() in their end() methods. ConcatenatingTokenStream fails to do 
> this. Because of this, it's final offset is not readable and 
> DefaultIndexingChain in turn fails to set the lastStartOffset properly. This 
> results in problems with indexing which can include unsearchable content or 
> IllegalStateExceptions.
>  
> ConcatenatingTokenStream also fails to reset() properly. Specifically, it 
> does not set its currentSource and offsetIncrement back to 0. Because of 
> this, copyField directives (in the schema) do not work and content becomes 
> unsearchable.
> I've created a few patches that illustrate the problem and then provide a fix.
> The first patch enhances the TestConcatenatingTokensStream to check for 
> finalOffset, which as you can see ends up being 0.
> I created the next patch separately because it includes extra classes used 
> for the testing that Lucene may or may not want to merge in. This patch adds 
> an integration test that loads some content into the 'text' field. The schema 
> then copies it to 'content' using a copyField directive. The test searches in 
> the content field for the loaded text and fails to find it even though the 
> field does contain the content. Flip the debug flag to see a nicer printout 
> of the response and what's in the index. Notice that the added class I 
> alluded to is KeywordTokenStream .This class had to be added because of 
> another (ultimately unrelated) problem: ConcatenatingTokenStream cannot 
> concatenate Tokenziers. This is because Tokenizer violates the contract put 
> forth by TokenStream.reset(). This separate problem warrants its own ticket, 
> though. However, ultimately KeywordTokenStream may be useful to others and 
> could be considered for adding to the repo.
> The third patch finally fixes ConcatenatingTokenStream by storing and setting 
> a finalOffset as the last task in the end() method, and resetting 
> currentSource, offsetIncrement and finalOffset when reset() is called.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8650) ConcatenatingTokenStream does not end() nor reset() properly

2019-01-18 Thread Daniel Meehl (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16746700#comment-16746700
 ] 

Daniel Meehl commented on LUCENE-8650:
--

Relates to LUCENE-2387 because that's the root cause of Tokenizer 
implementations not reset()'ing properly.

> ConcatenatingTokenStream does not end() nor reset() properly
> 
>
> Key: LUCENE-8650
> URL: https://issues.apache.org/jira/browse/LUCENE-8650
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/analysis
>Reporter: Daniel Meehl
>Assignee: Alan Woodward
>Priority: Major
> Attachments: LUCENE-8650-1.patch, LUCENE-8650-2.patch, 
> LUCENE-8650-3.patch
>
>
> All (I think) TokenStream implementations set a "final offset" after calling 
> super.end() in their end() methods. ConcatenatingTokenStream fails to do 
> this. Because of this, it's final offset is not readable and 
> DefaultIndexingChain in turn fails to set the lastStartOffset properly. This 
> results in problems with indexing which can include unsearchable content or 
> IllegalStateExceptions.
>  
> ConcatenatingTokenStream also fails to reset() properly. Specifically, it 
> does not set its currentSource and offsetIncrement back to 0. Because of 
> this, copyField directives (in the schema) do not work and content becomes 
> unsearchable.
> I've created a few patches that illustrate the problem and then provide a fix.
> The first patch enhances the TestConcatenatingTokensStream to check for 
> finalOffset, which as you can see ends up being 0.
> I created the next patch separately because it includes extra classes used 
> for the testing that Lucene may or may not want to merge in. This patch adds 
> an integration test that loads some content into the 'text' field. The schema 
> then copies it to 'content' using a copyField directive. The test searches in 
> the content field for the loaded text and fails to find it even though the 
> field does contain the content. Flip the debug flag to see a nicer printout 
> of the response and what's in the index. Notice that the added class I 
> alluded to is KeywordTokenStream .This class had to be added because of 
> another (ultimately unrelated) problem: ConcatenatingTokenStream cannot 
> concatenate Tokenziers. This is because Tokenizer violates the contract put 
> forth by TokenStream.reset(). This separate problem warrants its own ticket, 
> though. However, ultimately KeywordTokenStream may be useful to others and 
> could be considered for adding to the repo.
> The third patch finally fixes ConcatenatingTokenStream by storing and setting 
> a finalOffset as the last task in the end() method, and resetting 
> currentSource, offsetIncrement and finalOffset when reset() is called.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-8650) ConcatenatingTokenStream does not end() nor reset() properly

2019-01-18 Thread Daniel Meehl (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Meehl updated LUCENE-8650:
-
Description: 
All (I think) TokenStream implementations set a "final offset" after calling 
super.end() in their end() methods. ConcatenatingTokenStream fails to do this. 
Because of this, it's final offset is not readable and DefaultIndexingChain in 
turn fails to set the lastStartOffset properly. This results in problems with 
indexing which can include unsearchable content or IllegalStateExceptions.

 

ConcatenatingTokenStream also fails to reset() properly. Specifically, it does 
not set its currentSource and offsetIncrement back to 0. Because of this, 
copyField directives (in the schema) do not work and content becomes 
unsearchable.

I've created a few patches that illustrate the problem and then provide a fix.

The first patch enhances the TestConcatenatingTokensStream to check for 
finalOffset, which as you can see ends up being 0.

I created the next patch separately because it includes extra classes used for 
the testing that Lucene may or may not want to merge in. This patch adds an 
integration test that loads some content into the 'text' field. The schema then 
copies it to 'content' using a copyField directive. The test searches in the 
content field for the loaded text and fails to find it even though the field 
does contain the content. Flip the debug flag to see a nicer printout of the 
response and what's in the index. Notice that the added class I alluded to is 
KeywordTokenStream .This class had to be added because of another (ultimately 
unrelated) problem: ConcatenatingTokenStream cannot concatenate Tokenziers. 
This is because Tokenizer violates the contract put forth by 
TokenStream.reset(). This separate problem warrants its own ticket, though. 
However, ultimately KeywordTokenStream may be useful to others and could be 
considered for adding to the repo.

The third patch finally fixes ConcatenatingTokenStream by storing and setting a 
finalOffset as the last task in the end() method, and resetting currentSource, 
offsetIncrement and finalOffset when reset() is called.

  was:
All (I think) TokenStream implementations set a "final offset" after calling 
super.end() in their end() methods. ConcatenatingTokenStream fails to do this. 
Because of this, it's final offset is not readable and DefaultIndexingChain in 
turn fails to set the lastStartOffset properly. This results in problems with 
indexing which can include unsearchable content or IllegalStateExceptions.

 

ConcatenatingTokenStream also fails to reset() properly. Specifically, it does 
not set its currentSource and offsetIncrement back to 0. Because of this, 
copyField directives (in the schema) do not work and content becomes 
unsearchable.

I've created a few patches that illustrate the problem and then provide a fix.

The first patch enhances the TestConcatenatingTokensStream to check for 
finalOffset, which as you can see ends up being 0.

I created the next patch separately because it includes extra classes used for 
the testing that Lucene may or may not want to merge in. This patch adds an 
integration test that loads some content into the 'text' field. The schema then 
copies it to 'content' using a copyField directive. The test searches in the 
content field for the loaded text and fails to find it even though the field 
does contain the content. Flip the debug flag to see a nicer printout of the 
response and what's in the index. Notice that the added class I alluded to is 
KeywordTokenStream .This class had to be added because of another (ultimately 
unrelated) problem: ConcatenatingTokenStream cannot concatenate Tokenziers. 
This is because Tokenizer violates the contract put forth by 
TokenStream.reset(). This separate problem warrants its own ticket, though. 
However, ultimately KeywordTokenStream may be useful to others and and could be 
considered for adding to the repo.

The third patch finally fixes ConcatenatingTokenStream by storing and setting a 
finalOffset as the last task in the end() method, and resetting currentSource, 
offsetIncrement and finalOffset when reset() is called.


> ConcatenatingTokenStream does not end() nor reset() properly
> 
>
> Key: LUCENE-8650
> URL: https://issues.apache.org/jira/browse/LUCENE-8650
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/analysis
>Reporter: Daniel Meehl
>Priority: Major
> Attachments: LUCENE-8650-1.patch, LUCENE-8650-2.patch, 
> LUCENE-8650-3.patch
>
>
> All (I think) TokenStream implementations set a "final offset" after calling 
> super.end() in their end() methods. ConcatenatingTokenStream fails to do 
> this. Because of this, it's final offset is not readable and 
> 

[jira] [Updated] (LUCENE-8650) ConcatenatingTokenStream does not end() nor reset() properly

2019-01-18 Thread Daniel Meehl (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Meehl updated LUCENE-8650:
-
Attachment: LUCENE-8650-3.patch
LUCENE-8650-2.patch
LUCENE-8650-1.patch

> ConcatenatingTokenStream does not end() nor reset() properly
> 
>
> Key: LUCENE-8650
> URL: https://issues.apache.org/jira/browse/LUCENE-8650
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/analysis
>Reporter: Daniel Meehl
>Priority: Major
> Attachments: LUCENE-8650-1.patch, LUCENE-8650-2.patch, 
> LUCENE-8650-3.patch
>
>
> All (I think) TokenStream implementations set a "final offset" after calling 
> super.end() in their end() methods. ConcatenatingTokenStream fails to do 
> this. Because of this, it's final offset is not readable and 
> DefaultIndexingChain in turn fails to set the lastStartOffset properly. This 
> results in problems with indexing which can include unsearchable content or 
> IllegalStateExceptions.
>  
> ConcatenatingTokenStream also fails to reset() properly. Specifically, it 
> does not set its currentSource and offsetIncrement back to 0. Because of 
> this, copyField directives (in the schema) do not work and content becomes 
> unsearchable.
> I've created a few patches that illustrate the problem and then provide a fix.
> The first patch enhances the TestConcatenatingTokensStream to check for 
> finalOffset, which as you can see ends up being 0.
> I created the next patch separately because it includes extra classes used 
> for the testing that Lucene may or may not want to merge in. This patch adds 
> an integration test that loads some content into the 'text' field. The schema 
> then copies it to 'content' using a copyField directive. The test searches in 
> the content field for the loaded text and fails to find it even though the 
> field does contain the content. Flip the debug flag to see a nicer printout 
> of the response and what's in the index. Notice that the added class I 
> alluded to is KeywordTokenStream .This class had to be added because of 
> another (ultimately unrelated) problem: ConcatenatingTokenStream cannot 
> concatenate Tokenziers. This is because Tokenizer violates the contract put 
> forth by TokenStream.reset(). This separate problem warrants its own ticket, 
> though. However, ultimately KeywordTokenStream may be useful to others and 
> and could be considered for adding to the repo.
> The third patch finally fixes ConcatenatingTokenStream by storing and setting 
> a finalOffset as the last task in the end() method, and resetting 
> currentSource, offsetIncrement and finalOffset when reset() is called.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-8650) ConcatenatingTokenStream does not end() nor reset() properly

2019-01-18 Thread Daniel Meehl (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Meehl updated LUCENE-8650:
-
Attachment: ConcatTokenFilterFactory.java

> ConcatenatingTokenStream does not end() nor reset() properly
> 
>
> Key: LUCENE-8650
> URL: https://issues.apache.org/jira/browse/LUCENE-8650
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/analysis
>Reporter: Daniel Meehl
>Priority: Major
>
> All (I think) TokenStream implementations set a "final offset" after calling 
> super.end() in their end() methods. ConcatenatingTokenStream fails to do 
> this. Because of this, it's final offset is not readable and 
> DefaultIndexingChain in turn fails to set the lastStartOffset properly. This 
> results in problems with indexing which can include unsearchable content or 
> IllegalStateExceptions.
>  
> ConcatenatingTokenStream also fails to reset() properly. Specifically, it 
> does not set its currentSource and offsetIncrement back to 0. Because of 
> this, copyField directives (in the schema) do not work and content becomes 
> unsearchable.
> I've created a few patches that illustrate the problem and then provide a fix.
> The first patch enhances the TestConcatenatingTokensStream to check for 
> finalOffset, which as you can see ends up being 0.
> I created the next patch separately because it includes extra classes used 
> for the testing that Lucene may or may not want to merge in. This patch adds 
> an integration test that loads some content into the 'text' field. The schema 
> then copies it to 'content' using a copyField directive. The test searches in 
> the content field for the loaded text and fails to find it even though the 
> field does contain the content. Flip the debug flag to see a nicer printout 
> of the response and what's in the index. Notice that the added class I 
> alluded to is KeywordTokenStream .This class had to be added because of 
> another (ultimately unrelated) problem: ConcatenatingTokenStream cannot 
> concatenate Tokenziers. This is because Tokenizer violates the contract put 
> forth by TokenStream.reset(). This separate problem warrants its own ticket, 
> though. However, ultimately KeywordTokenStream may be useful to others and 
> and could be considered for adding to the repo.
> The third patch finally fixes ConcatenatingTokenStream by storing and setting 
> a finalOffset as the last task in the end() method, and resetting 
> currentSource, offsetIncrement and finalOffset when reset() is called.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-8650) ConcatenatingTokenStream does not end() nor reset() properly

2019-01-18 Thread Daniel Meehl (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Meehl updated LUCENE-8650:
-
Attachment: (was: ConcatTokenFilterFactory.java)

> ConcatenatingTokenStream does not end() nor reset() properly
> 
>
> Key: LUCENE-8650
> URL: https://issues.apache.org/jira/browse/LUCENE-8650
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/analysis
>Reporter: Daniel Meehl
>Priority: Major
>
> All (I think) TokenStream implementations set a "final offset" after calling 
> super.end() in their end() methods. ConcatenatingTokenStream fails to do 
> this. Because of this, it's final offset is not readable and 
> DefaultIndexingChain in turn fails to set the lastStartOffset properly. This 
> results in problems with indexing which can include unsearchable content or 
> IllegalStateExceptions.
>  
> ConcatenatingTokenStream also fails to reset() properly. Specifically, it 
> does not set its currentSource and offsetIncrement back to 0. Because of 
> this, copyField directives (in the schema) do not work and content becomes 
> unsearchable.
> I've created a few patches that illustrate the problem and then provide a fix.
> The first patch enhances the TestConcatenatingTokensStream to check for 
> finalOffset, which as you can see ends up being 0.
> I created the next patch separately because it includes extra classes used 
> for the testing that Lucene may or may not want to merge in. This patch adds 
> an integration test that loads some content into the 'text' field. The schema 
> then copies it to 'content' using a copyField directive. The test searches in 
> the content field for the loaded text and fails to find it even though the 
> field does contain the content. Flip the debug flag to see a nicer printout 
> of the response and what's in the index. Notice that the added class I 
> alluded to is KeywordTokenStream .This class had to be added because of 
> another (ultimately unrelated) problem: ConcatenatingTokenStream cannot 
> concatenate Tokenziers. This is because Tokenizer violates the contract put 
> forth by TokenStream.reset(). This separate problem warrants its own ticket, 
> though. However, ultimately KeywordTokenStream may be useful to others and 
> and could be considered for adding to the repo.
> The third patch finally fixes ConcatenatingTokenStream by storing and setting 
> a finalOffset as the last task in the end() method, and resetting 
> currentSource, offsetIncrement and finalOffset when reset() is called.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-8650) ConcatenatingTokenStream does not end() nor reset() properly

2019-01-18 Thread Daniel Meehl (JIRA)
Daniel Meehl created LUCENE-8650:


 Summary: ConcatenatingTokenStream does not end() nor reset() 
properly
 Key: LUCENE-8650
 URL: https://issues.apache.org/jira/browse/LUCENE-8650
 Project: Lucene - Core
  Issue Type: Bug
Reporter: Daniel Meehl


All (I think) TokenStream implementations set a "final offset" after calling 
super.end() in their end() methods. ConcatenatingTokenStream fails to do this. 
Because of this, it's final offset is not readable and DefaultIndexingChain in 
turn fails to set the lastStartOffset properly. This results in problems with 
indexing which can include unsearchable content or IllegalStateExceptions.

 

ConcatenatingTokenStream also fails to reset() properly. Specifically, it does 
not set its currentSource and offsetIncrement back to 0. Because of this, 
copyField directives (in the schema) do not work and content becomes 
unsearchable.

I've created a few patches that illustrate the problem and then provide a fix.

The first patch enhances the TestConcatenatingTokensStream to check for 
finalOffset, which as you can see ends up being 0.

I created the next patch separately because it includes extra classes used for 
the testing that Lucene may or may not want to merge in. This patch adds an 
integration test that loads some content into the 'text' field. The schema then 
copies it to 'content' using a copyField directive. The test searches in the 
content field for the loaded text and fails to find it even though the field 
does contain the content. Flip the debug flag to see a nicer printout of the 
response and what's in the index. Notice that the added class I alluded to is 
KeywordTokenStream .This class had to be added because of another (ultimately 
unrelated) problem: ConcatenatingTokenStream cannot concatenate Tokenziers. 
This is because Tokenizer violates the contract put forth by 
TokenStream.reset(). This separate problem warrants its own ticket, though. 
However, ultimately KeywordTokenStream may be useful to others and and could be 
considered for adding to the repo.

The third patch finally fixes ConcatenatingTokenStream by storing and setting a 
finalOffset as the last task in the end() method, and resetting currentSource, 
offsetIncrement and finalOffset when reset() is called.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-12328) Adding graph json facet domain change

2018-05-07 Thread Daniel Meehl (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-12328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Meehl updated SOLR-12328:

Attachment: SOLR-12328.patch

> Adding graph json facet domain change
> -
>
> Key: SOLR-12328
> URL: https://issues.apache.org/jira/browse/SOLR-12328
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: Facet Module
>Affects Versions: 7.3
>Reporter: Daniel Meehl
>Priority: Major
> Attachments: SOLR-12328.patch
>
>
> Json facets now support join queries via domain change. I've made a 
> relatively small enhancement to add graph to the mix. I'll attach a patch for 
> your viewing. I'm hoping this can be merged into solr proper. Please let me 
> know if there are any problems/changes/requirements. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (SOLR-12328) Adding graph json facet domain change

2018-05-07 Thread Daniel Meehl (JIRA)
Daniel Meehl created SOLR-12328:
---

 Summary: Adding graph json facet domain change
 Key: SOLR-12328
 URL: https://issues.apache.org/jira/browse/SOLR-12328
 Project: Solr
  Issue Type: Improvement
  Security Level: Public (Default Security Level. Issues are Public)
  Components: Facet Module
Affects Versions: 7.3
Reporter: Daniel Meehl


Json facets now support join queries via domain change. I've made a relatively 
small enhancement to add graph to the mix. I'll attach a patch for your 
viewing. I'm hoping this can be merged into solr proper. Please let me know if 
there are any problems/changes/requirements. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org