[jira] [Commented] (LUCENE-8945) Allow to change the output file delimiter on Luke "export terms" feature

2019-09-10 Thread Tomoko Uchida (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-8945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16926509#comment-16926509
 ] 

Tomoko Uchida commented on LUCENE-8945:
---

A dropdown box looks fine to me. Should we set the default delimiter to comma?

> Allow to change the output file delimiter on Luke "export terms" feature
> 
>
> Key: LUCENE-8945
> URL: https://issues.apache.org/jira/browse/LUCENE-8945
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/luke
>Reporter: Tomoko Uchida
>Priority: Minor
> Attachments: luke_export_delimiter.png
>
>
> This is a follow-up issue for LUCENE-8764.
> Current delimiter is fixed to "," (comma), but terms also can include comma 
> and they are not escaped. It would be better if the delimiter can be 
> changed/selected to a tab or whitespace when exporting.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-8969) Fix abusive usage of assert in ArrayUtil

2019-09-06 Thread Tomoko Uchida (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tomoko Uchida updated LUCENE-8969:
--
Summary: Fix abusive usage of assert in ArrayUtil  (was: Fix abusive usage 
of asset in ArrayUtil)

> Fix abusive usage of assert in ArrayUtil
> 
>
> Key: LUCENE-8969
> URL: https://issues.apache.org/jira/browse/LUCENE-8969
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Tomoko Uchida
>Priority: Minor
>
> Methods in {{o.a.l.util.ArrayUtil}} uses {{assert}} statements for argument 
> checks.
>  It would be suitable to throw \{{IllegalArgumentExceptions}}s instead of 
> assertions here, to improve traceability when the violations occur? Sometimes 
> I had difficulty in identifying the cause of assertion errors...



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-8969) Fix abusive usage of asset in ArrayUtil

2019-09-06 Thread Tomoko Uchida (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tomoko Uchida updated LUCENE-8969:
--
Summary: Fix abusive usage of asset in ArrayUtil  (was: Fix abusive usage 
of asset in ArrayUtils)

> Fix abusive usage of asset in ArrayUtil
> ---
>
> Key: LUCENE-8969
> URL: https://issues.apache.org/jira/browse/LUCENE-8969
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Tomoko Uchida
>Priority: Minor
>
> Methods in {{o.a.l.util.ArrayUtil}} uses {{assert}} statements for argument 
> checks.
>  It would be suitable to throw \{{IllegalArgumentExceptions}}s instead of 
> assertions here, to improve traceability when the violations occur? Sometimes 
> I had difficulty in identifying the cause of assertion errors...



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-8969) Fix abusive usage of asset in ArrayUtils

2019-09-06 Thread Tomoko Uchida (Jira)
Tomoko Uchida created LUCENE-8969:
-

 Summary: Fix abusive usage of asset in ArrayUtils
 Key: LUCENE-8969
 URL: https://issues.apache.org/jira/browse/LUCENE-8969
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Tomoko Uchida


Methods in {{o.a.l.util.ArrayUtil}} uses {{assert}} statements for argument 
checks.
 It would be suitable to throw \{{IllegalArgumentExceptions}}s instead of 
assertions here, to improve traceability when the violations occur? Sometimes I 
had difficulty in identifying the cause of assertion errors...



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-13690) Migrate field type configurations in default/example schema files to look up factories by "name"

2019-09-01 Thread Tomoko Uchida (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16920533#comment-16920533
 ] 

Tomoko Uchida commented on SOLR-13690:
--

I will fix the test resources.

> Migrate field type configurations in default/example schema files to look up 
> factories by "name"
> 
>
> Key: SOLR-13690
> URL: https://issues.apache.org/jira/browse/SOLR-13690
> Project: Solr
>  Issue Type: Task
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: Schema and Analysis
>Reporter: Tomoko Uchida
>Assignee: Tomoko Uchida
>Priority: Major
> Fix For: master (9.0)
>
> Attachments: SOLR-13690.patch, SOLR-13690.patch, Screenshot from 
> 2019-08-30 01-09-43.png
>
>
> This is a follow-up task for SOLR-13593.
> To encourage users to use the "name" attribute in field type configurations, 
> we should migrate all managed-schema files bundled with Solr.
> There are 8 managed-schemas (except for test resources) in solr.
> {code}
> lucene-solr-mirror $ find solr -name "managed-schema" | grep -v test
> solr/server/solr/configsets/sample_techproducts_configs/conf/managed-schema
> solr/server/solr/configsets/_default/conf/managed-schema
> solr/example/files/conf/managed-schema
> solr/example/example-DIH/solr/solr/conf/managed-schema
> solr/example/example-DIH/solr/db/conf/managed-schema
> solr/example/example-DIH/solr/atom/conf/managed-schema
> solr/example/example-DIH/solr/mail/conf/managed-schema
> solr/example/example-DIH/solr/tika/conf/managed-schema
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-8778) Define analyzer SPI names as static final fields and document the names in Javadocs

2019-09-01 Thread Tomoko Uchida (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8778?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tomoko Uchida updated LUCENE-8778:
--
Fix Version/s: 8.3

> Define analyzer SPI names as static final fields and document the names in 
> Javadocs
> ---
>
> Key: LUCENE-8778
> URL: https://issues.apache.org/jira/browse/LUCENE-8778
> Project: Lucene - Core
>  Issue Type: Task
>  Components: modules/analysis
>Reporter: Tomoko Uchida
>Assignee: Tomoko Uchida
>Priority: Minor
> Fix For: master (9.0), 8.3
>
> Attachments: LUCENE-8778-koreanNumber.patch, 
> LUCENE-8778-migrate-note.patch, ListAnalysisComponents.java, 
> SPINamesGenerator.java, Screenshot from 2019-04-26 02-17-48.png, Screenshot 
> from 2019-05-25 23-25-24.png, TestSPINames.java
>
>  Time Spent: 4.5h
>  Remaining Estimate: 0h
>
> Each built-in analysis component (factory of tokenizer / char filter / token 
> filter)  has a SPI name but currently this is not  documented anywhere.
> The goals of this issue:
>  * Define SPI names as static final field for each analysis component so that 
> users can get the component by name (via {{NAME}} static field.) This also 
> provides compile time safety.
>  * Officially document the SPI names in Javadocs.
>  * Add proper source validation rules to ant {{validate-source-patterns}} 
> target so that we can make sure that all analysis components have correct 
> field definitions and documentation
> and,
>  * Lookup SPI names on the new {{NAME}} fields. Instead deriving those from 
> class names.
> (Just for quick reference) we now have:
>  * *19* Tokenizers ({{TokenizerFactory.availableTokenizers()}})
>  * *6* CharFilters ({{CharFilterFactory.availableCharFilters()}})
>  * *118* TokenFilters ({{TokenFilterFactory.availableTokenFilters()}})



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Resolved] (SOLR-13691) Add example field type configurations using "name" attributes to Ref Guide

2019-08-31 Thread Tomoko Uchida (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-13691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tomoko Uchida resolved SOLR-13691.
--
  Assignee: Tomoko Uchida
Resolution: Fixed

> Add example field type configurations using "name" attributes to Ref Guide
> --
>
> Key: SOLR-13691
> URL: https://issues.apache.org/jira/browse/SOLR-13691
> Project: Solr
>  Issue Type: Task
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: documentation
>Reporter: Tomoko Uchida
>Assignee: Tomoko Uchida
>Priority: Major
> Fix For: master (9.0)
>
> Attachments: SOLR-13691.patch, SOLR-13691.patch, Screenshot from 
> 2019-08-30 14-19-01.png, Screenshot from 2019-08-30 14-19-09.png
>
>
> This is a follow-up task for SOLR-13593.
> To encourage users to use the "name" attribute in field type configurations, 
> we should add examples that includes "name" instead of "class" (and mark 
> "Legacy" to the old examples).



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Resolved] (SOLR-13690) Migrate field type configurations in default/example schema files to look up factories by "name"

2019-08-31 Thread Tomoko Uchida (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-13690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tomoko Uchida resolved SOLR-13690.
--
  Assignee: Tomoko Uchida
Resolution: Fixed

> Migrate field type configurations in default/example schema files to look up 
> factories by "name"
> 
>
> Key: SOLR-13690
> URL: https://issues.apache.org/jira/browse/SOLR-13690
> Project: Solr
>  Issue Type: Task
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: Schema and Analysis
>Reporter: Tomoko Uchida
>Assignee: Tomoko Uchida
>Priority: Major
> Fix For: master (9.0)
>
> Attachments: SOLR-13690.patch, SOLR-13690.patch, Screenshot from 
> 2019-08-30 01-09-43.png
>
>
> This is a follow-up task for SOLR-13593.
> To encourage users to use the "name" attribute in field type configurations, 
> we should migrate all managed-schema files bundled with Solr.
> There are 8 managed-schemas (except for test resources) in solr.
> {code}
> lucene-solr-mirror $ find solr -name "managed-schema" | grep -v test
> solr/server/solr/configsets/sample_techproducts_configs/conf/managed-schema
> solr/server/solr/configsets/_default/conf/managed-schema
> solr/example/files/conf/managed-schema
> solr/example/example-DIH/solr/solr/conf/managed-schema
> solr/example/example-DIH/solr/db/conf/managed-schema
> solr/example/example-DIH/solr/atom/conf/managed-schema
> solr/example/example-DIH/solr/mail/conf/managed-schema
> solr/example/example-DIH/solr/tika/conf/managed-schema
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-13690) Migrate field type configurations in default/example schema files to look up factories by "name"

2019-08-31 Thread Tomoko Uchida (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-13690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tomoko Uchida updated SOLR-13690:
-
Attachment: SOLR-13690.patch

> Migrate field type configurations in default/example schema files to look up 
> factories by "name"
> 
>
> Key: SOLR-13690
> URL: https://issues.apache.org/jira/browse/SOLR-13690
> Project: Solr
>  Issue Type: Task
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: Schema and Analysis
>Reporter: Tomoko Uchida
>Priority: Major
> Fix For: master (9.0)
>
> Attachments: SOLR-13690.patch, SOLR-13690.patch, Screenshot from 
> 2019-08-30 01-09-43.png
>
>
> This is a follow-up task for SOLR-13593.
> To encourage users to use the "name" attribute in field type configurations, 
> we should migrate all managed-schema files bundled with Solr.
> There are 8 managed-schemas (except for test resources) in solr.
> {code}
> lucene-solr-mirror $ find solr -name "managed-schema" | grep -v test
> solr/server/solr/configsets/sample_techproducts_configs/conf/managed-schema
> solr/server/solr/configsets/_default/conf/managed-schema
> solr/example/files/conf/managed-schema
> solr/example/example-DIH/solr/solr/conf/managed-schema
> solr/example/example-DIH/solr/db/conf/managed-schema
> solr/example/example-DIH/solr/atom/conf/managed-schema
> solr/example/example-DIH/solr/mail/conf/managed-schema
> solr/example/example-DIH/solr/tika/conf/managed-schema
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-13691) Add example field type configurations using "name" attributes to Ref Guide

2019-08-30 Thread Tomoko Uchida (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16919672#comment-16919672
 ] 

Tomoko Uchida commented on SOLR-13691:
--

Updated the patch. Will commit it to the master in shortly.

> Add example field type configurations using "name" attributes to Ref Guide
> --
>
> Key: SOLR-13691
> URL: https://issues.apache.org/jira/browse/SOLR-13691
> Project: Solr
>  Issue Type: Task
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: documentation
>Reporter: Tomoko Uchida
>Priority: Major
> Fix For: master (9.0)
>
> Attachments: SOLR-13691.patch, SOLR-13691.patch, Screenshot from 
> 2019-08-30 14-19-01.png, Screenshot from 2019-08-30 14-19-09.png
>
>
> This is a follow-up task for SOLR-13593.
> To encourage users to use the "name" attribute in field type configurations, 
> we should add examples that includes "name" instead of "class" (and mark 
> "Legacy" to the old examples).



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-13691) Add example field type configurations using "name" attributes to Ref Guide

2019-08-30 Thread Tomoko Uchida (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-13691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tomoko Uchida updated SOLR-13691:
-
Attachment: SOLR-13691.patch

> Add example field type configurations using "name" attributes to Ref Guide
> --
>
> Key: SOLR-13691
> URL: https://issues.apache.org/jira/browse/SOLR-13691
> Project: Solr
>  Issue Type: Task
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: documentation
>Reporter: Tomoko Uchida
>Priority: Major
> Fix For: master (9.0)
>
> Attachments: SOLR-13691.patch, SOLR-13691.patch, Screenshot from 
> 2019-08-30 14-19-01.png, Screenshot from 2019-08-30 14-19-09.png
>
>
> This is a follow-up task for SOLR-13593.
> To encourage users to use the "name" attribute in field type configurations, 
> we should add examples that includes "name" instead of "class" (and mark 
> "Legacy" to the old examples).



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-13691) Add example field type configurations using "name" attributes to Ref Guide

2019-08-30 Thread Tomoko Uchida (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16919243#comment-16919243
 ] 

Tomoko Uchida commented on SOLR-13691:
--

Here is the patch [^SOLR-13691.patch] including only changes for 
{{analyzers.adoc}}. Other files - {{tokenizers.adoc}}, 
{{filter-descriptions.adoc}}, and so on - would need to be updated in a similar 
way.
 Is there any good methods to automatically convert them...

> Add example field type configurations using "name" attributes to Ref Guide
> --
>
> Key: SOLR-13691
> URL: https://issues.apache.org/jira/browse/SOLR-13691
> Project: Solr
>  Issue Type: Task
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: documentation
>Reporter: Tomoko Uchida
>Priority: Major
> Fix For: master (9.0)
>
> Attachments: SOLR-13691.patch, Screenshot from 2019-08-30 
> 14-19-01.png, Screenshot from 2019-08-30 14-19-09.png
>
>
> This is a follow-up task for SOLR-13593.
> To encourage users to use the "name" attribute in field type configurations, 
> we should add examples that includes "name" instead of "class" (and mark 
> "Legacy" to the old examples).



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-13691) Add example field type configurations using "name" attributes to Ref Guide

2019-08-30 Thread Tomoko Uchida (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-13691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tomoko Uchida updated SOLR-13691:
-
Attachment: SOLR-13691.patch

> Add example field type configurations using "name" attributes to Ref Guide
> --
>
> Key: SOLR-13691
> URL: https://issues.apache.org/jira/browse/SOLR-13691
> Project: Solr
>  Issue Type: Task
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: documentation
>Reporter: Tomoko Uchida
>Priority: Major
> Fix For: master (9.0)
>
> Attachments: SOLR-13691.patch, Screenshot from 2019-08-30 
> 14-19-01.png, Screenshot from 2019-08-30 14-19-09.png
>
>
> This is a follow-up task for SOLR-13593.
> To encourage users to use the "name" attribute in field type configurations, 
> we should add examples that includes "name" instead of "class" (and mark 
> "Legacy" to the old examples).



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-13691) Add example field type configurations using "name" attributes to Ref Guide

2019-08-29 Thread Tomoko Uchida (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-13691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tomoko Uchida updated SOLR-13691:
-
Fix Version/s: master (9.0)

> Add example field type configurations using "name" attributes to Ref Guide
> --
>
> Key: SOLR-13691
> URL: https://issues.apache.org/jira/browse/SOLR-13691
> Project: Solr
>  Issue Type: Task
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: documentation
>Reporter: Tomoko Uchida
>Priority: Major
> Fix For: master (9.0)
>
> Attachments: Screenshot from 2019-08-30 14-19-01.png, Screenshot from 
> 2019-08-30 14-19-09.png
>
>
> This is a follow-up task for SOLR-13593.
> To encourage users to use the "name" attribute in field type configurations, 
> we should add examples that includes "name" instead of "class" (and mark 
> "Legacy" to the old examples).



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-13593) Allow to look-up analyzer components by their SPI names in field type configuration

2019-08-29 Thread Tomoko Uchida (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16919212#comment-16919212
 ] 

Tomoko Uchida commented on SOLR-13593:
--

I have started to work for Ref Guide (SOLR-13691). Though I'm not an expert on 
AsciiDoc/asciidoctor, it seems we can place "name=" examples for all analyzer 
documentation while keeping "class=" examples as is, with dynamic tabs (please 
see the screenshots on the issue).

> Allow to look-up analyzer components by their SPI names in field type 
> configuration
> ---
>
> Key: SOLR-13593
> URL: https://issues.apache.org/jira/browse/SOLR-13593
> Project: Solr
>  Issue Type: Improvement
>  Components: Schema and Analysis
>Reporter: Tomoko Uchida
>Assignee: Tomoko Uchida
>Priority: Major
> Fix For: master (9.0)
>
> Attachments: SOLR-13593-add-spi-ReversedWildcardFilterFactory.patch, 
> SOLR-13593.patch
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Now each analysis factory has explicitely documented SPI name which is stored 
> in the static "NAME" field (LUCENE-8778).
>  Solr uses factories' simple class name in schema definition (like 
> class="solr.WhitespaceTokenizerFactory"), but we should be able to also use 
> more concise SPI names (like name="whitespace").
> e.g.:
> {code:xml}
> 
>   
> 
>  />
> 
>   
> 
> {code}
> would be
> {code:xml}
> 
>   
> 
> 
> 
>   
> 
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-13691) Add example field type configurations using "name" attributes to Ref Guide

2019-08-29 Thread Tomoko Uchida (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16919202#comment-16919202
 ] 

Tomoko Uchida commented on SOLR-13691:
--

We can add "name" examples along with "class" examples to the Ref Guide's 
Analyzer section, with dynamic tabs like this:

 

*"name=" example*

 

!Screenshot from 2019-08-30 14-19-01.png!

 

*"class=" example*


 !Screenshot from 2019-08-30 14-19-09.png!

> Add example field type configurations using "name" attributes to Ref Guide
> --
>
> Key: SOLR-13691
> URL: https://issues.apache.org/jira/browse/SOLR-13691
> Project: Solr
>  Issue Type: Task
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: documentation
>Reporter: Tomoko Uchida
>Priority: Major
> Attachments: Screenshot from 2019-08-30 14-19-01.png, Screenshot from 
> 2019-08-30 14-19-09.png
>
>
> This is a follow-up task for SOLR-13593.
> To encourage users to use the "name" attribute in field type configurations, 
> we should add examples that includes "name" instead of "class" (and mark 
> "Legacy" to the old examples).



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-13691) Add example field type configurations using "name" attributes to Ref Guide

2019-08-29 Thread Tomoko Uchida (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-13691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tomoko Uchida updated SOLR-13691:
-
Attachment: Screenshot from 2019-08-30 14-19-01.png

> Add example field type configurations using "name" attributes to Ref Guide
> --
>
> Key: SOLR-13691
> URL: https://issues.apache.org/jira/browse/SOLR-13691
> Project: Solr
>  Issue Type: Task
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: documentation
>Reporter: Tomoko Uchida
>Priority: Major
> Attachments: Screenshot from 2019-08-30 14-19-01.png, Screenshot from 
> 2019-08-30 14-19-09.png
>
>
> This is a follow-up task for SOLR-13593.
> To encourage users to use the "name" attribute in field type configurations, 
> we should add examples that includes "name" instead of "class" (and mark 
> "Legacy" to the old examples).



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-13691) Add example field type configurations using "name" attributes to Ref Guide

2019-08-29 Thread Tomoko Uchida (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-13691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tomoko Uchida updated SOLR-13691:
-
Attachment: Screenshot from 2019-08-30 14-19-09.png

> Add example field type configurations using "name" attributes to Ref Guide
> --
>
> Key: SOLR-13691
> URL: https://issues.apache.org/jira/browse/SOLR-13691
> Project: Solr
>  Issue Type: Task
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: documentation
>Reporter: Tomoko Uchida
>Priority: Major
> Attachments: Screenshot from 2019-08-30 14-19-01.png, Screenshot from 
> 2019-08-30 14-19-09.png
>
>
> This is a follow-up task for SOLR-13593.
> To encourage users to use the "name" attribute in field type configurations, 
> we should add examples that includes "name" instead of "class" (and mark 
> "Legacy" to the old examples).



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-13593) Allow to look-up analyzer components by their SPI names in field type configuration

2019-08-29 Thread Tomoko Uchida (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16919126#comment-16919126
 ] 

Tomoko Uchida commented on SOLR-13593:
--

FYI, there is another follow-up task SOLR-13691 to add examples using the 
"name=...". It would be another way to find the "name"s. In another words, the 
way to find the each factory's identifier (whether "name" or "class") would be 
the same as before - the Ref Guide or bundled schema examples. 

> Allow to look-up analyzer components by their SPI names in field type 
> configuration
> ---
>
> Key: SOLR-13593
> URL: https://issues.apache.org/jira/browse/SOLR-13593
> Project: Solr
>  Issue Type: Improvement
>  Components: Schema and Analysis
>Reporter: Tomoko Uchida
>Assignee: Tomoko Uchida
>Priority: Major
> Fix For: master (9.0)
>
> Attachments: SOLR-13593-add-spi-ReversedWildcardFilterFactory.patch, 
> SOLR-13593.patch
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Now each analysis factory has explicitely documented SPI name which is stored 
> in the static "NAME" field (LUCENE-8778).
>  Solr uses factories' simple class name in schema definition (like 
> class="solr.WhitespaceTokenizerFactory"), but we should be able to also use 
> more concise SPI names (like name="whitespace").
> e.g.:
> {code:xml}
> 
>   
> 
>  />
> 
>   
> 
> {code}
> would be
> {code:xml}
> 
>   
> 
> 
> 
>   
> 
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-13593) Allow to look-up analyzer components by their SPI names in field type configuration

2019-08-29 Thread Tomoko Uchida (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16919094#comment-16919094
 ] 

Tomoko Uchida commented on SOLR-13593:
--

The (SPI) names for all factories were already documented in the Javadocs (it 
was the motivation for LUCENE-8778). I think we can add some notes to the Ref 
Guide that where one can find the "name"s.

> Allow to look-up analyzer components by their SPI names in field type 
> configuration
> ---
>
> Key: SOLR-13593
> URL: https://issues.apache.org/jira/browse/SOLR-13593
> Project: Solr
>  Issue Type: Improvement
>  Components: Schema and Analysis
>Reporter: Tomoko Uchida
>Assignee: Tomoko Uchida
>Priority: Major
> Fix For: master (9.0)
>
> Attachments: SOLR-13593-add-spi-ReversedWildcardFilterFactory.patch, 
> SOLR-13593.patch
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Now each analysis factory has explicitely documented SPI name which is stored 
> in the static "NAME" field (LUCENE-8778).
>  Solr uses factories' simple class name in schema definition (like 
> class="solr.WhitespaceTokenizerFactory"), but we should be able to also use 
> more concise SPI names (like name="whitespace").
> e.g.:
> {code:xml}
> 
>   
> 
>  />
> 
>   
> 
> {code}
> would be
> {code:xml}
> 
>   
> 
> 
> 
>   
> 
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-13690) Migrate field type configurations in default/example schema files to look up factories by "name"

2019-08-29 Thread Tomoko Uchida (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16918752#comment-16918752
 ] 

Tomoko Uchida commented on SOLR-13690:
--

Solr users will notice the changes when they open Admin UI files menu. For 
example:

!Screenshot from 2019-08-30 01-09-43.png!

> Migrate field type configurations in default/example schema files to look up 
> factories by "name"
> 
>
> Key: SOLR-13690
> URL: https://issues.apache.org/jira/browse/SOLR-13690
> Project: Solr
>  Issue Type: Task
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: Schema and Analysis
>Reporter: Tomoko Uchida
>Priority: Major
> Fix For: master (9.0)
>
> Attachments: SOLR-13690.patch, Screenshot from 2019-08-30 01-09-43.png
>
>
> This is a follow-up task for SOLR-13593.
> To encourage users to use the "name" attribute in field type configurations, 
> we should migrate all managed-schema files bundled with Solr.
> There are 8 managed-schemas (except for test resources) in solr.
> {code}
> lucene-solr-mirror $ find solr -name "managed-schema" | grep -v test
> solr/server/solr/configsets/sample_techproducts_configs/conf/managed-schema
> solr/server/solr/configsets/_default/conf/managed-schema
> solr/example/files/conf/managed-schema
> solr/example/example-DIH/solr/solr/conf/managed-schema
> solr/example/example-DIH/solr/db/conf/managed-schema
> solr/example/example-DIH/solr/atom/conf/managed-schema
> solr/example/example-DIH/solr/mail/conf/managed-schema
> solr/example/example-DIH/solr/tika/conf/managed-schema
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-13690) Migrate field type configurations in default/example schema files to look up factories by "name"

2019-08-29 Thread Tomoko Uchida (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-13690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tomoko Uchida updated SOLR-13690:
-
Attachment: Screenshot from 2019-08-30 01-09-43.png

> Migrate field type configurations in default/example schema files to look up 
> factories by "name"
> 
>
> Key: SOLR-13690
> URL: https://issues.apache.org/jira/browse/SOLR-13690
> Project: Solr
>  Issue Type: Task
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: Schema and Analysis
>Reporter: Tomoko Uchida
>Priority: Major
> Fix For: master (9.0)
>
> Attachments: SOLR-13690.patch, Screenshot from 2019-08-30 01-09-43.png
>
>
> This is a follow-up task for SOLR-13593.
> To encourage users to use the "name" attribute in field type configurations, 
> we should migrate all managed-schema files bundled with Solr.
> There are 8 managed-schemas (except for test resources) in solr.
> {code}
> lucene-solr-mirror $ find solr -name "managed-schema" | grep -v test
> solr/server/solr/configsets/sample_techproducts_configs/conf/managed-schema
> solr/server/solr/configsets/_default/conf/managed-schema
> solr/example/files/conf/managed-schema
> solr/example/example-DIH/solr/solr/conf/managed-schema
> solr/example/example-DIH/solr/db/conf/managed-schema
> solr/example/example-DIH/solr/atom/conf/managed-schema
> solr/example/example-DIH/solr/mail/conf/managed-schema
> solr/example/example-DIH/solr/tika/conf/managed-schema
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-13593) Allow to look-up analyzer components by their SPI names in field type configuration

2019-08-29 Thread Tomoko Uchida (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16918759#comment-16918759
 ] 

Tomoko Uchida commented on SOLR-13593:
--

Hi all,
I attached a patch to SOLR-13690 to change all default/example schemas bundled 
in Solr.
If there are no objections I will commit it to the master in shortly (so it 
will be shipped with Solr 9.0, and users will notice the changes soon after the 
first run).

> Allow to look-up analyzer components by their SPI names in field type 
> configuration
> ---
>
> Key: SOLR-13593
> URL: https://issues.apache.org/jira/browse/SOLR-13593
> Project: Solr
>  Issue Type: Improvement
>  Components: Schema and Analysis
>Reporter: Tomoko Uchida
>Assignee: Tomoko Uchida
>Priority: Major
> Fix For: master (9.0)
>
> Attachments: SOLR-13593-add-spi-ReversedWildcardFilterFactory.patch, 
> SOLR-13593.patch
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Now each analysis factory has explicitely documented SPI name which is stored 
> in the static "NAME" field (LUCENE-8778).
>  Solr uses factories' simple class name in schema definition (like 
> class="solr.WhitespaceTokenizerFactory"), but we should be able to also use 
> more concise SPI names (like name="whitespace").
> e.g.:
> {code:xml}
> 
>   
> 
>  />
> 
>   
> 
> {code}
> would be
> {code:xml}
> 
>   
> 
> 
> 
>   
> 
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-13690) Migrate field type configurations in default/example schema files to look up factories by "name"

2019-08-29 Thread Tomoko Uchida (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16918742#comment-16918742
 ] 

Tomoko Uchida commented on SOLR-13690:
--

This patch [^SOLR-13690.patch] changes all "class=" attributes to "name=" in 
the bundled default/example schemas.
 I tested the schemas by packaging Solr and manually creating fresh cores from 
the configsets.
{code:java}
# techproducts example
./solr-9.0.0-SNAPSHOT/bin/solr -e techproducts

# _default schema
./solr-9.0.0-SNAPSHOT/bin/solr create -c newcore1

# example/files
./solr-9.0.0-SNAPSHOT/bin/solr create -c newcore2 -d 
solr-9.0.0-SNAPSHOT/example/files/conf/

# example-DIH (solr)
./solr-9.0.0-SNAPSHOT/bin/solr create -c newcore3 -d 
solr-9.0.0-SNAPSHOT/example/example-DIH/solr/solr/conf/

# example-DIH (db)
./solr-9.0.0-SNAPSHOT/bin/solr create -c newcore4 -d 
solr-9.0.0-SNAPSHOT/example/example-DIH/solr/db/conf/

# example-DIH (atom)
./solr-9.0.0-SNAPSHOT/bin/solr create -c newcore5 -d 
solr-9.0.0-SNAPSHOT/example/example-DIH/solr/atom/conf/

# example-DIH (mail)
./solr-9.0.0-SNAPSHOT/bin/solr create -c newcore6 -d 
solr-9.0.0-SNAPSHOT/example/example-DIH/solr/mail/conf/

# example-DIH (tika)
./solr-9.0.0-SNAPSHOT/bin/solr create -c newcore7 -d 
solr-9.0.0-SNAPSHOT/example/example-DIH/solr/tika/conf/
{code}
They all work fine for me.

> Migrate field type configurations in default/example schema files to look up 
> factories by "name"
> 
>
> Key: SOLR-13690
> URL: https://issues.apache.org/jira/browse/SOLR-13690
> Project: Solr
>  Issue Type: Task
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: Schema and Analysis
>Reporter: Tomoko Uchida
>Priority: Major
> Fix For: master (9.0)
>
> Attachments: SOLR-13690.patch
>
>
> This is a follow-up task for SOLR-13593.
> To encourage users to use the "name" attribute in field type configurations, 
> we should migrate all managed-schema files bundled with Solr.
> There are 8 managed-schemas (except for test resources) in solr.
> {code}
> lucene-solr-mirror $ find solr -name "managed-schema" | grep -v test
> solr/server/solr/configsets/sample_techproducts_configs/conf/managed-schema
> solr/server/solr/configsets/_default/conf/managed-schema
> solr/example/files/conf/managed-schema
> solr/example/example-DIH/solr/solr/conf/managed-schema
> solr/example/example-DIH/solr/db/conf/managed-schema
> solr/example/example-DIH/solr/atom/conf/managed-schema
> solr/example/example-DIH/solr/mail/conf/managed-schema
> solr/example/example-DIH/solr/tika/conf/managed-schema
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-13690) Migrate field type configurations in default/example schema files to look up factories by "name"

2019-08-29 Thread Tomoko Uchida (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-13690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tomoko Uchida updated SOLR-13690:
-
Attachment: SOLR-13690.patch

> Migrate field type configurations in default/example schema files to look up 
> factories by "name"
> 
>
> Key: SOLR-13690
> URL: https://issues.apache.org/jira/browse/SOLR-13690
> Project: Solr
>  Issue Type: Task
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: Schema and Analysis
>Reporter: Tomoko Uchida
>Priority: Major
> Fix For: master (9.0)
>
> Attachments: SOLR-13690.patch
>
>
> This is a follow-up task for SOLR-13593.
> To encourage users to use the "name" attribute in field type configurations, 
> we should migrate all managed-schema files bundled with Solr.
> There are 8 managed-schemas (except for test resources) in solr.
> {code}
> lucene-solr-mirror $ find solr -name "managed-schema" | grep -v test
> solr/server/solr/configsets/sample_techproducts_configs/conf/managed-schema
> solr/server/solr/configsets/_default/conf/managed-schema
> solr/example/files/conf/managed-schema
> solr/example/example-DIH/solr/solr/conf/managed-schema
> solr/example/example-DIH/solr/db/conf/managed-schema
> solr/example/example-DIH/solr/atom/conf/managed-schema
> solr/example/example-DIH/solr/mail/conf/managed-schema
> solr/example/example-DIH/solr/tika/conf/managed-schema
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-13593) Allow to look-up analyzer components by their SPI names in field type configuration

2019-08-29 Thread Tomoko Uchida (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-13593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tomoko Uchida updated SOLR-13593:
-
Attachment: SOLR-13593-add-spi-ReversedWildcardFilterFactory.patch

> Allow to look-up analyzer components by their SPI names in field type 
> configuration
> ---
>
> Key: SOLR-13593
> URL: https://issues.apache.org/jira/browse/SOLR-13593
> Project: Solr
>  Issue Type: Improvement
>  Components: Schema and Analysis
>Reporter: Tomoko Uchida
>Assignee: Tomoko Uchida
>Priority: Major
> Fix For: master (9.0)
>
> Attachments: SOLR-13593-add-spi-ReversedWildcardFilterFactory.patch, 
> SOLR-13593.patch
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Now each analysis factory has explicitely documented SPI name which is stored 
> in the static "NAME" field (LUCENE-8778).
>  Solr uses factories' simple class name in schema definition (like 
> class="solr.WhitespaceTokenizerFactory"), but we should be able to also use 
> more concise SPI names (like name="whitespace").
> e.g.:
> {code:xml}
> 
>   
> 
>  />
> 
>   
> 
> {code}
> would be
> {code:xml}
> 
>   
> 
> 
> 
>   
> 
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-13690) Migrate field type configurations in default/example schema files to look up factories by "name"

2019-08-28 Thread Tomoko Uchida (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-13690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tomoko Uchida updated SOLR-13690:
-
Fix Version/s: master (9.0)

> Migrate field type configurations in default/example schema files to look up 
> factories by "name"
> 
>
> Key: SOLR-13690
> URL: https://issues.apache.org/jira/browse/SOLR-13690
> Project: Solr
>  Issue Type: Task
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: Schema and Analysis
>Reporter: Tomoko Uchida
>Priority: Major
> Fix For: master (9.0)
>
>
> This is a follow-up task for SOLR-13593.
> To encourage users to use the "name" attribute in field type configurations, 
> we should migrate all managed-schema files bundled with Solr.
> There are 8 managed-schemas (except for test resources) in solr.
> {code}
> lucene-solr-mirror $ find solr -name "managed-schema" | grep -v test
> solr/server/solr/configsets/sample_techproducts_configs/conf/managed-schema
> solr/server/solr/configsets/_default/conf/managed-schema
> solr/example/files/conf/managed-schema
> solr/example/example-DIH/solr/solr/conf/managed-schema
> solr/example/example-DIH/solr/db/conf/managed-schema
> solr/example/example-DIH/solr/atom/conf/managed-schema
> solr/example/example-DIH/solr/mail/conf/managed-schema
> solr/example/example-DIH/solr/tika/conf/managed-schema
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-13690) Migrate field type configurations in default/example schema files to look up factories by "name"

2019-08-28 Thread Tomoko Uchida (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-13690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tomoko Uchida updated SOLR-13690:
-
Description: 
This is a follow-up task for SOLR-13593.

To encourage users to use the "name" attribute in field type configurations, we 
should migrate all managed-schema files bundled with Solr.

There are 8 managed-schemas (except for test resources) in solr.

{code}
lucene-solr-mirror $ find solr -name "managed-schema" | grep -v test
solr/server/solr/configsets/sample_techproducts_configs/conf/managed-schema
solr/server/solr/configsets/_default/conf/managed-schema
solr/example/files/conf/managed-schema
solr/example/example-DIH/solr/solr/conf/managed-schema
solr/example/example-DIH/solr/db/conf/managed-schema
solr/example/example-DIH/solr/atom/conf/managed-schema
solr/example/example-DIH/solr/mail/conf/managed-schema
solr/example/example-DIH/solr/tika/conf/managed-schema
{code}

  was:
This is a follow-up task for SOLR-13593.

To encourage users to use the "name" attribute in field type configurations, we 
should migrate all managed-schema files bundled with Solr.


> Migrate field type configurations in default/example schema files to look up 
> factories by "name"
> 
>
> Key: SOLR-13690
> URL: https://issues.apache.org/jira/browse/SOLR-13690
> Project: Solr
>  Issue Type: Task
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: Schema and Analysis
>Reporter: Tomoko Uchida
>Priority: Major
>
> This is a follow-up task for SOLR-13593.
> To encourage users to use the "name" attribute in field type configurations, 
> we should migrate all managed-schema files bundled with Solr.
> There are 8 managed-schemas (except for test resources) in solr.
> {code}
> lucene-solr-mirror $ find solr -name "managed-schema" | grep -v test
> solr/server/solr/configsets/sample_techproducts_configs/conf/managed-schema
> solr/server/solr/configsets/_default/conf/managed-schema
> solr/example/files/conf/managed-schema
> solr/example/example-DIH/solr/solr/conf/managed-schema
> solr/example/example-DIH/solr/db/conf/managed-schema
> solr/example/example-DIH/solr/atom/conf/managed-schema
> solr/example/example-DIH/solr/mail/conf/managed-schema
> solr/example/example-DIH/solr/tika/conf/managed-schema
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8566) Deprecate methods in CustomAnalyzer.Builder which take factory classes

2019-08-28 Thread Tomoko Uchida (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-8566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16917578#comment-16917578
 ] 

Tomoko Uchida commented on LUCENE-8566:
---

The Javadoc example was updated in LUCENE-8957.

> Deprecate methods in CustomAnalyzer.Builder which take factory classes
> --
>
> Key: LUCENE-8566
> URL: https://issues.apache.org/jira/browse/LUCENE-8566
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Tomoko Uchida
>Assignee: Uwe Schindler
>Priority: Minor
>
> CustomAnalyzer.Builder has methods which take implementation classes as 
> follows.
>  - withTokenizer(Class factory, String... params)
>  - withTokenizer(Class factory, 
> Map params)
>  - addTokenFilter(Class factory, String... 
> params)
>  - addTokenFilter(Class factory, 
> Map params)
>  - addCharFilter(Class factory, String... params)
>  - addCharFilter(Class factory, 
> Map params)
> Since the builder also has methods which take service names, it seems like 
> that above methods are unnecessary and a little bit misleading. Giving 
> symbolic names is preferable to implementation factory classes, but for now, 
> users can write code depending on implementation classes.
> What do you think about deprecating those methods (adding {{@Deprecated}} 
> annotations) and deleting them in the future releases? Those are called by 
> only test cases so deleting them should have no impact on current lucene/solr 
> codebase.
> If this proposal gains your consent, I will create a patch. (Let me know if I 
> missed some point. I'll close it.)



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-8957) Update examples in CustomAnalyzer Javadocs

2019-08-28 Thread Tomoko Uchida (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tomoko Uchida resolved LUCENE-8957.
---
Fix Version/s: 8.3
   master (9.0)
   Resolution: Fixed

> Update examples in CustomAnalyzer Javadocs
> --
>
> Key: LUCENE-8957
> URL: https://issues.apache.org/jira/browse/LUCENE-8957
> Project: Lucene - Core
>  Issue Type: Task
>  Components: modules/analysis
>Reporter: Tomoko Uchida
>Assignee: Tomoko Uchida
>Priority: Minor
> Fix For: master (9.0), 8.3
>
> Attachments: LUCENE-8957.patch
>
>
> CustomAnalyzer Javadocs need to be updated:
> - Remove {{StandardFilterFactory}} from examples
> - Use {{Factory.NAME}} instead of {{Factory.class}} in the examples



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-8957) Update examples in CustomAnalyzer Javadocs

2019-08-27 Thread Tomoko Uchida (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tomoko Uchida updated LUCENE-8957:
--
Attachment: LUCENE-8957.patch

> Update examples in CustomAnalyzer Javadocs
> --
>
> Key: LUCENE-8957
> URL: https://issues.apache.org/jira/browse/LUCENE-8957
> Project: Lucene - Core
>  Issue Type: Task
>  Components: modules/analysis
>Reporter: Tomoko Uchida
>Assignee: Tomoko Uchida
>Priority: Minor
> Attachments: LUCENE-8957.patch
>
>
> CustomAnalyzer Javadocs need to be updated:
> - Remove {{StandardFilterFactory}} from examples
> - Use {{Factory.NAME}} instead of {{Factory.class}} in the examples



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-8957) Update examples in CustomAnalyzer Javadocs

2019-08-27 Thread Tomoko Uchida (Jira)
Tomoko Uchida created LUCENE-8957:
-

 Summary: Update examples in CustomAnalyzer Javadocs
 Key: LUCENE-8957
 URL: https://issues.apache.org/jira/browse/LUCENE-8957
 Project: Lucene - Core
  Issue Type: Task
  Components: modules/analysis
Reporter: Tomoko Uchida
Assignee: Tomoko Uchida


CustomAnalyzer Javadocs need to be updated:

- Remove {{StandardFilterFactory}} from examples
- Use {{Factory.NAME}} instead of {{Factory.class}} in the examples



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8945) Allow to change the output file delimiter on Luke "export terms" feature

2019-08-25 Thread Tomoko Uchida (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-8945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16915410#comment-16915410
 ] 

Tomoko Uchida commented on LUCENE-8945:
---

Sure, go ahead!

> Allow to change the output file delimiter on Luke "export terms" feature
> 
>
> Key: LUCENE-8945
> URL: https://issues.apache.org/jira/browse/LUCENE-8945
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/luke
>Reporter: Tomoko Uchida
>Priority: Minor
>
> This is a follow-up issue for LUCENE-8764.
> Current delimiter is fixed to "," (comma), but terms also can include comma 
> and they are not escaped. It would be better if the delimiter can be 
> changed/selected to a tab or whitespace when exporting.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (SOLR-13452) Update the lucene-solr build from Ivy+Ant+Maven (shadow build) to Gradle.

2019-08-23 Thread Tomoko Uchida (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16914216#comment-16914216
 ] 

Tomoko Uchida edited comment on SOLR-13452 at 8/23/19 2:18 PM:
---

When I looked into inter-module dependency problems on IDEA, I noticed that it 
seems difficult to manage to keep our current IDEA module structure with Gradle 
IDEA plugin.
 I discarded (refactored) current structure defined in {{.idea/modules.xml}} 
and instead, made use of the default structure generated by the Gradle plugin. 
(Gradle IDEA plugin works well for me and I think it's much better than 
maintaining ".iml" files for all Lucene/Solr modules on our own.)

By this change IDEA users will see:
 1. module names have been changed (e.g., "backward-codecs" is now 
"lucene-backward-codecs", "extraction" is "solr-contrib-extraction", and so on) 
 2. main / test modules had been merged into one module (e.g., "lucene-core" 
and "lucene-core-tests" are now packed into "lucene-core", "solr-core" and 
"solr-core-tests" are packed into "solr-core", and so on).
 And you might notice 3. IDEA complains some circular dependencies those 
haven't been reported so far (they had been cleverly suppressed by customized 
module structure), but daily development shouldn't be affected by this.

I think unused IDEA configuration files can be removed when we remove all other 
Ant related files.

For me IDEA setup is 90% ready now... still not perfect, some 
fixes/improvements are needed.

[~markrmil...@gmail.com]: when I looked through build files, I noticed a typo.

[https://github.com/apache/lucene-solr/blob/jira/SOLR-13452_gradle_5/lucene/analysis/stempel/build.gradle#L22]

I think {{lucene-analyzers-stemple}} should be {{lucene-analyzers-stempel}} :)


was (Author: tomoko uchida):
When I looked into inter-module dependency problems on IDEA, I noticed that it 
seems difficult to manage to keep our current IDEA module structure with Gradle 
IDEA plugin.
 I discarded (refactored) current structure defined in {{.idea/modules.xml}} 
and instead, made use of the default structure generated by the Gradle plugin. 
(Gradle IDEA plugin works well for me and I think it's much better than 
maintaining ".iml" files for all Lucene/Solr modules on our own.)

By this change IDEA users will see:
 1. module names have been changed (e.g., "backward-codecs" is now 
"lucene-backward-codecs", "extraction" is "solr-contrib-extraction", and so on) 
 2. main / test modules have been merged into one module (e.g., "lucene-core" 
and "lucene-core-tests" are now packed into "lucene-core", "solr-core" and 
"solr-core-tests" are packed into "solr-core", and so on).
 And you might notice 3. IDEA complains some circular dependencies those 
haven't been reported so far (they have been cleverly suppressed by customized 
module structure), but daily development shouldn't be affected by this.

I think unused IDEA configuration files can be removed when we remove all other 
Ant related files.

For me IDEA setup is 90% ready now... still not perfect, some 
fixes/improvements are needed.

> Update the lucene-solr build from Ivy+Ant+Maven (shadow build) to Gradle.
> -
>
> Key: SOLR-13452
> URL: https://issues.apache.org/jira/browse/SOLR-13452
> Project: Solr
>  Issue Type: Improvement
>  Components: Build
>Reporter: Mark Miller
>Assignee: Mark Miller
>Priority: Major
> Fix For: master (9.0)
>
> Attachments: gradle-build.pdf
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> I took some things from the great work that Dat did in 
> [https://github.com/apache/lucene-solr/tree/jira/gradle] and took the ball a 
> little further.
>  
> When working with gradle in sub modules directly, I recommend 
> [https://github.com/dougborg/gdub]
> This gradle branch uses the following plugin for version locking, version 
> configuration and version consistency across modules: 
> [https://github.com/palantir/gradle-consistent-versions]
>  
> https://github.com/apache/lucene-solr/tree/jira/SOLR-13452_gradle_5



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (SOLR-13452) Update the lucene-solr build from Ivy+Ant+Maven (shadow build) to Gradle.

2019-08-23 Thread Tomoko Uchida (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16914216#comment-16914216
 ] 

Tomoko Uchida edited comment on SOLR-13452 at 8/23/19 12:42 PM:


When I looked into inter-module dependency problems on IDEA, I noticed that it 
seems difficult to manage to keep our current IDEA module structure with Gradle 
IDEA plugin.
 I discarded (refactored) current structure defined in {{.idea/modules.xml}} 
and instead, made use of the default structure generated by the Gradle plugin. 
(Gradle IDEA plugin works well for me and I think it's much better than 
maintaining ".iml" files for all Lucene/Solr modules on our own.)

By this change IDEA users will see:
 1. module names have been changed (e.g., "backward-codecs" is now 
"lucene-backward-codecs", "extraction" is "solr-contrib-extraction", and so on) 
 2. main / test modules have been merged into one module (e.g., "lucene-core" 
and "lucene-core-tests" are now packed into "lucene-core", "solr-core" and 
"solr-core-tests" are packed into "solr-core", and so on).
 And you might notice 3. IDEA complains some circular dependencies those 
haven't been reported so far (they have been cleverly suppressed by customized 
module structure), but daily development shouldn't be affected by this.

I think unused IDEA configuration files can be removed when we remove all other 
Ant related files.

For me IDEA setup is 90% ready now... still not perfect, some 
fixes/improvements are needed.


was (Author: tomoko uchida):
When I looked into inter-module dependency problems on IDEA, I noticed that it 
seems difficult to manage to keep our current IDEA module structure with Gradle 
IDEA plugin.
 I discarded current structure (defined in {{.idea/modules.xml}}) and instead 
made use of the default structure generated by the Gradle plugin. (Gradle IDEA 
plugin works well for me and I think it's much better than maintaining ".iml" 
files for all Lucene/Solr modules on our own.)

By this change IDEA users will see:
 1. module names have been changed (e.g., "backward-codecs" is now 
"lucene-backward-codecs", "extraction" is "solr-contrib-extraction", and so on) 
 2. main / test modules have been merged into one module (e.g., "lucene-core" 
and "lucene-core-tests" are now packed into "lucene-core", "solr-core" and 
"solr-core-tests" are packed into "solr-core", and so on).
 And you might notice 3. IDEA complains some circular dependencies those 
haven't been reported so far (they have been cleverly suppressed by customized 
module structure), but daily development shouldn't be affected by this.

I think unused IDEA configuration files can be removed when we remove all other 
Ant related files.

For me IDEA setup is 90% ready now... still not perfect, some 
fixes/improvements are needed.

> Update the lucene-solr build from Ivy+Ant+Maven (shadow build) to Gradle.
> -
>
> Key: SOLR-13452
> URL: https://issues.apache.org/jira/browse/SOLR-13452
> Project: Solr
>  Issue Type: Improvement
>  Components: Build
>Reporter: Mark Miller
>Assignee: Mark Miller
>Priority: Major
> Fix For: master (9.0)
>
> Attachments: gradle-build.pdf
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> I took some things from the great work that Dat did in 
> [https://github.com/apache/lucene-solr/tree/jira/gradle] and took the ball a 
> little further.
>  
> When working with gradle in sub modules directly, I recommend 
> [https://github.com/dougborg/gdub]
> This gradle branch uses the following plugin for version locking, version 
> configuration and version consistency across modules: 
> [https://github.com/palantir/gradle-consistent-versions]
>  
> https://github.com/apache/lucene-solr/tree/jira/SOLR-13452_gradle_5



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (SOLR-13452) Update the lucene-solr build from Ivy+Ant+Maven (shadow build) to Gradle.

2019-08-23 Thread Tomoko Uchida (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16914216#comment-16914216
 ] 

Tomoko Uchida edited comment on SOLR-13452 at 8/23/19 12:40 PM:


When I looked into inter-module dependency problems on IDEA, I noticed that it 
seems difficult to manage to keep our current IDEA module structure with Gradle 
IDEA plugin.
 I discarded current structure (defined in {{.idea/modules.xml}}) and instead 
made use of the default structure generated by the Gradle plugin. (Gradle IDEA 
plugin works well for me and I think it's much better than maintaining ".iml" 
files for all Lucene/Solr modules on our own.)

By this change IDEA users will see:
 1. module names have been changed (e.g., "backward-codecs" is now 
"lucene-backward-codecs", "extraction" is "solr-contrib-extraction", and so on) 
 2. main / test modules have been merged into one module (e.g., "lucene-core" 
and "lucene-core-tests" are now packed into "lucene-core", "solr-core" and 
"solr-core-tests" are packed into "solr-core", and so on).
 And you might notice 3. IDEA complains some circular dependencies those 
haven't been reported so far (they have been cleverly suppressed by customized 
module structure), but daily development shouldn't be affected by this.

I think unused IDEA configuration files can be removed when we remove all other 
Ant related files.

For me IDEA setup is 90% ready now... still not perfect, some 
fixes/improvements are needed.


was (Author: tomoko uchida):
When I looked into inter-module dependency problems on IDEA, I noticed that it 
seems difficult to manage to keep our current IDEA module structure with Gradle 
IDEA plugin.
 I discarded current IDEA module structure (defined in {{.idea/modules.xml}}) 
and instead made use of the default structure generated by the Gradle plugin. 
(Gradle IDEA plugin works well for me and I think it's much better than 
maintaining ".iml" files for all Lucene/Solr modules.)

By this change IDEA users will see that 
 1. module names have been changed (e.g., "backward-codecs" is now 
"lucene-backward-codecs", "extraction" is "solr-contrib-extraction", and so on) 
 2. separated main / test modules have been merged into one module (e.g., 
"lucene-core" and "lucene-core-tests" are now packed into "lucene-core", 
"solr-core" and "solr-core-tests" are packed into "solr-core", and so on).
 And you might notice 3. IDEA complains some circular dependency problems those 
haven't been reported so far, but daily development shouldn't be affected by 
this.

For me IDEA setup is 90% ready now... still not perfect, some 
fixes/improvements are needed.

> Update the lucene-solr build from Ivy+Ant+Maven (shadow build) to Gradle.
> -
>
> Key: SOLR-13452
> URL: https://issues.apache.org/jira/browse/SOLR-13452
> Project: Solr
>  Issue Type: Improvement
>  Components: Build
>Reporter: Mark Miller
>Assignee: Mark Miller
>Priority: Major
> Fix For: master (9.0)
>
> Attachments: gradle-build.pdf
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> I took some things from the great work that Dat did in 
> [https://github.com/apache/lucene-solr/tree/jira/gradle] and took the ball a 
> little further.
>  
> When working with gradle in sub modules directly, I recommend 
> [https://github.com/dougborg/gdub]
> This gradle branch uses the following plugin for version locking, version 
> configuration and version consistency across modules: 
> [https://github.com/palantir/gradle-consistent-versions]
>  
> https://github.com/apache/lucene-solr/tree/jira/SOLR-13452_gradle_5



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-13452) Update the lucene-solr build from Ivy+Ant+Maven (shadow build) to Gradle.

2019-08-23 Thread Tomoko Uchida (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16914216#comment-16914216
 ] 

Tomoko Uchida commented on SOLR-13452:
--

When I looked into inter-module dependency problems on IDEA, I noticed that it 
seems difficult to manage to keep our current IDEA module structure with Gradle 
IDEA plugin.
 I discarded current IDEA module structure (defined in {{.idea/modules.xml}}) 
and instead made use of the default structure generated by the Gradle plugin. 
(Gradle IDEA plugin works well for me and I think it's much better than 
maintaining ".iml" files for all Lucene/Solr modules.)

By this change IDEA users will see that 
 1. module names have been changed (e.g., "backward-codecs" is now 
"lucene-backward-codecs", "extraction" is "solr-contrib-extraction", and so on) 
 2. separated main / test modules have been merged into one module (e.g., 
"lucene-core" and "lucene-core-tests" are now packed into "lucene-core", 
"solr-core" and "solr-core-tests" are packed into "solr-core", and so on).
 And you might notice 3. IDEA complains some circular dependency problems those 
haven't been reported so far, but daily development shouldn't be affected by 
this.

For me IDEA setup is 90% ready now... still not perfect, some 
fixes/improvements are needed.

> Update the lucene-solr build from Ivy+Ant+Maven (shadow build) to Gradle.
> -
>
> Key: SOLR-13452
> URL: https://issues.apache.org/jira/browse/SOLR-13452
> Project: Solr
>  Issue Type: Improvement
>  Components: Build
>Reporter: Mark Miller
>Assignee: Mark Miller
>Priority: Major
> Fix For: master (9.0)
>
> Attachments: gradle-build.pdf
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> I took some things from the great work that Dat did in 
> [https://github.com/apache/lucene-solr/tree/jira/gradle] and took the ball a 
> little further.
>  
> When working with gradle in sub modules directly, I recommend 
> [https://github.com/dougborg/gdub]
> This gradle branch uses the following plugin for version locking, version 
> configuration and version consistency across modules: 
> [https://github.com/palantir/gradle-consistent-versions]
>  
> https://github.com/apache/lucene-solr/tree/jira/SOLR-13452_gradle_5



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-13452) Update the lucene-solr build from Ivy+Ant+Maven (shadow build) to Gradle.

2019-08-18 Thread Tomoko Uchida (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-13452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16910106#comment-16910106
 ] 

Tomoko Uchida commented on SOLR-13452:
--

Hi,
 I pushed the first version of {{builsSrc/idea/idea.gradle}} to the branch 
{{jira/SOLR-13452_gradle_5}}. (I forgot to include issue # to the commit 
message, so the notification did not come here.)
 
[https://github.com/apache/lucene-solr/commit/da3654411aca3b9b74b1845f90c03de2e7dc6594]

No other file was changed.

There still remain some problems to be fixed (sometimes "idea" task fail with 
ConcurrentModificationException, and there are several unresolved dependencies 
on Solr modules when opening the project with IDEA). But generally "idea" and 
"cleanIdea" seem to work. I'll continue to look into the problems. Let me know 
if there is anything you noticed about that commit.

> Update the lucene-solr build from Ivy+Ant+Maven (shadow build) to Gradle.
> -
>
> Key: SOLR-13452
> URL: https://issues.apache.org/jira/browse/SOLR-13452
> Project: Solr
>  Issue Type: Improvement
>  Components: Build
>Reporter: Mark Miller
>Assignee: Mark Miller
>Priority: Major
> Fix For: master (9.0)
>
> Attachments: gradle-build.pdf
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> I took some things from the great work that Dat did in 
> [https://github.com/apache/lucene-solr/tree/jira/gradle] and took the ball a 
> little further.
>  
> When working with gradle in sub modules directly, I recommend 
> [https://github.com/dougborg/gdub]
> This gradle branch uses the following plugin for version locking, version 
> configuration and version consistency across modules: 
> [https://github.com/palantir/gradle-consistent-versions]
>  
> https://github.com/apache/lucene-solr/tree/jira/SOLR-13452_gradle_5



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8566) Deprecate methods in CustomAnalyzer.Builder which take factory classes

2019-08-17 Thread Tomoko Uchida (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16909708#comment-16909708
 ] 

Tomoko Uchida commented on LUCENE-8566:
---

I didn't intend to proceed this issue, just noticed that CustomAnalyzer Javadoc 
is the only place where we can promote the factory's {{NAME}} static fields 
(with their usage). Also the examples still contain {{StandardFilterFactory}} 
that was already removed, so anyway they should be updated. 
 I will open an issue to change the Javadoc.

> Deprecate methods in CustomAnalyzer.Builder which take factory classes
> --
>
> Key: LUCENE-8566
> URL: https://issues.apache.org/jira/browse/LUCENE-8566
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Tomoko Uchida
>Assignee: Uwe Schindler
>Priority: Minor
>
> CustomAnalyzer.Builder has methods which take implementation classes as 
> follows.
>  - withTokenizer(Class factory, String... params)
>  - withTokenizer(Class factory, 
> Map params)
>  - addTokenFilter(Class factory, String... 
> params)
>  - addTokenFilter(Class factory, 
> Map params)
>  - addCharFilter(Class factory, String... params)
>  - addCharFilter(Class factory, 
> Map params)
> Since the builder also has methods which take service names, it seems like 
> that above methods are unnecessary and a little bit misleading. Giving 
> symbolic names is preferable to implementation factory classes, but for now, 
> users can write code depending on implementation classes.
> What do you think about deprecating those methods (adding {{@Deprecated}} 
> annotations) and deleting them in the future releases? Those are called by 
> only test cases so deleting them should have no impact on current lucene/solr 
> codebase.
> If this proposal gains your consent, I will create a patch. (Let me know if I 
> missed some point. I'll close it.)



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8566) Deprecate methods in CustomAnalyzer.Builder which take factory classes

2019-08-16 Thread Tomoko Uchida (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16909498#comment-16909498
 ] 

Tomoko Uchida commented on LUCENE-8566:
---

Hi
 would it be a good time to change all {{CustomAnalyzer}} Javadoc examples to 
ones with SPI names to encourage users to use those instead of class names?
 I mean, now we can change this
{code:java}
Analyzer ana = CustomAnalyzer.builder(Paths.get("/path/to/config/dir"))
   .withTokenizer(StandardTokenizerFactory.class)
   .addTokenFilter(StandardFilterFactory.class)
   .addTokenFilter(LowerCaseFilterFactory.class)
   .addTokenFilter(StopFilterFactory.class, "ignoreCase", "false", "words", 
"stopwords.txt", "format", "wordset")
   .build();
{code}
to
{code:java}
Analyzer ana = CustomAnalyzer.builder(Paths.get("/path/to/config/dir"))
   .withTokenizer(StandardTokenizerFactory.NAME)
   .addTokenFilter(StandardFilterFactory.NAME)
   .addTokenFilter(LowerCaseFilterFactory.NAME)
   .addTokenFilter(StopFilterFactory.NAME, "ignoreCase", "false", "words", 
"stopwords.txt", "format", "wordset")
   .build();
{code}

> Deprecate methods in CustomAnalyzer.Builder which take factory classes
> --
>
> Key: LUCENE-8566
> URL: https://issues.apache.org/jira/browse/LUCENE-8566
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Tomoko Uchida
>Assignee: Uwe Schindler
>Priority: Minor
>
> CustomAnalyzer.Builder has methods which take implementation classes as 
> follows.
>  - withTokenizer(Class factory, String... params)
>  - withTokenizer(Class factory, 
> Map params)
>  - addTokenFilter(Class factory, String... 
> params)
>  - addTokenFilter(Class factory, 
> Map params)
>  - addCharFilter(Class factory, String... params)
>  - addCharFilter(Class factory, 
> Map params)
> Since the builder also has methods which take service names, it seems like 
> that above methods are unnecessary and a little bit misleading. Giving 
> symbolic names is preferable to implementation factory classes, but for now, 
> users can write code depending on implementation classes.
> What do you think about deprecating those methods (adding {{@Deprecated}} 
> annotations) and deleting them in the future releases? Those are called by 
> only test cases so deleting them should have no impact on current lucene/solr 
> codebase.
> If this proposal gains your consent, I will create a patch. (Let me know if I 
> missed some point. I'll close it.)



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-13452) Update the lucene-solr build from Ivy+Ant+Maven (shadow build) to Gradle.

2019-08-16 Thread Tomoko Uchida (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-13452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16909070#comment-16909070
 ] 

Tomoko Uchida commented on SOLR-13452:
--

bq. probably not a bad time to start looking at support for intellij? I think 
from that perspective things are in fairly good shape.

Okay, I will try it out this weekend.

> Update the lucene-solr build from Ivy+Ant+Maven (shadow build) to Gradle.
> -
>
> Key: SOLR-13452
> URL: https://issues.apache.org/jira/browse/SOLR-13452
> Project: Solr
>  Issue Type: Improvement
>  Components: Build
>Reporter: Mark Miller
>Assignee: Mark Miller
>Priority: Major
> Fix For: master (9.0)
>
> Attachments: gradle-build.pdf
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> I took some things from the great work that Dat did in 
> [https://github.com/apache/lucene-solr/tree/jira/gradle] and took the ball a 
> little further.
>  
> When working with gradle in sub modules directly, I recommend 
> [https://github.com/dougborg/gdub]
> This gradle branch uses the following plugin for version locking, version 
> configuration and version consistency across modules: 
> [https://github.com/palantir/gradle-consistent-versions]
>  
> https://github.com/apache/lucene-solr/tree/jira/SOLR-13452_gradle_5



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-8933) JapaneseTokenizer creates Token objects with corrupt offsets

2019-08-13 Thread Tomoko Uchida (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tomoko Uchida resolved LUCENE-8933.
---
   Resolution: Fixed
 Assignee: Tomoko Uchida
Fix Version/s: 8.3
   master (9.0)

I merged the PRs, one for master and one for 8.x.

> JapaneseTokenizer creates Token objects with corrupt offsets
> 
>
> Key: LUCENE-8933
> URL: https://issues.apache.org/jira/browse/LUCENE-8933
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Adrien Grand
>Assignee: Tomoko Uchida
>Priority: Minor
> Fix For: master (9.0), 8.3
>
>  Time Spent: 2h 50m
>  Remaining Estimate: 0h
>
> An Elasticsearch user reported the following stack trace when parsing 
> synonyms. It looks like the only reason why this might occur is if the offset 
> of a {{org.apache.lucene.analysis.ja.Token}} is not within the expected range.
>  
> {noformat}
> Caused by: java.lang.ArrayIndexOutOfBoundsException
> at 
> org.apache.lucene.analysis.tokenattributes.CharTermAttributeImpl.copyBuffer(CharTermAttributeImpl.java:44)
>  ~[lucene-core-7.6.0.jar:7.6.0 719cde97f84640faa1e3525690d262946571245f - 
> nknize - 2018-12-07 14:44:20]
> at 
> org.apache.lucene.analysis.ja.JapaneseTokenizer.incrementToken(JapaneseTokenizer.java:486)
>  ~[?:?]
> at 
> org.apache.lucene.analysis.synonym.SynonymMap$Parser.analyze(SynonymMap.java:318)
>  ~[lucene-analyzers-common-7.6.0.jar:7.6.0 
> 719cde97f84640faa1e3525690d262946571245f - nknize - 2018-12-07 14:44:48]
> at 
> org.elasticsearch.index.analysis.ESSolrSynonymParser.analyze(ESSolrSynonymParser.java:57)
>  ~[elasticsearch-6.6.1.jar:6.6.1]
> at 
> org.apache.lucene.analysis.synonym.SolrSynonymParser.addInternal(SolrSynonymParser.java:114)
>  ~[lucene-analyzers-common-7.6.0.jar:7.6.0 
> 719cde97f84640faa1e3525690d262946571245f - nknize - 2018-12-07 14:44:48]
> at 
> org.apache.lucene.analysis.synonym.SolrSynonymParser.parse(SolrSynonymParser.java:70)
>  ~[lucene-analyzers-common-7.6.0.jar:7.6.0 
> 719cde97f84640faa1e3525690d262946571245f - nknize - 2018-12-07 14:44:48]
> at 
> org.elasticsearch.index.analysis.SynonymTokenFilterFactory.buildSynonyms(SynonymTokenFilterFactory.java:154)
>  ~[elasticsearch-6.6.1.jar:6.6.1]
> ... 24 more
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8873) Improve analyzer factoryies' Javadoc.

2019-08-11 Thread Tomoko Uchida (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16904779#comment-16904779
 ] 

Tomoko Uchida commented on LUCENE-8873:
---

As a reminder, we cannot have attributes named "class" or "name" - these are 
reserved for Solr field type configuration and implicitly erased in the super 
class constructor. Can we treat this constraint explicitly here by throwing 
exceptions when a factory violates this assumption?  

> Improve analyzer factoryies' Javadoc.
> -
>
> Key: LUCENE-8873
> URL: https://issues.apache.org/jira/browse/LUCENE-8873
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Tomoko Uchida
>Priority: Minor
>
> Currently, the documentation for analyzer factories (subclasses of 
> {{TokenizerFactory}}, {{CharFilterFactory}}, {{TokenFilterFactory}}) still 
> includes lots of Solr schema.xml examples and not all properties are 
> documented. >From my perspective, the latter is more problematic because 
> users who want to use the factories have to refer to source code to know what 
> properties are defined.
> To improve documentation, XML examples should be removed for cleanup, and 
> instead, *all properties which can be passed to factory constructors should 
> be properly documented*.
> Documentation is often overlooked so some validation rules and 
> standardization effort would be desired (e.g. marking properties by 
> annotations).
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Resolved] (SOLR-13593) Allow to look-up analyzer components by their SPI names in field type configuration

2019-08-11 Thread Tomoko Uchida (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-13593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tomoko Uchida resolved SOLR-13593.
--
Resolution: Fixed

> Allow to look-up analyzer components by their SPI names in field type 
> configuration
> ---
>
> Key: SOLR-13593
> URL: https://issues.apache.org/jira/browse/SOLR-13593
> Project: Solr
>  Issue Type: Improvement
>  Components: Schema and Analysis
>Reporter: Tomoko Uchida
>Assignee: Tomoko Uchida
>Priority: Major
> Fix For: master (9.0)
>
> Attachments: SOLR-13593.patch
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Now each analysis factory has explicitely documented SPI name which is stored 
> in the static "NAME" field (LUCENE-8778).
>  Solr uses factories' simple class name in schema definition (like 
> class="solr.WhitespaceTokenizerFactory"), but we should be able to also use 
> more concise SPI names (like name="whitespace").
> e.g.:
> {code:xml}
> 
>   
> 
>  />
> 
>   
> 
> {code}
> would be
> {code:xml}
> 
>   
> 
> 
> 
>   
> 
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-13593) Allow to look-up analyzer components by their SPI names in field type configuration

2019-08-11 Thread Tomoko Uchida (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-13593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16904777#comment-16904777
 ] 

Tomoko Uchida commented on SOLR-13593:
--

I will close (resolve) this issue for now. If someone is interested in or has 
thoughts about backporting to 8.x, please feel free to reopen it. (In short, we 
can expose this feature from 8.x except for the ICU factories.)

> Allow to look-up analyzer components by their SPI names in field type 
> configuration
> ---
>
> Key: SOLR-13593
> URL: https://issues.apache.org/jira/browse/SOLR-13593
> Project: Solr
>  Issue Type: Improvement
>  Components: Schema and Analysis
>Reporter: Tomoko Uchida
>Assignee: Tomoko Uchida
>Priority: Major
> Fix For: master (9.0)
>
> Attachments: SOLR-13593.patch
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Now each analysis factory has explicitely documented SPI name which is stored 
> in the static "NAME" field (LUCENE-8778).
>  Solr uses factories' simple class name in schema definition (like 
> class="solr.WhitespaceTokenizerFactory"), but we should be able to also use 
> more concise SPI names (like name="whitespace").
> e.g.:
> {code:xml}
> 
>   
> 
>  />
> 
>   
> 
> {code}
> would be
> {code:xml}
> 
>   
> 
> 
> 
>   
> 
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-13593) Allow to look-up analyzer components by their SPI names in field type configuration

2019-08-11 Thread Tomoko Uchida (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-13593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tomoko Uchida updated SOLR-13593:
-
Fix Version/s: master (9.0)

> Allow to look-up analyzer components by their SPI names in field type 
> configuration
> ---
>
> Key: SOLR-13593
> URL: https://issues.apache.org/jira/browse/SOLR-13593
> Project: Solr
>  Issue Type: Improvement
>  Components: Schema and Analysis
>Reporter: Tomoko Uchida
>Assignee: Tomoko Uchida
>Priority: Major
> Fix For: master (9.0)
>
> Attachments: SOLR-13593.patch
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Now each analysis factory has explicitely documented SPI name which is stored 
> in the static "NAME" field (LUCENE-8778).
>  Solr uses factories' simple class name in schema definition (like 
> class="solr.WhitespaceTokenizerFactory"), but we should be able to also use 
> more concise SPI names (like name="whitespace").
> e.g.:
> {code:xml}
> 
>   
> 
>  />
> 
>   
> 
> {code}
> would be
> {code:xml}
> 
>   
> 
> 
> 
>   
> 
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Assigned] (SOLR-13593) Allow to look-up analyzer components by their SPI names in field type configuration

2019-08-11 Thread Tomoko Uchida (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-13593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tomoko Uchida reassigned SOLR-13593:


Assignee: Tomoko Uchida

> Allow to look-up analyzer components by their SPI names in field type 
> configuration
> ---
>
> Key: SOLR-13593
> URL: https://issues.apache.org/jira/browse/SOLR-13593
> Project: Solr
>  Issue Type: Improvement
>  Components: Schema and Analysis
>Reporter: Tomoko Uchida
>Assignee: Tomoko Uchida
>Priority: Major
> Attachments: SOLR-13593.patch
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Now each analysis factory has explicitely documented SPI name which is stored 
> in the static "NAME" field (LUCENE-8778).
>  Solr uses factories' simple class name in schema definition (like 
> class="solr.WhitespaceTokenizerFactory"), but we should be able to also use 
> more concise SPI names (like name="whitespace").
> e.g.:
> {code:xml}
> 
>   
> 
>  />
> 
>   
> 
> {code}
> would be
> {code:xml}
> 
>   
> 
> 
> 
>   
> 
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-13593) Allow to look-up analyzer components by their SPI names in field type configuration

2019-08-11 Thread Tomoko Uchida (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-13593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tomoko Uchida updated SOLR-13593:
-
Summary: Allow to look-up analyzer components by their SPI names in field 
type configuration  (was: Allow to specify analyzer components by their SPI 
names in schema definition)

> Allow to look-up analyzer components by their SPI names in field type 
> configuration
> ---
>
> Key: SOLR-13593
> URL: https://issues.apache.org/jira/browse/SOLR-13593
> Project: Solr
>  Issue Type: Improvement
>  Components: Schema and Analysis
>Reporter: Tomoko Uchida
>Priority: Major
> Attachments: SOLR-13593.patch
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Now each analysis factory has explicitely documented SPI name which is stored 
> in the static "NAME" field (LUCENE-8778).
>  Solr uses factories' simple class name in schema definition (like 
> class="solr.WhitespaceTokenizerFactory"), but we should be able to also use 
> more concise SPI names (like name="whitespace").
> e.g.:
> {code:xml}
> 
>   
> 
>  />
> 
>   
> 
> {code}
> would be
> {code:xml}
> 
>   
> 
> 
> 
>   
> 
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-13593) Allow to specify analyzer components by their SPI names in schema definition

2019-08-11 Thread Tomoko Uchida (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-13593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16904596#comment-16904596
 ] 

Tomoko Uchida commented on SOLR-13593:
--

Here is the final patch: [^SOLR-13593.patch]

I opened follow-up tasks: [SOLR-13690] (migrate default schemas) and 
[SOLR-13691] (add the examples in Ref Guide).

This breaks two ICU factories so I cannot backport to the 8x branch as is :-/
We might be able to backport this with some fixes to keep backwards 
compatibility, but it could introduce another concerns/confusions. I think it 
would be better that we leave 8x branch unchanged - opinions or ideas?

> Allow to specify analyzer components by their SPI names in schema definition
> 
>
> Key: SOLR-13593
> URL: https://issues.apache.org/jira/browse/SOLR-13593
> Project: Solr
>  Issue Type: Improvement
>  Components: Schema and Analysis
>Reporter: Tomoko Uchida
>Priority: Major
> Attachments: SOLR-13593.patch
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Now each analysis factory has explicitely documented SPI name which is stored 
> in the static "NAME" field (LUCENE-8778).
>  Solr uses factories' simple class name in schema definition (like 
> class="solr.WhitespaceTokenizerFactory"), but we should be able to also use 
> more concise SPI names (like name="whitespace").
> e.g.:
> {code:xml}
> 
>   
> 
>  />
> 
>   
> 
> {code}
> would be
> {code:xml}
> 
>   
> 
> 
> 
>   
> 
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (SOLR-13691) Add example field type configurations using "name" attributes to Ref Guide

2019-08-11 Thread Tomoko Uchida (JIRA)
Tomoko Uchida created SOLR-13691:


 Summary: Add example field type configurations using "name" 
attributes to Ref Guide
 Key: SOLR-13691
 URL: https://issues.apache.org/jira/browse/SOLR-13691
 Project: Solr
  Issue Type: Task
  Security Level: Public (Default Security Level. Issues are Public)
  Components: documentation
Reporter: Tomoko Uchida


This is a follow-up task for SOLR-13593.

To encourage users to use the "name" attribute in field type configurations, we 
should add examples that includes "name" instead of "class" (and mark "Legacy" 
to the old examples).



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-13690) Migrate field type configurations in default/example schema files to look up factories by "name"

2019-08-11 Thread Tomoko Uchida (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-13690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tomoko Uchida updated SOLR-13690:
-
Description: 
This is a follow-up task for SOLR-13593.

To encourage users to use the "name" attribute in field type configurations, we 
should migrate all managed-schema files bundled with Solr.

  was:
This is a follow-up task for SOLR-13593.

To encourage users to use the "name" attribute in field type configurations, we 
should migrate all bundled managed-schema files bundled with Solr.


> Migrate field type configurations in default/example schema files to look up 
> factories by "name"
> 
>
> Key: SOLR-13690
> URL: https://issues.apache.org/jira/browse/SOLR-13690
> Project: Solr
>  Issue Type: Task
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: Schema and Analysis
>Reporter: Tomoko Uchida
>Priority: Major
>
> This is a follow-up task for SOLR-13593.
> To encourage users to use the "name" attribute in field type configurations, 
> we should migrate all managed-schema files bundled with Solr.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-13690) Migrate field type configurations in default/example schema files to look up factories by "name"

2019-08-11 Thread Tomoko Uchida (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-13690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tomoko Uchida updated SOLR-13690:
-
Description: 
This is a follow-up task for SOLR-13593.

To encourage users to use the "name" attribute in field type configurations, we 
should migrate all bundled managed-schema files bundled with Solr.

> Migrate field type configurations in default/example schema files to look up 
> factories by "name"
> 
>
> Key: SOLR-13690
> URL: https://issues.apache.org/jira/browse/SOLR-13690
> Project: Solr
>  Issue Type: Task
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: Schema and Analysis
>Reporter: Tomoko Uchida
>Priority: Major
>
> This is a follow-up task for SOLR-13593.
> To encourage users to use the "name" attribute in field type configurations, 
> we should migrate all bundled managed-schema files bundled with Solr.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (SOLR-13690) Migrate field type configurations in default/example schema files to look up factories by "name"

2019-08-11 Thread Tomoko Uchida (JIRA)
Tomoko Uchida created SOLR-13690:


 Summary: Migrate field type configurations in default/example 
schema files to look up factories by "name"
 Key: SOLR-13690
 URL: https://issues.apache.org/jira/browse/SOLR-13690
 Project: Solr
  Issue Type: Task
  Security Level: Public (Default Security Level. Issues are Public)
  Components: Schema and Analysis
Reporter: Tomoko Uchida






--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-13593) Allow to specify analyzer components by their SPI names in schema definition

2019-08-11 Thread Tomoko Uchida (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-13593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tomoko Uchida updated SOLR-13593:
-
Attachment: SOLR-13593.patch

> Allow to specify analyzer components by their SPI names in schema definition
> 
>
> Key: SOLR-13593
> URL: https://issues.apache.org/jira/browse/SOLR-13593
> Project: Solr
>  Issue Type: Improvement
>  Components: Schema and Analysis
>Reporter: Tomoko Uchida
>Priority: Major
> Attachments: SOLR-13593.patch
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Now each analysis factory has explicitely documented SPI name which is stored 
> in the static "NAME" field (LUCENE-8778).
>  Solr uses factories' simple class name in schema definition (like 
> class="solr.WhitespaceTokenizerFactory"), but we should be able to also use 
> more concise SPI names (like name="whitespace").
> e.g.:
> {code:xml}
> 
>   
> 
>  />
> 
>   
> 
> {code}
> would be
> {code:xml}
> 
>   
> 
> 
> 
>   
> 
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-13593) Allow to specify analyzer components by their SPI names in schema definition

2019-08-11 Thread Tomoko Uchida (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-13593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16904584#comment-16904584
 ] 

Tomoko Uchida commented on SOLR-13593:
--

ICU factory "name" argument was changed to "form" on the master branch, so the 
factories can be looked up by names (with "form" attributes to specify 
normalization form) like this:
{code:xml}

  


  



  


  

{code}
Corresponding field types using "class" are:
{code:xml}

  


  



  


  

{code}
This works for me and the branch passed entire test. I will merge the all 
changes to the master branch soon.

> Allow to specify analyzer components by their SPI names in schema definition
> 
>
> Key: SOLR-13593
> URL: https://issues.apache.org/jira/browse/SOLR-13593
> Project: Solr
>  Issue Type: Improvement
>  Components: Schema and Analysis
>Reporter: Tomoko Uchida
>Priority: Major
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Now each analysis factory has explicitely documented SPI name which is stored 
> in the static "NAME" field (LUCENE-8778).
>  Solr uses factories' simple class name in schema definition (like 
> class="solr.WhitespaceTokenizerFactory"), but we should be able to also use 
> more concise SPI names (like name="whitespace").
> e.g.:
> {code:xml}
> 
>   
> 
>  />
> 
>   
> 
> {code}
> would be
> {code:xml}
> 
>   
> 
> 
> 
>   
> 
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-8948) Change "name" argument in ICU factories to "form"

2019-08-10 Thread Tomoko Uchida (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tomoko Uchida resolved LUCENE-8948.
---
   Resolution: Fixed
 Assignee: Tomoko Uchida
Fix Version/s: master (9.0)

> Change "name" argument in ICU factories to "form"
> -
>
> Key: LUCENE-8948
> URL: https://issues.apache.org/jira/browse/LUCENE-8948
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Tomoko Uchida
>Assignee: Tomoko Uchida
>Priority: Minor
> Fix For: master (9.0)
>
> Attachments: LUCENE-8948.patch
>
>
> {{o.a.l.a.icu.ICUNormalizer2CharFilterFactory}} and 
> {{o.a.l.a.icu.ICUNormalizer2FilterFactory}} have "name" arguments to specify 
> Unicode Normalization Form. The "name" is vague and it causes problem with 
> SOLR-13593.
> "form" would be suitable here instead of "name".



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8948) Change "name" argument in ICU factories to "form"

2019-08-10 Thread Tomoko Uchida (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8948?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16904562#comment-16904562
 ] 

Tomoko Uchida commented on LUCENE-8948:
---

OK, in the [ICU factory 
documentation|https://lucene.apache.org/core/8_2_0/analyzers-icu/org/apache/lucene/analysis/icu/ICUNormalizer2FilterFactory.html],
 it's explicitly documented as follows:
{quote}name: A Unicode Normalization Form, one of 'nfc','nfkc', 'nfkc_cf'. 
Default is nfkc_cf.
{quote}
So seems there is no need to worry about changing the parameter to "form" :)

Here is the patch that also includes tests and Javadoc changes: 
[^LUCENE-8948.patch]

> Change "name" argument in ICU factories to "form"
> -
>
> Key: LUCENE-8948
> URL: https://issues.apache.org/jira/browse/LUCENE-8948
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Tomoko Uchida
>Priority: Minor
> Attachments: LUCENE-8948.patch
>
>
> {{o.a.l.a.icu.ICUNormalizer2CharFilterFactory}} and 
> {{o.a.l.a.icu.ICUNormalizer2FilterFactory}} have "name" arguments to specify 
> Unicode Normalization Form. The "name" is vague and it causes problem with 
> SOLR-13593.
> "form" would be suitable here instead of "name".



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-8948) Change "name" argument in ICU factories to "form"

2019-08-10 Thread Tomoko Uchida (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tomoko Uchida updated LUCENE-8948:
--
Attachment: LUCENE-8948.patch

> Change "name" argument in ICU factories to "form"
> -
>
> Key: LUCENE-8948
> URL: https://issues.apache.org/jira/browse/LUCENE-8948
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Tomoko Uchida
>Priority: Minor
> Attachments: LUCENE-8948.patch
>
>
> {{o.a.l.a.icu.ICUNormalizer2CharFilterFactory}} and 
> {{o.a.l.a.icu.ICUNormalizer2FilterFactory}} have "name" arguments to specify 
> Unicode Normalization Form. The "name" is vague and it causes problem with 
> SOLR-13593.
> "form" would be suitable here instead of "name".



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8948) Change "name" argument in ICU factories to "form"

2019-08-10 Thread Tomoko Uchida (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8948?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16904548#comment-16904548
 ] 

Tomoko Uchida commented on LUCENE-8948:
---

I've searched a bit of details of the parameter naming.

The factories' "name" parameter should come from ICU4J Normalizer2 factory 
class method parameter.

[http://www.icu-project.org/apiref/icu4j/com/ibm/icu/text/Normalizer2.html#getInstance-java.io.InputStream-java.lang.String-com.ibm.icu.text.Normalizer2.Mode-]
{quote}data - the binary, big-endian normalization (.nrm file) data, or null 
for ICU data
 name - "nfc" or "nfkc" or "nfkc_cf" or name of custom data file
{quote}
Strictly speaking, the ICU4J normalizer's "name" seems not to be equal to the 
"Unicode normalization form" (it has wider meaning than "normalization form"). 
 Nonetheless "data" is always null when Lucene ICU factories instantiate it so 
it looks okay to me to change the parameter to "form" from the standpoint of 
understandability.

Just in case, [~thetaphi]: does that make sense to you?

> Change "name" argument in ICU factories to "form"
> -
>
> Key: LUCENE-8948
> URL: https://issues.apache.org/jira/browse/LUCENE-8948
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Tomoko Uchida
>Priority: Minor
>
> {{o.a.l.a.icu.ICUNormalizer2CharFilterFactory}} and 
> {{o.a.l.a.icu.ICUNormalizer2FilterFactory}} have "name" arguments to specify 
> Unicode Normalization Form. The "name" is vague and it causes problem with 
> SOLR-13593.
> "form" would be suitable here instead of "name".



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-13593) Allow to specify analyzer components by their SPI names in schema definition

2019-08-10 Thread Tomoko Uchida (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-13593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16904475#comment-16904475
 ] 

Tomoko Uchida commented on SOLR-13593:
--

When I grepped entire lucene code there are only two (ICU) factories which have 
"name" attribute. I opened LUCENE-8948, a blocker for this issue.

> Allow to specify analyzer components by their SPI names in schema definition
> 
>
> Key: SOLR-13593
> URL: https://issues.apache.org/jira/browse/SOLR-13593
> Project: Solr
>  Issue Type: Improvement
>  Components: Schema and Analysis
>Reporter: Tomoko Uchida
>Priority: Major
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Now each analysis factory has explicitely documented SPI name which is stored 
> in the static "NAME" field (LUCENE-8778).
>  Solr uses factories' simple class name in schema definition (like 
> class="solr.WhitespaceTokenizerFactory"), but we should be able to also use 
> more concise SPI names (like name="whitespace").
> e.g.:
> {code:xml}
> 
>   
> 
>  />
> 
>   
> 
> {code}
> would be
> {code:xml}
> 
>   
> 
> 
> 
>   
> 
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-8948) Change "name" argument in ICU factories to "form"

2019-08-10 Thread Tomoko Uchida (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tomoko Uchida updated LUCENE-8948:
--
Issue Type: Improvement  (was: Task)

> Change "name" argument in ICU factories to "form"
> -
>
> Key: LUCENE-8948
> URL: https://issues.apache.org/jira/browse/LUCENE-8948
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Tomoko Uchida
>Priority: Minor
>
> {{o.a.l.a.icu.ICUNormalizer2CharFilterFactory}} and 
> {{o.a.l.a.icu.ICUNormalizer2FilterFactory}} have "name" arguments to specify 
> Unicode Normalization Form. The "name" is vague and it causes problem with 
> SOLR-13593.
> "form" would be suitable here instead of "name".



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-8948) Change "name" argument in ICU factories to "form"

2019-08-10 Thread Tomoko Uchida (JIRA)
Tomoko Uchida created LUCENE-8948:
-

 Summary: Change "name" argument in ICU factories to "form"
 Key: LUCENE-8948
 URL: https://issues.apache.org/jira/browse/LUCENE-8948
 Project: Lucene - Core
  Issue Type: Task
  Components: modules/analysis
Reporter: Tomoko Uchida


{{o.a.l.a.icu.ICUNormalizer2CharFilterFactory}} and 
{{o.a.l.a.icu.ICUNormalizer2FilterFactory}} have "name" arguments to specify 
Unicode Normalization Form. The "name" is vague and it causes problem with 
SOLR-13593.

"form" would be suitable here instead of "name".



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-13593) Allow to specify analyzer components by their SPI names in schema definition

2019-08-10 Thread Tomoko Uchida (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-13593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16904460#comment-16904460
 ] 

Tomoko Uchida commented on SOLR-13593:
--

FYI, the "name" should come from ICU4J library's method signature: 
{code}
public static Normalizer2 getInstance(InputStream data, String name, 
Normalizer2.Mode mode)
{code}

Anyway I also would like to change the factory ("form" - named after "Unicode 
normalization form" - might be suitable). Will open an issue for this.

> Allow to specify analyzer components by their SPI names in schema definition
> 
>
> Key: SOLR-13593
> URL: https://issues.apache.org/jira/browse/SOLR-13593
> Project: Solr
>  Issue Type: Improvement
>  Components: Schema and Analysis
>Reporter: Tomoko Uchida
>Priority: Major
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Now each analysis factory has explicitely documented SPI name which is stored 
> in the static "NAME" field (LUCENE-8778).
>  Solr uses factories' simple class name in schema definition (like 
> class="solr.WhitespaceTokenizerFactory"), but we should be able to also use 
> more concise SPI names (like name="whitespace").
> e.g.:
> {code:xml}
> 
>   
> 
>  />
> 
>   
> 
> {code}
> would be
> {code:xml}
> 
>   
> 
> 
> 
>   
> 
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-13593) Allow to specify analyzer components by their SPI names in schema definition

2019-08-10 Thread Tomoko Uchida (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-13593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16904430#comment-16904430
 ] 

Tomoko Uchida commented on SOLR-13593:
--

When running entire test, I encountered a TokenFilterFactory which has "name" 
argument: 
[https://lucene.apache.org/core/8_2_0/analyzers-icu/org/apache/lucene/analysis/icu/ICUNormalizer2FilterFactory.html]

So the field type definition including this filter is like this:
{code:xml}
  

  
  

  
{code}
It's incompatible with the changes here of course...

There may be some options.

1. Allow to use "class" and "name" as is (only when the "name" is not a SPI 
name) and use "class" to look up the factory in that case.
 2. Forbid "name" argument in a factory and change existing "name" arguments to 
different ones.
 3. Rethink attribute name to look up factories, because "name" is already 
reserved.

I don't like option 1 - it seems too confusing and makes it's impossible to 
discard "class" attribute in future releases. Also I don't think we should take 
option 3 due to a few anomalistic classes.
 Option 2 would make sense to me, can we fix "name" args in existing factories 
(maybe another LUCENE issue is needed) before proceeding? We may also need to 
delay exposing this feature until Solr 9.0 because it breaks backwards 
compatibility.

[~thetaphi]: Do you have any ideas about that?

> Allow to specify analyzer components by their SPI names in schema definition
> 
>
> Key: SOLR-13593
> URL: https://issues.apache.org/jira/browse/SOLR-13593
> Project: Solr
>  Issue Type: Improvement
>  Components: Schema and Analysis
>Reporter: Tomoko Uchida
>Priority: Major
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Now each analysis factory has explicitely documented SPI name which is stored 
> in the static "NAME" field (LUCENE-8778).
>  Solr uses factories' simple class name in schema definition (like 
> class="solr.WhitespaceTokenizerFactory"), but we should be able to also use 
> more concise SPI names (like name="whitespace").
> e.g.:
> {code:xml}
> 
>   
> 
>  />
> 
>   
> 
> {code}
> would be
> {code:xml}
> 
>   
> 
> 
> 
>   
> 
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-13593) Allow to specify analyzer components by their SPI names in schema definition

2019-08-06 Thread Tomoko Uchida (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-13593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16901689#comment-16901689
 ] 

Tomoko Uchida commented on SOLR-13593:
--

I updated the pull request. If both of "name" and "class" appear at the same 
time on an element, SolrException is thrown and error logs are emmited.

I've also tested this manually: (1) start a local solr core with manually 
modified managed-schema which has field types including "name" property, (2) 
add types including "name" via the rest API as well. Works for me and this does 
not affect to existing field types (having "class"). Also the core can be 
restarted without any problems after adding the types having "name", so the 
regenerated & saved managed-schema works fine.

And I created the service provider file for Solr's custom filters (it has not 
been there so far) so that they can be looked up by names.

// META-INF/services/org.apache.lucene.analysis.util.TokenFilterFactory
{code:java}
org.apache.solr.rest.schema.analysis.ManagedStopFilterFactory
org.apache.solr.rest.schema.analysis.ManagedSynonymFilterFactory
org.apache.solr.rest.schema.analysis.ManagedSynonymGraphFilterFactory
{code}
Let me know if there are any other things that would block this issue - I'd 
like to wait until this weekend and merge the changes into the ASF repo, if 
there are no objections.

> Allow to specify analyzer components by their SPI names in schema definition
> 
>
> Key: SOLR-13593
> URL: https://issues.apache.org/jira/browse/SOLR-13593
> Project: Solr
>  Issue Type: Improvement
>  Components: Schema and Analysis
>Reporter: Tomoko Uchida
>Priority: Major
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Now each analysis factory has explicitely documented SPI name which is stored 
> in the static "NAME" field (LUCENE-8778).
>  Solr uses factories' simple class name in schema definition (like 
> class="solr.WhitespaceTokenizerFactory"), but we should be able to also use 
> more concise SPI names (like name="whitespace").
> e.g.:
> {code:xml}
> 
>   
> 
>  />
> 
>   
> 
> {code}
> would be
> {code:xml}
> 
>   
> 
> 
> 
>   
> 
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-13593) Allow to specify analyzer components by their SPI names in schema definition

2019-08-05 Thread Tomoko Uchida (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-13593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16900066#comment-16900066
 ] 

Tomoko Uchida commented on SOLR-13593:
--

Thanks [~dsmiley], I agree with you.

I will update the PR so the property check will be:
 * when neither "name" nor "class" is passed : an exception is thrown.
 * when both of "name" and "class" are passed: an exception is thrown.
 * when only "name" is passed: it's accepted.
 * when only "class" is specified: it's accepted for backwards compatibility. 
(maybe we should deprecate this and emit some warnings, after default schema is 
changed.)

> Allow to specify analyzer components by their SPI names in schema definition
> 
>
> Key: SOLR-13593
> URL: https://issues.apache.org/jira/browse/SOLR-13593
> Project: Solr
>  Issue Type: Improvement
>  Components: Schema and Analysis
>Reporter: Tomoko Uchida
>Priority: Major
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Now each analysis factory has explicitely documented SPI name which is stored 
> in the static "NAME" field (LUCENE-8778).
>  Solr uses factories' simple class name in schema definition (like 
> class="solr.WhitespaceTokenizerFactory"), but we should be able to also use 
> more concise SPI names (like name="whitespace").
> e.g.:
> {code:xml}
> 
>   
> 
>  />
> 
>   
> 
> {code}
> would be
> {code:xml}
> 
>   
> 
> 
> 
>   
> 
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8778) Define analyzer SPI names as static final fields and document the names in Javadocs

2019-08-04 Thread Tomoko Uchida (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16899643#comment-16899643
 ] 

Tomoko Uchida commented on LUCENE-8778:
---

Here is an additional patch for updating MIGRATE.txt: 
[^LUCENE-8778-migrate-note.patch]

> Define analyzer SPI names as static final fields and document the names in 
> Javadocs
> ---
>
> Key: LUCENE-8778
> URL: https://issues.apache.org/jira/browse/LUCENE-8778
> Project: Lucene - Core
>  Issue Type: Task
>  Components: modules/analysis
>Reporter: Tomoko Uchida
>Assignee: Tomoko Uchida
>Priority: Minor
> Fix For: master (9.0)
>
> Attachments: LUCENE-8778-koreanNumber.patch, 
> LUCENE-8778-migrate-note.patch, ListAnalysisComponents.java, 
> SPINamesGenerator.java, Screenshot from 2019-04-26 02-17-48.png, Screenshot 
> from 2019-05-25 23-25-24.png, TestSPINames.java
>
>  Time Spent: 4.5h
>  Remaining Estimate: 0h
>
> Each built-in analysis component (factory of tokenizer / char filter / token 
> filter)  has a SPI name but currently this is not  documented anywhere.
> The goals of this issue:
>  * Define SPI names as static final field for each analysis component so that 
> users can get the component by name (via {{NAME}} static field.) This also 
> provides compile time safety.
>  * Officially document the SPI names in Javadocs.
>  * Add proper source validation rules to ant {{validate-source-patterns}} 
> target so that we can make sure that all analysis components have correct 
> field definitions and documentation
> and,
>  * Lookup SPI names on the new {{NAME}} fields. Instead deriving those from 
> class names.
> (Just for quick reference) we now have:
>  * *19* Tokenizers ({{TokenizerFactory.availableTokenizers()}})
>  * *6* CharFilters ({{CharFilterFactory.availableCharFilters()}})
>  * *118* TokenFilters ({{TokenFilterFactory.availableTokenFilters()}})



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-8778) Define analyzer SPI names as static final fields and document the names in Javadocs

2019-08-04 Thread Tomoko Uchida (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8778?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tomoko Uchida updated LUCENE-8778:
--
Attachment: LUCENE-8778-migrate-note.patch

> Define analyzer SPI names as static final fields and document the names in 
> Javadocs
> ---
>
> Key: LUCENE-8778
> URL: https://issues.apache.org/jira/browse/LUCENE-8778
> Project: Lucene - Core
>  Issue Type: Task
>  Components: modules/analysis
>Reporter: Tomoko Uchida
>Assignee: Tomoko Uchida
>Priority: Minor
> Fix For: master (9.0)
>
> Attachments: LUCENE-8778-koreanNumber.patch, 
> LUCENE-8778-migrate-note.patch, ListAnalysisComponents.java, 
> SPINamesGenerator.java, Screenshot from 2019-04-26 02-17-48.png, Screenshot 
> from 2019-05-25 23-25-24.png, TestSPINames.java
>
>  Time Spent: 4.5h
>  Remaining Estimate: 0h
>
> Each built-in analysis component (factory of tokenizer / char filter / token 
> filter)  has a SPI name but currently this is not  documented anywhere.
> The goals of this issue:
>  * Define SPI names as static final field for each analysis component so that 
> users can get the component by name (via {{NAME}} static field.) This also 
> provides compile time safety.
>  * Officially document the SPI names in Javadocs.
>  * Add proper source validation rules to ant {{validate-source-patterns}} 
> target so that we can make sure that all analysis components have correct 
> field definitions and documentation
> and,
>  * Lookup SPI names on the new {{NAME}} fields. Instead deriving those from 
> class names.
> (Just for quick reference) we now have:
>  * *19* Tokenizers ({{TokenizerFactory.availableTokenizers()}})
>  * *6* CharFilters ({{CharFilterFactory.availableCharFilters()}})
>  * *118* TokenFilters ({{TokenFilterFactory.availableTokenFilters()}})



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-13593) Allow to specify analyzer components by their SPI names in schema definition

2019-08-04 Thread Tomoko Uchida (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-13593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16899602#comment-16899602
 ] 

Tomoko Uchida commented on SOLR-13593:
--

I have updated the pull request.

1. If both of "name" and "class" are specified, this redundancy does not cause 
any error but warnings are emitted when loading the schema. In this case "name" 
is given priority over "class". (In a future release "class" could be 
deprecated so this behaviour makes sense to me, any comments?)
 2. Added unit tests: for loading field types from schema.xml and creating 
those via REST API.

LUCENE-8778 was backported with proper backwards compatibility (LUCENE-8911), 
so I think we can expose this feature from 8.x minor releases. After the pull 
request gets reviewed I'd like to commit the changes to the master and 8x 
branch, then migrate default schema file(s) and the examples in Ref Guide.

> Allow to specify analyzer components by their SPI names in schema definition
> 
>
> Key: SOLR-13593
> URL: https://issues.apache.org/jira/browse/SOLR-13593
> Project: Solr
>  Issue Type: Improvement
>  Components: Schema and Analysis
>Reporter: Tomoko Uchida
>Priority: Major
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Now each analysis factory has explicitely documented SPI name which is stored 
> in the static "NAME" field (LUCENE-8778).
>  Solr uses factories' simple class name in schema definition (like 
> class="solr.WhitespaceTokenizerFactory"), but we should be able to also use 
> more concise SPI names (like name="whitespace").
> e.g.:
> {code:xml}
> 
>   
> 
>  />
> 
>   
> 
> {code}
> would be
> {code:xml}
> 
>   
> 
> 
> 
>   
> 
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8764) Add "export all terms" feature to Luke

2019-08-03 Thread Tomoko Uchida (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16899432#comment-16899432
 ] 

Tomoko Uchida commented on LUCENE-8764:
---

FYI, I opened a follow-up issue LUCENE-8945.

> Add "export all terms" feature to Luke
> --
>
> Key: LUCENE-8764
> URL: https://issues.apache.org/jira/browse/LUCENE-8764
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/luke
>Reporter: Tomoko Uchida
>Assignee: Tomoko Uchida
>Priority: Major
>  Labels: beginner
> Fix For: master (9.0), 8.3
>
> Attachments: LUCENE-8764.patch, LUCENE-8764.patch, LUCENE-8764.patch, 
> LUCENE-8764.patch, Screenshot 2019-07-23 12.29.06.png, Screenshot 2019-07-24 
> 12.35.48.png, Screenshot 2019-07-24 12.36.00.png, Screenshot 2019-07-24 
> 12.36.27.png, Screenshot 2019-07-25 13.20.40.png, Screenshot 2019-07-25 
> 13.20.48.png, Screenshot 2019-07-25 13.21.03.png, Screenshot 2019-07-25 
> 13.25.23.png
>
>
> This is a migrated issue from previous Luke project in GitHub: 
> [https://github.com/DmitryKey/luke/issues/3] (There are users' requests so I 
> moved this from GitHub to Jira)
> You can browse terms in arbitrary field via Luke GUI, but in some cases 
> "exporting all terms (and optionally docids) to a file" feature would be 
> useful for further inspection. It might be similar to Solr's terms component.
> As for the user interface, "Export terms" button should be located in 
> Overview tab and/or Documents tab.
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-8945) Allow to change the output file delimiter on Luke "export terms" feature

2019-08-03 Thread Tomoko Uchida (JIRA)
Tomoko Uchida created LUCENE-8945:
-

 Summary: Allow to change the output file delimiter on Luke "export 
terms" feature
 Key: LUCENE-8945
 URL: https://issues.apache.org/jira/browse/LUCENE-8945
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/luke
Reporter: Tomoko Uchida


This is a follow-up issue for LUCENE-8764.
Current delimiter is fixed to "," (comma), but terms also can include comma and 
they are not escaped. It would be better if the delimiter can be 
changed/selected to a tab or whitespace when exporting.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-8764) Add "export all terms" feature to Luke

2019-08-03 Thread Tomoko Uchida (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tomoko Uchida resolved LUCENE-8764.
---
   Resolution: Fixed
 Assignee: Tomoko Uchida
Fix Version/s: 8.3
   master (9.0)

> Add "export all terms" feature to Luke
> --
>
> Key: LUCENE-8764
> URL: https://issues.apache.org/jira/browse/LUCENE-8764
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/luke
>Reporter: Tomoko Uchida
>Assignee: Tomoko Uchida
>Priority: Major
>  Labels: beginner
> Fix For: master (9.0), 8.3
>
> Attachments: LUCENE-8764.patch, LUCENE-8764.patch, LUCENE-8764.patch, 
> LUCENE-8764.patch, Screenshot 2019-07-23 12.29.06.png, Screenshot 2019-07-24 
> 12.35.48.png, Screenshot 2019-07-24 12.36.00.png, Screenshot 2019-07-24 
> 12.36.27.png, Screenshot 2019-07-25 13.20.40.png, Screenshot 2019-07-25 
> 13.20.48.png, Screenshot 2019-07-25 13.21.03.png, Screenshot 2019-07-25 
> 13.25.23.png
>
>
> This is a migrated issue from previous Luke project in GitHub: 
> [https://github.com/DmitryKey/luke/issues/3] (There are users' requests so I 
> moved this from GitHub to Jira)
> You can browse terms in arbitrary field via Luke GUI, but in some cases 
> "exporting all terms (and optionally docids) to a file" feature would be 
> useful for further inspection. It might be similar to Solr's terms component.
> As for the user interface, "Export terms" button should be located in 
> Overview tab and/or Documents tab.
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-8764) Add "export all terms" feature to Luke

2019-08-03 Thread Tomoko Uchida (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tomoko Uchida updated LUCENE-8764:
--
Attachment: LUCENE-8764.patch

> Add "export all terms" feature to Luke
> --
>
> Key: LUCENE-8764
> URL: https://issues.apache.org/jira/browse/LUCENE-8764
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/luke
>Reporter: Tomoko Uchida
>Priority: Major
>  Labels: beginner
> Attachments: LUCENE-8764.patch, LUCENE-8764.patch, LUCENE-8764.patch, 
> LUCENE-8764.patch, Screenshot 2019-07-23 12.29.06.png, Screenshot 2019-07-24 
> 12.35.48.png, Screenshot 2019-07-24 12.36.00.png, Screenshot 2019-07-24 
> 12.36.27.png, Screenshot 2019-07-25 13.20.40.png, Screenshot 2019-07-25 
> 13.20.48.png, Screenshot 2019-07-25 13.21.03.png, Screenshot 2019-07-25 
> 13.25.23.png
>
>
> This is a migrated issue from previous Luke project in GitHub: 
> [https://github.com/DmitryKey/luke/issues/3] (There are users' requests so I 
> moved this from GitHub to Jira)
> You can browse terms in arbitrary field via Luke GUI, but in some cases 
> "exporting all terms (and optionally docids) to a file" feature would be 
> useful for further inspection. It might be similar to Solr's terms component.
> As for the user interface, "Export terms" button should be located in 
> Overview tab and/or Documents tab.
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8764) Add "export all terms" feature to Luke

2019-08-03 Thread Tomoko Uchida (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16899429#comment-16899429
 ] 

Tomoko Uchida commented on LUCENE-8764:
---

Here is the final patch:  [^LUCENE-8764.patch] 

> Add "export all terms" feature to Luke
> --
>
> Key: LUCENE-8764
> URL: https://issues.apache.org/jira/browse/LUCENE-8764
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/luke
>Reporter: Tomoko Uchida
>Priority: Major
>  Labels: beginner
> Attachments: LUCENE-8764.patch, LUCENE-8764.patch, LUCENE-8764.patch, 
> LUCENE-8764.patch, Screenshot 2019-07-23 12.29.06.png, Screenshot 2019-07-24 
> 12.35.48.png, Screenshot 2019-07-24 12.36.00.png, Screenshot 2019-07-24 
> 12.36.27.png, Screenshot 2019-07-25 13.20.40.png, Screenshot 2019-07-25 
> 13.20.48.png, Screenshot 2019-07-25 13.21.03.png, Screenshot 2019-07-25 
> 13.25.23.png
>
>
> This is a migrated issue from previous Luke project in GitHub: 
> [https://github.com/DmitryKey/luke/issues/3] (There are users' requests so I 
> moved this from GitHub to Jira)
> You can browse terms in arbitrary field via Luke GUI, but in some cases 
> "exporting all terms (and optionally docids) to a file" feature would be 
> useful for further inspection. It might be similar to Solr's terms component.
> As for the user interface, "Export terms" button should be located in 
> Overview tab and/or Documents tab.
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8764) Add "export all terms" feature to Luke

2019-08-03 Thread Tomoko Uchida (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16899427#comment-16899427
 ] 

Tomoko Uchida commented on LUCENE-8764:
---

I made small changes to the patch:
- Fix precommit failures 
- Improve error handlings
- Improve messages, adjust component layouts in the GUI

The revised patch was committed on the master and 8x branch.
Thanks [~lmenezes]!

> Add "export all terms" feature to Luke
> --
>
> Key: LUCENE-8764
> URL: https://issues.apache.org/jira/browse/LUCENE-8764
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/luke
>Reporter: Tomoko Uchida
>Priority: Major
>  Labels: beginner
> Attachments: LUCENE-8764.patch, LUCENE-8764.patch, LUCENE-8764.patch, 
> Screenshot 2019-07-23 12.29.06.png, Screenshot 2019-07-24 12.35.48.png, 
> Screenshot 2019-07-24 12.36.00.png, Screenshot 2019-07-24 12.36.27.png, 
> Screenshot 2019-07-25 13.20.40.png, Screenshot 2019-07-25 13.20.48.png, 
> Screenshot 2019-07-25 13.21.03.png, Screenshot 2019-07-25 13.25.23.png
>
>
> This is a migrated issue from previous Luke project in GitHub: 
> [https://github.com/DmitryKey/luke/issues/3] (There are users' requests so I 
> moved this from GitHub to Jira)
> You can browse terms in arbitrary field via Luke GUI, but in some cases 
> "exporting all terms (and optionally docids) to a file" feature would be 
> useful for further inspection. It might be similar to Solr's terms component.
> As for the user interface, "Export terms" button should be located in 
> Overview tab and/or Documents tab.
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-8937) Avoid agressive stemming on numbers in the FrenchMinimalStemmer

2019-07-30 Thread Tomoko Uchida (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tomoko Uchida updated LUCENE-8937:
--
Component/s: modules/analysis

> Avoid agressive stemming on numbers in the FrenchMinimalStemmer
> ---
>
> Key: LUCENE-8937
> URL: https://issues.apache.org/jira/browse/LUCENE-8937
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Adrien Gallou
>Assignee: Tomoko Uchida
>Priority: Minor
> Fix For: master (9.0)
>
> Attachments: 
> 0001-LUCENE-8937-Avoid-agressive-stemming-on-numbers-in-t.patch, 
> LUCENE-8937.patch
>
>
> Here is the discussion on the mailing list : 
> [http://mail-archives.apache.org/mod_mbox/lucene-java-user/201907.mbox/browser]
> The light stemmer removes the last character of a word if the last two
>  characters are identical.
>  We can see that here:
>  
> https://github.com/apache/lucene-solr/blob/813ca77/lucene/analysis/common/src/java/org/apache/lucene/analysis/fr/FrenchLightStemmer.java#L263
>  In this light stemmer, there is a check to avoid altering the token if the
>  token is a number.
> The minimal stemmer also removes the last character of a word if the last
>  two characters are identical.
>  We can see that here:
>  
> https://github.com/apache/lucene-solr/blob/813ca77/lucene/analysis/common/src/java/org/apache/lucene/analysis/fr/FrenchMinimalStemmer.java#L77
> But in this minimal stemmer there is no check to see if the character is a
>  letter or not.
>  So when we have numeric tokens with the last two characters identical they
>  are altered.
> For example "1234567899" will be stemmed as "123456789".
> It could be great of it's not altered.
> Here is the same issue for the LightStemmer : 
> https://issues.apache.org/jira/browse/LUCENE-4063



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-8937) Avoid agressive stemming on numbers in the FrenchMinimalStemmer

2019-07-30 Thread Tomoko Uchida (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tomoko Uchida resolved LUCENE-8937.
---
   Resolution: Fixed
 Assignee: Tomoko Uchida
Fix Version/s: master (9.0)

> Avoid agressive stemming on numbers in the FrenchMinimalStemmer
> ---
>
> Key: LUCENE-8937
> URL: https://issues.apache.org/jira/browse/LUCENE-8937
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Gallou
>Assignee: Tomoko Uchida
>Priority: Minor
> Fix For: master (9.0)
>
> Attachments: 
> 0001-LUCENE-8937-Avoid-agressive-stemming-on-numbers-in-t.patch, 
> LUCENE-8937.patch
>
>
> Here is the discussion on the mailing list : 
> [http://mail-archives.apache.org/mod_mbox/lucene-java-user/201907.mbox/browser]
> The light stemmer removes the last character of a word if the last two
>  characters are identical.
>  We can see that here:
>  
> https://github.com/apache/lucene-solr/blob/813ca77/lucene/analysis/common/src/java/org/apache/lucene/analysis/fr/FrenchLightStemmer.java#L263
>  In this light stemmer, there is a check to avoid altering the token if the
>  token is a number.
> The minimal stemmer also removes the last character of a word if the last
>  two characters are identical.
>  We can see that here:
>  
> https://github.com/apache/lucene-solr/blob/813ca77/lucene/analysis/common/src/java/org/apache/lucene/analysis/fr/FrenchMinimalStemmer.java#L77
> But in this minimal stemmer there is no check to see if the character is a
>  letter or not.
>  So when we have numeric tokens with the last two characters identical they
>  are altered.
> For example "1234567899" will be stemmed as "123456789".
> It could be great of it's not altered.
> Here is the same issue for the LightStemmer : 
> https://issues.apache.org/jira/browse/LUCENE-4063



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8937) Avoid agressive stemming on numbers in the FrenchMinimalStemmer

2019-07-30 Thread Tomoko Uchida (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16896097#comment-16896097
 ] 

Tomoko Uchida commented on LUCENE-8937:
---

Committed to the master. Thank you [~agallou]!

> Avoid agressive stemming on numbers in the FrenchMinimalStemmer
> ---
>
> Key: LUCENE-8937
> URL: https://issues.apache.org/jira/browse/LUCENE-8937
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Gallou
>Priority: Minor
> Attachments: 
> 0001-LUCENE-8937-Avoid-agressive-stemming-on-numbers-in-t.patch, 
> LUCENE-8937.patch
>
>
> Here is the discussion on the mailing list : 
> [http://mail-archives.apache.org/mod_mbox/lucene-java-user/201907.mbox/browser]
> The light stemmer removes the last character of a word if the last two
>  characters are identical.
>  We can see that here:
>  
> https://github.com/apache/lucene-solr/blob/813ca77/lucene/analysis/common/src/java/org/apache/lucene/analysis/fr/FrenchLightStemmer.java#L263
>  In this light stemmer, there is a check to avoid altering the token if the
>  token is a number.
> The minimal stemmer also removes the last character of a word if the last
>  two characters are identical.
>  We can see that here:
>  
> https://github.com/apache/lucene-solr/blob/813ca77/lucene/analysis/common/src/java/org/apache/lucene/analysis/fr/FrenchMinimalStemmer.java#L77
> But in this minimal stemmer there is no check to see if the character is a
>  letter or not.
>  So when we have numeric tokens with the last two characters identical they
>  are altered.
> For example "1234567899" will be stemmed as "123456789".
> It could be great of it's not altered.
> Here is the same issue for the LightStemmer : 
> https://issues.apache.org/jira/browse/LUCENE-4063



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8937) Avoid agressive stemming on numbers in the FrenchMinimalStemmer

2019-07-30 Thread Tomoko Uchida (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16895916#comment-16895916
 ] 

Tomoko Uchida commented on LUCENE-8937:
---

+1

I will commit it to the master in shortly. This changes the behaviour of the 
stemmer, so I won't backport to 8.x. (Since it's not a bug, but design change - 
see the discussion on LUCENE-4063.)



> Avoid agressive stemming on numbers in the FrenchMinimalStemmer
> ---
>
> Key: LUCENE-8937
> URL: https://issues.apache.org/jira/browse/LUCENE-8937
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Gallou
>Priority: Minor
> Attachments: 
> 0001-LUCENE-8937-Avoid-agressive-stemming-on-numbers-in-t.patch, 
> LUCENE-8937.patch
>
>
> Here is the discussion on the mailing list : 
> [http://mail-archives.apache.org/mod_mbox/lucene-java-user/201907.mbox/browser]
> The light stemmer removes the last character of a word if the last two
>  characters are identical.
>  We can see that here:
>  
> https://github.com/apache/lucene-solr/blob/813ca77/lucene/analysis/common/src/java/org/apache/lucene/analysis/fr/FrenchLightStemmer.java#L263
>  In this light stemmer, there is a check to avoid altering the token if the
>  token is a number.
> The minimal stemmer also removes the last character of a word if the last
>  two characters are identical.
>  We can see that here:
>  
> https://github.com/apache/lucene-solr/blob/813ca77/lucene/analysis/common/src/java/org/apache/lucene/analysis/fr/FrenchMinimalStemmer.java#L77
> But in this minimal stemmer there is no check to see if the character is a
>  letter or not.
>  So when we have numeric tokens with the last two characters identical they
>  are altered.
> For example "1234567899" will be stemmed as "123456789".
> It could be great of it's not altered.
> Here is the same issue for the LightStemmer : 
> https://issues.apache.org/jira/browse/LUCENE-4063



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8764) Add "export all terms" feature to Luke

2019-07-29 Thread Tomoko Uchida (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16895006#comment-16895006
 ] 

Tomoko Uchida commented on LUCENE-8764:
---

[~lmenezes]: Would you tell me your e-mail address to credit your name with 
e-mail as the Author of this commit? (I cannot find it from JIRA.)

> Add "export all terms" feature to Luke
> --
>
> Key: LUCENE-8764
> URL: https://issues.apache.org/jira/browse/LUCENE-8764
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/luke
>Reporter: Tomoko Uchida
>Priority: Major
>  Labels: beginner
> Attachments: LUCENE-8764.patch, LUCENE-8764.patch, LUCENE-8764.patch, 
> Screenshot 2019-07-23 12.29.06.png, Screenshot 2019-07-24 12.35.48.png, 
> Screenshot 2019-07-24 12.36.00.png, Screenshot 2019-07-24 12.36.27.png, 
> Screenshot 2019-07-25 13.20.40.png, Screenshot 2019-07-25 13.20.48.png, 
> Screenshot 2019-07-25 13.21.03.png, Screenshot 2019-07-25 13.25.23.png
>
>
> This is a migrated issue from previous Luke project in GitHub: 
> [https://github.com/DmitryKey/luke/issues/3] (There are users' requests so I 
> moved this from GitHub to Jira)
> You can browse terms in arbitrary field via Luke GUI, but in some cases 
> "exporting all terms (and optionally docids) to a file" feature would be 
> useful for further inspection. It might be similar to Solr's terms component.
> As for the user interface, "Export terms" button should be located in 
> Overview tab and/or Documents tab.
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8764) Add "export all terms" feature to Luke

2019-07-29 Thread Tomoko Uchida (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16895000#comment-16895000
 ] 

Tomoko Uchida commented on LUCENE-8764:
---

I have looked into the code.

+1 to the patch.
I'd like to make a few small changes to the UI (e.g., adjust component size and 
add some texts about this feature), and commit it shortly.


> Add "export all terms" feature to Luke
> --
>
> Key: LUCENE-8764
> URL: https://issues.apache.org/jira/browse/LUCENE-8764
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/luke
>Reporter: Tomoko Uchida
>Priority: Major
>  Labels: beginner
> Attachments: LUCENE-8764.patch, LUCENE-8764.patch, LUCENE-8764.patch, 
> Screenshot 2019-07-23 12.29.06.png, Screenshot 2019-07-24 12.35.48.png, 
> Screenshot 2019-07-24 12.36.00.png, Screenshot 2019-07-24 12.36.27.png, 
> Screenshot 2019-07-25 13.20.40.png, Screenshot 2019-07-25 13.20.48.png, 
> Screenshot 2019-07-25 13.21.03.png, Screenshot 2019-07-25 13.25.23.png
>
>
> This is a migrated issue from previous Luke project in GitHub: 
> [https://github.com/DmitryKey/luke/issues/3] (There are users' requests so I 
> moved this from GitHub to Jira)
> You can browse terms in arbitrary field via Luke GUI, but in some cases 
> "exporting all terms (and optionally docids) to a file" feature would be 
> useful for further inspection. It might be similar to Solr's terms component.
> As for the user interface, "Export terms" button should be located in 
> Overview tab and/or Documents tab.
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-8937) Avoid agressive stemming on numbers in the FrenchMinimalStemmer

2019-07-28 Thread Tomoko Uchida (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tomoko Uchida updated LUCENE-8937:
--
Priority: Minor  (was: Major)

> Avoid agressive stemming on numbers in the FrenchMinimalStemmer
> ---
>
> Key: LUCENE-8937
> URL: https://issues.apache.org/jira/browse/LUCENE-8937
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Gallou
>Priority: Minor
> Attachments: 0001-adds-test-cases-on-french-minimal-stemmer.patch, 
> 0002-check-if-the-last-character-is-a-letter-before-remov.patch, 
> SOLR-8937.patch
>
>
> Here is the discussion on the mailing list : 
> [http://mail-archives.apache.org/mod_mbox/lucene-java-user/201907.mbox/browser]
> The light stemmer removes the last character of a word if the last two
>  characters are identical.
>  We can see that here:
>  
> [https://github.com/apache/lucene-solr/blob/master/lucene/analysis/common/src/java/org/apache/lucene/analysis/fr/FrenchLightStemmer.java#L263]
>  In this light stemmer, there is a check to avoid altering the token if the
>  token is a number.
> The minimal stemmer also removes the last character of a word if the last
>  two characters are identical.
>  We can see that here:
>  
> [https://github.com/apache/lucene-solr/blob/master/lucene/analysis/common/src/java/org/apache/lucene/analysis/fr/FrenchMinimalStemmer.java#L77]
> But in this minimal stemmer there is no check to see if the character is a
>  letter or not.
>  So when we have numeric tokens with the last two characters identical they
>  are altered.
> For example "1234567899" will be stemmed as "123456789".
> It could be great of it's not altered.
> Here is the same issue for the LightStemmer : 
> https://issues.apache.org/jira/browse/LUCENE-4063



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-8937) Avoid agressive stemming on numbers in the FrenchMinimalStemmer

2019-07-28 Thread Tomoko Uchida (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tomoko Uchida updated LUCENE-8937:
--
Issue Type: Improvement  (was: Bug)

> Avoid agressive stemming on numbers in the FrenchMinimalStemmer
> ---
>
> Key: LUCENE-8937
> URL: https://issues.apache.org/jira/browse/LUCENE-8937
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Gallou
>Priority: Major
> Attachments: 0001-adds-test-cases-on-french-minimal-stemmer.patch, 
> 0002-check-if-the-last-character-is-a-letter-before-remov.patch, 
> SOLR-8937.patch
>
>
> Here is the discussion on the mailing list : 
> [http://mail-archives.apache.org/mod_mbox/lucene-java-user/201907.mbox/browser]
> The light stemmer removes the last character of a word if the last two
>  characters are identical.
>  We can see that here:
>  
> [https://github.com/apache/lucene-solr/blob/master/lucene/analysis/common/src/java/org/apache/lucene/analysis/fr/FrenchLightStemmer.java#L263]
>  In this light stemmer, there is a check to avoid altering the token if the
>  token is a number.
> The minimal stemmer also removes the last character of a word if the last
>  two characters are identical.
>  We can see that here:
>  
> [https://github.com/apache/lucene-solr/blob/master/lucene/analysis/common/src/java/org/apache/lucene/analysis/fr/FrenchMinimalStemmer.java#L77]
> But in this minimal stemmer there is no check to see if the character is a
>  letter or not.
>  So when we have numeric tokens with the last two characters identical they
>  are altered.
> For example "1234567899" will be stemmed as "123456789".
> It could be great of it's not altered.
> Here is the same issue for the LightStemmer : 
> https://issues.apache.org/jira/browse/LUCENE-4063



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8937) Avoid agressive stemming on numbers in the FrenchMinimalStemmer

2019-07-28 Thread Tomoko Uchida (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16894944#comment-16894944
 ] 

Tomoko Uchida commented on LUCENE-8937:
---

Hi [~agallou],
the added isLetter() check looks okay to me.

Can you please merge the two patch (0001-.patch and 0002-.patch) to one 
patch? "{{LUCENE-8937.patch}}" is correct file name here. And please remove 
"SOLR-8937.patch" to avoid confusion.
Also, can you add a few more tests for regression and edge cases, I think the 
same kind of tests for LUCENE-4063 would be needed.

[~steve_rowe] and [~rcmuir], do you have any thoughts or comments about this 
change?

> Avoid agressive stemming on numbers in the FrenchMinimalStemmer
> ---
>
> Key: LUCENE-8937
> URL: https://issues.apache.org/jira/browse/LUCENE-8937
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Adrien Gallou
>Priority: Major
> Attachments: 0001-adds-test-cases-on-french-minimal-stemmer.patch, 
> 0002-check-if-the-last-character-is-a-letter-before-remov.patch, 
> SOLR-8937.patch
>
>
> Here is the discussion on the mailing list : 
> [http://mail-archives.apache.org/mod_mbox/lucene-java-user/201907.mbox/browser]
> The light stemmer removes the last character of a word if the last two
>  characters are identical.
>  We can see that here:
>  
> [https://github.com/apache/lucene-solr/blob/master/lucene/analysis/common/src/java/org/apache/lucene/analysis/fr/FrenchLightStemmer.java#L263]
>  In this light stemmer, there is a check to avoid altering the token if the
>  token is a number.
> The minimal stemmer also removes the last character of a word if the last
>  two characters are identical.
>  We can see that here:
>  
> [https://github.com/apache/lucene-solr/blob/master/lucene/analysis/common/src/java/org/apache/lucene/analysis/fr/FrenchMinimalStemmer.java#L77]
> But in this minimal stemmer there is no check to see if the character is a
>  letter or not.
>  So when we have numeric tokens with the last two characters identical they
>  are altered.
> For example "1234567899" will be stemmed as "123456789".
> It could be great of it's not altered.
> Here is the same issue for the LightStemmer : 
> https://issues.apache.org/jira/browse/LUCENE-4063



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-8936) Add SpanishMinimalStemFilter

2019-07-28 Thread Tomoko Uchida (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tomoko Uchida updated LUCENE-8936:
--
Component/s: modules/analysis

> Add SpanishMinimalStemFilter
> 
>
> Key: LUCENE-8936
> URL: https://issues.apache.org/jira/browse/LUCENE-8936
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: vinod kumar
>Assignee: Tomoko Uchida
>Priority: Major
> Fix For: master (9.0), 8.3
>
> Attachments: LUCENE-8936.patch, LUCENE-8936.patch
>
>
> SpanishMinimalStemmerFilter is less aggressive stemmer than 
> SpanishLightStemmerFilter
> Ex:
> input tokens -> output tokens
>  1. camiseta niños -> *camiseta* and *nino*
>  2. camisas -> camisa
> *camisetas* and *camisas* are t-shirts and shirts respectively.
>  Stemming both of the tokens to *camis* will match both tokens and returns 
> both t-shirts and shirts for query camisas(shirts). 
> SpanishMinimalStemmerFilter will help handling these cases.
> And importantly It will preserve gender context with tokens.
> Ex:  *niños* ,*niñas* *chicos* and *chicas* are stemmed to *nino*, *nina*, 
> *chico* and *chica*



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-8936) Add SpanishMinimalStemFilter

2019-07-28 Thread Tomoko Uchida (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tomoko Uchida updated LUCENE-8936:
--
   Resolution: Fixed
Fix Version/s: 8.3
   master (9.0)
   Status: Resolved  (was: Patch Available)

I moved the change log to 8.3.0 updates section since this will be shipped with 
the next 8.3.0 release.

Thanks [~vinod1812]!

> Add SpanishMinimalStemFilter
> 
>
> Key: LUCENE-8936
> URL: https://issues.apache.org/jira/browse/LUCENE-8936
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: vinod kumar
>Assignee: Tomoko Uchida
>Priority: Major
> Fix For: master (9.0), 8.3
>
> Attachments: LUCENE-8936.patch, LUCENE-8936.patch
>
>
> SpanishMinimalStemmerFilter is less aggressive stemmer than 
> SpanishLightStemmerFilter
> Ex:
> input tokens -> output tokens
>  1. camiseta niños -> *camiseta* and *nino*
>  2. camisas -> camisa
> *camisetas* and *camisas* are t-shirts and shirts respectively.
>  Stemming both of the tokens to *camis* will match both tokens and returns 
> both t-shirts and shirts for query camisas(shirts). 
> SpanishMinimalStemmerFilter will help handling these cases.
> And importantly It will preserve gender context with tokens.
> Ex:  *niños* ,*niñas* *chicos* and *chicas* are stemmed to *nino*, *nina*, 
> *chico* and *chica*



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-8920) Reduce size of FSTs due to use of direct-addressing encoding

2019-07-28 Thread Tomoko Uchida (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tomoko Uchida updated LUCENE-8920:
--
Status: Open  (was: Patch Available)

> Reduce size of FSTs due to use of direct-addressing encoding 
> -
>
> Key: LUCENE-8920
> URL: https://issues.apache.org/jira/browse/LUCENE-8920
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Mike Sokolov
>Priority: Blocker
> Fix For: 8.3
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> Some data can lead to worst-case ~4x RAM usage due to this optimization. 
> Several ideas were suggested to combat this on the mailing list:
> bq. I think we can improve thesituation here by tracking, per-FST instance, 
> the size increase we're seeing while building (or perhaps do a preliminary 
> pass before building) in order to decide whether to apply the encoding. 
> bq. we could also make the encoding a bit more efficient. For instance I 
> noticed that arc metadata is pretty large in some cases (in the 10-20 bytes) 
> which make gaps very costly. Associating each label with a dense id and 
> having an intermediate lookup, ie. lookup label -> id and then id->arc offset 
> instead of doing label->arc directly could save a lot of space in some cases? 
> Also it seems that we are repeating the label in the arc metadata when 
> array-with-gaps is used, even though it shouldn't be necessary since the 
> label is implicit from the address?



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8936) Add SpanishMinimalStemFilter

2019-07-28 Thread Tomoko Uchida (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16894708#comment-16894708
 ] 

Tomoko Uchida commented on LUCENE-8936:
---

Hi [~vinod1812],

would you tell me your e-mail address to credit your name with e-mail as the 
Author of the commit? (I cannot find it from mail list or jira.)

> Add SpanishMinimalStemFilter
> 
>
> Key: LUCENE-8936
> URL: https://issues.apache.org/jira/browse/LUCENE-8936
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: vinod kumar
>Assignee: Tomoko Uchida
>Priority: Major
> Attachments: LUCENE-8936.patch, LUCENE-8936.patch
>
>
> SpanishMinimalStemmerFilter is less aggressive stemmer than 
> SpanishLightStemmerFilter
> Ex:
> input tokens -> output tokens
>  1. camiseta niños -> *camiseta* and *nino*
>  2. camisas -> camisa
> *camisetas* and *camisas* are t-shirts and shirts respectively.
>  Stemming both of the tokens to *camis* will match both tokens and returns 
> both t-shirts and shirts for query camisas(shirts). 
> SpanishMinimalStemmerFilter will help handling these cases.
> And importantly It will preserve gender context with tokens.
> Ex:  *niños* ,*niñas* *chicos* and *chicas* are stemmed to *nino*, *nina*, 
> *chico* and *chica*



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8936) Add SpanishMinimalStemFilter

2019-07-27 Thread Tomoko Uchida (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16894421#comment-16894421
 ] 

Tomoko Uchida commented on LUCENE-8936:
---

+1 to the patch.

Let us wait one day or so, then commit the changes on the master and 8.x branch.

> Add SpanishMinimalStemFilter
> 
>
> Key: LUCENE-8936
> URL: https://issues.apache.org/jira/browse/LUCENE-8936
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: vinod kumar
>Assignee: Tomoko Uchida
>Priority: Major
> Attachments: LUCENE-8936.patch, LUCENE-8936.patch
>
>
> SpanishMinimalStemmerFilter is less aggressive stemmer than 
> SpanishLightStemmerFilter
> Ex:
> input tokens -> output tokens
>  1. camiseta niños -> *camiseta* and *nino*
>  2. camisas -> camisa
> *camisetas* and *camisas* are t-shirts and shirts respectively.
>  Stemming both of the tokens to *camis* will match both tokens and returns 
> both t-shirts and shirts for query camisas(shirts). 
> SpanishMinimalStemmerFilter will help handling these cases.
> And importantly It will preserve gender context with tokens.
> Ex:  *niños* ,*niñas* *chicos* and *chicas* are stemmed to *nino*, *nina*, 
> *chico* and *chica*



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8936) Add SpanishMinimalStemFilter

2019-07-27 Thread Tomoko Uchida (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16894402#comment-16894402
 ] 

Tomoko Uchida commented on LUCENE-8936:
---

[~vinod1812]: I noticed your name is credited in {{SpanishMinimalStemmer}} 
Javadocs. Lucene/Solr source code don't have any {{@author}} tag or person's 
name who donated the code. Credits are appeared only in the commit log and 
CHANGES. Can you please remove it?

> Add SpanishMinimalStemFilter
> 
>
> Key: LUCENE-8936
> URL: https://issues.apache.org/jira/browse/LUCENE-8936
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: vinod kumar
>Assignee: Tomoko Uchida
>Priority: Major
> Attachments: LUCENE-8936.patch
>
>
> SpanishMinimalStemmerFilter is less aggressive stemmer than 
> SpanishLightStemmerFilter
> Ex:
> input tokens -> output tokens
>  1. camiseta niños -> *camiseta* and *nino*
>  2. camisas -> camisa
> *camisetas* and *camisas* are t-shirts and shirts respectively.
>  Stemming both of the tokens to *camis* will match both tokens and returns 
> both t-shirts and shirts for query camisas(shirts). 
> SpanishMinimalStemmerFilter will help handling these cases.
> And importantly It will preserve gender context with tokens.
> Ex:  *niños* ,*niñas* *chicos* and *chicas* are stemmed to *nino*, *nina*, 
> *chico* and *chica*



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Assigned] (LUCENE-8936) Add SpanishMinimalStemFilter

2019-07-27 Thread Tomoko Uchida (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tomoko Uchida reassigned LUCENE-8936:
-

Assignee: Tomoko Uchida

> Add SpanishMinimalStemFilter
> 
>
> Key: LUCENE-8936
> URL: https://issues.apache.org/jira/browse/LUCENE-8936
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: vinod kumar
>Assignee: Tomoko Uchida
>Priority: Major
> Attachments: LUCENE-8936.patch
>
>
> SpanishMinimalStemmerFilter is less aggressive stemmer than 
> SpanishLightStemmerFilter
> Ex:
> input tokens -> output tokens
>  1. camiseta niños -> *camiseta* and *nino*
>  2. camisas -> camisa
> *camisetas* and *camisas* are t-shirts and shirts respectively.
>  Stemming both of the tokens to *camis* will match both tokens and returns 
> both t-shirts and shirts for query camisas(shirts). 
> SpanishMinimalStemmerFilter will help handling these cases.
> And importantly It will preserve gender context with tokens.
> Ex:  *niños* ,*niñas* *chicos* and *chicas* are stemmed to *nino*, *nina*, 
> *chico* and *chica*



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8936) Add SpanishMinimalStemFilter

2019-07-27 Thread Tomoko Uchida (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16894394#comment-16894394
 ] 

Tomoko Uchida commented on LUCENE-8936:
---

Hi [~vinod1812],

the patch looks fine. Actually I cannot review the {{SpanishMinimalStemmer}} 
class (I don't understand Spanish), but other parts looks okay to me. And this 
passed {{ant precommit}} (thanks!).

I will commit it after waiting 24 hours if there are no other comments.

 

About the github PR, only the Lucene/Solr committers have the write permission 
to the apache/lucene-solr repo. So you have to fork the repo and open a pull 
request. But this time, a patch has been provided so you do not need to do so.

> Add SpanishMinimalStemFilter
> 
>
> Key: LUCENE-8936
> URL: https://issues.apache.org/jira/browse/LUCENE-8936
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: vinod kumar
>Priority: Major
> Attachments: LUCENE-8936.patch
>
>
> SpanishMinimalStemmerFilter is less aggressive stemmer than 
> SpanishLightStemmerFilter
> Ex:
> input tokens -> output tokens
>  1. camiseta niños -> *camiseta* and *nino*
>  2. camisas -> camisa
> *camisetas* and *camisas* are t-shirts and shirts respectively.
>  Stemming both of the tokens to *camis* will match both tokens and returns 
> both t-shirts and shirts for query camisas(shirts). 
> SpanishMinimalStemmerFilter will help handling these cases.
> And importantly It will preserve gender context with tokens.
> Ex:  *niños* ,*niñas* *chicos* and *chicas* are stemmed to *nino*, *nina*, 
> *chico* and *chica*



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8933) JapaneseTokenizer creates Token objects with corrupt offsets

2019-07-27 Thread Tomoko Uchida (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16894349#comment-16894349
 ] 

Tomoko Uchida commented on LUCENE-8933:
---

I opened pull requests. Could you review them?

For master: [https://github.com/apache/lucene-solr/pull/809]

For 8x branch: [https://github.com/apache/lucene-solr/pull/810]

 

> JapaneseTokenizer creates Token objects with corrupt offsets
> 
>
> Key: LUCENE-8933
> URL: https://issues.apache.org/jira/browse/LUCENE-8933
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Adrien Grand
>Priority: Minor
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> An Elasticsearch user reported the following stack trace when parsing 
> synonyms. It looks like the only reason why this might occur is if the offset 
> of a {{org.apache.lucene.analysis.ja.Token}} is not within the expected range.
>  
> {noformat}
> Caused by: java.lang.ArrayIndexOutOfBoundsException
> at 
> org.apache.lucene.analysis.tokenattributes.CharTermAttributeImpl.copyBuffer(CharTermAttributeImpl.java:44)
>  ~[lucene-core-7.6.0.jar:7.6.0 719cde97f84640faa1e3525690d262946571245f - 
> nknize - 2018-12-07 14:44:20]
> at 
> org.apache.lucene.analysis.ja.JapaneseTokenizer.incrementToken(JapaneseTokenizer.java:486)
>  ~[?:?]
> at 
> org.apache.lucene.analysis.synonym.SynonymMap$Parser.analyze(SynonymMap.java:318)
>  ~[lucene-analyzers-common-7.6.0.jar:7.6.0 
> 719cde97f84640faa1e3525690d262946571245f - nknize - 2018-12-07 14:44:48]
> at 
> org.elasticsearch.index.analysis.ESSolrSynonymParser.analyze(ESSolrSynonymParser.java:57)
>  ~[elasticsearch-6.6.1.jar:6.6.1]
> at 
> org.apache.lucene.analysis.synonym.SolrSynonymParser.addInternal(SolrSynonymParser.java:114)
>  ~[lucene-analyzers-common-7.6.0.jar:7.6.0 
> 719cde97f84640faa1e3525690d262946571245f - nknize - 2018-12-07 14:44:48]
> at 
> org.apache.lucene.analysis.synonym.SolrSynonymParser.parse(SolrSynonymParser.java:70)
>  ~[lucene-analyzers-common-7.6.0.jar:7.6.0 
> 719cde97f84640faa1e3525690d262946571245f - nknize - 2018-12-07 14:44:48]
> at 
> org.elasticsearch.index.analysis.SynonymTokenFilterFactory.buildSynonyms(SynonymTokenFilterFactory.java:154)
>  ~[elasticsearch-6.6.1.jar:6.6.1]
> ... 24 more
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-8933) JapaneseTokenizer creates Token objects with corrupt offsets

2019-07-27 Thread Tomoko Uchida (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16894298#comment-16894298
 ] 

Tomoko Uchida edited comment on LUCENE-8933 at 7/27/19 6:33 AM:


{quote}Should we go further and check that the concatenation of the segments is 
equal to the surface form?
{quote}
+1

I will open two pull requests.
 - For master branch: Add a equality check between surface form and its 
segmentation.
 - For 8x branch: Add a length check to avoid annoying runtime exceptions, in 
other words, "if the (concatenated) segment is longer than its surface form, 
throw exception when loading a user dictionary".


was (Author: tomoko uchida):
{quote}Should we go further and check that the concatenation of the segments is 
equal to the surface form?
{quote}
+1

I will open two pull requests.
 - For master branch: Add a equality check between surface form and its 
segmentation.
 - For 8x branch: Add a length check to avoid annoying runtime exceptions, in 
other words, "if the segmentation is longer than surface form, throw exception 
when loading a user dictionary".

> JapaneseTokenizer creates Token objects with corrupt offsets
> 
>
> Key: LUCENE-8933
> URL: https://issues.apache.org/jira/browse/LUCENE-8933
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Adrien Grand
>Priority: Minor
>
> An Elasticsearch user reported the following stack trace when parsing 
> synonyms. It looks like the only reason why this might occur is if the offset 
> of a {{org.apache.lucene.analysis.ja.Token}} is not within the expected range.
>  
> {noformat}
> Caused by: java.lang.ArrayIndexOutOfBoundsException
> at 
> org.apache.lucene.analysis.tokenattributes.CharTermAttributeImpl.copyBuffer(CharTermAttributeImpl.java:44)
>  ~[lucene-core-7.6.0.jar:7.6.0 719cde97f84640faa1e3525690d262946571245f - 
> nknize - 2018-12-07 14:44:20]
> at 
> org.apache.lucene.analysis.ja.JapaneseTokenizer.incrementToken(JapaneseTokenizer.java:486)
>  ~[?:?]
> at 
> org.apache.lucene.analysis.synonym.SynonymMap$Parser.analyze(SynonymMap.java:318)
>  ~[lucene-analyzers-common-7.6.0.jar:7.6.0 
> 719cde97f84640faa1e3525690d262946571245f - nknize - 2018-12-07 14:44:48]
> at 
> org.elasticsearch.index.analysis.ESSolrSynonymParser.analyze(ESSolrSynonymParser.java:57)
>  ~[elasticsearch-6.6.1.jar:6.6.1]
> at 
> org.apache.lucene.analysis.synonym.SolrSynonymParser.addInternal(SolrSynonymParser.java:114)
>  ~[lucene-analyzers-common-7.6.0.jar:7.6.0 
> 719cde97f84640faa1e3525690d262946571245f - nknize - 2018-12-07 14:44:48]
> at 
> org.apache.lucene.analysis.synonym.SolrSynonymParser.parse(SolrSynonymParser.java:70)
>  ~[lucene-analyzers-common-7.6.0.jar:7.6.0 
> 719cde97f84640faa1e3525690d262946571245f - nknize - 2018-12-07 14:44:48]
> at 
> org.elasticsearch.index.analysis.SynonymTokenFilterFactory.buildSynonyms(SynonymTokenFilterFactory.java:154)
>  ~[elasticsearch-6.6.1.jar:6.6.1]
> ... 24 more
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8933) JapaneseTokenizer creates Token objects with corrupt offsets

2019-07-27 Thread Tomoko Uchida (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16894298#comment-16894298
 ] 

Tomoko Uchida commented on LUCENE-8933:
---

{quote}Should we go further and check that the concatenation of the segments is 
equal to the surface form?
{quote}
+1

I will open two pull requests.
 - For master branch: Add a equality check between surface form and its 
segmentation.
 - For 8x branch: Add a length check to avoid annoying runtime exceptions, in 
other words, "if the segmentation is longer than surface form, throw exception 
when loading a user dictionary".

> JapaneseTokenizer creates Token objects with corrupt offsets
> 
>
> Key: LUCENE-8933
> URL: https://issues.apache.org/jira/browse/LUCENE-8933
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Adrien Grand
>Priority: Minor
>
> An Elasticsearch user reported the following stack trace when parsing 
> synonyms. It looks like the only reason why this might occur is if the offset 
> of a {{org.apache.lucene.analysis.ja.Token}} is not within the expected range.
>  
> {noformat}
> Caused by: java.lang.ArrayIndexOutOfBoundsException
> at 
> org.apache.lucene.analysis.tokenattributes.CharTermAttributeImpl.copyBuffer(CharTermAttributeImpl.java:44)
>  ~[lucene-core-7.6.0.jar:7.6.0 719cde97f84640faa1e3525690d262946571245f - 
> nknize - 2018-12-07 14:44:20]
> at 
> org.apache.lucene.analysis.ja.JapaneseTokenizer.incrementToken(JapaneseTokenizer.java:486)
>  ~[?:?]
> at 
> org.apache.lucene.analysis.synonym.SynonymMap$Parser.analyze(SynonymMap.java:318)
>  ~[lucene-analyzers-common-7.6.0.jar:7.6.0 
> 719cde97f84640faa1e3525690d262946571245f - nknize - 2018-12-07 14:44:48]
> at 
> org.elasticsearch.index.analysis.ESSolrSynonymParser.analyze(ESSolrSynonymParser.java:57)
>  ~[elasticsearch-6.6.1.jar:6.6.1]
> at 
> org.apache.lucene.analysis.synonym.SolrSynonymParser.addInternal(SolrSynonymParser.java:114)
>  ~[lucene-analyzers-common-7.6.0.jar:7.6.0 
> 719cde97f84640faa1e3525690d262946571245f - nknize - 2018-12-07 14:44:48]
> at 
> org.apache.lucene.analysis.synonym.SolrSynonymParser.parse(SolrSynonymParser.java:70)
>  ~[lucene-analyzers-common-7.6.0.jar:7.6.0 
> 719cde97f84640faa1e3525690d262946571245f - nknize - 2018-12-07 14:44:48]
> at 
> org.elasticsearch.index.analysis.SynonymTokenFilterFactory.buildSynonyms(SynonymTokenFilterFactory.java:154)
>  ~[elasticsearch-6.6.1.jar:6.6.1]
> ... 24 more
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8933) JapaneseTokenizer creates Token objects with corrupt offsets

2019-07-25 Thread Tomoko Uchida (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16893288#comment-16893288
 ] 

Tomoko Uchida commented on LUCENE-8933:
---

Just for clarification, let me wrap up the problem here.
 - JapaneseTokenizer has "search mode", which break up dictionary tokens 
(surface forms) to small segments and matches the input text to the segments 
(for increasing search recall).
 - The user dictionary of JapaneseTokenizer allows users to specify arbitrary 
segmentation rules in addition to add custom tokens.
 - e.g.: If an user entry {{"aabbcc,aa bb cc,aa bb cc,pos_tag"}} is given, the 
token stream for {{"aabbcc"}} should generate three tokens, {{"aa"}} {{"bb"}} 
{{"cc"}}.
 - The sum of length of segments are expected to be exactly same to the length 
of corresponding surface form (as [~jim.ferenczi] explained). If a segment is 
longer than its surface form, it's a violation against this assumption and 
causes an AIOOB when array copying the region of surface form.

For purpose of format validation, I think it would be better that we check if 
the sum of length of segments is equal to the length of its surface form.
 i.e., we also should not allow such entry {{"aabbcc,a b c,aa bb cc,pos_tag"}} 
even if this does not cause any exceptions.

> JapaneseTokenizer creates Token objects with corrupt offsets
> 
>
> Key: LUCENE-8933
> URL: https://issues.apache.org/jira/browse/LUCENE-8933
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Adrien Grand
>Priority: Minor
>
> An Elasticsearch user reported the following stack trace when parsing 
> synonyms. It looks like the only reason why this might occur is if the offset 
> of a {{org.apache.lucene.analysis.ja.Token}} is not within the expected range.
>  
> {noformat}
> Caused by: java.lang.ArrayIndexOutOfBoundsException
> at 
> org.apache.lucene.analysis.tokenattributes.CharTermAttributeImpl.copyBuffer(CharTermAttributeImpl.java:44)
>  ~[lucene-core-7.6.0.jar:7.6.0 719cde97f84640faa1e3525690d262946571245f - 
> nknize - 2018-12-07 14:44:20]
> at 
> org.apache.lucene.analysis.ja.JapaneseTokenizer.incrementToken(JapaneseTokenizer.java:486)
>  ~[?:?]
> at 
> org.apache.lucene.analysis.synonym.SynonymMap$Parser.analyze(SynonymMap.java:318)
>  ~[lucene-analyzers-common-7.6.0.jar:7.6.0 
> 719cde97f84640faa1e3525690d262946571245f - nknize - 2018-12-07 14:44:48]
> at 
> org.elasticsearch.index.analysis.ESSolrSynonymParser.analyze(ESSolrSynonymParser.java:57)
>  ~[elasticsearch-6.6.1.jar:6.6.1]
> at 
> org.apache.lucene.analysis.synonym.SolrSynonymParser.addInternal(SolrSynonymParser.java:114)
>  ~[lucene-analyzers-common-7.6.0.jar:7.6.0 
> 719cde97f84640faa1e3525690d262946571245f - nknize - 2018-12-07 14:44:48]
> at 
> org.apache.lucene.analysis.synonym.SolrSynonymParser.parse(SolrSynonymParser.java:70)
>  ~[lucene-analyzers-common-7.6.0.jar:7.6.0 
> 719cde97f84640faa1e3525690d262946571245f - nknize - 2018-12-07 14:44:48]
> at 
> org.elasticsearch.index.analysis.SynonymTokenFilterFactory.buildSynonyms(SynonymTokenFilterFactory.java:154)
>  ~[elasticsearch-6.6.1.jar:6.6.1]
> ... 24 more
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8933) JapaneseTokenizer creates Token objects with corrupt offsets

2019-07-25 Thread Tomoko Uchida (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16892941#comment-16892941
 ] 

Tomoko Uchida commented on LUCENE-8933:
---

[~danmuzi]: thanks for confirmation, sorry it relates only to 
JapaneseTokenizer's search mode and its user dictionary format. KoreanTokenizer 
and its user dictionary should not have same problem.

> JapaneseTokenizer creates Token objects with corrupt offsets
> 
>
> Key: LUCENE-8933
> URL: https://issues.apache.org/jira/browse/LUCENE-8933
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Adrien Grand
>Priority: Minor
>
> An Elasticsearch user reported the following stack trace when parsing 
> synonyms. It looks like the only reason why this might occur is if the offset 
> of a {{org.apache.lucene.analysis.ja.Token}} is not within the expected range.
>  
> {noformat}
> Caused by: java.lang.ArrayIndexOutOfBoundsException
> at 
> org.apache.lucene.analysis.tokenattributes.CharTermAttributeImpl.copyBuffer(CharTermAttributeImpl.java:44)
>  ~[lucene-core-7.6.0.jar:7.6.0 719cde97f84640faa1e3525690d262946571245f - 
> nknize - 2018-12-07 14:44:20]
> at 
> org.apache.lucene.analysis.ja.JapaneseTokenizer.incrementToken(JapaneseTokenizer.java:486)
>  ~[?:?]
> at 
> org.apache.lucene.analysis.synonym.SynonymMap$Parser.analyze(SynonymMap.java:318)
>  ~[lucene-analyzers-common-7.6.0.jar:7.6.0 
> 719cde97f84640faa1e3525690d262946571245f - nknize - 2018-12-07 14:44:48]
> at 
> org.elasticsearch.index.analysis.ESSolrSynonymParser.analyze(ESSolrSynonymParser.java:57)
>  ~[elasticsearch-6.6.1.jar:6.6.1]
> at 
> org.apache.lucene.analysis.synonym.SolrSynonymParser.addInternal(SolrSynonymParser.java:114)
>  ~[lucene-analyzers-common-7.6.0.jar:7.6.0 
> 719cde97f84640faa1e3525690d262946571245f - nknize - 2018-12-07 14:44:48]
> at 
> org.apache.lucene.analysis.synonym.SolrSynonymParser.parse(SolrSynonymParser.java:70)
>  ~[lucene-analyzers-common-7.6.0.jar:7.6.0 
> 719cde97f84640faa1e3525690d262946571245f - nknize - 2018-12-07 14:44:48]
> at 
> org.elasticsearch.index.analysis.SynonymTokenFilterFactory.buildSynonyms(SynonymTokenFilterFactory.java:154)
>  ~[elasticsearch-6.6.1.jar:6.6.1]
> ... 24 more
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8933) JapaneseTokenizer creates Token objects with corrupt offsets

2019-07-25 Thread Tomoko Uchida (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16892709#comment-16892709
 ] 

Tomoko Uchida commented on LUCENE-8933:
---

If there are no other opinions or objections, I'd like to create a patch that 
add a validation rule to the UserDictionary.

> JapaneseTokenizer creates Token objects with corrupt offsets
> 
>
> Key: LUCENE-8933
> URL: https://issues.apache.org/jira/browse/LUCENE-8933
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Adrien Grand
>Priority: Minor
>
> An Elasticsearch user reported the following stack trace when parsing 
> synonyms. It looks like the only reason why this might occur is if the offset 
> of a {{org.apache.lucene.analysis.ja.Token}} is not within the expected range.
>  
> {noformat}
> Caused by: java.lang.ArrayIndexOutOfBoundsException
> at 
> org.apache.lucene.analysis.tokenattributes.CharTermAttributeImpl.copyBuffer(CharTermAttributeImpl.java:44)
>  ~[lucene-core-7.6.0.jar:7.6.0 719cde97f84640faa1e3525690d262946571245f - 
> nknize - 2018-12-07 14:44:20]
> at 
> org.apache.lucene.analysis.ja.JapaneseTokenizer.incrementToken(JapaneseTokenizer.java:486)
>  ~[?:?]
> at 
> org.apache.lucene.analysis.synonym.SynonymMap$Parser.analyze(SynonymMap.java:318)
>  ~[lucene-analyzers-common-7.6.0.jar:7.6.0 
> 719cde97f84640faa1e3525690d262946571245f - nknize - 2018-12-07 14:44:48]
> at 
> org.elasticsearch.index.analysis.ESSolrSynonymParser.analyze(ESSolrSynonymParser.java:57)
>  ~[elasticsearch-6.6.1.jar:6.6.1]
> at 
> org.apache.lucene.analysis.synonym.SolrSynonymParser.addInternal(SolrSynonymParser.java:114)
>  ~[lucene-analyzers-common-7.6.0.jar:7.6.0 
> 719cde97f84640faa1e3525690d262946571245f - nknize - 2018-12-07 14:44:48]
> at 
> org.apache.lucene.analysis.synonym.SolrSynonymParser.parse(SolrSynonymParser.java:70)
>  ~[lucene-analyzers-common-7.6.0.jar:7.6.0 
> 719cde97f84640faa1e3525690d262946571245f - nknize - 2018-12-07 14:44:48]
> at 
> org.elasticsearch.index.analysis.SynonymTokenFilterFactory.buildSynonyms(SynonymTokenFilterFactory.java:154)
>  ~[elasticsearch-6.6.1.jar:6.6.1]
> ... 24 more
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8933) JapaneseTokenizer creates Token objects with corrupt offsets

2019-07-25 Thread Tomoko Uchida (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16892708#comment-16892708
 ] 

Tomoko Uchida commented on LUCENE-8933:
---

Thanks for your explanation and investigation, I agree with this policy.

bq. You don't need emojis or surrogate pairs to break this, just provide a rule 
where the length of the segmentation is greater than the input minus the 
whitespaces:

Just for confirmation, this entry works without any problem. (Here, same Emoji 
character appears both of first and second column. I think this should be 
allowed because some surrogate pair Kanjis are often used in specific 
situations like person names.)

{code:java}
UserDictionary dict = UserDictionary.open(new 
StringReader("アメカン航空,アメカン航空,アメリカンコウクウ,カスタム用語"));
JapaneseTokenizer tok = new JapaneseTokenizer(dict, true, Mode.NORMAL);
tok.setReader(new StringReader("アメリカン航空"));
tok.reset();
tok.incrementToken();
{code}

> JapaneseTokenizer creates Token objects with corrupt offsets
> 
>
> Key: LUCENE-8933
> URL: https://issues.apache.org/jira/browse/LUCENE-8933
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Adrien Grand
>Priority: Minor
>
> An Elasticsearch user reported the following stack trace when parsing 
> synonyms. It looks like the only reason why this might occur is if the offset 
> of a {{org.apache.lucene.analysis.ja.Token}} is not within the expected range.
>  
> {noformat}
> Caused by: java.lang.ArrayIndexOutOfBoundsException
> at 
> org.apache.lucene.analysis.tokenattributes.CharTermAttributeImpl.copyBuffer(CharTermAttributeImpl.java:44)
>  ~[lucene-core-7.6.0.jar:7.6.0 719cde97f84640faa1e3525690d262946571245f - 
> nknize - 2018-12-07 14:44:20]
> at 
> org.apache.lucene.analysis.ja.JapaneseTokenizer.incrementToken(JapaneseTokenizer.java:486)
>  ~[?:?]
> at 
> org.apache.lucene.analysis.synonym.SynonymMap$Parser.analyze(SynonymMap.java:318)
>  ~[lucene-analyzers-common-7.6.0.jar:7.6.0 
> 719cde97f84640faa1e3525690d262946571245f - nknize - 2018-12-07 14:44:48]
> at 
> org.elasticsearch.index.analysis.ESSolrSynonymParser.analyze(ESSolrSynonymParser.java:57)
>  ~[elasticsearch-6.6.1.jar:6.6.1]
> at 
> org.apache.lucene.analysis.synonym.SolrSynonymParser.addInternal(SolrSynonymParser.java:114)
>  ~[lucene-analyzers-common-7.6.0.jar:7.6.0 
> 719cde97f84640faa1e3525690d262946571245f - nknize - 2018-12-07 14:44:48]
> at 
> org.apache.lucene.analysis.synonym.SolrSynonymParser.parse(SolrSynonymParser.java:70)
>  ~[lucene-analyzers-common-7.6.0.jar:7.6.0 
> 719cde97f84640faa1e3525690d262946571245f - nknize - 2018-12-07 14:44:48]
> at 
> org.elasticsearch.index.analysis.SynonymTokenFilterFactory.buildSynonyms(SynonymTokenFilterFactory.java:154)
>  ~[elasticsearch-6.6.1.jar:6.6.1]
> ... 24 more
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8933) JapaneseTokenizer creates Token objects with corrupt offsets

2019-07-25 Thread Tomoko Uchida (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16892574#comment-16892574
 ] 

Tomoko Uchida commented on LUCENE-8933:
---

The surrogate pair Emoji character   included in the user dictionary is 
problematic. Interestingly, it does not cause the error if it appears at the 
first column (for "normal" mode segmentation).

KoreanTokenizer could have same issue?

> JapaneseTokenizer creates Token objects with corrupt offsets
> 
>
> Key: LUCENE-8933
> URL: https://issues.apache.org/jira/browse/LUCENE-8933
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Adrien Grand
>Priority: Minor
>
> An Elasticsearch user reported the following stack trace when parsing 
> synonyms. It looks like the only reason why this might occur is if the offset 
> of a {{org.apache.lucene.analysis.ja.Token}} is not within the expected range.
>  
> {noformat}
> Caused by: java.lang.ArrayIndexOutOfBoundsException
> at 
> org.apache.lucene.analysis.tokenattributes.CharTermAttributeImpl.copyBuffer(CharTermAttributeImpl.java:44)
>  ~[lucene-core-7.6.0.jar:7.6.0 719cde97f84640faa1e3525690d262946571245f - 
> nknize - 2018-12-07 14:44:20]
> at 
> org.apache.lucene.analysis.ja.JapaneseTokenizer.incrementToken(JapaneseTokenizer.java:486)
>  ~[?:?]
> at 
> org.apache.lucene.analysis.synonym.SynonymMap$Parser.analyze(SynonymMap.java:318)
>  ~[lucene-analyzers-common-7.6.0.jar:7.6.0 
> 719cde97f84640faa1e3525690d262946571245f - nknize - 2018-12-07 14:44:48]
> at 
> org.elasticsearch.index.analysis.ESSolrSynonymParser.analyze(ESSolrSynonymParser.java:57)
>  ~[elasticsearch-6.6.1.jar:6.6.1]
> at 
> org.apache.lucene.analysis.synonym.SolrSynonymParser.addInternal(SolrSynonymParser.java:114)
>  ~[lucene-analyzers-common-7.6.0.jar:7.6.0 
> 719cde97f84640faa1e3525690d262946571245f - nknize - 2018-12-07 14:44:48]
> at 
> org.apache.lucene.analysis.synonym.SolrSynonymParser.parse(SolrSynonymParser.java:70)
>  ~[lucene-analyzers-common-7.6.0.jar:7.6.0 
> 719cde97f84640faa1e3525690d262946571245f - nknize - 2018-12-07 14:44:48]
> at 
> org.elasticsearch.index.analysis.SynonymTokenFilterFactory.buildSynonyms(SynonymTokenFilterFactory.java:154)
>  ~[elasticsearch-6.6.1.jar:6.6.1]
> ... 24 more
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



  1   2   3   4   5   >