Re: Text search in Arabic

2021-05-20 Thread Walter Underwood
I recommend normalizing all characters with a compatibility transformation, 
whether they are Arabic or not. 

We use this charFilter as the first step in every query and indexing analysis 
chain.



You’ll also need to include the ICU library, which should be included by 
default. Actually, the compatbility normalization should be done by default, 
too. That transform was designed specifically for string matching and search.

We have this in every solrconfig.xml.

  
  
  

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On May 20, 2021, at 9:38 AM, Mete Kural  wrote:
> 
> Hello Michael,
> 
> Thank you very much for this information.
> 
> I will try at  java-u...@lucene.apache.org 
>  also.
> 
> By the way, is the Arabic analyzer referenced here 
> (https://github.com/apache/lucene/tree/main/lucene/analysis/common/src/java/org/apache/lucene/analysis/ar)
>  just for the Arabic language or all languages written with the Arabic script?
> 
> Thank you,
> Mete
> 
> 
>> On May 20, 2021, at 4:35 PM, Michael Wechner  
>> wrote:
>> 
>> Hi Mete
>> 
>> You might also want to try the java-u...@lucene.apache.org mailing list
>> 
>> https://lucene.apache.org/core/discussion.html#java-user-list-java-userluceneapacheorg
>> 
>> Re languages other than english you might find more information at
>> 
>> https://cwiki.apache.org/confluence/display/lucene/LuceneFAQ#LuceneFAQ-CanIuseLucenetoindextextinChinese,Japanese,Korean,andothermulti-bytecharactersets?
>> 
>> whereas I just realize that the following link does not work anymore
>> 
>> https://lucene.apache.org/core/lucene-sandbox/
>> 
>> Are these analyzers now inside
>> 
>> https://github.com/apache/lucene/tree/main/lucene/analysis/common/src/java/org/apache/lucene/analysis
>> https://github.com/apache/lucene/tree/main/lucene/analysis/common/src/java/org/apache/lucene/analysis/ar
>> 
>> ?
>> 
>> Thanks
>> 
>> Michael
>> 
>> 
>> Am 20.05.21 um 14:48 schrieb Mete Kural:
>>> Hello Lucene Community,
>>> 
>>> I hope this finds you all well. I want to ask you if this would be the 
>>> right medium to discuss some matters surrounding text search in relation to 
>>> variant Unicode codings of words in Arabic and Arabic scripted languages. 
>>> This is not a great example but the said matters are similar to matters 
>>> around Latin scripted searches where the letter “İ” needs to be substituted 
>>> with “I” in searches and so forth. Would this mailing list be the best 
>>> medium to discuss such matters? If not, would you mind recommending me a 
>>> medium for discussion on this?
>>> 
>>> Kind regards,
>>> Mete Kural
>>> -
>>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>> 
>> 
>> 
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: dev-h...@lucene.apache.org
>> 
> 



Re: Text search in Arabic

2021-05-20 Thread Uwe Schindler
This is only for Arabic language.

If you don't know the language and just want to assist people searching with 
different scripts (search with latin letters for Arabic text), see my other 
answer.

Uwe

Am May 20, 2021 2:38:26 PM UTC schrieb Mete Kural 
:
>Hello Michael,
>
>Thank you very much for this information.
>
>I will try at  java-u...@lucene.apache.org also.
>
>By the way, is the Arabic analyzer referenced here
>(https://github.com/apache/lucene/tree/main/lucene/analysis/common/src/java/org/apache/lucene/analysis/ar)
>just for the Arabic language or all languages written with the Arabic
>script?
>
>Thank you,
>Mete
>
>
>> On May 20, 2021, at 4:35 PM, Michael Wechner
> wrote:
>> 
>> Hi Mete
>> 
>> You might also want to try the java-u...@lucene.apache.org mailing
>list
>> 
>>
>https://lucene.apache.org/core/discussion.html#java-user-list-java-userluceneapacheorg
>> 
>> Re languages other than english you might find more information at
>> 
>>
>https://cwiki.apache.org/confluence/display/lucene/LuceneFAQ#LuceneFAQ-CanIuseLucenetoindextextinChinese,Japanese,Korean,andothermulti-bytecharactersets?
>> 
>> whereas I just realize that the following link does not work anymore
>> 
>> https://lucene.apache.org/core/lucene-sandbox/
>> 
>> Are these analyzers now inside
>> 
>>
>https://github.com/apache/lucene/tree/main/lucene/analysis/common/src/java/org/apache/lucene/analysis
>>
>https://github.com/apache/lucene/tree/main/lucene/analysis/common/src/java/org/apache/lucene/analysis/ar
>> 
>> ?
>> 
>> Thanks
>> 
>> Michael
>> 
>> 
>> Am 20.05.21 um 14:48 schrieb Mete Kural:
>>> Hello Lucene Community,
>>> 
>>> I hope this finds you all well. I want to ask you if this would be
>the right medium to discuss some matters surrounding text search in
>relation to variant Unicode codings of words in Arabic and Arabic
>scripted languages. This is not a great example but the said matters
>are similar to matters around Latin scripted searches where the letter
>“İ” needs to be substituted with “I” in searches and so forth. Would
>this mailing list be the best medium to discuss such matters? If not,
>would you mind recommending me a medium for discussion on this?
>>> 
>>> Kind regards,
>>> Mete Kural
>>>
>-
>>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>>> For additional commands, e-mail: dev-h...@lucene.apache.org
>> 
>> 
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: dev-h...@lucene.apache.org

--
Uwe Schindler
Achterdiek 19, 28357 Bremen
https://www.thetaphi.de

Re: Text search in Arabic

2021-05-20 Thread Uwe Schindler
Hi,

As answer to your question looking for character substitutions. There is the 
ICU library doing this with ICU Transformers. It may also change all Cyrillic 
text to latin during indexing and search. This greatly helps people to find 
stuff.

A great example of a transformer is here as part of elasticsearch's 
documentation. I regularly use it when language of text is unknown and can only 
be tokenized: 
https://www.elastic.co/guide/en/elasticsearch/plugins/current/analysis-icu-transform.html

The example mentioned there replaces any text with a transformation to latin 
characters, then decomposes umlauts and accents, strips those accents after the 
decomposition, and composes the remaining chars again. After that you have 
tokens in mostly latin without any accents.

You can use this also in Solr or pure Lucene (ICUTransformTokenFilter).

Uwe

Am May 20, 2021 1:35:45 PM UTC schrieb Michael Wechner 
:
>Hi Mete
>
>You might also want to try the java-u...@lucene.apache.org mailing list
>
>https://lucene.apache.org/core/discussion.html#java-user-list-java-userluceneapacheorg
>
>Re languages other than english you might find more information at
>
>https://cwiki.apache.org/confluence/display/lucene/LuceneFAQ#LuceneFAQ-CanIuseLucenetoindextextinChinese,Japanese,Korean,andothermulti-bytecharactersets?
>
>whereas I just realize that the following link does not work anymore
>
>https://lucene.apache.org/core/lucene-sandbox/
>
>Are these analyzers now inside
>
>https://github.com/apache/lucene/tree/main/lucene/analysis/common/src/java/org/apache/lucene/analysis
>https://github.com/apache/lucene/tree/main/lucene/analysis/common/src/java/org/apache/lucene/analysis/ar
>
>?
>
>Thanks
>
>Michael
>
>
>Am 20.05.21 um 14:48 schrieb Mete Kural:
>> Hello Lucene Community,
>>
>> I hope this finds you all well. I want to ask you if this would be
>the right medium to discuss some matters surrounding text search in
>relation to variant Unicode codings of words in Arabic and Arabic
>scripted languages. This is not a great example but the said matters
>are similar to matters around Latin scripted searches where the letter
>“İ” needs to be substituted with “I” in searches and so forth. Would
>this mailing list be the best medium to discuss such matters? If not,
>would you mind recommending me a medium for discussion on this?
>>
>> Kind regards,
>> Mete Kural
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>
>
>
>-
>To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>For additional commands, e-mail: dev-h...@lucene.apache.org

--
Uwe Schindler
Achterdiek 19, 28357 Bremen
https://www.thetaphi.de

Re: Text search in Arabic

2021-05-20 Thread Mete Kural
Hello Michael,

Thank you very much for this information.

I will try at  java-u...@lucene.apache.org also.

By the way, is the Arabic analyzer referenced here 
(https://github.com/apache/lucene/tree/main/lucene/analysis/common/src/java/org/apache/lucene/analysis/ar)
 just for the Arabic language or all languages written with the Arabic script?

Thank you,
Mete


> On May 20, 2021, at 4:35 PM, Michael Wechner  
> wrote:
> 
> Hi Mete
> 
> You might also want to try the java-u...@lucene.apache.org mailing list
> 
> https://lucene.apache.org/core/discussion.html#java-user-list-java-userluceneapacheorg
> 
> Re languages other than english you might find more information at
> 
> https://cwiki.apache.org/confluence/display/lucene/LuceneFAQ#LuceneFAQ-CanIuseLucenetoindextextinChinese,Japanese,Korean,andothermulti-bytecharactersets?
> 
> whereas I just realize that the following link does not work anymore
> 
> https://lucene.apache.org/core/lucene-sandbox/
> 
> Are these analyzers now inside
> 
> https://github.com/apache/lucene/tree/main/lucene/analysis/common/src/java/org/apache/lucene/analysis
> https://github.com/apache/lucene/tree/main/lucene/analysis/common/src/java/org/apache/lucene/analysis/ar
> 
> ?
> 
> Thanks
> 
> Michael
> 
> 
> Am 20.05.21 um 14:48 schrieb Mete Kural:
>> Hello Lucene Community,
>> 
>> I hope this finds you all well. I want to ask you if this would be the right 
>> medium to discuss some matters surrounding text search in relation to 
>> variant Unicode codings of words in Arabic and Arabic scripted languages. 
>> This is not a great example but the said matters are similar to matters 
>> around Latin scripted searches where the letter “İ” needs to be substituted 
>> with “I” in searches and so forth. Would this mailing list be the best 
>> medium to discuss such matters? If not, would you mind recommending me a 
>> medium for discussion on this?
>> 
>> Kind regards,
>> Mete Kural
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: dev-h...@lucene.apache.org
> 
> 
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Text search in Arabic

2021-05-20 Thread Michael Wechner

Hi Mete

You might also want to try the java-u...@lucene.apache.org mailing list

https://lucene.apache.org/core/discussion.html#java-user-list-java-userluceneapacheorg

Re languages other than english you might find more information at

https://cwiki.apache.org/confluence/display/lucene/LuceneFAQ#LuceneFAQ-CanIuseLucenetoindextextinChinese,Japanese,Korean,andothermulti-bytecharactersets?

whereas I just realize that the following link does not work anymore

https://lucene.apache.org/core/lucene-sandbox/

Are these analyzers now inside

https://github.com/apache/lucene/tree/main/lucene/analysis/common/src/java/org/apache/lucene/analysis
https://github.com/apache/lucene/tree/main/lucene/analysis/common/src/java/org/apache/lucene/analysis/ar

?

Thanks

Michael


Am 20.05.21 um 14:48 schrieb Mete Kural:

Hello Lucene Community,

I hope this finds you all well. I want to ask you if this would be the right 
medium to discuss some matters surrounding text search in relation to variant 
Unicode codings of words in Arabic and Arabic scripted languages. This is not a 
great example but the said matters are similar to matters around Latin scripted 
searches where the letter “İ” needs to be substituted with “I” in searches and 
so forth. Would this mailing list be the best medium to discuss such matters? 
If not, would you mind recommending me a medium for discussion on this?

Kind regards,
Mete Kural
-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org




-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Lucene 9.0 snapshot names

2021-05-20 Thread Uwe Schindler
The default suffix in this system prop is "SNAPSHOT" and the timestamp comes 
then from Maven's internal Logic, this cannot be changed.

By overriding the suffix explicit (as said before and find by Jenkins) you 
convert it to an official "release" in Maven's sense and it is no longer a 
snapshot. So you are free with versioning.

Uwe

Am May 20, 2021 1:15:12 PM UTC schrieb Uwe Schindler :
>Jenkins does this already:
>https://ci-builds.apache.org/job/Lucene/job/Lucene-Artifacts-main/242/
>
>It uses build number!
>
>The system property "version suffix" is responsible and is set by
>Jenkins. See in command line: [Lucene-Artifacts-main] $
>/home/jenkins/jenkins-slave/workspace/Lucene/Lucene-Artifacts-main/gradlew
>-Dlucene.javadoc.url=https://ci-builds.apache.org/job/Lucene/job/Lucene-Artifacts-main/javadoc
>-Dversion.suffix=jenkins242 assemble
>
>Uwe
>
>Am May 20, 2021 12:25:48 PM UTC schrieb Michael Sokolov
>:
>>In principal it makes sense, but is there any chance the build
>artifact
>>could vary for the same SHA? We hope not, I think, but stranger things
>>have
>>happened. Probably an edge case not worth worrying about though, and
>>relying on the build server's clock doesn't seem great, so +1 from me,
>>although I don't use these so my interest is mostly theoretical.
>>
>>On Thu, May 20, 2021, 8:20 AM Alan Woodward 
>>wrote:
>>
>>> Hi all,
>>>
>>> I’m preparing a local lucene 9.0 snapshot build and I notice that
>the
>>jar
>>> files generated by `./gradlew mavenToLocalFolder` are called
>>something like
>>> `lucene-suggest-9.0.0-20210520.111833-1-javadoc.jar` - in other
>>words, they
>>> are including a timestamp.  For my setup I’d like to replace this
>>with the
>>> git SHA of the commit the snapshot is based on.  So I have two
>>questions:
>>>
>>> 1) Is there a simple override or gradle property that I can pass on
>>the
>>> command line that will change the output names of artefacts?
>>> 2) I think in general commit SHAs are better than timestamps for
>>snapshot
>>> names - two identical snapshots taken from identical sources at
>>different
>>> times shouldn’t really have different names.  Should we look at
>>changing
>>> the existing snapshot generation code to switch to using SHAs?
>>>
>>> - Alan
>>>
>-
>>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>>
>>>
>
>--
>Uwe Schindler
>Achterdiek 19, 28357 Bremen
>https://www.thetaphi.de

--
Uwe Schindler
Achterdiek 19, 28357 Bremen
https://www.thetaphi.de

Re: Lucene 9.0 snapshot names

2021-05-20 Thread Uwe Schindler
Jenkins does this already: 
https://ci-builds.apache.org/job/Lucene/job/Lucene-Artifacts-main/242/

It uses build number!

The system property "version suffix" is responsible and is set by Jenkins. See 
in command line: [Lucene-Artifacts-main] $ 
/home/jenkins/jenkins-slave/workspace/Lucene/Lucene-Artifacts-main/gradlew 
-Dlucene.javadoc.url=https://ci-builds.apache.org/job/Lucene/job/Lucene-Artifacts-main/javadoc
 -Dversion.suffix=jenkins242 assemble

Uwe

Am May 20, 2021 12:25:48 PM UTC schrieb Michael Sokolov :
>In principal it makes sense, but is there any chance the build artifact
>could vary for the same SHA? We hope not, I think, but stranger things
>have
>happened. Probably an edge case not worth worrying about though, and
>relying on the build server's clock doesn't seem great, so +1 from me,
>although I don't use these so my interest is mostly theoretical.
>
>On Thu, May 20, 2021, 8:20 AM Alan Woodward 
>wrote:
>
>> Hi all,
>>
>> I’m preparing a local lucene 9.0 snapshot build and I notice that the
>jar
>> files generated by `./gradlew mavenToLocalFolder` are called
>something like
>> `lucene-suggest-9.0.0-20210520.111833-1-javadoc.jar` - in other
>words, they
>> are including a timestamp.  For my setup I’d like to replace this
>with the
>> git SHA of the commit the snapshot is based on.  So I have two
>questions:
>>
>> 1) Is there a simple override or gradle property that I can pass on
>the
>> command line that will change the output names of artefacts?
>> 2) I think in general commit SHAs are better than timestamps for
>snapshot
>> names - two identical snapshots taken from identical sources at
>different
>> times shouldn’t really have different names.  Should we look at
>changing
>> the existing snapshot generation code to switch to using SHAs?
>>
>> - Alan
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>
>>

--
Uwe Schindler
Achterdiek 19, 28357 Bremen
https://www.thetaphi.de

Text search in Arabic

2021-05-20 Thread Mete Kural
Hello Lucene Community,

I hope this finds you all well. I want to ask you if this would be the right 
medium to discuss some matters surrounding text search in relation to variant 
Unicode codings of words in Arabic and Arabic scripted languages. This is not a 
great example but the said matters are similar to matters around Latin scripted 
searches where the letter “İ” needs to be substituted with “I” in searches and 
so forth. Would this mailing list be the best medium to discuss such matters? 
If not, would you mind recommending me a medium for discussion on this?

Kind regards,
Mete Kural
-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Lucene 9.0 snapshot names

2021-05-20 Thread Michael Sokolov
In principal it makes sense, but is there any chance the build artifact
could vary for the same SHA? We hope not, I think, but stranger things have
happened. Probably an edge case not worth worrying about though, and
relying on the build server's clock doesn't seem great, so +1 from me,
although I don't use these so my interest is mostly theoretical.

On Thu, May 20, 2021, 8:20 AM Alan Woodward  wrote:

> Hi all,
>
> I’m preparing a local lucene 9.0 snapshot build and I notice that the jar
> files generated by `./gradlew mavenToLocalFolder` are called something like
> `lucene-suggest-9.0.0-20210520.111833-1-javadoc.jar` - in other words, they
> are including a timestamp.  For my setup I’d like to replace this with the
> git SHA of the commit the snapshot is based on.  So I have two questions:
>
> 1) Is there a simple override or gradle property that I can pass on the
> command line that will change the output names of artefacts?
> 2) I think in general commit SHAs are better than timestamps for snapshot
> names - two identical snapshots taken from identical sources at different
> times shouldn’t really have different names.  Should we look at changing
> the existing snapshot generation code to switch to using SHAs?
>
> - Alan
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>


Lucene 9.0 snapshot names

2021-05-20 Thread Alan Woodward
Hi all,

I’m preparing a local lucene 9.0 snapshot build and I notice that the jar files 
generated by `./gradlew mavenToLocalFolder` are called something like 
`lucene-suggest-9.0.0-20210520.111833-1-javadoc.jar` - in other words, they are 
including a timestamp.  For my setup I’d like to replace this with the git SHA 
of the commit the snapshot is based on.  So I have two questions:

1) Is there a simple override or gradle property that I can pass on the command 
line that will change the output names of artefacts?
2) I think in general commit SHAs are better than timestamps for snapshot names 
- two identical snapshots taken from identical sources at different times 
shouldn’t really have different names.  Should we look at changing the existing 
snapshot generation code to switch to using SHAs?

- Alan
-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org