Hi Emir, An example of the string in Chinese is 预支款管理及账务处理办法
The number of characters is 12, but the expected length should be 36. Regards, Edwin On 4 January 2018 at 16:21, Emir Arnautović <emir.arnauto...@sematext.com> wrote: > Hi Edwin, > I don’t have enough knowledge in eastern languages to know what is > expected number when you as for sting length. Maybe you can try some of > regex unicode settings and see if you’ll get what you need: try setting > unicode flag with (?U) or try using regex groups and ranges. If you provide > example string and expected length, maybe we could provide you regex. > > Thanks, > Emir > -- > Monitoring - Log Management - Alerting - Anomaly Detection > Solr & Elasticsearch Consulting Support Training - http://sematext.com/ > > > > > On 4 Jan 2018, at 04:37, Zheng Lin Edwin Yeo <edwinye...@gmail.com> > wrote: > > > > Hi Emir, > > > > So this would likely be different from what the operating system counts, > as > > the operating system may consider each Chinese characters as 3 to 4 > bytes. > > Which is probably why I could not find any record with > subject:/.{255,}.*/ > > > > Is there other tools that we can use to query the length for data that > are > > already indexed which are not in the standard English language? (Eg: > > Chinese, Japanese, etc) > > > > Regards, > > Edwin > > > > On 3 January 2018 at 23:51, Emir Arnautović < > emir.arnauto...@sematext.com> > > wrote: > > > >> Hi Edwin, > >> I do not know, but my guess would be that each character is counted as 1 > >> in regex regardless how many bytes it takes in used encoding. > >> > >> Regards, > >> Emir > >> -- > >> Monitoring - Log Management - Alerting - Anomaly Detection > >> Solr & Elasticsearch Consulting Support Training - http://sematext.com/ > >> > >> > >> > >>> On 3 Jan 2018, at 16:43, Zheng Lin Edwin Yeo <edwinye...@gmail.com> > >> wrote: > >>> > >>> Thanks for the reply. > >>> > >>> I am doing the search on existing data that has already been indexed, > and > >>> it is likely to be a one time thing. > >>> > >>> This subject:/.{255,}.*/ works for English characters. However, there > >> are > >>> Chinese characters in some of the records. The length seems to be more > >> than > >>> 255, but it does not shows up in the results. > >>> > >>> Do you know how the length for Chinese characters and other languages > are > >>> being determined? > >>> > >>> Regards, > >>> Edwin > >>> > >>> > >>> On 3 January 2018 at 23:01, Alexandre Rafalovitch <arafa...@gmail.com> > >>> wrote: > >>> > >>>> Do that during indexing as Emir suggested. Specifically, use an > >>>> UpdateRequestProcessor chain, probably with the Clone and FieldLength > >>>> processors: http://www.solr-start.com/javadoc/solr-lucene/org/ > >>>> apache/solr/update/processor/FieldLengthUpdateProcessorFactory.html > >>>> > >>>> Regards, > >>>> Alex. > >>>> > >>>> On 31 December 2017 at 22:00, Zheng Lin Edwin Yeo < > edwinye...@gmail.com > >>> > >>>> wrote: > >>>>> Hi, > >>>>> > >>>>> Would like to check, if it is possible to query a field which has > data > >> of > >>>>> more than a certain length? > >>>>> > >>>>> Like for example, I want to query the field subject that has more > than > >>>> 255 > >>>>> bytes. Is it possible? > >>>>> > >>>>> I am currently using Solr 6.5.1. > >>>>> > >>>>> Regards, > >>>>> Edwin > >>>> > >> > >> > >