Hi Edwin, Unfortunately, I was not able find regex that would work in your case.
Emir -- Monitoring - Log Management - Alerting - Anomaly Detection Solr & Elasticsearch Consulting Support Training - http://sematext.com/ > On 1 Feb 2018, at 05:42, Zheng Lin Edwin Yeo <edwinye...@gmail.com> wrote: > > Hi, > > Have you manage to get the regex for this string in Chinese: 预支款管理及账务处理办法 ? > > Regards, > Edwin > > > On 4 January 2018 at 18:04, Zheng Lin Edwin Yeo <edwinye...@gmail.com> > wrote: > >> Hi Emir, >> >> An example of the string in Chinese is 预支款管理及账务处理办法 >> >> The number of characters is 12, but the expected length should be 36. >> >> Regards, >> Edwin >> >> >> On 4 January 2018 at 16:21, Emir Arnautović <emir.arnauto...@sematext.com> >> wrote: >> >>> Hi Edwin, >>> I don’t have enough knowledge in eastern languages to know what is >>> expected number when you as for sting length. Maybe you can try some of >>> regex unicode settings and see if you’ll get what you need: try setting >>> unicode flag with (?U) or try using regex groups and ranges. If you provide >>> example string and expected length, maybe we could provide you regex. >>> >>> Thanks, >>> Emir >>> -- >>> Monitoring - Log Management - Alerting - Anomaly Detection >>> Solr & Elasticsearch Consulting Support Training - http://sematext.com/ >>> >>> >>> >>>> On 4 Jan 2018, at 04:37, Zheng Lin Edwin Yeo <edwinye...@gmail.com> >>> wrote: >>>> >>>> Hi Emir, >>>> >>>> So this would likely be different from what the operating system >>> counts, as >>>> the operating system may consider each Chinese characters as 3 to 4 >>> bytes. >>>> Which is probably why I could not find any record with >>> subject:/.{255,}.*/ >>>> >>>> Is there other tools that we can use to query the length for data that >>> are >>>> already indexed which are not in the standard English language? (Eg: >>>> Chinese, Japanese, etc) >>>> >>>> Regards, >>>> Edwin >>>> >>>> On 3 January 2018 at 23:51, Emir Arnautović < >>> emir.arnauto...@sematext.com> >>>> wrote: >>>> >>>>> Hi Edwin, >>>>> I do not know, but my guess would be that each character is counted as >>> 1 >>>>> in regex regardless how many bytes it takes in used encoding. >>>>> >>>>> Regards, >>>>> Emir >>>>> -- >>>>> Monitoring - Log Management - Alerting - Anomaly Detection >>>>> Solr & Elasticsearch Consulting Support Training - >>> http://sematext.com/ >>>>> >>>>> >>>>> >>>>>> On 3 Jan 2018, at 16:43, Zheng Lin Edwin Yeo <edwinye...@gmail.com> >>>>> wrote: >>>>>> >>>>>> Thanks for the reply. >>>>>> >>>>>> I am doing the search on existing data that has already been indexed, >>> and >>>>>> it is likely to be a one time thing. >>>>>> >>>>>> This subject:/.{255,}.*/ works for English characters. However, >>> there >>>>> are >>>>>> Chinese characters in some of the records. The length seems to be more >>>>> than >>>>>> 255, but it does not shows up in the results. >>>>>> >>>>>> Do you know how the length for Chinese characters and other languages >>> are >>>>>> being determined? >>>>>> >>>>>> Regards, >>>>>> Edwin >>>>>> >>>>>> >>>>>> On 3 January 2018 at 23:01, Alexandre Rafalovitch <arafa...@gmail.com >>>> >>>>>> wrote: >>>>>> >>>>>>> Do that during indexing as Emir suggested. Specifically, use an >>>>>>> UpdateRequestProcessor chain, probably with the Clone and FieldLength >>>>>>> processors: http://www.solr-start.com/javadoc/solr-lucene/org/ >>>>>>> apache/solr/update/processor/FieldLengthUpdateProcessorFactory.html >>>>>>> >>>>>>> Regards, >>>>>>> Alex. >>>>>>> >>>>>>> On 31 December 2017 at 22:00, Zheng Lin Edwin Yeo < >>> edwinye...@gmail.com >>>>>> >>>>>>> wrote: >>>>>>>> Hi, >>>>>>>> >>>>>>>> Would like to check, if it is possible to query a field which has >>> data >>>>> of >>>>>>>> more than a certain length? >>>>>>>> >>>>>>>> Like for example, I want to query the field subject that has more >>> than >>>>>>> 255 >>>>>>>> bytes. Is it possible? >>>>>>>> >>>>>>>> I am currently using Solr 6.5.1. >>>>>>>> >>>>>>>> Regards, >>>>>>>> Edwin >>>>>>> >>>>> >>>>> >>> >>> >>