Hi Emir,

An example of the string in Chinese is 预支款管理及账务处理办法

The number of characters is 12, but the expected length should be 36.

Regards,
Edwin


On 4 January 2018 at 16:21, Emir Arnautović <emir.arnauto...@sematext.com>
wrote:

> Hi Edwin,
> I don’t have enough knowledge in eastern languages to know what is
> expected number when you as for sting length. Maybe you can try some of
> regex unicode settings and see if you’ll get what you need: try setting
> unicode flag with (?U) or try using regex groups and ranges. If you provide
> example string and expected length, maybe we could provide you regex.
>
> Thanks,
> Emir
> --
> Monitoring - Log Management - Alerting - Anomaly Detection
> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>
>
>
> > On 4 Jan 2018, at 04:37, Zheng Lin Edwin Yeo <edwinye...@gmail.com>
> wrote:
> >
> > Hi Emir,
> >
> > So this would likely be different from what the operating system counts,
> as
> > the operating system may consider each Chinese characters as 3 to 4
> bytes.
> > Which is probably why I could not find any record with
> subject:/.{255,}.*/
> >
> > Is there other tools that we can use to query the length for data that
> are
> > already indexed which are not in the standard English language? (Eg:
> > Chinese, Japanese, etc)
> >
> > Regards,
> > Edwin
> >
> > On 3 January 2018 at 23:51, Emir Arnautović <
> emir.arnauto...@sematext.com>
> > wrote:
> >
> >> Hi Edwin,
> >> I do not know, but my guess would be that each character is counted as 1
> >> in regex regardless how many bytes it takes in used encoding.
> >>
> >> Regards,
> >> Emir
> >> --
> >> Monitoring - Log Management - Alerting - Anomaly Detection
> >> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
> >>
> >>
> >>
> >>> On 3 Jan 2018, at 16:43, Zheng Lin Edwin Yeo <edwinye...@gmail.com>
> >> wrote:
> >>>
> >>> Thanks for the reply.
> >>>
> >>> I am doing the search on existing data that has already been indexed,
> and
> >>> it is likely to be a one time thing.
> >>>
> >>> This  subject:/.{255,}.*/  works for English characters. However, there
> >> are
> >>> Chinese characters in some of the records. The length seems to be more
> >> than
> >>> 255, but it does not shows up in the results.
> >>>
> >>> Do you know how the length for Chinese characters and other languages
> are
> >>> being determined?
> >>>
> >>> Regards,
> >>> Edwin
> >>>
> >>>
> >>> On 3 January 2018 at 23:01, Alexandre Rafalovitch <arafa...@gmail.com>
> >>> wrote:
> >>>
> >>>> Do that during indexing as Emir suggested. Specifically, use an
> >>>> UpdateRequestProcessor chain, probably with the Clone and FieldLength
> >>>> processors: http://www.solr-start.com/javadoc/solr-lucene/org/
> >>>> apache/solr/update/processor/FieldLengthUpdateProcessorFactory.html
> >>>>
> >>>> Regards,
> >>>>  Alex.
> >>>>
> >>>> On 31 December 2017 at 22:00, Zheng Lin Edwin Yeo <
> edwinye...@gmail.com
> >>>
> >>>> wrote:
> >>>>> Hi,
> >>>>>
> >>>>> Would like to check, if it is possible to query a field which has
> data
> >> of
> >>>>> more than a certain length?
> >>>>>
> >>>>> Like for example, I want to query the field subject that has more
> than
> >>>> 255
> >>>>> bytes. Is it possible?
> >>>>>
> >>>>> I am currently using Solr 6.5.1.
> >>>>>
> >>>>> Regards,
> >>>>> Edwin
> >>>>
> >>
> >>
>
>

Reply via email to