Hi,

Thanks for the reply.

Meaning we have to write this custom QParser ourselves?

Regards,
Edwin


On 3 February 2018 at 03:28, Chris Hostetter <hossman_luc...@fucit.org>
wrote:

>
> : Have you manage to get the regex for this string in Chinese:
> 预支款管理及账务处理办法 ?
>         ...
> : > An example of the string in Chinese is 预支款管理及账务处理办法
> : >
> : > The number of characters is 12, but the expected length should be 36.
>         ...
> : >> > So this would likely be different from what the operating system
> : >> counts, as
> : >> > the operating system may consider each Chinese characters as 3 to 4
> : >> bytes.
> : >> > Which is probably why I could not find any record with
> : >> subject:/.{255,}.*/
>
> Java regexes operate on unicode strings, so ".' matches any *character*
> There is no regex syntax to match an any "byte" so a regex based approach
> is never going to be viable.
>
> You're best bet is to check the byte count when indexing -- but even then
> you'd need some custom code since things like
> FieldLengthUpdateProcessorFactory are well behaved and count the
> *characters* of the unicode strings.
>
> If you absolutely can't reindex, then you'd need a custom QParser that
> produced a custom Query object that iterated over the TermEnum looking at
> the buffers and counting the bytes in each term -- matching each doc
> assocaited with those terms.
>
>
>
> -Hoss
> http://www.lucidworks.com/

Reply via email to