Hi, Thanks for the reply.
Meaning we have to write this custom QParser ourselves? Regards, Edwin On 3 February 2018 at 03:28, Chris Hostetter <hossman_luc...@fucit.org> wrote: > > : Have you manage to get the regex for this string in Chinese: > 预支款管理及账务处理办法 ? > ... > : > An example of the string in Chinese is 预支款管理及账务处理办法 > : > > : > The number of characters is 12, but the expected length should be 36. > ... > : >> > So this would likely be different from what the operating system > : >> counts, as > : >> > the operating system may consider each Chinese characters as 3 to 4 > : >> bytes. > : >> > Which is probably why I could not find any record with > : >> subject:/.{255,}.*/ > > Java regexes operate on unicode strings, so ".' matches any *character* > There is no regex syntax to match an any "byte" so a regex based approach > is never going to be viable. > > You're best bet is to check the byte count when indexing -- but even then > you'd need some custom code since things like > FieldLengthUpdateProcessorFactory are well behaved and count the > *characters* of the unicode strings. > > If you absolutely can't reindex, then you'd need a custom QParser that > produced a custom Query object that iterated over the TermEnum looking at > the buffers and counting the bytes in each term -- matching each doc > assocaited with those terms. > > > > -Hoss > http://www.lucidworks.com/