Re: Field.setStringValue
It's fixed now in JCC's trunk. Andi.. > On Oct 10, 2019, at 05:18, Marc Jeurissen > wrote: > > Ok thank you Andi. > I’ll use the sidepath with the bytes for the moment. > Hope it will get solved soon though. > > > Met vriendelijke groeten, > Marc Jeurissen > > > > Bibliotheek UAntwerpen > Stadscampus – Ve35.303 > Venusstraat 35 – 2000 Antwerpen > marc.jeuris...@uantwerpen.be > T +32 3 265 49 71 > > > > From: Andi Vajda > Sent: woensdag 9 oktober 2019 23:33 > To: Andi Vajda > Cc: pylucene-dev@lucene.apache.org > Subject: Re: Field.setStringValue > > > On Wed, 9 Oct 2019, Andi Vajda wrote: > > > > > On Wed, 9 Oct 2019, Marc Jeurissen wrote: > > > >> Good day to you, > >> > >> I have the following issue when setting the value of a field, value > >> containing a character > 160 (Pylucene 8.1.1, Python 3.7.2) > >> > >> ... > >> (Pdb) field > >> >> indexOptions=DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS> > >> (Pdb) value = '«Volgende facturen werden verstuurd aan de financiële > >> dienst.»' > >> (Pdb) type(value) > >> > >> (Pdb) field.setStringValue(value) > >> (Pdb) field > >> >> stored,indexed,tokenized,omitNorms,indexOptions=DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS >> facturen werden verstuurd aan de financiële dienst>> > >> > >> The field value has lost 2 characters. > >> > >> But when I encode value: > >> > >> (Pdb) value = value.encode('utf-8') > >> (Pdb) value > >> b'\xc2\xabVolgende facturen werden verstuurd aan de financi\xc3\xable > >> dienst.\xc2\xbb' > >> > >> (Pdb) field.setStringValue(value) > >> (Pdb) field > >> >> stored,indexed,tokenized,omitNorms,indexOptions=DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS >> facturen werden verstuurd aan de financiële dienst.»>> > >> > >> The field value is correct. > >> > >> So what does field.setStringValue expect: a string (as says the Lucene > >> documentation) or a byte sequence? > > > > Indeed, there is a problem. I was able to reproduce it with just > > StringBuffer, no lucene involved at all: > > > >>>> from lucene import initVM > >>>> initVM() > >>>> b=b'\xc2\xabVolgende facturen werden verstuurd aan de > >>>> financi\xc3\xabledienst.\xc2\xbb' > >>>> a=b.decode('utf-8') > >>>> from java.lang import StringBuffer > >>>> StringBuffer(b) > > > financiëledienst.»> > >>>> StringBuffer(a) > > > >>>> StringBuffer(a).length() > > 59 > >>>> StringBuffer(b).length() > > 61 > >>>> type(a) > > > >>>> type(b) > > > > > > There must be a bug in the Python 'str' -> Java 'String' conversion code. > > Any Java API such as field.setStringValue() that expects a > > java.lang.String() > > can be passed a 'str' or 'bytes', JCC auto-converts as needed. This is very > > likely where the bug is. > > Digging a bit further, it doesn't seem to be a problem when using Python 2. > I'm not implying this is a python bug, strings are just very different > between python 2 and 3. > > Andi.. > > > > > Andi.. > > > >> > >> Thank you very much. > >> > >> > >> Met vriendelijke groeten, > >> Marc Jeurissen > >> > >> Bibliotheek UAntwerpen > >> Stadscampus ? Ve35.303 > >> Venusstraat 35 ? 2000 Antwerpen > >> marc.jeuris...@uantwerpen.be > >> T +32 3 265 49 71 > >> > >> > >> > > >
RE: Field.setStringValue
Ok thank you Andi. I’ll use the sidepath with the bytes for the moment. Hope it will get solved soon though. Met vriendelijke groeten, Marc Jeurissen Bibliotheek UAntwerpen Stadscampus – Ve35.303 Venusstraat 35 – 2000 Antwerpen marc.jeuris...@uantwerpen.be T +32 3 265 49 71 From: Andi Vajda Sent: woensdag 9 oktober 2019 23:33 To: Andi Vajda Cc: pylucene-dev@lucene.apache.org Subject: Re: Field.setStringValue On Wed, 9 Oct 2019, Andi Vajda wrote: > > On Wed, 9 Oct 2019, Marc Jeurissen wrote: > >> Good day to you, >> >> I have the following issue when setting the value of a field, value >> containing a character > 160 (Pylucene 8.1.1, Python 3.7.2) >> >> ... >> (Pdb) field >> > indexOptions=DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS> >> (Pdb) value = '«Volgende facturen werden verstuurd aan de financiële >> dienst.»' >> (Pdb) type(value) >> >> (Pdb) field.setStringValue(value) >> (Pdb) field >> > stored,indexed,tokenized,omitNorms,indexOptions=DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS> >> facturen werden verstuurd aan de financiële dienst>> >> >> The field value has lost 2 characters. >> >> But when I encode value: >> >> (Pdb) value = value.encode('utf-8') >> (Pdb) value >> b'\xc2\xabVolgende facturen werden verstuurd aan de financi\xc3\xable >> dienst.\xc2\xbb' >> >> (Pdb) field.setStringValue(value) >> (Pdb) field >> > stored,indexed,tokenized,omitNorms,indexOptions=DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS> >> facturen werden verstuurd aan de financiële dienst.»>> >> >> The field value is correct. >> >> So what does field.setStringValue expect: a string (as says the Lucene >> documentation) or a byte sequence? > > Indeed, there is a problem. I was able to reproduce it with just > StringBuffer, no lucene involved at all: > >>>> from lucene import initVM >>>> initVM() >>>> b=b'\xc2\xabVolgende facturen werden verstuurd aan de >>>> financi\xc3\xabledienst.\xc2\xbb' >>>> a=b.decode('utf-8') >>>> from java.lang import StringBuffer >>>> StringBuffer(b) > >>>> StringBuffer(a) > >>>> StringBuffer(a).length() > 59 >>>> StringBuffer(b).length() > 61 >>>> type(a) > >>>> type(b) > > > There must be a bug in the Python 'str' -> Java 'String' conversion code. > Any Java API such as field.setStringValue() that expects a java.lang.String() > can be passed a 'str' or 'bytes', JCC auto-converts as needed. This is very > likely where the bug is. Digging a bit further, it doesn't seem to be a problem when using Python 2. I'm not implying this is a python bug, strings are just very different between python 2 and 3. Andi.. > > Andi.. > >> >> Thank you very much. >> >> >> Met vriendelijke groeten, >> Marc Jeurissen >> >> Bibliotheek UAntwerpen >> Stadscampus ? Ve35.303 >> Venusstraat 35 ? 2000 Antwerpen >> marc.jeuris...@uantwerpen.be >> T +32 3 265 49 71 >> >> >> >
Re: Field.setStringValue
On Wed, 9 Oct 2019, Andi Vajda wrote: On Wed, 9 Oct 2019, Marc Jeurissen wrote: Good day to you, I have the following issue when setting the value of a field, value containing a character > 160 (Pylucene 8.1.1, Python 3.7.2) ... (Pdb) field indexOptions=DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS> (Pdb) value = '«Volgende facturen werden verstuurd aan de financiële dienst.»' (Pdb) type(value) (Pdb) field.setStringValue(value) (Pdb) field stored,indexed,tokenized,omitNorms,indexOptions=DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETSfacturen werden verstuurd aan de financiële dienst>> The field value has lost 2 characters. But when I encode value: (Pdb) value = value.encode('utf-8') (Pdb) value b'\xc2\xabVolgende facturen werden verstuurd aan de financi\xc3\xable dienst.\xc2\xbb' (Pdb) field.setStringValue(value) (Pdb) field stored,indexed,tokenized,omitNorms,indexOptions=DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETSfacturen werden verstuurd aan de financiële dienst.»>> The field value is correct. So what does field.setStringValue expect: a string (as says the Lucene documentation) or a byte sequence? Indeed, there is a problem. I was able to reproduce it with just StringBuffer, no lucene involved at all: from lucene import initVM initVM() b=b'\xc2\xabVolgende facturen werden verstuurd aan de financi\xc3\xabledienst.\xc2\xbb' a=b.decode('utf-8') from java.lang import StringBuffer StringBuffer(b) StringBuffer(a) StringBuffer(a).length() 59 StringBuffer(b).length() 61 type(a) type(b) There must be a bug in the Python 'str' -> Java 'String' conversion code. Any Java API such as field.setStringValue() that expects a java.lang.String() can be passed a 'str' or 'bytes', JCC auto-converts as needed. This is very likely where the bug is. Digging a bit further, it doesn't seem to be a problem when using Python 2. I'm not implying this is a python bug, strings are just very different between python 2 and 3. Andi.. Andi.. Thank you very much. Met vriendelijke groeten, Marc Jeurissen Bibliotheek UAntwerpen Stadscampus ? Ve35.303 Venusstraat 35 ? 2000 Antwerpen marc.jeuris...@uantwerpen.be T +32 3 265 49 71
Re: Field.setStringValue
On Wed, 9 Oct 2019, Marc Jeurissen wrote: Good day to you, I have the following issue when setting the value of a field, value containing a character > 160 (Pylucene 8.1.1, Python 3.7.2) ... (Pdb) field > (Pdb) value = '«Volgende facturen werden verstuurd aan de financiële dienst.»' (Pdb) type(value) (Pdb) field.setStringValue(value) (Pdb) field > The field value has lost 2 characters. But when I encode value: (Pdb) value = value.encode('utf-8') (Pdb) value b'\xc2\xabVolgende facturen werden verstuurd aan de financi\xc3\xable dienst.\xc2\xbb' (Pdb) field.setStringValue(value) (Pdb) field > The field value is correct. So what does field.setStringValue expect: a string (as says the Lucene documentation) or a byte sequence? Indeed, there is a problem. I was able to reproduce it with just StringBuffer, no lucene involved at all: from lucene import initVM initVM() b=b'\xc2\xabVolgende facturen werden verstuurd aan de financi\xc3\xabledienst.\xc2\xbb' a=b.decode('utf-8') from java.lang import StringBuffer StringBuffer(b) StringBuffer(a) StringBuffer(a).length() 59 StringBuffer(b).length() 61 type(a) type(b) There must be a bug in the Python 'str' -> Java 'String' conversion code. Any Java API such as field.setStringValue() that expects a java.lang.String() can be passed a 'str' or 'bytes', JCC auto-converts as needed. This is very likely where the bug is. Andi.. Thank you very much. Met vriendelijke groeten, Marc Jeurissen Bibliotheek UAntwerpen Stadscampus ? Ve35.303 Venusstraat 35 ? 2000 Antwerpen marc.jeuris...@uantwerpen.be T +32 3 265 49 71