Re: Field.setStringValue

2019-10-10 Thread Andi Vajda
It's fixed now in JCC's trunk.

Andi..

> On Oct 10, 2019, at 05:18, Marc Jeurissen  
> wrote:
> 
> Ok thank you Andi.
> I’ll use the sidepath with the bytes for the moment.
> Hope it will get solved soon though.
>  
>  
> Met vriendelijke groeten,
> Marc Jeurissen
> 
> 
> 
> Bibliotheek UAntwerpen
> Stadscampus – Ve35.303
> Venusstraat 35 – 2000 Antwerpen
> marc.jeuris...@uantwerpen.be
> T +32 3 265 49 71
>  
> 
>  
> From: Andi Vajda
> Sent: woensdag 9 oktober 2019 23:33
> To: Andi Vajda
> Cc: pylucene-dev@lucene.apache.org
> Subject: Re: Field.setStringValue
>  
>  
> On Wed, 9 Oct 2019, Andi Vajda wrote:
>  
> > 
> > On Wed, 9 Oct 2019, Marc Jeurissen wrote:
> > 
> >> Good day to you,
> >>
> >> I have the following issue when setting the value of a field, value
> >> containing a character > 160 (Pylucene 8.1.1, Python 3.7.2)
> >>
> >> ...
> >> (Pdb) field
> >>  >> indexOptions=DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS>
> >> (Pdb) value = '«Volgende facturen werden verstuurd aan de financiële
> >> dienst.»'
> >> (Pdb) type(value)
> >> 
> >> (Pdb) field.setStringValue(value)
> >> (Pdb) field
> >>  >> stored,indexed,tokenized,omitNorms,indexOptions=DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS >> facturen werden verstuurd aan de financiële dienst>>
> >>
> >> The field value has lost 2 characters.
> >>
> >> But when I encode value:
> >>
> >> (Pdb) value = value.encode('utf-8')
> >> (Pdb) value
> >> b'\xc2\xabVolgende facturen werden verstuurd aan de financi\xc3\xable
> >> dienst.\xc2\xbb'
> >>
> >> (Pdb) field.setStringValue(value)
> >> (Pdb) field
> >>  >> stored,indexed,tokenized,omitNorms,indexOptions=DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS >> facturen werden verstuurd aan de financiële dienst.»>>
> >>
> >> The field value is correct.
> >>
> >> So what does field.setStringValue expect: a string (as says the Lucene
> >> documentation) or a byte sequence?
> > 
> > Indeed, there is a problem. I was able to reproduce it with just
> > StringBuffer, no lucene involved at all:
> > 
> >>>> from lucene import initVM
> >>>> initVM()
> >>>> b=b'\xc2\xabVolgende facturen werden verstuurd aan de
> >>>> financi\xc3\xabledienst.\xc2\xbb'
> >>>> a=b.decode('utf-8')
> >>>> from java.lang import StringBuffer
> >>>> StringBuffer(b)
> >  > financiëledienst.»>
> >>>> StringBuffer(a)
> > 
> >>>> StringBuffer(a).length()
> > 59
> >>>> StringBuffer(b).length()
> > 61
> >>>> type(a)
> > 
> >>>> type(b)
> > 
> > 
> > There must be a bug in the Python 'str' -> Java 'String' conversion code.
> > Any Java API such as field.setStringValue() that expects a 
> > java.lang.String()
> > can be passed a 'str' or 'bytes', JCC auto-converts as needed. This is very
> > likely where the bug is.
>  
> Digging a bit further, it doesn't seem to be a problem when using Python 2.
> I'm not implying this is a python bug, strings are just very different
> between python 2 and 3.
>  
> Andi..
>  
> > 
> > Andi..
> > 
> >>
> >> Thank you very much.
> >>
> >>
> >> Met vriendelijke groeten,
> >> Marc Jeurissen
> >>
> >> Bibliotheek UAntwerpen
> >> Stadscampus ? Ve35.303
> >> Venusstraat 35 ? 2000 Antwerpen
> >> marc.jeuris...@uantwerpen.be
> >> T +32 3 265 49 71
> >>
> >>
> >>
> > 
>  


RE: Field.setStringValue

2019-10-10 Thread Marc Jeurissen
Ok thank you Andi.
I’ll use the sidepath with the bytes for the moment.
Hope it will get solved soon though.


Met vriendelijke groeten,
Marc Jeurissen

Bibliotheek UAntwerpen
Stadscampus – Ve35.303
Venusstraat 35 – 2000 Antwerpen
marc.jeuris...@uantwerpen.be
T +32 3 265 49 71



From: Andi Vajda
Sent: woensdag 9 oktober 2019 23:33
To: Andi Vajda
Cc: pylucene-dev@lucene.apache.org
Subject: Re: Field.setStringValue


On Wed, 9 Oct 2019, Andi Vajda wrote:

>
> On Wed, 9 Oct 2019, Marc Jeurissen wrote:
>
>> Good day to you,
>> 
>> I have the following issue when setting the value of a field, value 
>> containing a character > 160 (Pylucene 8.1.1, Python 3.7.2)
>> 
>> ...
>> (Pdb) field
>> > indexOptions=DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS>
>> (Pdb) value = '«Volgende facturen werden verstuurd aan de financiële 
>> dienst.»'
>> (Pdb) type(value)
>> 
>> (Pdb) field.setStringValue(value)
>> (Pdb) field
>> > stored,indexed,tokenized,omitNorms,indexOptions=DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS>  
>> facturen werden verstuurd aan de financiële dienst>>
>> 
>> The field value has lost 2 characters.
>> 
>> But when I encode value:
>> 
>> (Pdb) value = value.encode('utf-8')
>> (Pdb) value
>> b'\xc2\xabVolgende facturen werden verstuurd aan de financi\xc3\xable 
>> dienst.\xc2\xbb'
>> 
>> (Pdb) field.setStringValue(value)
>> (Pdb) field
>> > stored,indexed,tokenized,omitNorms,indexOptions=DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS>  
>> facturen werden verstuurd aan de financiële dienst.»>>
>> 
>> The field value is correct.
>> 
>> So what does field.setStringValue expect: a string (as says the Lucene 
>> documentation) or a byte sequence?
>
> Indeed, there is a problem. I was able to reproduce it with just 
> StringBuffer, no lucene involved at all:
>
>>>> from lucene import initVM
>>>> initVM()
>>>> b=b'\xc2\xabVolgende facturen werden verstuurd aan de 
>>>> financi\xc3\xabledienst.\xc2\xbb'
>>>> a=b.decode('utf-8')
>>>> from java.lang import StringBuffer
>>>> StringBuffer(b)
> 
>>>> StringBuffer(a)
> 
>>>> StringBuffer(a).length()
> 59
>>>> StringBuffer(b).length()
> 61
>>>> type(a)
> 
>>>> type(b)
> 
>
> There must be a bug in the Python 'str' -> Java 'String' conversion code.
> Any Java API such as field.setStringValue() that expects a java.lang.String() 
> can be passed a 'str' or 'bytes', JCC auto-converts as needed. This is very 
> likely where the bug is.

Digging a bit further, it doesn't seem to be a problem when using Python 2. 
I'm not implying this is a python bug, strings are just very different 
between python 2 and 3.

Andi..

>
> Andi..
>
>> 
>> Thank you very much.
>> 
>> 
>> Met vriendelijke groeten,
>> Marc Jeurissen
>> 
>> Bibliotheek UAntwerpen
>> Stadscampus ? Ve35.303
>> Venusstraat 35 ? 2000 Antwerpen
>> marc.jeuris...@uantwerpen.be
>> T +32 3 265 49 71
>> 
>> 
>> 
>



Re: Field.setStringValue

2019-10-09 Thread Andi Vajda


On Wed, 9 Oct 2019, Andi Vajda wrote:



On Wed, 9 Oct 2019, Marc Jeurissen wrote:


Good day to you,

I have the following issue when setting the value of a field, value 
containing a character > 160 (Pylucene 8.1.1, Python 3.7.2)


...
(Pdb) field
indexOptions=DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS>
(Pdb) value = '«Volgende facturen werden verstuurd aan de financiële 
dienst.»'

(Pdb) type(value)

(Pdb) field.setStringValue(value)
(Pdb) field
stored,indexed,tokenized,omitNorms,indexOptions=DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETSfacturen werden verstuurd aan de financiële dienst>>


The field value has lost 2 characters.

But when I encode value:

(Pdb) value = value.encode('utf-8')
(Pdb) value
b'\xc2\xabVolgende facturen werden verstuurd aan de financi\xc3\xable 
dienst.\xc2\xbb'


(Pdb) field.setStringValue(value)
(Pdb) field
stored,indexed,tokenized,omitNorms,indexOptions=DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETSfacturen werden verstuurd aan de financiële dienst.»>>


The field value is correct.

So what does field.setStringValue expect: a string (as says the Lucene 
documentation) or a byte sequence?


Indeed, there is a problem. I was able to reproduce it with just 
StringBuffer, no lucene involved at all:



from lucene import initVM
initVM()
b=b'\xc2\xabVolgende facturen werden verstuurd aan de 
financi\xc3\xabledienst.\xc2\xbb'

a=b.decode('utf-8')
from java.lang import StringBuffer
StringBuffer(b)



StringBuffer(a)



StringBuffer(a).length()

59

StringBuffer(b).length()

61

type(a)



type(b)



There must be a bug in the Python 'str' -> Java 'String' conversion code.
Any Java API such as field.setStringValue() that expects a java.lang.String() 
can be passed a 'str' or 'bytes', JCC auto-converts as needed. This is very 
likely where the bug is.


Digging a bit further, it doesn't seem to be a problem when using Python 2. 
I'm not implying this is a python bug, strings are just very different 
between python 2 and 3.


Andi..



Andi..



Thank you very much.


Met vriendelijke groeten,
Marc Jeurissen

Bibliotheek UAntwerpen
Stadscampus ? Ve35.303
Venusstraat 35 ? 2000 Antwerpen
marc.jeuris...@uantwerpen.be
T +32 3 265 49 71





Re: Field.setStringValue

2019-10-09 Thread Andi Vajda


On Wed, 9 Oct 2019, Marc Jeurissen wrote:


Good day to you,

I have the following issue when setting the value of a field, value containing a 
character > 160 (Pylucene 8.1.1, Python 3.7.2)

...
(Pdb) field
>
(Pdb) value = '«Volgende facturen werden verstuurd aan de financiële dienst.»'
(Pdb) type(value)

(Pdb) field.setStringValue(value)
(Pdb) field
>

The field value has lost 2 characters.

But when I encode value:

(Pdb) value = value.encode('utf-8')
(Pdb) value
b'\xc2\xabVolgende facturen werden verstuurd aan de financi\xc3\xable 
dienst.\xc2\xbb'

(Pdb) field.setStringValue(value)
(Pdb) field
>

The field value is correct.

So what does field.setStringValue expect: a string (as says the Lucene 
documentation) or a byte sequence?


Indeed, there is a problem. I was able to reproduce it with just 
StringBuffer, no lucene involved at all:



from lucene import initVM
initVM()
b=b'\xc2\xabVolgende facturen werden verstuurd aan de 
financi\xc3\xabledienst.\xc2\xbb'
a=b.decode('utf-8')
from java.lang import StringBuffer
StringBuffer(b)



StringBuffer(a)



StringBuffer(a).length()

59

StringBuffer(b).length()

61

type(a)



type(b)



There must be a bug in the Python 'str' -> Java 'String' conversion code.
Any Java API such as field.setStringValue() that expects a 
java.lang.String() can be passed a 'str' or 'bytes', JCC auto-converts as 
needed. This is very likely where the bug is.


Andi..



Thank you very much.


Met vriendelijke groeten,
Marc Jeurissen

Bibliotheek UAntwerpen
Stadscampus ? Ve35.303
Venusstraat 35 ? 2000 Antwerpen
marc.jeuris...@uantwerpen.be
T +32 3 265 49 71