Re: C API PyObject_CallFunctionObjArgs returns incorrect result

2022-03-07 Thread Jen Kris via Python-list
Thanks to MRAB and Chris Angelico for your help.  Here is how I implemented the 
string conversion, and it works correctly now for a library call that needs a 
list converted to a string (error handling not shown):

PyObject* str_sentence = PyObject_Str(pSentence);  
PyObject* separator = PyUnicode_FromString(" ");
PyObject* str_join = PyUnicode_Join(separator, pSentence);
Py_DECREF(separator);
PyObject* pNltk_WTok = PyObject_GetAttrString(pModule_mstr, "word_tokenize");
PyObject* pWTok = PyObject_CallFunctionObjArgs(pNltk_WTok, str_join, 0);

That produces what I need (this is the REPR of pWTok):

"['[', 'Emma', 'by', 'Jane', 'Austen', '1816', ']']"

Thanks again to both of you. 

Jen


Mar 7, 2022, 11:03 by pyt...@mrabarnett.plus.com:

> On 2022-03-07 17:05, Jen Kris wrote:
>
>> Thank you MRAB for your reply.
>>
>> Regarding your first question, pSentence is a list.  In the nltk library, 
>> nltk.word_tokenize takes a string, so we convert sentence to string before 
>> we call nltk.word_tokenize:
>>
>> >>> sentence = " ".join(sentence)
>> >>> pt = nltk.word_tokenize(sentence)
>> >>> print(sentence)
>> [ Emma by Jane Austen 1816 ]
>>
>> But with the C API it looks like this:
>>
>> PyObject *pSentence = PySequence_GetItem(pSents, sent_count);
>> PyObject* str_sentence = PyObject_Str(pSentence); // Convert to string
>>
>> ; See what str_sentence looks like:
>> PyObject* repr_str = PyObject_Repr(str_sentence);
>> PyObject* str_str = PyUnicode_AsEncodedString(repr_str, "utf-8", "~E~");
>> const char *bytes_str = PyBytes_AS_STRING(str_str);
>> printf("REPR_String: %s\n", bytes_str);
>>
>> REPR_String: "['[', 'Emma', 'by', 'Jane', 'Austen', '1816', ']']"
>>
>> So the two string representations are not the same – or at least the   
>> PyUnicode_AsEncodedString is not the same, as each item is surrounded by 
>> single quotes.
>>
>> Assuming that the conversion to bytes object for the REPR is an accurate 
>> representation of str_sentence, it looks like I need to strip the quotes 
>> from str_sentence before “PyObject* pWTok = 
>> PyObject_CallFunctionObjArgs(pNltk_WTok, str_sentence, 0).”
>>
>> So my questions now are (1) is there a C API function that will convert a 
>> list to a string exactly the same way as ‘’.join, and if not then (2) how 
>> can I strip characters from a string object in the C API?
>>
> Your Python code is joining the list with a space as the separator.
>
> The equivalent using the C API is:
>
>     PyObject* separator;
>     PyObject* joined;
>
>     separator = PyUnicode_FromString(" ");
>     joined = PyUnicode_Join(separator, pSentence);
>     Py_DECREF(sep);
>
>>
>> Mar 6, 2022, 17:42 by pyt...@mrabarnett.plus.com:
>>
>>  On 2022-03-07 00:32, Jen Kris via Python-list wrote:
>>
>>  I am using the C API in Python 3.8 with the nltk library, and
>>  I have a problem with the return from a library call
>>  implemented with PyObject_CallFunctionObjArgs.
>>
>>  This is the relevant Python code:
>>
>>  import nltk
>>  from nltk.corpus import gutenberg
>>  fileids = gutenberg.fileids()
>>  sentences = gutenberg.sents(fileids[0])
>>  sentence = sentences[0]
>>  sentence = " ".join(sentence)
>>  pt = nltk.word_tokenize(sentence)
>>
>>  I run this at the Python command prompt to show how it works:
>>
>>  sentence = " ".join(sentence)
>>  pt = nltk.word_tokenize(sentence)
>>  print(pt)
>>
>>  ['[', 'Emma', 'by', 'Jane', 'Austen', '1816', ']']
>>
>>  type(pt)
>>
>>  
>>
>>  This is the relevant part of the C API code:
>>
>>  PyObject* str_sentence = PyObject_Str(pSentence);
>>  // nltk.word_tokenize(sentence)
>>  PyObject* pNltk_WTok = PyObject_GetAttrString(pModule_mstr,
>>  "word_tokenize");
>>  PyObject* pWTok = PyObject_CallFunctionObjArgs(pNltk_WTok,
>>  str_sentence, 0);
>>
>>  (where pModule_mstr is the nltk library).
>>
>>  That should produce a list with a length of 7 that looks like
>>  it does on the command line version shown above:
>>
>>  ['[', 'Emma', 'by', 'Jane', 'Austen', '1816', ']']
>>
>>  But instead the C API produces a list with a length of 24, and
>>  the REPR looks like this:
>>
>>  '[\'[\', "\'", \'[\', "\'", \',\', "\'Emma", "\'", \',\',
>>  "\'by", "\'", \',\', "\'Jane", "\'", \',\', "\'Austen", "\'",
>>  \',\', "\'1816", "\'", \',\', "\'", \']\', "\'", \']\']'
>>
>>  I also tried this with PyObject_CallMethodObjArgs and
>>  PyObject_Call without success.
>>
>>  Thanks for any help on this.
>>
>>  What is pSentence? Is it what you think it is?
>>  To me it looks like it's either the list:
>>
>>  ['[', 'Emma', 'by', 'Jane', 'Austen', '1816', ']']
>>
>>  or that list as a string:
>>
>>  "['[', 'Emma', 'by', 'Jane', 'Austen', '1816', ']']"
>>
>>  and that what you're tokenising.
>>  -- https://mail.python.org/mailman/listinfo/python-list
>>
> -- 
> https://mail.python.org/mailman/listinfo/python-list
>

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: C API PyObject_CallFunctionObjArgs returns incorrect result

2022-03-07 Thread MRAB

On 2022-03-07 17:05, Jen Kris wrote:

Thank you MRAB for your reply.

Regarding your first question, pSentence is a list.  In the nltk 
library, nltk.word_tokenize takes a string, so we convert sentence to 
string before we call nltk.word_tokenize:


>>> sentence = " ".join(sentence)
>>> pt = nltk.word_tokenize(sentence)
>>> print(sentence)
[ Emma by Jane Austen 1816 ]

But with the C API it looks like this:

PyObject *pSentence = PySequence_GetItem(pSents, sent_count);
PyObject* str_sentence = PyObject_Str(pSentence); // Convert to string

; See what str_sentence looks like:
PyObject* repr_str = PyObject_Repr(str_sentence);
PyObject* str_str = PyUnicode_AsEncodedString(repr_str, "utf-8", "~E~");
const char *bytes_str = PyBytes_AS_STRING(str_str);
printf("REPR_String: %s\n", bytes_str);

REPR_String: "['[', 'Emma', 'by', 'Jane', 'Austen', '1816', ']']"

So the two string representations are not the same – or at least the   
PyUnicode_AsEncodedString is not the same, as each item is surrounded 
by single quotes.


Assuming that the conversion to bytes object for the REPR is an 
accurate representation of str_sentence, it looks like I need to strip 
the quotes from str_sentence before “PyObject* pWTok = 
PyObject_CallFunctionObjArgs(pNltk_WTok, str_sentence, 0).”


So my questions now are (1) is there a C API function that will 
convert a list to a string exactly the same way as ‘’.join, and if not 
then (2) how can I strip characters from a string object in the C API?



Your Python code is joining the list with a space as the separator.

The equivalent using the C API is:

    PyObject* separator;
    PyObject* joined;

    separator = PyUnicode_FromString(" ");
    joined = PyUnicode_Join(separator, pSentence);
    Py_DECREF(sep);



Mar 6, 2022, 17:42 by pyt...@mrabarnett.plus.com:

On 2022-03-07 00:32, Jen Kris via Python-list wrote:

I am using the C API in Python 3.8 with the nltk library, and
I have a problem with the return from a library call
implemented with PyObject_CallFunctionObjArgs.

This is the relevant Python code:

import nltk
from nltk.corpus import gutenberg
fileids = gutenberg.fileids()
sentences = gutenberg.sents(fileids[0])
sentence = sentences[0]
sentence = " ".join(sentence)
pt = nltk.word_tokenize(sentence)

I run this at the Python command prompt to show how it works:

sentence = " ".join(sentence)
pt = nltk.word_tokenize(sentence)
print(pt)

['[', 'Emma', 'by', 'Jane', 'Austen', '1816', ']']

type(pt)



This is the relevant part of the C API code:

PyObject* str_sentence = PyObject_Str(pSentence);
// nltk.word_tokenize(sentence)
PyObject* pNltk_WTok = PyObject_GetAttrString(pModule_mstr,
"word_tokenize");
PyObject* pWTok = PyObject_CallFunctionObjArgs(pNltk_WTok,
str_sentence, 0);

(where pModule_mstr is the nltk library).

That should produce a list with a length of 7 that looks like
it does on the command line version shown above:

['[', 'Emma', 'by', 'Jane', 'Austen', '1816', ']']

But instead the C API produces a list with a length of 24, and
the REPR looks like this:

'[\'[\', "\'", \'[\', "\'", \',\', "\'Emma", "\'", \',\',
"\'by", "\'", \',\', "\'Jane", "\'", \',\', "\'Austen", "\'",
\',\', "\'1816", "\'", \',\', "\'", \']\', "\'", \']\']'

I also tried this with PyObject_CallMethodObjArgs and
PyObject_Call without success.

Thanks for any help on this.

What is pSentence? Is it what you think it is?
To me it looks like it's either the list:

['[', 'Emma', 'by', 'Jane', 'Austen', '1816', ']']

or that list as a string:

"['[', 'Emma', 'by', 'Jane', 'Austen', '1816', ']']"

and that what you're tokenising.
-- 
https://mail.python.org/mailman/listinfo/python-list




--
https://mail.python.org/mailman/listinfo/python-list


Re: C API PyObject_CallFunctionObjArgs returns incorrect result

2022-03-07 Thread Chris Angelico
On Tue, 8 Mar 2022 at 04:13, Jen Kris  wrote:
>
>
> The PyObject str_sentence is a string representation of a list.  I need to 
> convert the list to a string like "".join because that's what the library 
> call takes.
>

What you're doing is the equivalent of str(sentence), not
"".join(sentence). Since the join method is part of the string
protocol, you'll find it here:

https://docs.python.org/3/c-api/unicode.html#c.PyUnicode_Join

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: C API PyObject_CallFunctionObjArgs returns incorrect result

2022-03-07 Thread Jen Kris via Python-list

The PyObject str_sentence is a string representation of a list.  I need to 
convert the list to a string like "".join because that's what the library call 
takes.  


Mar 7, 2022, 09:09 by ros...@gmail.com:

> On Tue, 8 Mar 2022 at 04:06, Jen Kris via Python-list
>  wrote:
>
>> But with the C API it looks like this:
>>
>> PyObject *pSentence = PySequence_GetItem(pSents, sent_count);
>> PyObject* str_sentence = PyObject_Str(pSentence);  // Convert to string
>>
>> PyObject* repr_str = PyObject_Repr(str_sentence);
>>
>
> You convert it to a string, then take the representation of that. Is
> that what you intended?
>
> ChrisA
> -- 
> https://mail.python.org/mailman/listinfo/python-list
>

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: C API PyObject_CallFunctionObjArgs returns incorrect result

2022-03-07 Thread Chris Angelico
On Tue, 8 Mar 2022 at 04:06, Jen Kris via Python-list
 wrote:
> But with the C API it looks like this:
>
> PyObject *pSentence = PySequence_GetItem(pSents, sent_count);
> PyObject* str_sentence = PyObject_Str(pSentence);  // Convert to string
>
> PyObject* repr_str = PyObject_Repr(str_sentence);

You convert it to a string, then take the representation of that. Is
that what you intended?

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: C API PyObject_CallFunctionObjArgs returns incorrect result

2022-03-07 Thread Jen Kris via Python-list
Thank you MRAB for your reply.

Regarding your first question, pSentence is a list.  In the nltk library, 
nltk.word_tokenize takes a string, so we convert sentence to string before we 
call nltk.word_tokenize:

>>> sentence = " ".join(sentence)
>>> pt = nltk.word_tokenize(sentence)
>>> print(sentence)
[ Emma by Jane Austen 1816 ]

But with the C API it looks like this:

PyObject *pSentence = PySequence_GetItem(pSents, sent_count);
PyObject* str_sentence = PyObject_Str(pSentence);  // Convert to string

; See what str_sentence looks like:
PyObject* repr_str = PyObject_Repr(str_sentence);  
PyObject* str_str = PyUnicode_AsEncodedString(repr_str, "utf-8", "~E~");  
const char *bytes_str = PyBytes_AS_STRING(str_str);
printf("REPR_String: %s\n", bytes_str); 

REPR_String: "['[', 'Emma', 'by', 'Jane', 'Austen', '1816', ']']"
So the two string representations are not the same – or at least the   
PyUnicode_AsEncodedString is not the same, as each item is surrounded by single 
quotes. 

Assuming that the conversion to bytes object for the REPR is an accurate 
representation of str_sentence, it looks like I need to strip the quotes from 
str_sentence before “PyObject* pWTok = PyObject_CallFunctionObjArgs(pNltk_WTok, 
str_sentence, 0).”   

So my questions now are (1) is there a C API function that will convert a list 
to a string exactly the same way as ‘’.join, and if not then (2) how can I 
strip characters from a string object in the C API? 

Thanks.



Mar 6, 2022, 17:42 by pyt...@mrabarnett.plus.com:

> On 2022-03-07 00:32, Jen Kris via Python-list wrote:
>
>> I am using the C API in Python 3.8 with the nltk library, and I have a 
>> problem with the return from a library call implemented with 
>> PyObject_CallFunctionObjArgs.
>>
>> This is the relevant Python code:
>>
>> import nltk
>> from nltk.corpus import gutenberg
>> fileids = gutenberg.fileids()
>> sentences = gutenberg.sents(fileids[0])
>> sentence = sentences[0]
>> sentence = " ".join(sentence)
>> pt = nltk.word_tokenize(sentence)
>>
>> I run this at the Python command prompt to show how it works:
>>
> sentence = " ".join(sentence)
> pt = nltk.word_tokenize(sentence)
> print(pt)
>
>> ['[', 'Emma', 'by', 'Jane', 'Austen', '1816', ']']
>>
> type(pt)
>
>> 
>>
>> This is the relevant part of the C API code:
>>
>> PyObject* str_sentence = PyObject_Str(pSentence);
>> // nltk.word_tokenize(sentence)
>> PyObject* pNltk_WTok = PyObject_GetAttrString(pModule_mstr, "word_tokenize");
>> PyObject* pWTok = PyObject_CallFunctionObjArgs(pNltk_WTok, str_sentence, 0);
>>
>> (where pModule_mstr is the nltk library).
>>
>> That should produce a list with a length of 7 that looks like it does on the 
>> command line version shown above:
>>
>> ['[', 'Emma', 'by', 'Jane', 'Austen', '1816', ']']
>>
>> But instead the C API produces a list with a length of 24, and the REPR 
>> looks like this:
>>
>> '[\'[\', "\'", \'[\', "\'", \',\', "\'Emma", "\'", \',\', "\'by", "\'", 
>> \',\', "\'Jane", "\'", \',\', "\'Austen", "\'", \',\', "\'1816", "\'", 
>> \',\', "\'", \']\', "\'", \']\']'
>>
>> I also tried this with PyObject_CallMethodObjArgs and PyObject_Call without 
>> success.
>>
>> Thanks for any help on this.
>>
> What is pSentence? Is it what you think it is?
> To me it looks like it's either the list:
>
>  ['[', 'Emma', 'by', 'Jane', 'Austen', '1816', ']']
>
> or that list as a string:
>
>  "['[', 'Emma', 'by', 'Jane', 'Austen', '1816', ']']"
>
> and that what you're tokenising.
> -- 
> https://mail.python.org/mailman/listinfo/python-list
>

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: C API PyObject_CallFunctionObjArgs returns incorrect result

2022-03-06 Thread MRAB

On 2022-03-07 00:32, Jen Kris via Python-list wrote:

I am using the C API in Python 3.8 with the nltk library, and I have a problem 
with the return from a library call implemented with 
PyObject_CallFunctionObjArgs.

This is the relevant Python code:

import nltk
from nltk.corpus import gutenberg
fileids = gutenberg.fileids()
sentences = gutenberg.sents(fileids[0])
sentence = sentences[0]
sentence = " ".join(sentence)
pt = nltk.word_tokenize(sentence)

I run this at the Python command prompt to show how it works:

sentence = " ".join(sentence)
pt = nltk.word_tokenize(sentence)
print(pt)

['[', 'Emma', 'by', 'Jane', 'Austen', '1816', ']']

type(pt)



This is the relevant part of the C API code:

PyObject* str_sentence = PyObject_Str(pSentence);
// nltk.word_tokenize(sentence)
PyObject* pNltk_WTok = PyObject_GetAttrString(pModule_mstr, "word_tokenize");
PyObject* pWTok = PyObject_CallFunctionObjArgs(pNltk_WTok, str_sentence, 0);

(where pModule_mstr is the nltk library).

That should produce a list with a length of 7 that looks like it does on the 
command line version shown above:

['[', 'Emma', 'by', 'Jane', 'Austen', '1816', ']']

But instead the C API produces a list with a length of 24, and the REPR looks 
like this:

'[\'[\', "\'", \'[\', "\'", \',\', "\'Emma", "\'", \',\', "\'by", "\'", \',\', "\'Jane", "\'", \',\', "\'Austen", "\'", 
\',\', "\'1816", "\'", \',\', "\'", \']\', "\'", \']\']'

I also tried this with PyObject_CallMethodObjArgs and PyObject_Call without 
success.

Thanks for any help on this.


What is pSentence? Is it what you think it is?
To me it looks like it's either the list:

['[', 'Emma', 'by', 'Jane', 'Austen', '1816', ']']

or that list as a string:

"['[', 'Emma', 'by', 'Jane', 'Austen', '1816', ']']"

and that what you're tokenising.
--
https://mail.python.org/mailman/listinfo/python-list


C API PyObject_CallFunctionObjArgs returns incorrect result

2022-03-06 Thread Jen Kris via Python-list
I am using the C API in Python 3.8 with the nltk library, and I have a problem 
with the return from a library call implemented with 
PyObject_CallFunctionObjArgs.  

This is the relevant Python code:

import nltk
from nltk.corpus import gutenberg
fileids = gutenberg.fileids()
sentences = gutenberg.sents(fileids[0])
sentence = sentences[0]
sentence = " ".join(sentence)
pt = nltk.word_tokenize(sentence)

I run this at the Python command prompt to show how it works:
>>> sentence = " ".join(sentence)
>>> pt = nltk.word_tokenize(sentence)
>>> print(pt)
['[', 'Emma', 'by', 'Jane', 'Austen', '1816', ']']
>>> type(pt)


This is the relevant part of the C API code:

PyObject* str_sentence = PyObject_Str(pSentence);  
// nltk.word_tokenize(sentence)  
PyObject* pNltk_WTok = PyObject_GetAttrString(pModule_mstr, "word_tokenize");
PyObject* pWTok = PyObject_CallFunctionObjArgs(pNltk_WTok, str_sentence, 0);

(where pModule_mstr is the nltk library). 

That should produce a list with a length of 7 that looks like it does on the 
command line version shown above:

['[', 'Emma', 'by', 'Jane', 'Austen', '1816', ']']

But instead the C API produces a list with a length of 24, and the REPR looks 
like this:

'[\'[\', "\'", \'[\', "\'", \',\', "\'Emma", "\'", \',\', "\'by", "\'", \',\', 
"\'Jane", "\'", \',\', "\'Austen", "\'", \',\', "\'1816", "\'", \',\', "\'", 
\']\', "\'", \']\']'

I also tried this with PyObject_CallMethodObjArgs and PyObject_Call without 
success. 

Thanks for any help on this. 

Jen

-- 
https://mail.python.org/mailman/listinfo/python-list