Re: C API PyObject_CallFunctionObjArgs returns incorrect result
Thanks to MRAB and Chris Angelico for your help. Here is how I implemented the string conversion, and it works correctly now for a library call that needs a list converted to a string (error handling not shown): PyObject* str_sentence = PyObject_Str(pSentence); PyObject* separator = PyUnicode_FromString(" "); PyObject* str_join = PyUnicode_Join(separator, pSentence); Py_DECREF(separator); PyObject* pNltk_WTok = PyObject_GetAttrString(pModule_mstr, "word_tokenize"); PyObject* pWTok = PyObject_CallFunctionObjArgs(pNltk_WTok, str_join, 0); That produces what I need (this is the REPR of pWTok): "['[', 'Emma', 'by', 'Jane', 'Austen', '1816', ']']" Thanks again to both of you. Jen Mar 7, 2022, 11:03 by pyt...@mrabarnett.plus.com: > On 2022-03-07 17:05, Jen Kris wrote: > >> Thank you MRAB for your reply. >> >> Regarding your first question, pSentence is a list. In the nltk library, >> nltk.word_tokenize takes a string, so we convert sentence to string before >> we call nltk.word_tokenize: >> >> >>> sentence = " ".join(sentence) >> >>> pt = nltk.word_tokenize(sentence) >> >>> print(sentence) >> [ Emma by Jane Austen 1816 ] >> >> But with the C API it looks like this: >> >> PyObject *pSentence = PySequence_GetItem(pSents, sent_count); >> PyObject* str_sentence = PyObject_Str(pSentence); // Convert to string >> >> ; See what str_sentence looks like: >> PyObject* repr_str = PyObject_Repr(str_sentence); >> PyObject* str_str = PyUnicode_AsEncodedString(repr_str, "utf-8", "~E~"); >> const char *bytes_str = PyBytes_AS_STRING(str_str); >> printf("REPR_String: %s\n", bytes_str); >> >> REPR_String: "['[', 'Emma', 'by', 'Jane', 'Austen', '1816', ']']" >> >> So the two string representations are not the same – or at least the >> PyUnicode_AsEncodedString is not the same, as each item is surrounded by >> single quotes. >> >> Assuming that the conversion to bytes object for the REPR is an accurate >> representation of str_sentence, it looks like I need to strip the quotes >> from str_sentence before “PyObject* pWTok = >> PyObject_CallFunctionObjArgs(pNltk_WTok, str_sentence, 0).” >> >> So my questions now are (1) is there a C API function that will convert a >> list to a string exactly the same way as ‘’.join, and if not then (2) how >> can I strip characters from a string object in the C API? >> > Your Python code is joining the list with a space as the separator. > > The equivalent using the C API is: > > PyObject* separator; > PyObject* joined; > > separator = PyUnicode_FromString(" "); > joined = PyUnicode_Join(separator, pSentence); > Py_DECREF(sep); > >> >> Mar 6, 2022, 17:42 by pyt...@mrabarnett.plus.com: >> >> On 2022-03-07 00:32, Jen Kris via Python-list wrote: >> >> I am using the C API in Python 3.8 with the nltk library, and >> I have a problem with the return from a library call >> implemented with PyObject_CallFunctionObjArgs. >> >> This is the relevant Python code: >> >> import nltk >> from nltk.corpus import gutenberg >> fileids = gutenberg.fileids() >> sentences = gutenberg.sents(fileids[0]) >> sentence = sentences[0] >> sentence = " ".join(sentence) >> pt = nltk.word_tokenize(sentence) >> >> I run this at the Python command prompt to show how it works: >> >> sentence = " ".join(sentence) >> pt = nltk.word_tokenize(sentence) >> print(pt) >> >> ['[', 'Emma', 'by', 'Jane', 'Austen', '1816', ']'] >> >> type(pt) >> >> >> >> This is the relevant part of the C API code: >> >> PyObject* str_sentence = PyObject_Str(pSentence); >> // nltk.word_tokenize(sentence) >> PyObject* pNltk_WTok = PyObject_GetAttrString(pModule_mstr, >> "word_tokenize"); >> PyObject* pWTok = PyObject_CallFunctionObjArgs(pNltk_WTok, >> str_sentence, 0); >> >> (where pModule_mstr is the nltk library). >> >> That should produce a list with a length of 7 that looks like >> it does on the command line version shown above: >> >> ['[', 'Emma', 'by', 'Jane', 'Austen', '1816', ']'] >> >> But instead the C API produces a list with a length of 24, and >> the REPR looks like this: >> >> '[\'[\', "\'", \'[\', "\'", \',\', "\'Emma", "\'", \',\', >> "\'by", "\'", \',\', "\'Jane", "\'", \',\', "\'Austen", "\'", >> \',\', "\'1816", "\'", \',\', "\'", \']\', "\'", \']\']' >> >> I also tried this with PyObject_CallMethodObjArgs and >> PyObject_Call without success. >> >> Thanks for any help on this. >> >> What is pSentence? Is it what you think it is? >> To me it looks like it's either the list: >> >> ['[', 'Emma', 'by', 'Jane', 'Austen', '1816', ']'] >> >> or that list as a string: >> >> "['[', 'Emma', 'by', 'Jane', 'Austen', '1816', ']']" >> >> and that what you're tokenising. >> -- https://mail.python.org/mailman/listinfo/python-list >> > -- > https://mail.python.org/mailman/listinfo/python-list > -- https://mail.python.org/mailman/listinfo/python-list
Re: C API PyObject_CallFunctionObjArgs returns incorrect result
On 2022-03-07 17:05, Jen Kris wrote: Thank you MRAB for your reply. Regarding your first question, pSentence is a list. In the nltk library, nltk.word_tokenize takes a string, so we convert sentence to string before we call nltk.word_tokenize: >>> sentence = " ".join(sentence) >>> pt = nltk.word_tokenize(sentence) >>> print(sentence) [ Emma by Jane Austen 1816 ] But with the C API it looks like this: PyObject *pSentence = PySequence_GetItem(pSents, sent_count); PyObject* str_sentence = PyObject_Str(pSentence); // Convert to string ; See what str_sentence looks like: PyObject* repr_str = PyObject_Repr(str_sentence); PyObject* str_str = PyUnicode_AsEncodedString(repr_str, "utf-8", "~E~"); const char *bytes_str = PyBytes_AS_STRING(str_str); printf("REPR_String: %s\n", bytes_str); REPR_String: "['[', 'Emma', 'by', 'Jane', 'Austen', '1816', ']']" So the two string representations are not the same – or at least the PyUnicode_AsEncodedString is not the same, as each item is surrounded by single quotes. Assuming that the conversion to bytes object for the REPR is an accurate representation of str_sentence, it looks like I need to strip the quotes from str_sentence before “PyObject* pWTok = PyObject_CallFunctionObjArgs(pNltk_WTok, str_sentence, 0).” So my questions now are (1) is there a C API function that will convert a list to a string exactly the same way as ‘’.join, and if not then (2) how can I strip characters from a string object in the C API? Your Python code is joining the list with a space as the separator. The equivalent using the C API is: PyObject* separator; PyObject* joined; separator = PyUnicode_FromString(" "); joined = PyUnicode_Join(separator, pSentence); Py_DECREF(sep); Mar 6, 2022, 17:42 by pyt...@mrabarnett.plus.com: On 2022-03-07 00:32, Jen Kris via Python-list wrote: I am using the C API in Python 3.8 with the nltk library, and I have a problem with the return from a library call implemented with PyObject_CallFunctionObjArgs. This is the relevant Python code: import nltk from nltk.corpus import gutenberg fileids = gutenberg.fileids() sentences = gutenberg.sents(fileids[0]) sentence = sentences[0] sentence = " ".join(sentence) pt = nltk.word_tokenize(sentence) I run this at the Python command prompt to show how it works: sentence = " ".join(sentence) pt = nltk.word_tokenize(sentence) print(pt) ['[', 'Emma', 'by', 'Jane', 'Austen', '1816', ']'] type(pt) This is the relevant part of the C API code: PyObject* str_sentence = PyObject_Str(pSentence); // nltk.word_tokenize(sentence) PyObject* pNltk_WTok = PyObject_GetAttrString(pModule_mstr, "word_tokenize"); PyObject* pWTok = PyObject_CallFunctionObjArgs(pNltk_WTok, str_sentence, 0); (where pModule_mstr is the nltk library). That should produce a list with a length of 7 that looks like it does on the command line version shown above: ['[', 'Emma', 'by', 'Jane', 'Austen', '1816', ']'] But instead the C API produces a list with a length of 24, and the REPR looks like this: '[\'[\', "\'", \'[\', "\'", \',\', "\'Emma", "\'", \',\', "\'by", "\'", \',\', "\'Jane", "\'", \',\', "\'Austen", "\'", \',\', "\'1816", "\'", \',\', "\'", \']\', "\'", \']\']' I also tried this with PyObject_CallMethodObjArgs and PyObject_Call without success. Thanks for any help on this. What is pSentence? Is it what you think it is? To me it looks like it's either the list: ['[', 'Emma', 'by', 'Jane', 'Austen', '1816', ']'] or that list as a string: "['[', 'Emma', 'by', 'Jane', 'Austen', '1816', ']']" and that what you're tokenising. -- https://mail.python.org/mailman/listinfo/python-list -- https://mail.python.org/mailman/listinfo/python-list
Re: C API PyObject_CallFunctionObjArgs returns incorrect result
On Tue, 8 Mar 2022 at 04:13, Jen Kris wrote: > > > The PyObject str_sentence is a string representation of a list. I need to > convert the list to a string like "".join because that's what the library > call takes. > What you're doing is the equivalent of str(sentence), not "".join(sentence). Since the join method is part of the string protocol, you'll find it here: https://docs.python.org/3/c-api/unicode.html#c.PyUnicode_Join ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: C API PyObject_CallFunctionObjArgs returns incorrect result
The PyObject str_sentence is a string representation of a list. I need to convert the list to a string like "".join because that's what the library call takes. Mar 7, 2022, 09:09 by ros...@gmail.com: > On Tue, 8 Mar 2022 at 04:06, Jen Kris via Python-list > wrote: > >> But with the C API it looks like this: >> >> PyObject *pSentence = PySequence_GetItem(pSents, sent_count); >> PyObject* str_sentence = PyObject_Str(pSentence); // Convert to string >> >> PyObject* repr_str = PyObject_Repr(str_sentence); >> > > You convert it to a string, then take the representation of that. Is > that what you intended? > > ChrisA > -- > https://mail.python.org/mailman/listinfo/python-list > -- https://mail.python.org/mailman/listinfo/python-list
Re: C API PyObject_CallFunctionObjArgs returns incorrect result
On Tue, 8 Mar 2022 at 04:06, Jen Kris via Python-list wrote: > But with the C API it looks like this: > > PyObject *pSentence = PySequence_GetItem(pSents, sent_count); > PyObject* str_sentence = PyObject_Str(pSentence); // Convert to string > > PyObject* repr_str = PyObject_Repr(str_sentence); You convert it to a string, then take the representation of that. Is that what you intended? ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: C API PyObject_CallFunctionObjArgs returns incorrect result
Thank you MRAB for your reply. Regarding your first question, pSentence is a list. In the nltk library, nltk.word_tokenize takes a string, so we convert sentence to string before we call nltk.word_tokenize: >>> sentence = " ".join(sentence) >>> pt = nltk.word_tokenize(sentence) >>> print(sentence) [ Emma by Jane Austen 1816 ] But with the C API it looks like this: PyObject *pSentence = PySequence_GetItem(pSents, sent_count); PyObject* str_sentence = PyObject_Str(pSentence); // Convert to string ; See what str_sentence looks like: PyObject* repr_str = PyObject_Repr(str_sentence); PyObject* str_str = PyUnicode_AsEncodedString(repr_str, "utf-8", "~E~"); const char *bytes_str = PyBytes_AS_STRING(str_str); printf("REPR_String: %s\n", bytes_str); REPR_String: "['[', 'Emma', 'by', 'Jane', 'Austen', '1816', ']']" So the two string representations are not the same – or at least the PyUnicode_AsEncodedString is not the same, as each item is surrounded by single quotes. Assuming that the conversion to bytes object for the REPR is an accurate representation of str_sentence, it looks like I need to strip the quotes from str_sentence before “PyObject* pWTok = PyObject_CallFunctionObjArgs(pNltk_WTok, str_sentence, 0).” So my questions now are (1) is there a C API function that will convert a list to a string exactly the same way as ‘’.join, and if not then (2) how can I strip characters from a string object in the C API? Thanks. Mar 6, 2022, 17:42 by pyt...@mrabarnett.plus.com: > On 2022-03-07 00:32, Jen Kris via Python-list wrote: > >> I am using the C API in Python 3.8 with the nltk library, and I have a >> problem with the return from a library call implemented with >> PyObject_CallFunctionObjArgs. >> >> This is the relevant Python code: >> >> import nltk >> from nltk.corpus import gutenberg >> fileids = gutenberg.fileids() >> sentences = gutenberg.sents(fileids[0]) >> sentence = sentences[0] >> sentence = " ".join(sentence) >> pt = nltk.word_tokenize(sentence) >> >> I run this at the Python command prompt to show how it works: >> > sentence = " ".join(sentence) > pt = nltk.word_tokenize(sentence) > print(pt) > >> ['[', 'Emma', 'by', 'Jane', 'Austen', '1816', ']'] >> > type(pt) > >> >> >> This is the relevant part of the C API code: >> >> PyObject* str_sentence = PyObject_Str(pSentence); >> // nltk.word_tokenize(sentence) >> PyObject* pNltk_WTok = PyObject_GetAttrString(pModule_mstr, "word_tokenize"); >> PyObject* pWTok = PyObject_CallFunctionObjArgs(pNltk_WTok, str_sentence, 0); >> >> (where pModule_mstr is the nltk library). >> >> That should produce a list with a length of 7 that looks like it does on the >> command line version shown above: >> >> ['[', 'Emma', 'by', 'Jane', 'Austen', '1816', ']'] >> >> But instead the C API produces a list with a length of 24, and the REPR >> looks like this: >> >> '[\'[\', "\'", \'[\', "\'", \',\', "\'Emma", "\'", \',\', "\'by", "\'", >> \',\', "\'Jane", "\'", \',\', "\'Austen", "\'", \',\', "\'1816", "\'", >> \',\', "\'", \']\', "\'", \']\']' >> >> I also tried this with PyObject_CallMethodObjArgs and PyObject_Call without >> success. >> >> Thanks for any help on this. >> > What is pSentence? Is it what you think it is? > To me it looks like it's either the list: > > ['[', 'Emma', 'by', 'Jane', 'Austen', '1816', ']'] > > or that list as a string: > > "['[', 'Emma', 'by', 'Jane', 'Austen', '1816', ']']" > > and that what you're tokenising. > -- > https://mail.python.org/mailman/listinfo/python-list > -- https://mail.python.org/mailman/listinfo/python-list
Re: C API PyObject_CallFunctionObjArgs returns incorrect result
On 2022-03-07 00:32, Jen Kris via Python-list wrote: I am using the C API in Python 3.8 with the nltk library, and I have a problem with the return from a library call implemented with PyObject_CallFunctionObjArgs. This is the relevant Python code: import nltk from nltk.corpus import gutenberg fileids = gutenberg.fileids() sentences = gutenberg.sents(fileids[0]) sentence = sentences[0] sentence = " ".join(sentence) pt = nltk.word_tokenize(sentence) I run this at the Python command prompt to show how it works: sentence = " ".join(sentence) pt = nltk.word_tokenize(sentence) print(pt) ['[', 'Emma', 'by', 'Jane', 'Austen', '1816', ']'] type(pt) This is the relevant part of the C API code: PyObject* str_sentence = PyObject_Str(pSentence); // nltk.word_tokenize(sentence) PyObject* pNltk_WTok = PyObject_GetAttrString(pModule_mstr, "word_tokenize"); PyObject* pWTok = PyObject_CallFunctionObjArgs(pNltk_WTok, str_sentence, 0); (where pModule_mstr is the nltk library). That should produce a list with a length of 7 that looks like it does on the command line version shown above: ['[', 'Emma', 'by', 'Jane', 'Austen', '1816', ']'] But instead the C API produces a list with a length of 24, and the REPR looks like this: '[\'[\', "\'", \'[\', "\'", \',\', "\'Emma", "\'", \',\', "\'by", "\'", \',\', "\'Jane", "\'", \',\', "\'Austen", "\'", \',\', "\'1816", "\'", \',\', "\'", \']\', "\'", \']\']' I also tried this with PyObject_CallMethodObjArgs and PyObject_Call without success. Thanks for any help on this. What is pSentence? Is it what you think it is? To me it looks like it's either the list: ['[', 'Emma', 'by', 'Jane', 'Austen', '1816', ']'] or that list as a string: "['[', 'Emma', 'by', 'Jane', 'Austen', '1816', ']']" and that what you're tokenising. -- https://mail.python.org/mailman/listinfo/python-list
C API PyObject_CallFunctionObjArgs returns incorrect result
I am using the C API in Python 3.8 with the nltk library, and I have a problem with the return from a library call implemented with PyObject_CallFunctionObjArgs. This is the relevant Python code: import nltk from nltk.corpus import gutenberg fileids = gutenberg.fileids() sentences = gutenberg.sents(fileids[0]) sentence = sentences[0] sentence = " ".join(sentence) pt = nltk.word_tokenize(sentence) I run this at the Python command prompt to show how it works: >>> sentence = " ".join(sentence) >>> pt = nltk.word_tokenize(sentence) >>> print(pt) ['[', 'Emma', 'by', 'Jane', 'Austen', '1816', ']'] >>> type(pt) This is the relevant part of the C API code: PyObject* str_sentence = PyObject_Str(pSentence); // nltk.word_tokenize(sentence) PyObject* pNltk_WTok = PyObject_GetAttrString(pModule_mstr, "word_tokenize"); PyObject* pWTok = PyObject_CallFunctionObjArgs(pNltk_WTok, str_sentence, 0); (where pModule_mstr is the nltk library). That should produce a list with a length of 7 that looks like it does on the command line version shown above: ['[', 'Emma', 'by', 'Jane', 'Austen', '1816', ']'] But instead the C API produces a list with a length of 24, and the REPR looks like this: '[\'[\', "\'", \'[\', "\'", \',\', "\'Emma", "\'", \',\', "\'by", "\'", \',\', "\'Jane", "\'", \',\', "\'Austen", "\'", \',\', "\'1816", "\'", \',\', "\'", \']\', "\'", \']\']' I also tried this with PyObject_CallMethodObjArgs and PyObject_Call without success. Thanks for any help on this. Jen -- https://mail.python.org/mailman/listinfo/python-list