Hi, I have a JSON file in the following structure:
+--------------------+-------------------+
|           full_text|                 id|
+--------------------+-------------------+

I want to tokenize each sentence into pairs of (word, id)

for example, having the record : ("Hi, How are you?", id) I want to convert
the dataframe to:
hi, id
how, id
are, id
you, id
?, id

So I try :

data.rdd.map(lambda data : (data[0], data[1]))\
   .flatMap(lambda row: (word_tokenize(row[0].lower()), row[1])

but it converts the dataframe to:
[hi, how, are, you, ?]

How can I do the desired transformation?

Reply via email to