Hi, My text data are in the form of text file. In the processing logic, I
need to know each word is from which file. Actually, I need to tokenize the
words and create the pair of <fileName, word>. The naive solution is to
call sc.textFile for each file and having the fileName in a variable,
create the pairs, but it's not efficient and I got the StackOverflow error
as dataset grew.
So my question is supposing all files are in a directory and I read then
using sc.textFile("path/*"), how can I understand each data is for which
file?
Is it possible (and needed) to customize the textFile method?