You can create your own data source exactly doing this. Why is the file name important if the file content is the same?
> On 24. Sep 2018, at 13:53, Soheil Pourbafrani <[email protected]> wrote: > > Hi, My text data are in the form of text file. In the processing logic, I > need to know each word is from which file. Actually, I need to tokenize the > words and create the pair of <fileName, word>. The naive solution is to call > sc.textFile for each file and having the fileName in a variable, create the > pairs, but it's not efficient and I got the StackOverflow error as dataset > grew. > > So my question is supposing all files are in a directory and I read then > using sc.textFile("path/*"), how can I understand each data is for which file? > > Is it possible (and needed) to customize the textFile method? --------------------------------------------------------------------- To unsubscribe e-mail: [email protected]
