You can create your own data source exactly doing this. 

Why is the file name important if the file content is the same?

> On 24. Sep 2018, at 13:53, Soheil Pourbafrani <[email protected]> wrote:
> 
> Hi, My text data are in the form of text file. In the processing logic, I 
> need to know each word is from which file. Actually, I need to tokenize the 
> words and create the pair of <fileName, word>. The naive solution is to call 
> sc.textFile for each file and having the fileName in a variable, create the 
> pairs, but it's not efficient and I got the StackOverflow error as dataset 
> grew.
> 
> So my question is supposing all files are in a directory and I read then 
> using sc.textFile("path/*"), how can I understand each data is for which file?
> 
> Is it possible (and needed) to customize the textFile method?

---------------------------------------------------------------------
To unsubscribe e-mail: [email protected]

Reply via email to