> So my question is supposing all files are in a directory and I read then
using sc.textFile("path/*"), how can I understand each data is for which
file?Maybe the input_file_name() function help you: https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.functions$@input_file_name():org.apache.spark.sql.Column On Mon, Sep 24, 2018 at 2:54 PM Soheil Pourbafrani <[email protected]> wrote: > Hi, My text data are in the form of text file. In the processing logic, I > need to know each word is from which file. Actually, I need to tokenize the > words and create the pair of <fileName, word>. The naive solution is to > call sc.textFile for each file and having the fileName in a variable, > create the pairs, but it's not efficient and I got the StackOverflow error > as dataset grew. > > So my question is supposing all files are in a directory and I read then > using sc.textFile("path/*"), how can I understand each data is for which > file? > > Is it possible (and needed) to customize the textFile method? > -- Maxim Gekk Technical Solutions Lead Databricks Inc. [email protected] databricks.com <http://databricks.com/>
