> So my question is supposing all files are in a directory and I read then
using sc.textFile("path/*"), how can I understand each data is for which
file?

Maybe the input_file_name() function help you:
https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.functions$@input_file_name():org.apache.spark.sql.Column

On Mon, Sep 24, 2018 at 2:54 PM Soheil Pourbafrani <[email protected]>
wrote:

> Hi, My text data are in the form of text file. In the processing logic, I
> need to know each word is from which file. Actually, I need to tokenize the
> words and create the pair of <fileName, word>. The naive solution is to
> call sc.textFile for each file and having the fileName in a variable,
> create the pairs, but it's not efficient and I got the StackOverflow error
> as dataset grew.
>
> So my question is supposing all files are in a directory and I read then
> using sc.textFile("path/*"), how can I understand each data is for which
> file?
>
> Is it possible (and needed) to customize the textFile method?
>


-- 

Maxim Gekk

Technical Solutions Lead

Databricks Inc.

[email protected]

databricks.com

  <http://databricks.com/>

Reply via email to