Re: how to get file name of record being reading in spark
Can anybody suggest different solution using inputFileName or input_file_name On Tue, May 31, 2016 at 11:43 PM, Vikash Kumar wrote: > thanks Ajay but I have this below code to generate dataframes, So I wanted > to change in df only to achieve this. I thought inputFileName will work but > it's not working. > > private def getPaths: String = { > val regex = (conf.namingConvention + conf.extension).replace("?", > ".?").replace("**", ".**?") > val files = FileUtilities.getFiles(conf.filePath).filter(x => > x.getName.matches(regex)) > println(s"${files.length} files matched:\n${files.map( x => "-- " + > x.getName ).mkString("\n")}") > files.map(_.getPath).mkString(",") > } > private def readTextFile(sqlContext: SQLContext): DataFrame = { > System.out.println(s"Reading ${conf.filePath}") > sqlContext.read >.format("com.databricks.spark.csv") >.option("delimiter", conf.delimiter.getOrElse(defaultDelimiter)) >.option("header", if (conf.hasHeader.getOrElse(defaultHasHeader)) > "true" else "false") >.option("quote", if > (conf.textQualifier.getOrElse(defaultTextQualifier)) "\"" else null) >.schema(conf.schema.toStruct) >.load(getPaths) > } > println("Intaking text file(s)...") > *val df: DataFrame = readTextFile(sqlContext)* > > On Tue, May 31, 2016 at 11:26 PM, Ajay Chander > wrote: > >> Hi Vikash, >> >> These are my thoughts, read the input directory using wholeTextFiles() >> which would give a paired RDD with key as file name and value as file >> content. Then you can apply a map function to read each line and append >> key to the content. >> >> Thank you, >> Aj >> >> >> On Tuesday, May 31, 2016, Vikash Kumar wrote: >> >>> I have a requirement in which I need to read the input files from a >>> directory and append the file name in each record while output. >>> >>> e.g. I have directory /input/files/ which have folllowing files: >>> ABC_input_0528.txt >>> ABC_input_0531.txt >>> >>> suppose input file ABC_input_0528.txt contains >>> 111,abc,234 >>> 222,xyz,456 >>> >>> suppose input file ABC_input_0531.txt contains >>> 100,abc,299 >>> 200,xyz,499 >>> >>> and I need to create one final output with file name in each record >>> using dataframes >>> my output file should looks like this: >>> 111,abc,234,ABC_input_0528.txt >>> 222,xyz,456,ABC_input_0528.txt >>> 100,abc,299,ABC_input_0531.txt >>> 200,xyz,499,ABC_input_0531.txt >>> >>> I am trying to use this inputFileName function but it is showing blank. >>> >>> https://spark.apache.org/docs/1.6.1/api/java/org/apache/spark/sql/functions.html#inputFileName() >>> >>> Can anybody help me? >>> >>> >
Re: how to get file name of record being reading in spark
thanks Ajay but I have this below code to generate dataframes, So I wanted to change in df only to achieve this. I thought inputFileName will work but it's not working. private def getPaths: String = { val regex = (conf.namingConvention + conf.extension).replace("?", ".?").replace("**", ".**?") val files = FileUtilities.getFiles(conf.filePath).filter(x => x.getName.matches(regex)) println(s"${files.length} files matched:\n${files.map( x => "-- " + x.getName ).mkString("\n")}") files.map(_.getPath).mkString(",") } private def readTextFile(sqlContext: SQLContext): DataFrame = { System.out.println(s"Reading ${conf.filePath}") sqlContext.read .format("com.databricks.spark.csv") .option("delimiter", conf.delimiter.getOrElse(defaultDelimiter)) .option("header", if (conf.hasHeader.getOrElse(defaultHasHeader)) "true" else "false") .option("quote", if (conf.textQualifier.getOrElse(defaultTextQualifier)) "\"" else null) .schema(conf.schema.toStruct) .load(getPaths) } println("Intaking text file(s)...") *val df: DataFrame = readTextFile(sqlContext)* On Tue, May 31, 2016 at 11:26 PM, Ajay Chander wrote: > Hi Vikash, > > These are my thoughts, read the input directory using wholeTextFiles() > which would give a paired RDD with key as file name and value as file > content. Then you can apply a map function to read each line and append > key to the content. > > Thank you, > Aj > > > On Tuesday, May 31, 2016, Vikash Kumar wrote: > >> I have a requirement in which I need to read the input files from a >> directory and append the file name in each record while output. >> >> e.g. I have directory /input/files/ which have folllowing files: >> ABC_input_0528.txt >> ABC_input_0531.txt >> >> suppose input file ABC_input_0528.txt contains >> 111,abc,234 >> 222,xyz,456 >> >> suppose input file ABC_input_0531.txt contains >> 100,abc,299 >> 200,xyz,499 >> >> and I need to create one final output with file name in each record using >> dataframes >> my output file should looks like this: >> 111,abc,234,ABC_input_0528.txt >> 222,xyz,456,ABC_input_0528.txt >> 100,abc,299,ABC_input_0531.txt >> 200,xyz,499,ABC_input_0531.txt >> >> I am trying to use this inputFileName function but it is showing blank. >> >> https://spark.apache.org/docs/1.6.1/api/java/org/apache/spark/sql/functions.html#inputFileName() >> >> Can anybody help me? >> >>
Re: how to get file name of record being reading in spark
Hi Vikash, These are my thoughts, read the input directory using wholeTextFiles() which would give a paired RDD with key as file name and value as file content. Then you can apply a map function to read each line and append key to the content. Thank you, Aj On Tuesday, May 31, 2016, Vikash Kumar wrote: > I have a requirement in which I need to read the input files from a > directory and append the file name in each record while output. > > e.g. I have directory /input/files/ which have folllowing files: > ABC_input_0528.txt > ABC_input_0531.txt > > suppose input file ABC_input_0528.txt contains > 111,abc,234 > 222,xyz,456 > > suppose input file ABC_input_0531.txt contains > 100,abc,299 > 200,xyz,499 > > and I need to create one final output with file name in each record using > dataframes > my output file should looks like this: > 111,abc,234,ABC_input_0528.txt > 222,xyz,456,ABC_input_0528.txt > 100,abc,299,ABC_input_0531.txt > 200,xyz,499,ABC_input_0531.txt > > I am trying to use this inputFileName function but it is showing blank. > > https://spark.apache.org/docs/1.6.1/api/java/org/apache/spark/sql/functions.html#inputFileName() > > Can anybody help me? > >