subject:"multiple hdfs folder files input to PySpark"

Re: multiple hdfs folder & files input to PySpark

2015-05-15 Thread Oleg Ruchovets

Hello , I used approach that you've suggested : lines = sc.textFile("/input/lprs/2015_05_15/file4.csv, /input/lprs/2015_05_14/file3.csv, /input/lprs/2015_05_13/file2.csv, /input/lprs/2015_05_12/file1.csv") but It doesn't work for me: py4j.protocol.Py4JJavaError: An error occurred

Re: multiple hdfs folder & files input to PySpark

2015-05-05 Thread Ai He

Hi Oleg, For 1, RDD#union will help. You can iterate over folders and union the obtained RDD along. For 2, seems like it won’t work in a deterministic way according to this discussion(http://stackoverflow.com/questions/24871044/in-spark-what-does-the-parameter-minpartitions-works-in-sparkcontex

multiple hdfs folder & files input to PySpark

2015-05-05 Thread Oleg Ruchovets

Hi We are using pyspark 1.3 and input is text files located on hdfs. file structure file1.txt file2.txt file1.txt file2.txt ... Question: 1) What is the way to provide as an input for PySpark job multiple files