Re: Why are there different parts in my CSV?
You can directly write to hbase with Spark. Here's and example for doing that https://issues.apache.org/jira/browse/SPARK-944 Thanks Best Regards On Sat, Feb 14, 2015 at 2:55 PM, Su She suhsheka...@gmail.com wrote: Hello Akhil, thank you for your continued help! 1) So, if I can write it in programitically after every batch, then technically I should be able to have just the csv files in one directory. However, can the /desired/output/file.txt be in hdfs? If it is only local, I am not sure if it will help me for my use case I describe in 2) so can i do something like this hadoop fs -getmerge /output/dir/on/hdfs desired/dir/in/hdfs ? 2) Just to make sure I am going on the right path...my end use case is to use hive or hbase to create a database off these csv files. Is there an easy way for hive to read /user/test/many sub directories/with one csv file in each into a table? Thank you! On Sat, Feb 14, 2015 at 12:39 AM, Akhil Das ak...@sigmoidanalytics.com wrote: Simplest way would be to merge the output files at the end of your job like: hadoop fs -getmerge /output/dir/on/hdfs/ /desired/local/output/file.txt If you want to do it pro grammatically, then you can use the FileUtil.copyMerge API . like: FileUtil.copyMerge(FileSystem of source(hdfs), /output-location, FileSystem of destination(hdfs), Path to the merged files /merged-ouput, true(to delete the original dir),null) Thanks Best Regards On Sat, Feb 14, 2015 at 2:18 AM, Su She suhsheka...@gmail.com wrote: Thanks Akhil for the suggestion, it is now only giving me one part - . Is there anyway I can just create a file rather than a directory? It doesn't seem like there is just a saveAsTextFile option for JavaPairRecieverDstream. Also, for the copy/merge api, how would I add that to my spark job? Thanks Akhil! Best, Su On Thu, Feb 12, 2015 at 11:51 PM, Akhil Das ak...@sigmoidanalytics.com wrote: For streaming application, for every batch it will create a new directory and puts the data in it. If you don't want to have multiple files inside the directory as part- then you can do a repartition before the saveAs* call. messages.repartition(1).saveAsHadoopFiles(hdfs://user/ec2-user/,csv,String.class, String.class, (Class) TextOutputFormat.class); Thanks Best Regards On Fri, Feb 13, 2015 at 11:59 AM, Su She suhsheka...@gmail.com wrote: Hello Everyone, I am writing simple word counts to hdfs using messages.saveAsHadoopFiles(hdfs://user/ec2-user/,csv,String.class, String.class, (Class) TextOutputFormat.class); 1) However, each 2 seconds I getting a new *directory *that is titled as a csv. So i'll have test.csv, which will be a directory that has two files inside of it called part-0 and part 1 (something like that). This obv makes it very hard for me to read the data stored in the csv files. I am wondering if there is a better way to store the JavaPairRecieverDStream and JavaPairDStream? 2) I know there is a copy/merge hadoop api for merging files...can this be done inside java? I am not sure the logic behind this api if I am using spark streaming which is constantly making new files. Thanks a lot for the help!
Re: Why are there different parts in my CSV?
http://stackoverflow.com/questions/23527941/how-to-write-to-csv-in-spark Just read this...seems like it should be easily readable. Thanks! On Sat, Feb 14, 2015 at 1:36 AM, Su She suhsheka...@gmail.com wrote: Thanks Akhil for the link. Is there a reason why there is a new directory created for each batch? Is this a format that is easily readable by other applications such as hive/impala? On Sat, Feb 14, 2015 at 1:28 AM, Akhil Das ak...@sigmoidanalytics.com wrote: You can directly write to hbase with Spark. Here's and example for doing that https://issues.apache.org/jira/browse/SPARK-944 Thanks Best Regards On Sat, Feb 14, 2015 at 2:55 PM, Su She suhsheka...@gmail.com wrote: Hello Akhil, thank you for your continued help! 1) So, if I can write it in programitically after every batch, then technically I should be able to have just the csv files in one directory. However, can the /desired/output/file.txt be in hdfs? If it is only local, I am not sure if it will help me for my use case I describe in 2) so can i do something like this hadoop fs -getmerge /output/dir/on/hdfs desired/dir/in/hdfs ? 2) Just to make sure I am going on the right path...my end use case is to use hive or hbase to create a database off these csv files. Is there an easy way for hive to read /user/test/many sub directories/with one csv file in each into a table? Thank you! On Sat, Feb 14, 2015 at 12:39 AM, Akhil Das ak...@sigmoidanalytics.com wrote: Simplest way would be to merge the output files at the end of your job like: hadoop fs -getmerge /output/dir/on/hdfs/ /desired/local/output/file.txt If you want to do it pro grammatically, then you can use the FileUtil.copyMerge API . like: FileUtil.copyMerge(FileSystem of source(hdfs), /output-location, FileSystem of destination(hdfs), Path to the merged files /merged-ouput, true(to delete the original dir),null) Thanks Best Regards On Sat, Feb 14, 2015 at 2:18 AM, Su She suhsheka...@gmail.com wrote: Thanks Akhil for the suggestion, it is now only giving me one part - . Is there anyway I can just create a file rather than a directory? It doesn't seem like there is just a saveAsTextFile option for JavaPairRecieverDstream. Also, for the copy/merge api, how would I add that to my spark job? Thanks Akhil! Best, Su On Thu, Feb 12, 2015 at 11:51 PM, Akhil Das ak...@sigmoidanalytics.com wrote: For streaming application, for every batch it will create a new directory and puts the data in it. If you don't want to have multiple files inside the directory as part- then you can do a repartition before the saveAs* call. messages.repartition(1).saveAsHadoopFiles(hdfs://user/ec2-user/,csv,String.class, String.class, (Class) TextOutputFormat.class); Thanks Best Regards On Fri, Feb 13, 2015 at 11:59 AM, Su She suhsheka...@gmail.com wrote: Hello Everyone, I am writing simple word counts to hdfs using messages.saveAsHadoopFiles(hdfs://user/ec2-user/,csv,String.class, String.class, (Class) TextOutputFormat.class); 1) However, each 2 seconds I getting a new *directory *that is titled as a csv. So i'll have test.csv, which will be a directory that has two files inside of it called part-0 and part 1 (something like that). This obv makes it very hard for me to read the data stored in the csv files. I am wondering if there is a better way to store the JavaPairRecieverDStream and JavaPairDStream? 2) I know there is a copy/merge hadoop api for merging files...can this be done inside java? I am not sure the logic behind this api if I am using spark streaming which is constantly making new files. Thanks a lot for the help!
Re: Why are there different parts in my CSV?
Simplest way would be to merge the output files at the end of your job like: hadoop fs -getmerge /output/dir/on/hdfs/ /desired/local/output/file.txt If you want to do it pro grammatically, then you can use the FileUtil.copyMerge API . like: FileUtil.copyMerge(FileSystem of source(hdfs), /output-location, FileSystem of destination(hdfs), Path to the merged files /merged-ouput, true(to delete the original dir),null) Thanks Best Regards On Sat, Feb 14, 2015 at 2:18 AM, Su She suhsheka...@gmail.com wrote: Thanks Akhil for the suggestion, it is now only giving me one part - . Is there anyway I can just create a file rather than a directory? It doesn't seem like there is just a saveAsTextFile option for JavaPairRecieverDstream. Also, for the copy/merge api, how would I add that to my spark job? Thanks Akhil! Best, Su On Thu, Feb 12, 2015 at 11:51 PM, Akhil Das ak...@sigmoidanalytics.com wrote: For streaming application, for every batch it will create a new directory and puts the data in it. If you don't want to have multiple files inside the directory as part- then you can do a repartition before the saveAs* call. messages.repartition(1).saveAsHadoopFiles(hdfs://user/ec2-user/,csv,String.class, String.class, (Class) TextOutputFormat.class); Thanks Best Regards On Fri, Feb 13, 2015 at 11:59 AM, Su She suhsheka...@gmail.com wrote: Hello Everyone, I am writing simple word counts to hdfs using messages.saveAsHadoopFiles(hdfs://user/ec2-user/,csv,String.class, String.class, (Class) TextOutputFormat.class); 1) However, each 2 seconds I getting a new *directory *that is titled as a csv. So i'll have test.csv, which will be a directory that has two files inside of it called part-0 and part 1 (something like that). This obv makes it very hard for me to read the data stored in the csv files. I am wondering if there is a better way to store the JavaPairRecieverDStream and JavaPairDStream? 2) I know there is a copy/merge hadoop api for merging files...can this be done inside java? I am not sure the logic behind this api if I am using spark streaming which is constantly making new files. Thanks a lot for the help!
Re: Why are there different parts in my CSV?
Hello Akhil, thank you for your continued help! 1) So, if I can write it in programitically after every batch, then technically I should be able to have just the csv files in one directory. However, can the /desired/output/file.txt be in hdfs? If it is only local, I am not sure if it will help me for my use case I describe in 2) so can i do something like this hadoop fs -getmerge /output/dir/on/hdfs desired/dir/in/hdfs ? 2) Just to make sure I am going on the right path...my end use case is to use hive or hbase to create a database off these csv files. Is there an easy way for hive to read /user/test/many sub directories/with one csv file in each into a table? Thank you! On Sat, Feb 14, 2015 at 12:39 AM, Akhil Das ak...@sigmoidanalytics.com wrote: Simplest way would be to merge the output files at the end of your job like: hadoop fs -getmerge /output/dir/on/hdfs/ /desired/local/output/file.txt If you want to do it pro grammatically, then you can use the FileUtil.copyMerge API . like: FileUtil.copyMerge(FileSystem of source(hdfs), /output-location, FileSystem of destination(hdfs), Path to the merged files /merged-ouput, true(to delete the original dir),null) Thanks Best Regards On Sat, Feb 14, 2015 at 2:18 AM, Su She suhsheka...@gmail.com wrote: Thanks Akhil for the suggestion, it is now only giving me one part - . Is there anyway I can just create a file rather than a directory? It doesn't seem like there is just a saveAsTextFile option for JavaPairRecieverDstream. Also, for the copy/merge api, how would I add that to my spark job? Thanks Akhil! Best, Su On Thu, Feb 12, 2015 at 11:51 PM, Akhil Das ak...@sigmoidanalytics.com wrote: For streaming application, for every batch it will create a new directory and puts the data in it. If you don't want to have multiple files inside the directory as part- then you can do a repartition before the saveAs* call. messages.repartition(1).saveAsHadoopFiles(hdfs://user/ec2-user/,csv,String.class, String.class, (Class) TextOutputFormat.class); Thanks Best Regards On Fri, Feb 13, 2015 at 11:59 AM, Su She suhsheka...@gmail.com wrote: Hello Everyone, I am writing simple word counts to hdfs using messages.saveAsHadoopFiles(hdfs://user/ec2-user/,csv,String.class, String.class, (Class) TextOutputFormat.class); 1) However, each 2 seconds I getting a new *directory *that is titled as a csv. So i'll have test.csv, which will be a directory that has two files inside of it called part-0 and part 1 (something like that). This obv makes it very hard for me to read the data stored in the csv files. I am wondering if there is a better way to store the JavaPairRecieverDStream and JavaPairDStream? 2) I know there is a copy/merge hadoop api for merging files...can this be done inside java? I am not sure the logic behind this api if I am using spark streaming which is constantly making new files. Thanks a lot for the help!
Re: Why are there different parts in my CSV?
Thanks Akhil for the link. Is there a reason why there is a new directory created for each batch? Is this a format that is easily readable by other applications such as hive/impala? On Sat, Feb 14, 2015 at 1:28 AM, Akhil Das ak...@sigmoidanalytics.com wrote: You can directly write to hbase with Spark. Here's and example for doing that https://issues.apache.org/jira/browse/SPARK-944 Thanks Best Regards On Sat, Feb 14, 2015 at 2:55 PM, Su She suhsheka...@gmail.com wrote: Hello Akhil, thank you for your continued help! 1) So, if I can write it in programitically after every batch, then technically I should be able to have just the csv files in one directory. However, can the /desired/output/file.txt be in hdfs? If it is only local, I am not sure if it will help me for my use case I describe in 2) so can i do something like this hadoop fs -getmerge /output/dir/on/hdfs desired/dir/in/hdfs ? 2) Just to make sure I am going on the right path...my end use case is to use hive or hbase to create a database off these csv files. Is there an easy way for hive to read /user/test/many sub directories/with one csv file in each into a table? Thank you! On Sat, Feb 14, 2015 at 12:39 AM, Akhil Das ak...@sigmoidanalytics.com wrote: Simplest way would be to merge the output files at the end of your job like: hadoop fs -getmerge /output/dir/on/hdfs/ /desired/local/output/file.txt If you want to do it pro grammatically, then you can use the FileUtil.copyMerge API . like: FileUtil.copyMerge(FileSystem of source(hdfs), /output-location, FileSystem of destination(hdfs), Path to the merged files /merged-ouput, true(to delete the original dir),null) Thanks Best Regards On Sat, Feb 14, 2015 at 2:18 AM, Su She suhsheka...@gmail.com wrote: Thanks Akhil for the suggestion, it is now only giving me one part - . Is there anyway I can just create a file rather than a directory? It doesn't seem like there is just a saveAsTextFile option for JavaPairRecieverDstream. Also, for the copy/merge api, how would I add that to my spark job? Thanks Akhil! Best, Su On Thu, Feb 12, 2015 at 11:51 PM, Akhil Das ak...@sigmoidanalytics.com wrote: For streaming application, for every batch it will create a new directory and puts the data in it. If you don't want to have multiple files inside the directory as part- then you can do a repartition before the saveAs* call. messages.repartition(1).saveAsHadoopFiles(hdfs://user/ec2-user/,csv,String.class, String.class, (Class) TextOutputFormat.class); Thanks Best Regards On Fri, Feb 13, 2015 at 11:59 AM, Su She suhsheka...@gmail.com wrote: Hello Everyone, I am writing simple word counts to hdfs using messages.saveAsHadoopFiles(hdfs://user/ec2-user/,csv,String.class, String.class, (Class) TextOutputFormat.class); 1) However, each 2 seconds I getting a new *directory *that is titled as a csv. So i'll have test.csv, which will be a directory that has two files inside of it called part-0 and part 1 (something like that). This obv makes it very hard for me to read the data stored in the csv files. I am wondering if there is a better way to store the JavaPairRecieverDStream and JavaPairDStream? 2) I know there is a copy/merge hadoop api for merging files...can this be done inside java? I am not sure the logic behind this api if I am using spark streaming which is constantly making new files. Thanks a lot for the help!
Re: Why are there different parts in my CSV?
Keep in mind that if you repartition to 1 partition, you are only using 1 task to write the output, and potentially only 1 task to compute some parent RDDs. You lose parallelism. The files-in-a-directory output scheme is standard for Hadoop and for a reason. Therefore I would consider separating this concern and merging the files afterwards if you need to. On Sat, Feb 14, 2015 at 8:39 AM, Akhil Das ak...@sigmoidanalytics.com wrote: Simplest way would be to merge the output files at the end of your job like: hadoop fs -getmerge /output/dir/on/hdfs/ /desired/local/output/file.txt If you want to do it pro grammatically, then you can use the FileUtil.copyMerge API . like: FileUtil.copyMerge(FileSystem of source(hdfs), /output-location, FileSystem of destination(hdfs), Path to the merged files /merged-ouput, true(to delete the original dir),null) Thanks Best Regards On Sat, Feb 14, 2015 at 2:18 AM, Su She suhsheka...@gmail.com wrote: Thanks Akhil for the suggestion, it is now only giving me one part - . Is there anyway I can just create a file rather than a directory? It doesn't seem like there is just a saveAsTextFile option for JavaPairRecieverDstream. Also, for the copy/merge api, how would I add that to my spark job? Thanks Akhil! Best, Su On Thu, Feb 12, 2015 at 11:51 PM, Akhil Das ak...@sigmoidanalytics.com wrote: For streaming application, for every batch it will create a new directory and puts the data in it. If you don't want to have multiple files inside the directory as part- then you can do a repartition before the saveAs* call. messages.repartition(1).saveAsHadoopFiles(hdfs://user/ec2-user/,csv,String.class, String.class, (Class) TextOutputFormat.class); Thanks Best Regards On Fri, Feb 13, 2015 at 11:59 AM, Su She suhsheka...@gmail.com wrote: Hello Everyone, I am writing simple word counts to hdfs using messages.saveAsHadoopFiles(hdfs://user/ec2-user/,csv,String.class, String.class, (Class) TextOutputFormat.class); 1) However, each 2 seconds I getting a new directory that is titled as a csv. So i'll have test.csv, which will be a directory that has two files inside of it called part-0 and part 1 (something like that). This obv makes it very hard for me to read the data stored in the csv files. I am wondering if there is a better way to store the JavaPairRecieverDStream and JavaPairDStream? 2) I know there is a copy/merge hadoop api for merging files...can this be done inside java? I am not sure the logic behind this api if I am using spark streaming which is constantly making new files. Thanks a lot for the help! - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Why are there different parts in my CSV?
Thanks Sean and Akhil! I will take out the repartition(1). Please let me know if I understood this correctly, Spark Streamingwrites data like this: foo-1001.csv/part -x, part-x foo-1002.csv/part -x, part-x When I see this on Hue, the csv's appear to me as *directories*, but if I understand correctly, they will appear as csv *files* to other hadoop ecosystem tools? And, if I understand Tathagata's answer correctly, other hadoop based ecosystems, such as Hive, will be able to create a table based of the multiple foo-10x.csv directories? Thank you, I really appreciate the help! On Sat, Feb 14, 2015 at 3:20 AM, Sean Owen so...@cloudera.com wrote: Keep in mind that if you repartition to 1 partition, you are only using 1 task to write the output, and potentially only 1 task to compute some parent RDDs. You lose parallelism. The files-in-a-directory output scheme is standard for Hadoop and for a reason. Therefore I would consider separating this concern and merging the files afterwards if you need to. On Sat, Feb 14, 2015 at 8:39 AM, Akhil Das ak...@sigmoidanalytics.com wrote: Simplest way would be to merge the output files at the end of your job like: hadoop fs -getmerge /output/dir/on/hdfs/ /desired/local/output/file.txt If you want to do it pro grammatically, then you can use the FileUtil.copyMerge API . like: FileUtil.copyMerge(FileSystem of source(hdfs), /output-location, FileSystem of destination(hdfs), Path to the merged files /merged-ouput, true(to delete the original dir),null) Thanks Best Regards On Sat, Feb 14, 2015 at 2:18 AM, Su She suhsheka...@gmail.com wrote: Thanks Akhil for the suggestion, it is now only giving me one part - . Is there anyway I can just create a file rather than a directory? It doesn't seem like there is just a saveAsTextFile option for JavaPairRecieverDstream. Also, for the copy/merge api, how would I add that to my spark job? Thanks Akhil! Best, Su On Thu, Feb 12, 2015 at 11:51 PM, Akhil Das ak...@sigmoidanalytics.com wrote: For streaming application, for every batch it will create a new directory and puts the data in it. If you don't want to have multiple files inside the directory as part- then you can do a repartition before the saveAs* call. messages.repartition(1).saveAsHadoopFiles(hdfs://user/ec2-user/,csv,String.class, String.class, (Class) TextOutputFormat.class); Thanks Best Regards On Fri, Feb 13, 2015 at 11:59 AM, Su She suhsheka...@gmail.com wrote: Hello Everyone, I am writing simple word counts to hdfs using messages.saveAsHadoopFiles(hdfs://user/ec2-user/,csv,String.class, String.class, (Class) TextOutputFormat.class); 1) However, each 2 seconds I getting a new directory that is titled as a csv. So i'll have test.csv, which will be a directory that has two files inside of it called part-0 and part 1 (something like that). This obv makes it very hard for me to read the data stored in the csv files. I am wondering if there is a better way to store the JavaPairRecieverDStream and JavaPairDStream? 2) I know there is a copy/merge hadoop api for merging files...can this be done inside java? I am not sure the logic behind this api if I am using spark streaming which is constantly making new files. Thanks a lot for the help!
Re: Why are there different parts in my CSV?
Okay, got it, thanks for the help Sean! On Sat, Feb 14, 2015 at 1:08 PM, Sean Owen so...@cloudera.com wrote: No, they appear as directories + files to everything. Lots of tools are used to taking an input that is a directory of part files though. You can certainly point MR, Hive, etc at a directory of these files. On Sat, Feb 14, 2015 at 9:05 PM, Su She suhsheka...@gmail.com wrote: Thanks Sean and Akhil! I will take out the repartition(1). Please let me know if I understood this correctly, Spark Streamingwrites data like this: foo-1001.csv/part -x, part-x foo-1002.csv/part -x, part-x When I see this on Hue, the csv's appear to me as directories, but if I understand correctly, they will appear as csv files to other hadoop ecosystem tools? And, if I understand Tathagata's answer correctly, other hadoop based ecosystems, such as Hive, will be able to create a table based of the multiple foo-10x.csv directories? Thank you, I really appreciate the help! On Sat, Feb 14, 2015 at 3:20 AM, Sean Owen so...@cloudera.com wrote: Keep in mind that if you repartition to 1 partition, you are only using 1 task to write the output, and potentially only 1 task to compute some parent RDDs. You lose parallelism. The files-in-a-directory output scheme is standard for Hadoop and for a reason. Therefore I would consider separating this concern and merging the files afterwards if you need to. On Sat, Feb 14, 2015 at 8:39 AM, Akhil Das ak...@sigmoidanalytics.com wrote: Simplest way would be to merge the output files at the end of your job like: hadoop fs -getmerge /output/dir/on/hdfs/ /desired/local/output/file.txt If you want to do it pro grammatically, then you can use the FileUtil.copyMerge API . like: FileUtil.copyMerge(FileSystem of source(hdfs), /output-location, FileSystem of destination(hdfs), Path to the merged files /merged-ouput, true(to delete the original dir),null) Thanks Best Regards On Sat, Feb 14, 2015 at 2:18 AM, Su She suhsheka...@gmail.com wrote: Thanks Akhil for the suggestion, it is now only giving me one part - . Is there anyway I can just create a file rather than a directory? It doesn't seem like there is just a saveAsTextFile option for JavaPairRecieverDstream. Also, for the copy/merge api, how would I add that to my spark job? Thanks Akhil! Best, Su On Thu, Feb 12, 2015 at 11:51 PM, Akhil Das ak...@sigmoidanalytics.com wrote: For streaming application, for every batch it will create a new directory and puts the data in it. If you don't want to have multiple files inside the directory as part- then you can do a repartition before the saveAs* call. messages.repartition(1).saveAsHadoopFiles(hdfs://user/ec2-user/,csv,String.class, String.class, (Class) TextOutputFormat.class); Thanks Best Regards On Fri, Feb 13, 2015 at 11:59 AM, Su She suhsheka...@gmail.com wrote: Hello Everyone, I am writing simple word counts to hdfs using messages.saveAsHadoopFiles(hdfs://user/ec2-user/,csv,String.class, String.class, (Class) TextOutputFormat.class); 1) However, each 2 seconds I getting a new directory that is titled as a csv. So i'll have test.csv, which will be a directory that has two files inside of it called part-0 and part 1 (something like that). This obv makes it very hard for me to read the data stored in the csv files. I am wondering if there is a better way to store the JavaPairRecieverDStream and JavaPairDStream? 2) I know there is a copy/merge hadoop api for merging files...can this be done inside java? I am not sure the logic behind this api if I am using spark streaming which is constantly making new files. Thanks a lot for the help!
Re: Why are there different parts in my CSV?
No, they appear as directories + files to everything. Lots of tools are used to taking an input that is a directory of part files though. You can certainly point MR, Hive, etc at a directory of these files. On Sat, Feb 14, 2015 at 9:05 PM, Su She suhsheka...@gmail.com wrote: Thanks Sean and Akhil! I will take out the repartition(1). Please let me know if I understood this correctly, Spark Streamingwrites data like this: foo-1001.csv/part -x, part-x foo-1002.csv/part -x, part-x When I see this on Hue, the csv's appear to me as directories, but if I understand correctly, they will appear as csv files to other hadoop ecosystem tools? And, if I understand Tathagata's answer correctly, other hadoop based ecosystems, such as Hive, will be able to create a table based of the multiple foo-10x.csv directories? Thank you, I really appreciate the help! On Sat, Feb 14, 2015 at 3:20 AM, Sean Owen so...@cloudera.com wrote: Keep in mind that if you repartition to 1 partition, you are only using 1 task to write the output, and potentially only 1 task to compute some parent RDDs. You lose parallelism. The files-in-a-directory output scheme is standard for Hadoop and for a reason. Therefore I would consider separating this concern and merging the files afterwards if you need to. On Sat, Feb 14, 2015 at 8:39 AM, Akhil Das ak...@sigmoidanalytics.com wrote: Simplest way would be to merge the output files at the end of your job like: hadoop fs -getmerge /output/dir/on/hdfs/ /desired/local/output/file.txt If you want to do it pro grammatically, then you can use the FileUtil.copyMerge API . like: FileUtil.copyMerge(FileSystem of source(hdfs), /output-location, FileSystem of destination(hdfs), Path to the merged files /merged-ouput, true(to delete the original dir),null) Thanks Best Regards On Sat, Feb 14, 2015 at 2:18 AM, Su She suhsheka...@gmail.com wrote: Thanks Akhil for the suggestion, it is now only giving me one part - . Is there anyway I can just create a file rather than a directory? It doesn't seem like there is just a saveAsTextFile option for JavaPairRecieverDstream. Also, for the copy/merge api, how would I add that to my spark job? Thanks Akhil! Best, Su On Thu, Feb 12, 2015 at 11:51 PM, Akhil Das ak...@sigmoidanalytics.com wrote: For streaming application, for every batch it will create a new directory and puts the data in it. If you don't want to have multiple files inside the directory as part- then you can do a repartition before the saveAs* call. messages.repartition(1).saveAsHadoopFiles(hdfs://user/ec2-user/,csv,String.class, String.class, (Class) TextOutputFormat.class); Thanks Best Regards On Fri, Feb 13, 2015 at 11:59 AM, Su She suhsheka...@gmail.com wrote: Hello Everyone, I am writing simple word counts to hdfs using messages.saveAsHadoopFiles(hdfs://user/ec2-user/,csv,String.class, String.class, (Class) TextOutputFormat.class); 1) However, each 2 seconds I getting a new directory that is titled as a csv. So i'll have test.csv, which will be a directory that has two files inside of it called part-0 and part 1 (something like that). This obv makes it very hard for me to read the data stored in the csv files. I am wondering if there is a better way to store the JavaPairRecieverDStream and JavaPairDStream? 2) I know there is a copy/merge hadoop api for merging files...can this be done inside java? I am not sure the logic behind this api if I am using spark streaming which is constantly making new files. Thanks a lot for the help! - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Why are there different parts in my CSV?
Thanks Akhil for the suggestion, it is now only giving me one part - . Is there anyway I can just create a file rather than a directory? It doesn't seem like there is just a saveAsTextFile option for JavaPairRecieverDstream. Also, for the copy/merge api, how would I add that to my spark job? Thanks Akhil! Best, Su On Thu, Feb 12, 2015 at 11:51 PM, Akhil Das ak...@sigmoidanalytics.com wrote: For streaming application, for every batch it will create a new directory and puts the data in it. If you don't want to have multiple files inside the directory as part- then you can do a repartition before the saveAs* call. messages.repartition(1).saveAsHadoopFiles(hdfs://user/ec2-user/,csv,String.class, String.class, (Class) TextOutputFormat.class); Thanks Best Regards On Fri, Feb 13, 2015 at 11:59 AM, Su She suhsheka...@gmail.com wrote: Hello Everyone, I am writing simple word counts to hdfs using messages.saveAsHadoopFiles(hdfs://user/ec2-user/,csv,String.class, String.class, (Class) TextOutputFormat.class); 1) However, each 2 seconds I getting a new *directory *that is titled as a csv. So i'll have test.csv, which will be a directory that has two files inside of it called part-0 and part 1 (something like that). This obv makes it very hard for me to read the data stored in the csv files. I am wondering if there is a better way to store the JavaPairRecieverDStream and JavaPairDStream? 2) I know there is a copy/merge hadoop api for merging files...can this be done inside java? I am not sure the logic behind this api if I am using spark streaming which is constantly making new files. Thanks a lot for the help!
Re: Why are there different parts in my CSV?
For streaming application, for every batch it will create a new directory and puts the data in it. If you don't want to have multiple files inside the directory as part- then you can do a repartition before the saveAs* call. messages.repartition(1).saveAsHadoopFiles(hdfs://user/ec2-user/,csv,String.class, String.class, (Class) TextOutputFormat.class); Thanks Best Regards On Fri, Feb 13, 2015 at 11:59 AM, Su She suhsheka...@gmail.com wrote: Hello Everyone, I am writing simple word counts to hdfs using messages.saveAsHadoopFiles(hdfs://user/ec2-user/,csv,String.class, String.class, (Class) TextOutputFormat.class); 1) However, each 2 seconds I getting a new *directory *that is titled as a csv. So i'll have test.csv, which will be a directory that has two files inside of it called part-0 and part 1 (something like that). This obv makes it very hard for me to read the data stored in the csv files. I am wondering if there is a better way to store the JavaPairRecieverDStream and JavaPairDStream? 2) I know there is a copy/merge hadoop api for merging files...can this be done inside java? I am not sure the logic behind this api if I am using spark streaming which is constantly making new files. Thanks a lot for the help!