Thanks Akhil for the link. Is there a reason why there is a new directory created for each batch? Is this a format that is easily readable by other applications such as hive/impala?
On Sat, Feb 14, 2015 at 1:28 AM, Akhil Das <ak...@sigmoidanalytics.com> wrote: > You can directly write to hbase with Spark. Here's and example for doing > that https://issues.apache.org/jira/browse/SPARK-944 > > Thanks > Best Regards > > On Sat, Feb 14, 2015 at 2:55 PM, Su She <suhsheka...@gmail.com> wrote: > >> Hello Akhil, thank you for your continued help! >> >> 1) So, if I can write it in programitically after every batch, then >> technically I should be able to have just the csv files in one directory. >> However, can the /desired/output/file.txt be in hdfs? If it is only local, >> I am not sure if it will help me for my use case I describe in 2) >> >> so can i do something like this hadoop fs -getmerge /output/dir/on/hdfs >> desired/dir/in/hdfs ? >> >> 2) Just to make sure I am going on the right path...my end use case is to >> use hive or hbase to create a database off these csv files. Is there an >> easy way for hive to read /user/test/many sub directories/with one csv file >> in each into a table? >> >> Thank you! >> >> >> On Sat, Feb 14, 2015 at 12:39 AM, Akhil Das <ak...@sigmoidanalytics.com> >> wrote: >> >>> Simplest way would be to merge the output files at the end of your job >>> like: >>> >>> hadoop fs -getmerge /output/dir/on/hdfs/ /desired/local/output/file.txt >>> >>> ​If you want to do it pro grammatically, then you can use the ​ >>> FileUtil.copyMerge API >>> ​.​ like: >>> >>> FileUtil.copyMerge(FileSystem of source(hdfs), /output-location, >>> FileSystem of destination(hdfs), Path to the merged files /merged-ouput, >>> true(to delete the original dir),null) >>> >>> >>> >>> Thanks >>> Best Regards >>> >>> On Sat, Feb 14, 2015 at 2:18 AM, Su She <suhsheka...@gmail.com> wrote: >>> >>>> Thanks Akhil for the suggestion, it is now only giving me one part - >>>> xxxx. Is there anyway I can just create a file rather than a directory? It >>>> doesn't seem like there is just a saveAsTextFile option for >>>> JavaPairRecieverDstream. >>>> >>>> Also, for the copy/merge api, how would I add that to my spark job? >>>> >>>> Thanks Akhil! >>>> >>>> Best, >>>> >>>> Su >>>> >>>> On Thu, Feb 12, 2015 at 11:51 PM, Akhil Das <ak...@sigmoidanalytics.com >>>> > wrote: >>>> >>>>> For streaming application, for every batch it will create a new >>>>> directory and puts the data in it. If you don't want to have multiple >>>>> files >>>>> inside the directory as part-xxxx then you can do a repartition before the >>>>> saveAs* call. >>>>> >>>>> messages.repartition(1).saveAsHadoopFiles("hdfs://user/ec2-user/","csv",String.class, >>>>> String.class, (Class) TextOutputFormat.class); >>>>> >>>>> >>>>> Thanks >>>>> Best Regards >>>>> >>>>> On Fri, Feb 13, 2015 at 11:59 AM, Su She <suhsheka...@gmail.com> >>>>> wrote: >>>>> >>>>>> Hello Everyone, >>>>>> >>>>>> I am writing simple word counts to hdfs using >>>>>> messages.saveAsHadoopFiles("hdfs://user/ec2-user/","csv",String.class, >>>>>> String.class, (Class) TextOutputFormat.class); >>>>>> >>>>>> 1) However, each 2 seconds I getting a new *directory *that is >>>>>> titled as a csv. So i'll have test.csv, which will be a directory that >>>>>> has >>>>>> two files inside of it called part-00000 and part 00001 (something like >>>>>> that). This obv makes it very hard for me to read the data stored in the >>>>>> csv files. I am wondering if there is a better way to store the >>>>>> JavaPairRecieverDStream and JavaPairDStream? >>>>>> >>>>>> 2) I know there is a copy/merge hadoop api for merging files...can >>>>>> this be done inside java? I am not sure the logic behind this api if I am >>>>>> using spark streaming which is constantly making new files. >>>>>> >>>>>> Thanks a lot for the help! >>>>>> >>>>> >>>>> >>>> >>> >> >