Re: Why are there different parts in my CSV?

2015-02-14 Thread Akhil Das
You can directly write to hbase with Spark. Here's and example for doing
that https://issues.apache.org/jira/browse/SPARK-944

Thanks
Best Regards

On Sat, Feb 14, 2015 at 2:55 PM, Su She suhsheka...@gmail.com wrote:

 Hello Akhil, thank you for your continued help!

 1) So, if I can write it in programitically after every batch, then
 technically I should be able to have just the csv files in one directory.
 However, can the /desired/output/file.txt be in hdfs? If it is only local,
 I am not sure if it will help me for my use case I describe in 2)

 so can i do something like this hadoop fs -getmerge /output/dir/on/hdfs
 desired/dir/in/hdfs ?

 2) Just to make sure I am going on the right path...my end use case is to
 use hive or hbase to create a database off these csv files. Is there an
 easy way for hive to read /user/test/many sub directories/with one csv file
 in each into a table?

 Thank you!


 On Sat, Feb 14, 2015 at 12:39 AM, Akhil Das ak...@sigmoidanalytics.com
 wrote:

 Simplest way would be to merge the output files at the end of your job
 like:

 hadoop fs -getmerge /output/dir/on/hdfs/ /desired/local/output/file.txt

 ​If you want to do it pro grammatically, then you can use the ​
 FileUtil.copyMerge API
 ​.​ like:

 FileUtil.copyMerge(FileSystem of source(hdfs), /output-location,
 FileSystem of destination(hdfs), Path to the merged files /merged-ouput,
 true(to delete the original dir),null)



 Thanks
 Best Regards

 On Sat, Feb 14, 2015 at 2:18 AM, Su She suhsheka...@gmail.com wrote:

 Thanks Akhil for the suggestion, it is now only giving me one part -
 . Is there anyway I can just create a file rather than a directory? It
 doesn't seem like there is just a saveAsTextFile option for
 JavaPairRecieverDstream.

 Also, for the copy/merge api, how would I add that to my spark job?

 Thanks Akhil!

 Best,

 Su

 On Thu, Feb 12, 2015 at 11:51 PM, Akhil Das ak...@sigmoidanalytics.com
 wrote:

 For streaming application, for every batch it will create a new
 directory and puts the data in it. If you don't want to have multiple files
 inside the directory as part- then you can do a repartition before the
 saveAs* call.

 messages.repartition(1).saveAsHadoopFiles(hdfs://user/ec2-user/,csv,String.class,
 String.class, (Class) TextOutputFormat.class);


 Thanks
 Best Regards

 On Fri, Feb 13, 2015 at 11:59 AM, Su She suhsheka...@gmail.com wrote:

 Hello Everyone,

 I am writing simple word counts to hdfs using
 messages.saveAsHadoopFiles(hdfs://user/ec2-user/,csv,String.class,
 String.class, (Class) TextOutputFormat.class);

 1) However, each 2 seconds I getting a new *directory *that is titled
 as a csv. So i'll have test.csv, which will be a directory that has two
 files inside of it called part-0 and part 1 (something like that).
 This obv makes it very hard for me to read the data stored in the csv
 files. I am wondering if there is a better way to store the
 JavaPairRecieverDStream and JavaPairDStream?

 2) I know there is a copy/merge hadoop api for merging files...can
 this be done inside java? I am not sure the logic behind this api if I am
 using spark streaming which is constantly making new files.

 Thanks a lot for the help!








Re: Why are there different parts in my CSV?

2015-02-14 Thread Su She
http://stackoverflow.com/questions/23527941/how-to-write-to-csv-in-spark

Just read this...seems like it should be easily readable. Thanks!


On Sat, Feb 14, 2015 at 1:36 AM, Su She suhsheka...@gmail.com wrote:

 Thanks Akhil for the link. Is there a reason why there is a new directory
 created for each batch? Is this a format that is easily readable by other
 applications such as hive/impala?


 On Sat, Feb 14, 2015 at 1:28 AM, Akhil Das ak...@sigmoidanalytics.com
 wrote:

 You can directly write to hbase with Spark. Here's and example for doing
 that https://issues.apache.org/jira/browse/SPARK-944

 Thanks
 Best Regards

 On Sat, Feb 14, 2015 at 2:55 PM, Su She suhsheka...@gmail.com wrote:

 Hello Akhil, thank you for your continued help!

 1) So, if I can write it in programitically after every batch, then
 technically I should be able to have just the csv files in one directory.
 However, can the /desired/output/file.txt be in hdfs? If it is only local,
 I am not sure if it will help me for my use case I describe in 2)

 so can i do something like this hadoop fs -getmerge /output/dir/on/hdfs
 desired/dir/in/hdfs ?

 2) Just to make sure I am going on the right path...my end use case is
 to use hive or hbase to create a database off these csv files. Is there an
 easy way for hive to read /user/test/many sub directories/with one csv file
 in each into a table?

 Thank you!


 On Sat, Feb 14, 2015 at 12:39 AM, Akhil Das ak...@sigmoidanalytics.com
 wrote:

 Simplest way would be to merge the output files at the end of your job
 like:

 hadoop fs -getmerge /output/dir/on/hdfs/ /desired/local/output/file.txt

 ​If you want to do it pro grammatically, then you can use the ​
 FileUtil.copyMerge API
 ​.​ like:

 FileUtil.copyMerge(FileSystem of source(hdfs), /output-location,
 FileSystem of destination(hdfs), Path to the merged files /merged-ouput,
 true(to delete the original dir),null)



 Thanks
 Best Regards

 On Sat, Feb 14, 2015 at 2:18 AM, Su She suhsheka...@gmail.com wrote:

 Thanks Akhil for the suggestion, it is now only giving me one part -
 . Is there anyway I can just create a file rather than a directory? It
 doesn't seem like there is just a saveAsTextFile option for
 JavaPairRecieverDstream.

 Also, for the copy/merge api, how would I add that to my spark job?

 Thanks Akhil!

 Best,

 Su

 On Thu, Feb 12, 2015 at 11:51 PM, Akhil Das 
 ak...@sigmoidanalytics.com wrote:

 For streaming application, for every batch it will create a new
 directory and puts the data in it. If you don't want to have multiple 
 files
 inside the directory as part- then you can do a repartition before 
 the
 saveAs* call.

 messages.repartition(1).saveAsHadoopFiles(hdfs://user/ec2-user/,csv,String.class,
 String.class, (Class) TextOutputFormat.class);


 Thanks
 Best Regards

 On Fri, Feb 13, 2015 at 11:59 AM, Su She suhsheka...@gmail.com
 wrote:

 Hello Everyone,

 I am writing simple word counts to hdfs using
 messages.saveAsHadoopFiles(hdfs://user/ec2-user/,csv,String.class,
 String.class, (Class) TextOutputFormat.class);

 1) However, each 2 seconds I getting a new *directory *that is
 titled as a csv. So i'll have test.csv, which will be a directory that 
 has
 two files inside of it called part-0 and part 1 (something like
 that). This obv makes it very hard for me to read the data stored in the
 csv files. I am wondering if there is a better way to store the
 JavaPairRecieverDStream and JavaPairDStream?

 2) I know there is a copy/merge hadoop api for merging files...can
 this be done inside java? I am not sure the logic behind this api if I 
 am
 using spark streaming which is constantly making new files.

 Thanks a lot for the help!










Re: Why are there different parts in my CSV?

2015-02-14 Thread Akhil Das
Simplest way would be to merge the output files at the end of your job like:

hadoop fs -getmerge /output/dir/on/hdfs/ /desired/local/output/file.txt

​If you want to do it pro grammatically, then you can use the ​
FileUtil.copyMerge API
​.​ like:

FileUtil.copyMerge(FileSystem of source(hdfs), /output-location, FileSystem
of destination(hdfs), Path to the merged files /merged-ouput, true(to
delete the original dir),null)



Thanks
Best Regards

On Sat, Feb 14, 2015 at 2:18 AM, Su She suhsheka...@gmail.com wrote:

 Thanks Akhil for the suggestion, it is now only giving me one part - .
 Is there anyway I can just create a file rather than a directory? It
 doesn't seem like there is just a saveAsTextFile option for
 JavaPairRecieverDstream.

 Also, for the copy/merge api, how would I add that to my spark job?

 Thanks Akhil!

 Best,

 Su

 On Thu, Feb 12, 2015 at 11:51 PM, Akhil Das ak...@sigmoidanalytics.com
 wrote:

 For streaming application, for every batch it will create a new directory
 and puts the data in it. If you don't want to have multiple files inside
 the directory as part- then you can do a repartition before the saveAs*
 call.

 messages.repartition(1).saveAsHadoopFiles(hdfs://user/ec2-user/,csv,String.class,
 String.class, (Class) TextOutputFormat.class);


 Thanks
 Best Regards

 On Fri, Feb 13, 2015 at 11:59 AM, Su She suhsheka...@gmail.com wrote:

 Hello Everyone,

 I am writing simple word counts to hdfs using
 messages.saveAsHadoopFiles(hdfs://user/ec2-user/,csv,String.class,
 String.class, (Class) TextOutputFormat.class);

 1) However, each 2 seconds I getting a new *directory *that is titled
 as a csv. So i'll have test.csv, which will be a directory that has two
 files inside of it called part-0 and part 1 (something like that).
 This obv makes it very hard for me to read the data stored in the csv
 files. I am wondering if there is a better way to store the
 JavaPairRecieverDStream and JavaPairDStream?

 2) I know there is a copy/merge hadoop api for merging files...can this
 be done inside java? I am not sure the logic behind this api if I am using
 spark streaming which is constantly making new files.

 Thanks a lot for the help!






Re: Why are there different parts in my CSV?

2015-02-14 Thread Su She
Hello Akhil, thank you for your continued help!

1) So, if I can write it in programitically after every batch, then
technically I should be able to have just the csv files in one directory.
However, can the /desired/output/file.txt be in hdfs? If it is only local,
I am not sure if it will help me for my use case I describe in 2)

so can i do something like this hadoop fs -getmerge /output/dir/on/hdfs
desired/dir/in/hdfs ?

2) Just to make sure I am going on the right path...my end use case is to
use hive or hbase to create a database off these csv files. Is there an
easy way for hive to read /user/test/many sub directories/with one csv file
in each into a table?

Thank you!


On Sat, Feb 14, 2015 at 12:39 AM, Akhil Das ak...@sigmoidanalytics.com
wrote:

 Simplest way would be to merge the output files at the end of your job
 like:

 hadoop fs -getmerge /output/dir/on/hdfs/ /desired/local/output/file.txt

 ​If you want to do it pro grammatically, then you can use the ​
 FileUtil.copyMerge API
 ​.​ like:

 FileUtil.copyMerge(FileSystem of source(hdfs), /output-location,
 FileSystem of destination(hdfs), Path to the merged files /merged-ouput,
 true(to delete the original dir),null)



 Thanks
 Best Regards

 On Sat, Feb 14, 2015 at 2:18 AM, Su She suhsheka...@gmail.com wrote:

 Thanks Akhil for the suggestion, it is now only giving me one part -
 . Is there anyway I can just create a file rather than a directory? It
 doesn't seem like there is just a saveAsTextFile option for
 JavaPairRecieverDstream.

 Also, for the copy/merge api, how would I add that to my spark job?

 Thanks Akhil!

 Best,

 Su

 On Thu, Feb 12, 2015 at 11:51 PM, Akhil Das ak...@sigmoidanalytics.com
 wrote:

 For streaming application, for every batch it will create a new
 directory and puts the data in it. If you don't want to have multiple files
 inside the directory as part- then you can do a repartition before the
 saveAs* call.

 messages.repartition(1).saveAsHadoopFiles(hdfs://user/ec2-user/,csv,String.class,
 String.class, (Class) TextOutputFormat.class);


 Thanks
 Best Regards

 On Fri, Feb 13, 2015 at 11:59 AM, Su She suhsheka...@gmail.com wrote:

 Hello Everyone,

 I am writing simple word counts to hdfs using
 messages.saveAsHadoopFiles(hdfs://user/ec2-user/,csv,String.class,
 String.class, (Class) TextOutputFormat.class);

 1) However, each 2 seconds I getting a new *directory *that is titled
 as a csv. So i'll have test.csv, which will be a directory that has two
 files inside of it called part-0 and part 1 (something like that).
 This obv makes it very hard for me to read the data stored in the csv
 files. I am wondering if there is a better way to store the
 JavaPairRecieverDStream and JavaPairDStream?

 2) I know there is a copy/merge hadoop api for merging files...can this
 be done inside java? I am not sure the logic behind this api if I am using
 spark streaming which is constantly making new files.

 Thanks a lot for the help!







Re: Why are there different parts in my CSV?

2015-02-14 Thread Su She
Thanks Akhil for the link. Is there a reason why there is a new directory
created for each batch? Is this a format that is easily readable by other
applications such as hive/impala?


On Sat, Feb 14, 2015 at 1:28 AM, Akhil Das ak...@sigmoidanalytics.com
wrote:

 You can directly write to hbase with Spark. Here's and example for doing
 that https://issues.apache.org/jira/browse/SPARK-944

 Thanks
 Best Regards

 On Sat, Feb 14, 2015 at 2:55 PM, Su She suhsheka...@gmail.com wrote:

 Hello Akhil, thank you for your continued help!

 1) So, if I can write it in programitically after every batch, then
 technically I should be able to have just the csv files in one directory.
 However, can the /desired/output/file.txt be in hdfs? If it is only local,
 I am not sure if it will help me for my use case I describe in 2)

 so can i do something like this hadoop fs -getmerge /output/dir/on/hdfs
 desired/dir/in/hdfs ?

 2) Just to make sure I am going on the right path...my end use case is to
 use hive or hbase to create a database off these csv files. Is there an
 easy way for hive to read /user/test/many sub directories/with one csv file
 in each into a table?

 Thank you!


 On Sat, Feb 14, 2015 at 12:39 AM, Akhil Das ak...@sigmoidanalytics.com
 wrote:

 Simplest way would be to merge the output files at the end of your job
 like:

 hadoop fs -getmerge /output/dir/on/hdfs/ /desired/local/output/file.txt

 ​If you want to do it pro grammatically, then you can use the ​
 FileUtil.copyMerge API
 ​.​ like:

 FileUtil.copyMerge(FileSystem of source(hdfs), /output-location,
 FileSystem of destination(hdfs), Path to the merged files /merged-ouput,
 true(to delete the original dir),null)



 Thanks
 Best Regards

 On Sat, Feb 14, 2015 at 2:18 AM, Su She suhsheka...@gmail.com wrote:

 Thanks Akhil for the suggestion, it is now only giving me one part -
 . Is there anyway I can just create a file rather than a directory? It
 doesn't seem like there is just a saveAsTextFile option for
 JavaPairRecieverDstream.

 Also, for the copy/merge api, how would I add that to my spark job?

 Thanks Akhil!

 Best,

 Su

 On Thu, Feb 12, 2015 at 11:51 PM, Akhil Das ak...@sigmoidanalytics.com
  wrote:

 For streaming application, for every batch it will create a new
 directory and puts the data in it. If you don't want to have multiple 
 files
 inside the directory as part- then you can do a repartition before the
 saveAs* call.

 messages.repartition(1).saveAsHadoopFiles(hdfs://user/ec2-user/,csv,String.class,
 String.class, (Class) TextOutputFormat.class);


 Thanks
 Best Regards

 On Fri, Feb 13, 2015 at 11:59 AM, Su She suhsheka...@gmail.com
 wrote:

 Hello Everyone,

 I am writing simple word counts to hdfs using
 messages.saveAsHadoopFiles(hdfs://user/ec2-user/,csv,String.class,
 String.class, (Class) TextOutputFormat.class);

 1) However, each 2 seconds I getting a new *directory *that is
 titled as a csv. So i'll have test.csv, which will be a directory that 
 has
 two files inside of it called part-0 and part 1 (something like
 that). This obv makes it very hard for me to read the data stored in the
 csv files. I am wondering if there is a better way to store the
 JavaPairRecieverDStream and JavaPairDStream?

 2) I know there is a copy/merge hadoop api for merging files...can
 this be done inside java? I am not sure the logic behind this api if I am
 using spark streaming which is constantly making new files.

 Thanks a lot for the help!









Re: Why are there different parts in my CSV?

2015-02-14 Thread Sean Owen
Keep in mind that if you repartition to 1 partition, you are only
using 1 task to write the output, and potentially only 1 task to
compute some parent RDDs. You lose parallelism.  The
files-in-a-directory output scheme is standard for Hadoop and for a
reason.

Therefore I would consider separating this concern and merging the
files afterwards if you need to.

On Sat, Feb 14, 2015 at 8:39 AM, Akhil Das ak...@sigmoidanalytics.com wrote:
 Simplest way would be to merge the output files at the end of your job like:

 hadoop fs -getmerge /output/dir/on/hdfs/ /desired/local/output/file.txt

 If you want to do it pro grammatically, then you can use the
 FileUtil.copyMerge API
 . like:

 FileUtil.copyMerge(FileSystem of source(hdfs), /output-location, FileSystem
 of destination(hdfs), Path to the merged files /merged-ouput, true(to delete
 the original dir),null)



 Thanks
 Best Regards

 On Sat, Feb 14, 2015 at 2:18 AM, Su She suhsheka...@gmail.com wrote:

 Thanks Akhil for the suggestion, it is now only giving me one part - .
 Is there anyway I can just create a file rather than a directory? It doesn't
 seem like there is just a saveAsTextFile option for JavaPairRecieverDstream.

 Also, for the copy/merge api, how would I add that to my spark job?

 Thanks Akhil!

 Best,

 Su

 On Thu, Feb 12, 2015 at 11:51 PM, Akhil Das ak...@sigmoidanalytics.com
 wrote:

 For streaming application, for every batch it will create a new directory
 and puts the data in it. If you don't want to have multiple files inside the
 directory as part- then you can do a repartition before the saveAs*
 call.


 messages.repartition(1).saveAsHadoopFiles(hdfs://user/ec2-user/,csv,String.class,
 String.class, (Class) TextOutputFormat.class);


 Thanks
 Best Regards

 On Fri, Feb 13, 2015 at 11:59 AM, Su She suhsheka...@gmail.com wrote:

 Hello Everyone,

 I am writing simple word counts to hdfs using
 messages.saveAsHadoopFiles(hdfs://user/ec2-user/,csv,String.class,
 String.class, (Class) TextOutputFormat.class);

 1) However, each 2 seconds I getting a new directory that is titled as a
 csv. So i'll have test.csv, which will be a directory that has two files
 inside of it called part-0 and part 1 (something like that). This
 obv makes it very hard for me to read the data stored in the csv files. I 
 am
 wondering if there is a better way to store the JavaPairRecieverDStream and
 JavaPairDStream?

 2) I know there is a copy/merge hadoop api for merging files...can this
 be done inside java? I am not sure the logic behind this api if I am using
 spark streaming which is constantly making new files.

 Thanks a lot for the help!





-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Why are there different parts in my CSV?

2015-02-14 Thread Su She
Thanks Sean and Akhil! I will take out the repartition(1).  Please let me
know if I understood this correctly, Spark Streamingwrites data like this:

foo-1001.csv/part -x, part-x
foo-1002.csv/part -x, part-x

When I see this on Hue, the csv's appear to me as *directories*, but if I
understand correctly, they will appear as csv *files* to other hadoop
ecosystem tools? And, if I understand Tathagata's answer correctly, other
hadoop based ecosystems, such as Hive, will be able to create a table based
of the multiple foo-10x.csv directories?

Thank you, I really appreciate the help!

On Sat, Feb 14, 2015 at 3:20 AM, Sean Owen so...@cloudera.com wrote:

 Keep in mind that if you repartition to 1 partition, you are only
 using 1 task to write the output, and potentially only 1 task to
 compute some parent RDDs. You lose parallelism.  The
 files-in-a-directory output scheme is standard for Hadoop and for a
 reason.

 Therefore I would consider separating this concern and merging the
 files afterwards if you need to.

 On Sat, Feb 14, 2015 at 8:39 AM, Akhil Das ak...@sigmoidanalytics.com
 wrote:
  Simplest way would be to merge the output files at the end of your job
 like:
 
  hadoop fs -getmerge /output/dir/on/hdfs/ /desired/local/output/file.txt
 
  If you want to do it pro grammatically, then you can use the
  FileUtil.copyMerge API
  . like:
 
  FileUtil.copyMerge(FileSystem of source(hdfs), /output-location,
 FileSystem
  of destination(hdfs), Path to the merged files /merged-ouput, true(to
 delete
  the original dir),null)
 
 
 
  Thanks
  Best Regards
 
  On Sat, Feb 14, 2015 at 2:18 AM, Su She suhsheka...@gmail.com wrote:
 
  Thanks Akhil for the suggestion, it is now only giving me one part -
 .
  Is there anyway I can just create a file rather than a directory? It
 doesn't
  seem like there is just a saveAsTextFile option for
 JavaPairRecieverDstream.
 
  Also, for the copy/merge api, how would I add that to my spark job?
 
  Thanks Akhil!
 
  Best,
 
  Su
 
  On Thu, Feb 12, 2015 at 11:51 PM, Akhil Das ak...@sigmoidanalytics.com
 
  wrote:
 
  For streaming application, for every batch it will create a new
 directory
  and puts the data in it. If you don't want to have multiple files
 inside the
  directory as part- then you can do a repartition before the saveAs*
  call.
 
 
 
 messages.repartition(1).saveAsHadoopFiles(hdfs://user/ec2-user/,csv,String.class,
  String.class, (Class) TextOutputFormat.class);
 
 
  Thanks
  Best Regards
 
  On Fri, Feb 13, 2015 at 11:59 AM, Su She suhsheka...@gmail.com
 wrote:
 
  Hello Everyone,
 
  I am writing simple word counts to hdfs using
  messages.saveAsHadoopFiles(hdfs://user/ec2-user/,csv,String.class,
  String.class, (Class) TextOutputFormat.class);
 
  1) However, each 2 seconds I getting a new directory that is titled
 as a
  csv. So i'll have test.csv, which will be a directory that has two
 files
  inside of it called part-0 and part 1 (something like that).
 This
  obv makes it very hard for me to read the data stored in the csv
 files. I am
  wondering if there is a better way to store the
 JavaPairRecieverDStream and
  JavaPairDStream?
 
  2) I know there is a copy/merge hadoop api for merging files...can
 this
  be done inside java? I am not sure the logic behind this api if I am
 using
  spark streaming which is constantly making new files.
 
  Thanks a lot for the help!
 
 
 
 



Re: Why are there different parts in my CSV?

2015-02-14 Thread Su She
Okay, got it, thanks for the help Sean!


On Sat, Feb 14, 2015 at 1:08 PM, Sean Owen so...@cloudera.com wrote:

 No, they appear as directories + files to everything. Lots of tools
 are used to taking an input that is a directory of part files though.
 You can certainly point MR, Hive, etc at a directory of these files.

 On Sat, Feb 14, 2015 at 9:05 PM, Su She suhsheka...@gmail.com wrote:
  Thanks Sean and Akhil! I will take out the repartition(1).  Please let me
  know if I understood this correctly, Spark Streamingwrites data like
 this:
 
  foo-1001.csv/part -x, part-x
  foo-1002.csv/part -x, part-x
 
  When I see this on Hue, the csv's appear to me as directories, but if I
  understand correctly, they will appear as csv files to other hadoop
  ecosystem tools? And, if I understand Tathagata's answer correctly, other
  hadoop based ecosystems, such as Hive, will be able to create a table
 based
  of the multiple foo-10x.csv directories?
 
  Thank you, I really appreciate the help!
 
  On Sat, Feb 14, 2015 at 3:20 AM, Sean Owen so...@cloudera.com wrote:
 
  Keep in mind that if you repartition to 1 partition, you are only
  using 1 task to write the output, and potentially only 1 task to
  compute some parent RDDs. You lose parallelism.  The
  files-in-a-directory output scheme is standard for Hadoop and for a
  reason.
 
  Therefore I would consider separating this concern and merging the
  files afterwards if you need to.
 
  On Sat, Feb 14, 2015 at 8:39 AM, Akhil Das ak...@sigmoidanalytics.com
  wrote:
   Simplest way would be to merge the output files at the end of your job
   like:
  
   hadoop fs -getmerge /output/dir/on/hdfs/
 /desired/local/output/file.txt
  
   If you want to do it pro grammatically, then you can use the
   FileUtil.copyMerge API
   . like:
  
   FileUtil.copyMerge(FileSystem of source(hdfs), /output-location,
   FileSystem
   of destination(hdfs), Path to the merged files /merged-ouput, true(to
   delete
   the original dir),null)
  
  
  
   Thanks
   Best Regards
  
   On Sat, Feb 14, 2015 at 2:18 AM, Su She suhsheka...@gmail.com
 wrote:
  
   Thanks Akhil for the suggestion, it is now only giving me one part -
   .
   Is there anyway I can just create a file rather than a directory? It
   doesn't
   seem like there is just a saveAsTextFile option for
   JavaPairRecieverDstream.
  
   Also, for the copy/merge api, how would I add that to my spark job?
  
   Thanks Akhil!
  
   Best,
  
   Su
  
   On Thu, Feb 12, 2015 at 11:51 PM, Akhil Das
   ak...@sigmoidanalytics.com
   wrote:
  
   For streaming application, for every batch it will create a new
   directory
   and puts the data in it. If you don't want to have multiple files
   inside the
   directory as part- then you can do a repartition before the
   saveAs*
   call.
  
  
  
  
 messages.repartition(1).saveAsHadoopFiles(hdfs://user/ec2-user/,csv,String.class,
   String.class, (Class) TextOutputFormat.class);
  
  
   Thanks
   Best Regards
  
   On Fri, Feb 13, 2015 at 11:59 AM, Su She suhsheka...@gmail.com
   wrote:
  
   Hello Everyone,
  
   I am writing simple word counts to hdfs using
  
  
 messages.saveAsHadoopFiles(hdfs://user/ec2-user/,csv,String.class,
   String.class, (Class) TextOutputFormat.class);
  
   1) However, each 2 seconds I getting a new directory that is titled
   as a
   csv. So i'll have test.csv, which will be a directory that has two
   files
   inside of it called part-0 and part 1 (something like
 that).
   This
   obv makes it very hard for me to read the data stored in the csv
   files. I am
   wondering if there is a better way to store the
   JavaPairRecieverDStream and
   JavaPairDStream?
  
   2) I know there is a copy/merge hadoop api for merging files...can
   this
   be done inside java? I am not sure the logic behind this api if I
 am
   using
   spark streaming which is constantly making new files.
  
   Thanks a lot for the help!
  
  
  
  
 
 



Re: Why are there different parts in my CSV?

2015-02-14 Thread Sean Owen
No, they appear as directories + files to everything. Lots of tools
are used to taking an input that is a directory of part files though.
You can certainly point MR, Hive, etc at a directory of these files.

On Sat, Feb 14, 2015 at 9:05 PM, Su She suhsheka...@gmail.com wrote:
 Thanks Sean and Akhil! I will take out the repartition(1).  Please let me
 know if I understood this correctly, Spark Streamingwrites data like this:

 foo-1001.csv/part -x, part-x
 foo-1002.csv/part -x, part-x

 When I see this on Hue, the csv's appear to me as directories, but if I
 understand correctly, they will appear as csv files to other hadoop
 ecosystem tools? And, if I understand Tathagata's answer correctly, other
 hadoop based ecosystems, such as Hive, will be able to create a table based
 of the multiple foo-10x.csv directories?

 Thank you, I really appreciate the help!

 On Sat, Feb 14, 2015 at 3:20 AM, Sean Owen so...@cloudera.com wrote:

 Keep in mind that if you repartition to 1 partition, you are only
 using 1 task to write the output, and potentially only 1 task to
 compute some parent RDDs. You lose parallelism.  The
 files-in-a-directory output scheme is standard for Hadoop and for a
 reason.

 Therefore I would consider separating this concern and merging the
 files afterwards if you need to.

 On Sat, Feb 14, 2015 at 8:39 AM, Akhil Das ak...@sigmoidanalytics.com
 wrote:
  Simplest way would be to merge the output files at the end of your job
  like:
 
  hadoop fs -getmerge /output/dir/on/hdfs/ /desired/local/output/file.txt
 
  If you want to do it pro grammatically, then you can use the
  FileUtil.copyMerge API
  . like:
 
  FileUtil.copyMerge(FileSystem of source(hdfs), /output-location,
  FileSystem
  of destination(hdfs), Path to the merged files /merged-ouput, true(to
  delete
  the original dir),null)
 
 
 
  Thanks
  Best Regards
 
  On Sat, Feb 14, 2015 at 2:18 AM, Su She suhsheka...@gmail.com wrote:
 
  Thanks Akhil for the suggestion, it is now only giving me one part -
  .
  Is there anyway I can just create a file rather than a directory? It
  doesn't
  seem like there is just a saveAsTextFile option for
  JavaPairRecieverDstream.
 
  Also, for the copy/merge api, how would I add that to my spark job?
 
  Thanks Akhil!
 
  Best,
 
  Su
 
  On Thu, Feb 12, 2015 at 11:51 PM, Akhil Das
  ak...@sigmoidanalytics.com
  wrote:
 
  For streaming application, for every batch it will create a new
  directory
  and puts the data in it. If you don't want to have multiple files
  inside the
  directory as part- then you can do a repartition before the
  saveAs*
  call.
 
 
 
  messages.repartition(1).saveAsHadoopFiles(hdfs://user/ec2-user/,csv,String.class,
  String.class, (Class) TextOutputFormat.class);
 
 
  Thanks
  Best Regards
 
  On Fri, Feb 13, 2015 at 11:59 AM, Su She suhsheka...@gmail.com
  wrote:
 
  Hello Everyone,
 
  I am writing simple word counts to hdfs using
 
  messages.saveAsHadoopFiles(hdfs://user/ec2-user/,csv,String.class,
  String.class, (Class) TextOutputFormat.class);
 
  1) However, each 2 seconds I getting a new directory that is titled
  as a
  csv. So i'll have test.csv, which will be a directory that has two
  files
  inside of it called part-0 and part 1 (something like that).
  This
  obv makes it very hard for me to read the data stored in the csv
  files. I am
  wondering if there is a better way to store the
  JavaPairRecieverDStream and
  JavaPairDStream?
 
  2) I know there is a copy/merge hadoop api for merging files...can
  this
  be done inside java? I am not sure the logic behind this api if I am
  using
  spark streaming which is constantly making new files.
 
  Thanks a lot for the help!
 
 
 
 



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Why are there different parts in my CSV?

2015-02-13 Thread Su She
Thanks Akhil for the suggestion, it is now only giving me one part - .
Is there anyway I can just create a file rather than a directory? It
doesn't seem like there is just a saveAsTextFile option for
JavaPairRecieverDstream.

Also, for the copy/merge api, how would I add that to my spark job?

Thanks Akhil!

Best,

Su

On Thu, Feb 12, 2015 at 11:51 PM, Akhil Das ak...@sigmoidanalytics.com
wrote:

 For streaming application, for every batch it will create a new directory
 and puts the data in it. If you don't want to have multiple files inside
 the directory as part- then you can do a repartition before the saveAs*
 call.

 messages.repartition(1).saveAsHadoopFiles(hdfs://user/ec2-user/,csv,String.class,
 String.class, (Class) TextOutputFormat.class);


 Thanks
 Best Regards

 On Fri, Feb 13, 2015 at 11:59 AM, Su She suhsheka...@gmail.com wrote:

 Hello Everyone,

 I am writing simple word counts to hdfs using
 messages.saveAsHadoopFiles(hdfs://user/ec2-user/,csv,String.class,
 String.class, (Class) TextOutputFormat.class);

 1) However, each 2 seconds I getting a new *directory *that is titled as
 a csv. So i'll have test.csv, which will be a directory that has two files
 inside of it called part-0 and part 1 (something like that). This
 obv makes it very hard for me to read the data stored in the csv files. I
 am wondering if there is a better way to store the JavaPairRecieverDStream
 and JavaPairDStream?

 2) I know there is a copy/merge hadoop api for merging files...can this
 be done inside java? I am not sure the logic behind this api if I am using
 spark streaming which is constantly making new files.

 Thanks a lot for the help!





Re: Why are there different parts in my CSV?

2015-02-12 Thread Akhil Das
For streaming application, for every batch it will create a new directory
and puts the data in it. If you don't want to have multiple files inside
the directory as part- then you can do a repartition before the saveAs*
call.

messages.repartition(1).saveAsHadoopFiles(hdfs://user/ec2-user/,csv,String.class,
String.class, (Class) TextOutputFormat.class);


Thanks
Best Regards

On Fri, Feb 13, 2015 at 11:59 AM, Su She suhsheka...@gmail.com wrote:

 Hello Everyone,

 I am writing simple word counts to hdfs using
 messages.saveAsHadoopFiles(hdfs://user/ec2-user/,csv,String.class,
 String.class, (Class) TextOutputFormat.class);

 1) However, each 2 seconds I getting a new *directory *that is titled as
 a csv. So i'll have test.csv, which will be a directory that has two files
 inside of it called part-0 and part 1 (something like that). This
 obv makes it very hard for me to read the data stored in the csv files. I
 am wondering if there is a better way to store the JavaPairRecieverDStream
 and JavaPairDStream?

 2) I know there is a copy/merge hadoop api for merging files...can this be
 done inside java? I am not sure the logic behind this api if I am using
 spark streaming which is constantly making new files.

 Thanks a lot for the help!