Re: Why are there different "parts" in my CSV?

Su She Sat, 14 Feb 2015 01:37:59 -0800

Thanks Akhil for the link. Is there a reason why there is a new directory
created for each batch? Is this a format that is easily readable by other
applications such as hive/impala?



On Sat, Feb 14, 2015 at 1:28 AM, Akhil Das <ak...@sigmoidanalytics.com>
wrote:

> You can directly write to hbase with Spark. Here's and example for doing
> that https://issues.apache.org/jira/browse/SPARK-944
>
> Thanks
> Best Regards
>
> On Sat, Feb 14, 2015 at 2:55 PM, Su She <suhsheka...@gmail.com> wrote:
>
>> Hello Akhil, thank you for your continued help!
>>
>> 1) So, if I can write it in programitically after every batch, then
>> technically I should be able to have just the csv files in one directory.
>> However, can the /desired/output/file.txt be in hdfs? If it is only local,
>> I am not sure if it will help me for my use case I describe in 2)
>>
>> so can i do something like this hadoop fs -getmerge /output/dir/on/hdfs
>> desired/dir/in/hdfs ?
>>
>> 2) Just to make sure I am going on the right path...my end use case is to
>> use hive or hbase to create a database off these csv files. Is there an
>> easy way for hive to read /user/test/many sub directories/with one csv file
>> in each into a table?
>>
>> Thank you!
>>
>>
>> On Sat, Feb 14, 2015 at 12:39 AM, Akhil Das <ak...@sigmoidanalytics.com>
>> wrote:
>>
>>> Simplest way would be to merge the output files at the end of your job
>>> like:
>>>
>>> hadoop fs -getmerge /output/dir/on/hdfs/ /desired/local/output/file.txt
>>>
>>> If you want to do it pro grammatically, then you can use the 
>>> FileUtil.copyMerge API
>>> . like:
>>>
>>> FileUtil.copyMerge(FileSystem of source(hdfs), /output-location,
>>> FileSystem of destination(hdfs), Path to the merged files /merged-ouput,
>>> true(to delete the original dir),null)
>>>
>>>
>>>
>>> Thanks
>>> Best Regards
>>>
>>> On Sat, Feb 14, 2015 at 2:18 AM, Su She <suhsheka...@gmail.com> wrote:
>>>
>>>> Thanks Akhil for the suggestion, it is now only giving me one part -
>>>> xxxx. Is there anyway I can just create a file rather than a directory? It
>>>> doesn't seem like there is just a saveAsTextFile option for
>>>> JavaPairRecieverDstream.
>>>>
>>>> Also, for the copy/merge api, how would I add that to my spark job?
>>>>
>>>> Thanks Akhil!
>>>>
>>>> Best,
>>>>
>>>> Su
>>>>
>>>> On Thu, Feb 12, 2015 at 11:51 PM, Akhil Das <ak...@sigmoidanalytics.com
>>>> > wrote:
>>>>
>>>>> For streaming application, for every batch it will create a new
>>>>> directory and puts the data in it. If you don't want to have multiple 
>>>>> files
>>>>> inside the directory as part-xxxx then you can do a repartition before the
>>>>> saveAs* call.
>>>>>
>>>>> messages.repartition(1).saveAsHadoopFiles("hdfs://user/ec2-user/","csv",String.class,
>>>>> String.class, (Class) TextOutputFormat.class);
>>>>>
>>>>>
>>>>> Thanks
>>>>> Best Regards
>>>>>
>>>>> On Fri, Feb 13, 2015 at 11:59 AM, Su She <suhsheka...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hello Everyone,
>>>>>>
>>>>>> I am writing simple word counts to hdfs using
>>>>>> messages.saveAsHadoopFiles("hdfs://user/ec2-user/","csv",String.class,
>>>>>> String.class, (Class) TextOutputFormat.class);
>>>>>>
>>>>>> 1) However, each 2 seconds I getting a new *directory *that is
>>>>>> titled as a csv. So i'll have test.csv, which will be a directory that 
>>>>>> has
>>>>>> two files inside of it called part-00000 and part 00001 (something like
>>>>>> that). This obv makes it very hard for me to read the data stored in the
>>>>>> csv files. I am wondering if there is a better way to store the
>>>>>> JavaPairRecieverDStream and JavaPairDStream?
>>>>>>
>>>>>> 2) I know there is a copy/merge hadoop api for merging files...can
>>>>>> this be done inside java? I am not sure the logic behind this api if I am
>>>>>> using spark streaming which is constantly making new files.
>>>>>>
>>>>>> Thanks a lot for the help!
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Why are there different "parts" in my CSV?

Reply via email to