Re: streaming: missing data. does saveAsTextFile() append or replace?

Andy Davidson Mon, 09 Nov 2015 09:28:56 -0800

Thank Gerard

I¹ll give that a try. It seems like this approach is going to create a very
large number of files. I guess I could write a cron job to concatenate the
files by hour or maybe days. I imagine this is a common problem. Do you know
of something that does this already ?


I am using the stand alone cluster manager. I do not think it directly
supports cron job/table functionality. It should be easy to use the hdfs api
and linux crontab or may https://quartz-scheduler.org/

Kind regards

andy

From:  Gerard Maas <gerard.m...@gmail.com>
Date:  Sunday, November 8, 2015 at 2:13 AM
To:  Andrew Davidson <a...@santacruzintegration.com>
Cc:  "user @spark" <user@spark.apache.org>
Subject:  Re: streaming: missing data. does saveAsTextFile() append or
replace?

> Andy,
> 
> Using the rdd.saveAsTextFile(...)  will overwrite the data if your target is
> the same file.
> 
> If you want to save to HDFS, DStream offers dstream.saveAsTextFiles(prefix,
> suffix)  where a new file will be written at each streaming interval.
> Note that this will result in a saved file for each streaming interval. If you
> want to increase the file size (usually a good idea in HDFS), you can use a
> window function over the dstream and save the 'windowed'  dstream instead.
> 
> kind regards, Gerard.
> 
> On Sat, Nov 7, 2015 at 10:55 PM, Andy Davidson <a...@santacruzintegration.com>
> wrote:
>> Hi
>> 
>> I just started a new spark streaming project. In this phase of the system all
>> we want to do is save the data we received to hdfs. I after running for a
>> couple of days it looks like I am missing a lot of data. I wonder if
>> saveAsTextFile("hdfs:///rawSteamingData²); is overwriting the data I capture
>> in previous window? I noticed that after running for a couple of days  my
>> hdfs file system has 25 file. The names are something like ³part-00006². I
>> used 'hadoop fs dus¹ to check the total data captured. While the system was
>> running I would periodically call dus¹ I was surprised sometimes the numbers
>> of total bytes actually dropped.
>> 
>> 
>> Is there a better way to save write my data to disk?
>> 
>> Any suggestions would be appreciated
>> 
>> Andy
>> 
>> 
>>    public static void main(String[] args) {
>> 
>>       SparkConf conf = new SparkConf().setAppName(appName);
>> 
>>         JavaSparkContext jsc = new JavaSparkContext(conf);
>> 
>>         JavaStreamingContext ssc = new JavaStreamingContext(jsc, new
>> Duration(5 * 1000));
>> 
>> 
>> 
>> [ deleted code ]
>> 
>> 
>> 
>> data.foreachRDD(new Function<JavaRDD<String>, Void>(){
>> 
>>             private static final long serialVersionUID =
>> -7957854392903581284L;
>> 
>> 
>> 
>>             @Override
>> 
>>             public Void call(JavaRDD<String> jsonStr) throws Exception {
>> 
>>                 jsonStr.saveAsTextFile("hdfs:///rawSteamingData²); //
>> /rawSteamingData is a directory
>> 
>>                 return null;
>> 
>>             }   
>> 
>>         });
>> 
>>         
>> 
>>         ssc.checkpoint(checkPointUri);
>> 
>>         
>> 
>>         ssc.start();
>> 
>>         ssc.awaitTermination();
>> 
>> }
>

Re: streaming: missing data. does saveAsTextFile() append or replace?

Reply via email to