Re: window every n elements instead of time based

Michael Allman Tue, 07 Oct 2014 23:17:30 -0700

Yes, I meant batch interval. Thanks for clarifying.

Cheers,


Michael


On Oct 7, 2014, at 11:14 PM, jayant [via Apache Spark User List] 
<ml-node+s1001560n15904...@n3.nabble.com> wrote:

> Hi Michael,
> 
> I think you are meaning batch interval instead of windowing. It can be 
> helpful for cases when you do not want to process very small batch sizes.
> 
> HDFS sink in Flume has the concept of rolling files based on time, number of 
> events or size.
> https://flume.apache.org/FlumeUserGuide.html#hdfs-sink
> 
> The same could be applied to Spark if the use cases demand. The only major 
> catch would be that it breaks the concept of window operations which are in 
> Spark.
> 
> Thanks,
> Jayant
> 
> 
> 
> 
> On Tue, Oct 7, 2014 at 10:19 PM, Michael Allman <[hidden email]> wrote:
> Hi Andrew,
> 
> The use case I have in mind is batch data serialization to HDFS, where sizing 
> files to a certain HDFS block size is desired. In my particular use case, I 
> want to process 10GB batches of data at a time. I'm not sure this is a 
> sensible use case for spark streaming, and I was trying to test it. However, 
> I had trouble getting it working and in the end I decided it was more trouble 
> than it was worth. So I decided to split my task into two: one streaming job 
> on small, time-defined batches of data, and a traditional Spark job 
> aggregating the smaller files into a larger whole. In retrospect, I think 
> this is the right way to go, even if a count-based window specification was 
> possible. Therefore, I can't suggest my use case for a count-based window 
> size.
> 
> Cheers,
> 
> Michael
> 
> On Oct 5, 2014, at 4:03 PM, Andrew Ash <[hidden email]> wrote:
> 
>> Hi Michael,
>> 
>> I couldn't find anything in Jira for it -- 
>> https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20text%20~%20%22window%22%20AND%20component%20%3D%20Streaming
>> 
>> Could you or Adrian please file a Jira ticket explaining the functionality 
>> and maybe a proposed API?  This will help people interested in count-based 
>> windowing to understand the state of the feature in Spark Streaming.
>> 
>> Thanks!
>> Andrew
>> 
>> On Fri, Oct 3, 2014 at 4:09 PM, Michael Allman <[hidden email]> wrote:
>> Hi,
>> 
>> I also have a use for count-based windowing. I'd like to process data
>> batches by size as opposed to time. Is this feature on the development
>> roadmap? Is there a JIRA ticket for it?
>> 
>> Thank you,
>> 
>> Michael
>> 
>> 
>> 
>> --
>> View this message in context: 
>> http://apache-spark-user-list.1001560.n3.nabble.com/window-every-n-elements-instead-of-time-based-tp2085p15701.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>> 
>> 
> 
> 
> 
> 
> If you reply to this email, your message will be added to the discussion 
> below:
> http://apache-spark-user-list.1001560.n3.nabble.com/window-every-n-elements-instead-of-time-based-tp2085p15904.html
> To unsubscribe from window every n elements instead of time based, click here.
> NAML





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/window-every-n-elements-instead-of-time-based-tp2085p15905.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: window every n elements instead of time based

Reply via email to