[jira] [Commented] (SPARK-18791) Stream-Stream Joins

2018-03-05 Thread Yuriy Bondaruk (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16386583#comment-16386583
 ] 

Yuriy Bondaruk commented on SPARK-18791:


Shouldn't it be marked as resolved? Stream-stream joins are already supported 
in Spark 2.3: 
https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#stream-stream-joins

> Stream-Stream Joins
> ---
>
> Key: SPARK-18791
> URL: https://issues.apache.org/jira/browse/SPARK-18791
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Reporter: Michael Armbrust
>Assignee: Tathagata Das
>Priority: Major
>
> Stream stream join is a much requested, but missing feature in Structured 
> Streaming. While the join API exists in Datasets and DataFrames, it throws 
> UnsupportedOperationException when applied between two streaming 
> Datasets/DataFrames. To support this, we have to maintain the same semantics 
> as other Structured Streaming operations - the result of the operation after 
> consuming two data streams data till positions/offsets X and Y, respectively, 
> must be the same as a single batch join operation on all the data till 
> positions X and Y, respectively. To achieve this, the execution has to buffer 
> past data (i.e. streaming state) from each stream, so that future data can be 
> matched against past data. Here is the set of a few high-level requirements. 
> - Buffer past rows as streaming state (using StateStore), and joining with 
> the past rows.
> - Support state cleanup using the event time watermark when possible.
> - Support different types of joins (inner, left outer, right outer is in 
> highest demand for ETL/enrichment type use cases [kafka -> best-effort enrich 
> -> write to S3])
> - Support cascading join operations (i.e. joining more than 2 streams)
> - Support multiple output modes (Append mode is in highest demand for 
> enabling ETL/enrichment type use cases)
> All the work to incrementally build this is going represented by this JIRA, 
> with specific subtasks for each step. At this point, this is the rough 
> direction as follows:
> - Implement stream-stream inner join in Append Mode, supporting multiple 
> cascaded joins.
> - Extends it stream-stream left/right outer join in Append Mode



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18791) Stream-Stream Joins

2017-06-06 Thread xianyao jiang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16038690#comment-16038690
 ] 

xianyao jiang commented on SPARK-18791:
---

  We have the draft design for the stream-stream inner join , and  complete a 
demo based on it, it seems it can work.  We hope we can get more advice or 
helps form open source social and make the stream join implementation more 
popular and common. If you have any question or advice, please contact us and 
let us know.   Thanks
  Document link:   
https://docs.google.com/document/d/1i528WI7KFica0Dg1LTQfdQMsW8ai3WDvHmUvkH1BKg4/edit?usp=sharing

> Stream-Stream Joins
> ---
>
> Key: SPARK-18791
> URL: https://issues.apache.org/jira/browse/SPARK-18791
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Reporter: Michael Armbrust
>
> Just a placeholder for now.  Please comment with your requirements.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18791) Stream-Stream Joins

2017-04-24 Thread Saul Shanabrook (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15981383#comment-15981383
 ] 

Saul Shanabrook commented on SPARK-18791:
-

I am using Spark to process the results from genetic programming experiments. 
One dataframe (from a directory of Parquet files) has a row for each 
experiment, holding the configuration. Another dataframe has one row for each 
"generation" of each experiment. I want to join these together and write out a 
dataframe that has one row per experiment, where one column contains an array 
of all the generations for each experiment.

> Stream-Stream Joins
> ---
>
> Key: SPARK-18791
> URL: https://issues.apache.org/jira/browse/SPARK-18791
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Reporter: Michael Armbrust
>
> Just a placeholder for now.  Please comment with your requirements.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18791) Stream-Stream Joins

2017-04-12 Thread xianyao jiang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15965657#comment-15965657
 ] 

xianyao jiang commented on SPARK-18791:
---

when this feature will be provided?
is there any idea about it?
we want to use the structured streaming, but we need the stream-stream join 
function.
 

> Stream-Stream Joins
> ---
>
> Key: SPARK-18791
> URL: https://issues.apache.org/jira/browse/SPARK-18791
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Reporter: Michael Armbrust
>
> Just a placeholder for now.  Please comment with your requirements.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org