Code-sharing for the 2 Spark runners proposed is a great question, and I
believe my answers will clarify why I suggested 2 runners instead of a fork.
Without getting into Class-by-Class details, the Spark runner currently
uses the RDD (and DStream) API, while Structured Streaming (Spark 2) and
the
+1
I definitely think it is important to support spark 1 and 2 simultaneously,
and I agree that side-by-side seems the best way to do it. I'll refrain
from commenting on the specific technical aspects of the two runners and
focus just on the split: I am also curious about the answer to Dan's
quest
Can they share any substantial code? If not, they will really be separate
runners.
If so, would it make more sense to fork into runners/spark and
runners/spark2?
On Thu, Aug 4, 2016 at 9:33 AM, Ismaël Mejía wrote:
> +1
>
> In particular for three reasons:
>
> 1. The new DataSet API in spark 2
+1
In particular for three reasons:
1. The new DataSet API in spark 2 and the new semantics it allows for the
runner (and the effect that we cannot retro port this to the spark 1
runner).
2. The current performance regressions in spark 2 (another reason to keep
the spark 1 runner).
3. The differe
After discussions with JB, and understanding that a lot of companies
running Spark will probably run 1.6.x for a while, we thought it would be a
good idea to have (some) support for both branches.
The SparkRunnerV1 will mostly support Batch, but could also support
“KeyedState” workflows and Sessio