Re: Spark

Matt Casters Sat, 19 Jan 2019 11:50:17 -0800

Thanks for the suggestion but throwing another server into the mix wouldn't
help in my case.  I'm still betting that using SparkLauncher would solve a
lot.
Building a far jar isn't that big of a deal.  However, all the Kettle libs
along with the ones from Beam clocks in at 1.4GB at this point in time.
Creating packaged copies of that collection all over the place just feels
icky if you know what I mean.
Making things easy, convenient and transparent for the end-user is not
always trivial but it's necessarily something I need to go for since that's
the whole purpose of using a data integration tool ;-)


I'm somewhat documenting my exploits over here:
https://github.com/mattcasters/kettle-beam/issues/23

Once I've solved these things we'll be able to visually design batch or
streaming transformations in Kettle, write unit tests for them and then,
from the same GUI, launch them on a Direct runner, in DataFlow, on Spark,
Flink, ... I think it will be a first in the open source data integration
world and it's all possible thanks to the Apache Beam team so on behalf of
myself and our community: thanks again.

Cheers,

Matt
---
Matt Casters <m <[email protected]>[email protected]>
Senior Solution Architect, Kettle Project Founder




Op vr 18 jan. 2019 om 16:58 schreef Alexey Romanenko <
[email protected]>:

> Hi Matt,
>
> I just wanted to remind that you also can use Apache Livy [1] to launch
> Spark jobs (or Beam pipelines that are built with support of SparkRunner)
> on Spark using just REST API [2].
> And of course, you need to create manually a “fat" jar and put it
> somewhere where Spark can find it.
>
> [1] https://livy.incubator.apache.org/
> [2] https://livy.incubator.apache.org/docs/latest/rest-api.html
>
> On 18 Jan 2019, at 13:03, Juan Carlos Garcia <[email protected]> wrote:
>
> Hi Matt,
>
> With flink you will be able launch your pipeline just by invoking the main
> method of your main class, however it will run as standalone process and
> you will not have the advantage of distribute computation.
>
> Am Fr., 18. Jan. 2019, 09:37 hat Matt Casters <[email protected]>
> geschrieben:
>
>> Thanks for the reply JC, I really appreciate it.
>>
>> I really can't force our users to use antiquated stuff like scripts, let
>> alone command line things, but I'll simply use SparkLauncher and your
>> comment about the main class doing Pipeline.run() on the Master is
>> something I can work with... somewhat.
>> The execution results, metrics and all that are handled the Master I
>> guess.  Over time I'll figure out a way to report the metrics and results
>> from the master back to the client.  I've done similar things with
>> Map/Reduce in the past.
>>
>> Looking around I see that the same conditions apply for Flink.  Is this
>> because Spark and Flink lack the APIs to talk to a client about the state
>> of workloads unlike DataFlow and the Direct Runner?
>>
>> Thanks!
>>
>> Matt
>> ---
>> Matt Casters <m <[email protected]>[email protected]>
>> Senior Solution Architect, Kettle Project Founder
>>
>>
>>
>>
>> Op do 17 jan. 2019 om 15:30 schreef Juan Carlos Garcia <
>> [email protected]>:
>>
>>> Hi Matt, during the time we were using Spark with Beam, the solution was
>>> always to pack the jar and use the spark-submit command pointing to your
>>> main class which will do `pipeline.run`.
>>>
>>> The spark-submit command have a flag to decide how to run it
>>> (--deploy-mode), whether to launch the job on the driver machine or in one
>>> of the machine in the cluster.
>>>
>>>
>>> JC
>>>
>>>
>>> On Thu, Jan 17, 2019 at 10:00 AM Matt Casters <[email protected]>
>>> wrote:
>>>
>>>> Dear Beam friends,
>>>>
>>>> Now that I've got cool data integration (Kettle-beam) scenarios running
>>>> on DataFlow with sample data sets in Google (Files, Pub/Sub, BigQuery,
>>>> Streaming, Windowing, ...) I thought it was time to also give Apache Spark
>>>> some attention.
>>>>
>>>> The thing I have some trouble with it figuring out what the
>>>> relationship is between the runner (SparkRunner), Pipeline.run() and
>>>> spark-submit (or SparkLauncher).
>>>>
>>>> The samples I'm seeing mostly involve packaging up a jar file and then
>>>> doing a spark-submit.  That obviously makes it unclear if Pipeline.run()
>>>> should be used at all and how Metrics should be obtained from a Spark job
>>>> during execution or after completion.
>>>>
>>>> I really like the way the GCP DataFlow implementation automatically
>>>> deploys jar file binaries and from what I can
>>>> determine org.apache.spark.launcher.SparkLauncher offers this functionality
>>>> so perhaps I'm either doing something wrong or I'm reading the docs wrong
>>>> or the wrong docs.
>>>> The thing is, if you try running your pipelines against a Spark master
>>>> feedback is really minimal putting you in a trial & error situation pretty
>>>> quickly.
>>>>
>>>> So thanks again in advance for any help!
>>>>
>>>> Cheers,
>>>>
>>>> Matt
>>>> ---
>>>> Matt Casters <m <[email protected]>[email protected]>
>>>> Senior Solution Architect, Kettle Project Founder
>>>>
>>>>
>>>
>>> --
>>>
>>> JC
>>>
>>>
>

Re: Spark

Reply via email to