[jira] [Commented] (GOBBLIN-385) Add Spark execution mode for Gobblin
[ https://issues.apache.org/jira/browse/GOBBLIN-385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16424428#comment-16424428 ] Hung Tran commented on GOBBLIN-385: --- [^samza.diff] [~vinothchandar], I have attached the Samza work. Gobblin was plugged into Samza through the {{SystemProducer}} interface. The {{GobblinSystemProducer}} creates a {{SamzaTaskRunner}} for each source that is registered by Samza. The {{SamzaTaskRunner}} instantiates a {{JobContext}} that is configured to execute a {{SamzaSource}}. This {{SamzaSource}} consumes from a queue that is written to when Samza calls the {{SystemProducer.send()}} call. Please see the unit tests for example configuration. > Add Spark execution mode for Gobblin > > > Key: GOBBLIN-385 > URL: https://issues.apache.org/jira/browse/GOBBLIN-385 > Project: Apache Gobblin > Issue Type: New Feature > Components: gobblin-cluster >Reporter: Vinoth Chandar >Assignee: Hung Tran >Priority: Major > Attachments: samza.diff > > > If there is interest, happy to contribute spark execution mode and eventually > add support for ingesting data into [https://github.com/uber/hudi] format.. > Please provide some guidance -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (GOBBLIN-385) Add Spark execution mode for Gobblin
[ https://issues.apache.org/jira/browse/GOBBLIN-385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16419203#comment-16419203 ] Vinoth Chandar commented on GOBBLIN-385: At a high level, seems like we need a new - *CliSparkJobLauncher* : Just wrapping ServiceBasedApplicationLauncher and SparkJobLauncher - *SparkJobLauncher :* I think we should be able to extend MrJobLauncher and override `runWorkUnits` alone (which I think is the parallel work here). This class can create the SparkContext (there can be only 1 per jvm) and simply reuse existing input/output formats using SparkContext.newAPIHadoopRDD and PairRDD.saveAsNewAPIHadoopFile [http://spark.apache.org/docs/latest/api/java/org/apache/spark/api/java/JavaPairRDD.html#saveAsNewAPIHadoopFile-java.lang.String-java.lang.Class-java.lang.Class-java.lang.Class-] [https://spark.apache.org/docs/2.1.0/api/java/org/apache/spark/api/java/JavaSparkContext.html#newAPIHadoopRDD(org.apache.hadoop.conf.Configuration,%20java.lang.Class,%20java.lang.Class,%20java.lang.Class)] Above seems very simple, I am sure once we actually try doing it, we may see standard spark issues like NotSerializable exceptions or some MR specific paths. Do you have the Samza patch sitting in a diff somewhere? Could be useful to checkout before I embark on trying this out for reals. > Add Spark execution mode for Gobblin > > > Key: GOBBLIN-385 > URL: https://issues.apache.org/jira/browse/GOBBLIN-385 > Project: Apache Gobblin > Issue Type: New Feature > Components: gobblin-cluster >Reporter: Vinoth Chandar >Assignee: Hung Tran >Priority: Major > > If there is interest, happy to contribute spark execution mode and eventually > add support for ingesting data into [https://github.com/uber/hudi] format.. > Please provide some guidance -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (GOBBLIN-385) Add Spark execution mode for Gobblin
[ https://issues.apache.org/jira/browse/GOBBLIN-385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16413933#comment-16413933 ] Vinoth Chandar commented on GOBBLIN-385: gtk :) I have started scoping. Will get something to you by EoW > Add Spark execution mode for Gobblin > > > Key: GOBBLIN-385 > URL: https://issues.apache.org/jira/browse/GOBBLIN-385 > Project: Apache Gobblin > Issue Type: New Feature > Components: gobblin-cluster >Reporter: Vinoth Chandar >Assignee: Hung Tran >Priority: Major > > If there is interest, happy to contribute spark execution mode and eventually > add support for ingesting data into [https://github.com/uber/hudi] format.. > Please provide some guidance -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (GOBBLIN-385) Add Spark execution mode for Gobblin
[ https://issues.apache.org/jira/browse/GOBBLIN-385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16411898#comment-16411898 ] Abhishek Tiwari commented on GOBBLIN-385: - This popped up twice again in conversations last week :) > Add Spark execution mode for Gobblin > > > Key: GOBBLIN-385 > URL: https://issues.apache.org/jira/browse/GOBBLIN-385 > Project: Apache Gobblin > Issue Type: New Feature > Components: gobblin-cluster >Reporter: Vinoth Chandar >Assignee: Hung Tran >Priority: Major > > If there is interest, happy to contribute spark execution mode and eventually > add support for ingesting data into [https://github.com/uber/hudi] format.. > Please provide some guidance -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (GOBBLIN-385) Add Spark execution mode for Gobblin
[ https://issues.apache.org/jira/browse/GOBBLIN-385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16374524#comment-16374524 ] Vinoth Chandar commented on GOBBLIN-385: Apologies.. been busy with other things lately.. Will get to this in a week or so.. > Add Spark execution mode for Gobblin > > > Key: GOBBLIN-385 > URL: https://issues.apache.org/jira/browse/GOBBLIN-385 > Project: Apache Gobblin > Issue Type: New Feature > Components: gobblin-cluster >Reporter: Vinoth Chandar >Assignee: Hung Tran >Priority: Major > > If there is interest, happy to contribute spark execution mode and eventually > add support for ingesting data into [https://github.com/uber/hudi] format.. > Please provide some guidance -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (GOBBLIN-385) Add Spark execution mode for Gobblin
[ https://issues.apache.org/jira/browse/GOBBLIN-385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16336638#comment-16336638 ] Vinoth Chandar commented on GOBBLIN-385: Sg. Let me scope the change and respond back here with a plan to see if that make sense. > Add Spark execution mode for Gobblin > > > Key: GOBBLIN-385 > URL: https://issues.apache.org/jira/browse/GOBBLIN-385 > Project: Apache Gobblin > Issue Type: New Feature > Components: gobblin-cluster >Reporter: Vinoth Chandar >Assignee: Hung Tran >Priority: Major > > If there is interest, happy to contribute spark execution mode and eventually > add support for ingesting data into [https://github.com/uber/hudi] format.. > Please provide some guidance -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (GOBBLIN-385) Add Spark execution mode for Gobblin
[ https://issues.apache.org/jira/browse/GOBBLIN-385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16336619#comment-16336619 ] Abhishek Tiwari commented on GOBBLIN-385: - [~vinothchandar] we will love that! I remember users have asked about it in our monthly video meetups a few times, so there is interest. Let us know if you have any questions in getting started. > Add Spark execution mode for Gobblin > > > Key: GOBBLIN-385 > URL: https://issues.apache.org/jira/browse/GOBBLIN-385 > Project: Apache Gobblin > Issue Type: New Feature > Components: gobblin-cluster >Reporter: Vinoth Chandar >Assignee: Hung Tran >Priority: Major > > If there is interest, happy to contribute spark execution mode and eventually > add support for ingesting data into [https://github.com/uber/hudi] format.. > Please provide some guidance -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (GOBBLIN-385) Add Spark execution mode for Gobblin
[ https://issues.apache.org/jira/browse/GOBBLIN-385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16336538#comment-16336538 ] Vinoth Chandar commented on GOBBLIN-385: ah [~hutran] :) we meet again, looks like > Add Spark execution mode for Gobblin > > > Key: GOBBLIN-385 > URL: https://issues.apache.org/jira/browse/GOBBLIN-385 > Project: Apache Gobblin > Issue Type: New Feature > Components: gobblin-cluster >Reporter: Vinoth Chandar >Assignee: Hung Tran >Priority: Major > > If there is interest, happy to contribute spark execution mode and eventually > add support for ingesting data into [https://github.com/uber/hudi] format.. > Please provide some guidance -- This message was sent by Atlassian JIRA (v7.6.3#76005)