This is an automated email from the ASF dual-hosted git repository. aromanenko pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/beam.git
The following commit(s) were added to refs/heads/master by this push: new a9dcf95453d [CdapIO] Add readme for CdapIO. Update readme for SparkReceiverIO. (#23959) a9dcf95453d is described below commit a9dcf95453d43621ef3ce63782f7cd6c47399bb6 Author: Vitaly Terentyev <vitaly.terent...@akvelon.com> AuthorDate: Mon Dec 5 14:45:31 2022 +0400 [CdapIO] Add readme for CdapIO. Update readme for SparkReceiverIO. (#23959) * Add README for CdapIO. Update README for SparkReceiverIO. * Set export javadoc true for cdap and sparkreceiver * Updates to CdapIO and SparkReceiverIO readmes * Add manual how to add support for Batch and Streaming Cdap plugins. * Add manual how to add support for Spark Receiver. * Fix whitespace * updates to CDAP and SparkReceiver readme files * Updated CDAP readme * Fix links in readme Co-authored-by: Alex Kosolapov <alex.kosola...@gmail.com> --- sdks/java/io/cdap/README.md | 145 ++++++++++++++++++++++++++++++ sdks/java/io/cdap/build.gradle | 1 - sdks/java/io/sparkreceiver/2/README.md | 40 +++++++-- sdks/java/io/sparkreceiver/2/build.gradle | 1 - 4 files changed, 180 insertions(+), 7 deletions(-) diff --git a/sdks/java/io/cdap/README.md b/sdks/java/io/cdap/README.md new file mode 100644 index 00000000000..5269204c1d1 --- /dev/null +++ b/sdks/java/io/cdap/README.md @@ -0,0 +1,145 @@ +<!-- + Licensed to the Apache Software Foundation (ASF) under one + or more contributor license agreements. See the NOTICE file + distributed with this work for additional information + regarding copyright ownership. The ASF licenses this file + to you under the Apache License, Version 2.0 (the + "License"); you may not use this file except in compliance + with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, + software distributed under the License is distributed on an + "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + KIND, either express or implied. See the License for the + specific language governing permissions and limitations + under the License. +--> + +# CdapIO +CdapIO provides I/O transforms for [CDAP](https://cdap.io/) plugins. + +## What is CDAP? + +[CDAP](https://cdap.io/) is an application platform for building and managing data applications in hybrid and multi-cloud environments. +It enables developers, business analysts, and data scientists to use a visual rapid development environment and utilize common patterns, +data, and application abstractions to accelerate the development of data applications, addressing a broader range of real-time and batch use cases. + +[CDAP plugins](https://github.com/data-integrations) types: +- Batch source +- Batch sink +- Streaming source + +To learn more about CDAP plugins please see [io.cdap.cdap.api.annotation.Plugin](https://javadoc.io/static/io.cdap.cdap/cdap-api/6.7.2/io/cdap/cdap/api/annotation/Plugin.html) and [Data Integrations](https://github.com/data-integrations) plugins repository. + +## CDAP Batch plugins support in CDAP IO + +CdapIO supports CDAP Batch plugins based on Hadoop [InputFormat](https://hadoop.apache.org/docs/stable/api/org/apache/hadoop/mapred/InputFormat.html) and [OutputFormat](https://hadoop.apache.org/docs/stable/api/org/apache/hadoop/mapred/OutputFormat.html). +CDAP batch plugins support is implemented using [HadoopFormatIO](https://beam.apache.org/documentation/io/built-in/hadoop/). + +CdapIO currently supports the following CDAP Batch plugins by referencing `CDAP plugin` class: +* [Hubspot Batch Source](https://github.com/data-integrations/hubspot/blob/develop/src/main/java/io/cdap/plugin/hubspot/source/batch/HubspotBatchSource.java) +* [Hubspot Batch Sink](https://github.com/data-integrations/hubspot/blob/develop/src/main/java/io/cdap/plugin/hubspot/sink/batch/HubspotBatchSink.java) +* [Salesforce Batch Source](https://github.com/data-integrations/salesforce/blob/develop/src/main/java/io/cdap/plugin/salesforce/plugin/source/batch/SalesforceBatchSource.java) +* [Salesforce Batch Sink](https://github.com/data-integrations/salesforce/blob/develop/src/main/java/io/cdap/plugin/salesforce/plugin/sink/batch/SalesforceBatchSink.java) +* [ServiceNow Batch Source](https://github.com/data-integrations/servicenow-plugins/blob/develop/src/main/java/io/cdap/plugin/servicenow/source/ServiceNowSource.java) +* [Zendesk Batch Source](https://github.com/data-integrations/zendesk/blob/develop/src/main/java/io/cdap/plugin/zendesk/source/batch/ZendeskBatchSource.java) + +It means that all these plugins can be used like this: +``CdapIO.withCdapPluginClass(HubspotBatchSource.class)`` + +### Requirements for Cdap Batch plugins + +CDAP Batch plugin should be based on `HadoopFormat` implementation. + +### How to add support for a new CDAP Batch plugin + +To add CdapIO support for a new CDAP Batch [Plugin](src/main/java/org/apache/beam/sdk/io/cdap/Plugin.java) perform the following steps: +1. Find CDAP plugin artifacts in the Maven Central repository. *Example:* [Hubspot plugin Maven repository](https://mvnrepository.com/artifact/io.cdap/hubspot-plugins/1.0.0). *Note:* To add a custom CDAP plugin, please follow [Sonatype publishing guidelines](https://central.sonatype.org/publish/). +2. Add the CDAP plugin Maven dependency to the `build.gradle` file. *Example:* ``implementation "io.cdap:hubspot-plugins:1.0.0"``. +3. Here are two ways of using CDAP batch plugin with CdapIO: + 1. Using `Plugin.createBatch()` method. Pass Cdap Plugin class and correct `InputFormat` (or `OutputFormat`) and `InputFormatProvider` (or `OutputFormatProvider`) classes to CdapIO. *Example:* + ``` + CdapIO.withCdapPlugin( + Plugin.createBatch( + EmployeeBatchSource.class, + EmployeeInputFormat.class, + EmployeeInputFormatProvider.class)); + ``` + 2. Using `MappingUtils`. + 1. Navigate to [MappingUtils](src/main/java/org/apache/beam/sdk/io/cdap/MappingUtils.java) class. + 2. Modify `getPluginClassByName()` method: + 3. Add the code for mapping Cdap Plugin class name and `Input/Output Format` and `FormatProvider` classes. + *Example:* + ``` + if (pluginClass.equals(EmployeeBatchSource.class)){ + return Plugin.createBatch(pluginClass, + EmployeeInputFormat.class, + EmployeeInputFormatProvider.class); + } + ``` + 4. After these steps you will be able to use Cdap Plugin by class name like this: ``CdapIO.withCdapPluginClass(EmployeeBatchSource.class)`` + +To learn more, please check out [complete examples](https://github.com/apache/beam/tree/master/examples/java/cdap/src/main/java/org/apache/beam/examples/complete/cdap). + +## CDAP Streaming plugins support in CDAP IO + +CdapIO supports CDAP Streaming plugins based on [Apache Spark Receiver](https://spark.apache.org/docs/2.4.0/streaming-custom-receivers.html). +CDAP streaming plugins support is implemented using [SparkReceiverIO](https://github.com/apache/beam/tree/master/sdks/java/io/sparkreceiver). + +### Requirements for Cdap Streaming plugins + +1. CDAP Streaming plugin should be based on `Spark Receiver`. +2. CDAP Streaming plugin should support work with offsets. + 1. Corresponding Spark Receiver should implement [HasOffset](https://github.com/apache/beam/blob/master/sdks/java/io/sparkreceiver/src/main/java/org/apache/beam/sdk/io/sparkreceiver/HasOffset.java) interface. + 2. Records should have the numeric field that represents record offset. *Example:* `RecordId` field for Salesforce and `vid` field for Hubspot plugins. + For more details please see [GetOffsetUtils](https://github.com/apache/beam/tree/master/examples/java/cdap/src/main/java/org/apache/beam/examples/complete/cdap/utils/GetOffsetUtils.java) class from examples. + +### How to add support for a new CDAP Streaming plugin + +To add CdapIO support for a new CDAP Streaming SparkReceiver [Plugin](src/main/java/org/apache/beam/sdk/io/cdap/Plugin.java), perform the following steps: +1. Find CDAP plugin artifacts in the Maven Central repository. *Example:* [Hubspot plugin Maven repository](https://mvnrepository.com/artifact/io.cdap/hubspot-plugins/1.0.0). *Note:* To add a custom CDAP plugin, please follow [Sonatype publishing guidelines](https://central.sonatype.org/publish/). +2. Add CDAP plugin Maven dependency to the `build.gradle` file. *Example:* ``implementation "io.cdap:hubspot-plugins:1.0.0"``. +3. Implement function that will define how to get `Long offset` from the record of the Cdap Plugin. +*Example:* see [GetOffsetUtils](https://github.com/apache/beam/tree/master/examples/java/cdap/src/main/java/org/apache/beam/examples/complete/cdap/utils/GetOffsetUtils.java) class from examples. +4. Here are two ways of using Cdap streaming Plugin with CdapIO: + 1. Using `Plugin.createStreaming()` method. Pass Cdap Plugin class, correct `getOffsetFn` (from step 3) and Spark `Receiver` class to CdapIO. *Example:* + ``` + CdapIO.withCdapPlugin( + Plugin.createStreaming( + HubspotStreamingSource.class, + offsetFnForHubspot, + HubspotReceiver.class))); + ``` + 2. Using `MappingUtils`. + 1. Navigate to [MappingUtils](src/main/java/org/apache/beam/sdk/io/cdap/MappingUtils.java) class. + 2. Modify `getPluginClassByName()` method: + 3. Add the code for mapping Cdap Plugin class name, `getOffsetFn` function and Spark `Receiver` class. + *Example:* + ``` + if (pluginClass.equals(HubspotStreamingSource.class)){ + return Plugin.createStreaming(pluginClass, + getOffsetFnForHubpot(), + HubspotReceiverClass.class); + } + ``` + 4. After these steps you will be able to use Cdap Plugin by class name like this: ``CdapIO.withCdapPluginClass(HubspotStreamingSource.class)`` + +To learn more, please check out [complete examples](https://github.com/apache/beam/tree/master/examples/java/cdap/src/main/java/org/apache/beam/examples/complete/cdap). + +## Dependencies + +To use CdapIO please add a dependency on `beam-sdks-java-io-cdap`. + +```maven +<dependency> + <groupId>org.apache.beam</groupId> + <artifactId>beam-sdks-java-io-cdap</artifactId> + <version>...</version> +</dependency> +``` + +## Documentation + +The documentation and usage examples are maintained in JavaDoc for [CdapIO.java](src/main/java/org/apache/beam/sdk/io/cdap/CdapIO.java). diff --git a/sdks/java/io/cdap/build.gradle b/sdks/java/io/cdap/build.gradle index 52edc91078a..2274325ceef 100644 --- a/sdks/java/io/cdap/build.gradle +++ b/sdks/java/io/cdap/build.gradle @@ -22,7 +22,6 @@ plugins { } applyJavaNature( - exportJavadoc: false, automaticModuleName: 'org.apache.beam.sdk.io.cdap', ) provideIntegrationTestingDependencies() diff --git a/sdks/java/io/sparkreceiver/2/README.md b/sdks/java/io/sparkreceiver/2/README.md index 6ce48efd58f..035456f30fd 100644 --- a/sdks/java/io/sparkreceiver/2/README.md +++ b/sdks/java/io/sparkreceiver/2/README.md @@ -16,12 +16,44 @@ specific language governing permissions and limitations under the License. --> +# SparkReceiverIO -SparkReceiverIO contains I/O transforms which allow you to read messages from Spark Receiver (org.apache.spark.streaming.receiver.Receiver). +SparkReceiverIO provides I/O transforms to read messages from an [Apache Spark Receiver](https://spark.apache.org/docs/2.4.0/streaming-custom-receivers.html) `org.apache.spark.streaming.receiver.Receiver` as an unbounded source. + +## Prerequistes + +SparkReceiverIO supports [Spark Receivers](https://spark.apache.org/docs/2.4.0/streaming-custom-receivers.html) (Spark version 2.4). +1. Corresponding Spark Receiver should implement [HasOffset](https://github.com/apache/beam/blob/master/sdks/java/io/sparkreceiver/src/main/java/org/apache/beam/sdk/io/sparkreceiver/HasOffset.java) interface. +2. Records should have the numeric field that represents record offset. *Example:* `RecordId` field for Salesforce and `vid` field for Hubspot Receivers. + For more details please see [GetOffsetUtils](https://github.com/apache/beam/tree/master/examples/java/cdap/src/main/java/org/apache/beam/examples/complete/cdap/utils/GetOffsetUtils.java) class from CDAP plugins examples. + +## Adding support for a new Spark Receiver + +To add SparkReceiverIO support for a new Spark `Receiver`, perform the following steps: +1. Add Spark Receiver to the Maven Central repository (see [Sonatype publishing guidelines](https://central.sonatype.org/publish/)). *Example:* [Hubspot CDAP plugin Maven repository](https://mvnrepository.com/artifact/io.cdap/hubspot-plugins/1.0.0). +2. Add Spark Receiver Maven dependency to the `build.gradle` file. *Example:* ``implementation "io.cdap:hubspot-plugins:1.0.0"``. +3. Implement function that will define how to get `Long offset` from the record of the Spark Receiver. + *Example:* see [GetOffsetUtils](https://github.com/apache/beam/tree/master/examples/java/cdap/src/main/java/org/apache/beam/examples/complete/cdap/utils/GetOffsetUtils.java) class from CDAP plugins examples. +4. Construct `ReceiverBuilder` object by passing class of record that you want to read (e.g. String) and your Spark `Receiver` class name (dependency from step 2). *Example:* + ``` + ReceiverBuilder<String, HubspotReceiver> receiverBuilder = + new ReceiverBuilder<>(HubspotReceiver.class).withConstructorArgs(); + ``` +5. Use your Spark `Receiver` with SparkReceiverIO: + 1. Pass correct `getOffsetFn` (from step 3) and correct `ReceiverBuilder` (from step 4). *Example:* + ``` + SparkReceiverIO.Read<V> reader = + SparkReceiverIO.<V>read() + .withGetOffsetFn(getOffsetFn) + .withSparkReceiverBuilder(receiverBuilder); + ``` + + +To learn more, please check out CDAP Streaming plugins [complete examples](https://github.com/apache/beam/tree/master/examples/java/cdap/src/main/java/org/apache/beam/examples/complete/cdap) where Spark Receivers are used. ## Dependencies -To use SparkReceiverIO you must first add a dependency on `beam-sdks-java-io-sparkreceiver`. +To use SparkReceiverIO, add a dependency on `beam-sdks-java-io-sparkreceiver`. ```maven <dependency> @@ -33,6 +65,4 @@ To use SparkReceiverIO you must first add a dependency on `beam-sdks-java-io-spa ## Documentation -The documentation is maintained in JavaDoc for SparkReceiverIO class. It includes -usage examples and primary concepts. -- [SparkReceiverIO.java](src/main/java/org/apache/beam/sdk/io/sparkreceiver/SparkReceiverIO.java) +The documentation and usage examples are maintained in JavaDoc for [SparkReceiverIO class](src/main/java/org/apache/beam/sdk/io/sparkreceiver/SparkReceiverIO.java). diff --git a/sdks/java/io/sparkreceiver/2/build.gradle b/sdks/java/io/sparkreceiver/2/build.gradle index 7607127adca..b039b9ce204 100644 --- a/sdks/java/io/sparkreceiver/2/build.gradle +++ b/sdks/java/io/sparkreceiver/2/build.gradle @@ -22,7 +22,6 @@ plugins { } applyJavaNature( - exportJavadoc: false, automaticModuleName: 'org.apache.beam.sdk.io.sparkreceiver', ) provideIntegrationTestingDependencies()