This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch branch-3.4 in repository https://gitbox.apache.org/repos/asf/spark.git
The following commit(s) were added to refs/heads/branch-3.4 by this push: new cc662875cee [SPARK-42797][CONNECT][DOCS] Grammatical improvements for Spark Connect content cc662875cee is described below commit cc662875cee3cccc67c3ee2a30f0d44d5b618ac8 Author: Allan Folting <allan.folt...@databricks.com> AuthorDate: Wed Mar 15 12:33:45 2023 +0900 [SPARK-42797][CONNECT][DOCS] Grammatical improvements for Spark Connect content ### What changes were proposed in this pull request? Grammatical improvements to the Spark Connect content as a follow-up on https://github.com/apache/spark/pull/40324/ ### Why are the changes needed? To improve readability of the pages. ### Does this PR introduce _any_ user-facing change? Yes, user-facing documentation is updated. ### How was this patch tested? Built the doc website locally and checked the updates. PRODUCTION=1 SKIP_RDOC=1 bundle exec jekyll build Closes #40428 from allanf-db/connect_overview_doc. Authored-by: Allan Folting <allan.folt...@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls...@apache.org> (cherry picked from commit 88d5c752829722b0b42f2c91fd57fb3e8fa17339) Signed-off-by: Hyukjin Kwon <gurwls...@apache.org> --- docs/index.md | 14 +++++++------- docs/spark-connect-overview.md | 28 ++++++++++++++-------------- 2 files changed, 21 insertions(+), 21 deletions(-) diff --git a/docs/index.md b/docs/index.md index 4f24ad4edce..37b1311c306 100644 --- a/docs/index.md +++ b/docs/index.md @@ -21,7 +21,7 @@ license: | --- Apache Spark is a unified analytics engine for large-scale data processing. -It provides high-level APIs in Java, Scala, Python and R, +It provides high-level APIs in Java, Scala, Python, and R, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including [Spark SQL](sql-programming-guide.html) for SQL and structured data processing, [pandas API on Spark](api/python/getting_started/quickstart_ps.html) for pandas workloads, [MLlib](ml-guide.html) for machine learning, [GraphX](graphx-programming-guide.html) for graph processing, and [Structured Streaming](structured-streaming-programming-guide.html) for incremental computation and stream processing. @@ -39,17 +39,17 @@ source, visit [Building Spark](building-spark.html). Spark runs on both Windows and UNIX-like systems (e.g. Linux, Mac OS), and it should run on any platform that runs a supported version of Java. This should include JVMs on x86_64 and ARM64. It's easy to run locally on one machine --- all you need is to have `java` installed on your system `PATH`, or the `JAVA_HOME` environment variable pointing to a Java installation. -Spark runs on Java 8/11/17, Scala 2.12/2.13, Python 3.7+ and R 3.5+. +Spark runs on Java 8/11/17, Scala 2.12/2.13, Python 3.7+, and R 3.5+. Python 3.7 support is deprecated as of Spark 3.4.0. Java 8 prior to version 8u362 support is deprecated as of Spark 3.4.0. When using the Scala API, it is necessary for applications to use the same version of Scala that Spark was compiled for. For example, when using Scala 2.13, use Spark compiled for 2.13, and compile code/applications for Scala 2.13 as well. -For Java 11, `-Dio.netty.tryReflectionSetAccessible=true` is required additionally for Apache Arrow library. This prevents `java.lang.UnsupportedOperationException: sun.misc.Unsafe or java.nio.DirectByteBuffer.(long, int) not available` when Apache Arrow uses Netty internally. +For Java 11, setting `-Dio.netty.tryReflectionSetAccessible=true` is required for the Apache Arrow library. This prevents the `java.lang.UnsupportedOperationException: sun.misc.Unsafe or java.nio.DirectByteBuffer.(long, int) not available` error when Apache Arrow uses Netty internally. # Running the Examples and Shell -Spark comes with several sample programs. Python, Scala, Java and R examples are in the +Spark comes with several sample programs. Python, Scala, Java, and R examples are in the `examples/src/main` directory. To run Spark interactively in a Python interpreter, use @@ -77,14 +77,14 @@ great way to learn the framework. The `--master` option specifies the [master URL for a distributed cluster](submitting-applications.html#master-urls), or `local` to run locally with one thread, or `local[N]` to run locally with N threads. You should start by using -`local` for testing. For a full list of options, run Spark shell with the `--help` option. +`local` for testing. For a full list of options, run the Spark shell with the `--help` option. -Spark also provides an [R API](sparkr.html) since 1.4 (only DataFrame APIs are included). +Since version 1.4, Spark has provided an [R API](sparkr.html) (only the DataFrame APIs are included). To run Spark interactively in an R interpreter, use `bin/sparkR`: ./bin/sparkR --master "local[2]" -Example applications are also provided in R. For example, +Example applications are also provided in R. For example: ./bin/spark-submit examples/src/main/r/dataframe.R diff --git a/docs/spark-connect-overview.md b/docs/spark-connect-overview.md index e46fb9ad913..f942a884873 100644 --- a/docs/spark-connect-overview.md +++ b/docs/spark-connect-overview.md @@ -44,13 +44,13 @@ The Spark Connect client translates DataFrame operations into unresolved logical query plans which are encoded using protocol buffers. These are sent to the server using the gRPC framework. -The Spark Connect endpoint embedded on the Spark Server, receives and +The Spark Connect endpoint embedded on the Spark Server receives and translates unresolved logical plans into Spark's logical plan operators. This is similar to parsing a SQL query, where attributes and relations are parsed and an initial parse plan is built. From there, the standard Spark execution process kicks in, ensuring that Spark Connect leverages all of Spark's optimizations and enhancements. Results are streamed back to the -client via gRPC as Apache Arrow-encoded row batches. +client through gRPC as Apache Arrow-encoded row batches. <p style="text-align: center;"> <img src="img/spark-connect-communication.png" title="Spark Connect communication" alt="Spark Connect communication" /> @@ -67,11 +67,11 @@ own dependencies on the client and don't need to worry about potential conflicts with the Spark driver. **Upgradability**: The Spark driver can now seamlessly be upgraded independently -of applications, e.g. to benefit from performance improvements and security fixes. +of applications, for example to benefit from performance improvements and security fixes. This means applications can be forward-compatible, as long as the server-side RPC definitions are designed to be backwards compatible. -**Debuggability and Observability**: Spark Connect enables interactive debugging +**Debuggability and observability**: Spark Connect enables interactive debugging during development directly from your favorite IDE. Similarly, applications can be monitored using the application's framework native metrics and logging libraries. @@ -106,8 +106,8 @@ Spark Connect, like in this example: Note that we include a Spark Connect package (`spark-connect_2.12:3.4.0`), when starting Spark server. This is required to use Spark Connect. Make sure to use the same version -of the package as the Spark version you downloaded above. In the example here, Spark 3.4.0 -with Scala 2.12. +of the package as the Spark version you downloaded previously. In this example, +Spark 3.4.0 with Scala 2.12. Now Spark server is running and ready to accept Spark Connect sessions from client applications. In the next section we will walk through how to use Spark Connect @@ -116,7 +116,7 @@ when writing client applications. ## Use Spark Connect in client applications When creating a Spark session, you can specify that you want to use Spark Connect -and there are a few ways to do that as outlined below. +and there are a few ways to do that outlined as follows. If you do not use one of the mechanisms outlined here, your Spark session will work just like before, without leveraging Spark Connect, and your application code @@ -125,12 +125,12 @@ will run on the Spark driver node. ### Set SPARK_REMOTE environment variable If you set the `SPARK_REMOTE` environment variable on the client machine where your -Spark client application is running and create a new Spark Session as illustrated -below, the session will be a Spark Connect session. With this approach, there is -no code change needed to start using Spark Connect. +Spark client application is running and create a new Spark Session as in the following +example, the session will be a Spark Connect session. With this approach, there is no +code change needed to start using Spark Connect. In a terminal window, set the `SPARK_REMOTE` environment variable to point to the -local Spark server you started on your computer above: +local Spark server you started previously on your computer: {% highlight bash %} export SPARK_REMOTE="sc://localhost" @@ -164,8 +164,8 @@ spark = SparkSession.builder.getOrCreate() </div> -Which will create a Spark Connect session from your application by reading the -`SPARK_REMOTE` environment variable we set above. +This will create a Spark Connect session from your application by reading the +`SPARK_REMOTE` environment variable we set previously. ### Specify Spark Connect when creating Spark session @@ -180,7 +180,7 @@ illustrated here. <div data-lang="python" markdown="1"> To launch the PySpark shell with Spark Connect, simply include the `remote` parameter and specify the location of your Spark server. We are using `localhost` -in this example to connect to the local Spark server we started above. +in this example to connect to the local Spark server we started previously. {% highlight bash %} ./bin/pyspark --remote "sc://localhost" --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org