[spark] branch master updated: [SPARK-45532][DOCS] Restore codetabs for the Protobuf Data Source Guide

yao Thu, 12 Oct 2023 23:05:47 -0700

This is an automated email from the ASF dual-hosted git repository.

yao pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git



The following commit(s) were added to refs/heads/master by this push:
     new 0257b77528a [SPARK-45532][DOCS] Restore codetabs for the Protobuf Data 
Source Guide
0257b77528a is described below

commit 0257b77528a3a0d0ba08df5363470e4bc5928b06
Author: Kent Yao <y...@apache.org>
AuthorDate: Fri Oct 13 14:05:23 2023 +0800

    [SPARK-45532][DOCS] Restore codetabs for the Protobuf Data Source Guide
    
    ### What changes were proposed in this pull request?
    
    This PR restores the [Protobuf Data Source 
Guide](https://spark.apache.org/docs/latest/sql-data-sources-protobuf.html#python)'s
 code tabs which https://github.com/apache/spark/pull/40614 removed for 
markdown syntax fixes
    
    In this PR, we introduce a hidden div to hold the code-block marker of 
markdown, then make both the liquid and markdown happy.
    
    ### Why are the changes needed?
    
    improve doc readability and consistency.
    
    ### Does this PR introduce _any_ user-facing change?
    
    yes, doc change
    
    ### How was this patch tested?
    
    #### Doc build
    
    
![image](https://github.com/apache/spark/assets/8326978/8aefeee0-92b2-4048-a3f6-108e4c3f309d)
    
    #### markdown editor and view
    
    
![image](https://github.com/apache/spark/assets/8326978/283b0820-390a-4540-8713-647c40f956ac)
    
    ### Was this patch authored or co-authored using generative AI tooling?
    
    no
    
    Closes #43361 from yaooqinn/SPARK-45532.
    
    Authored-by: Kent Yao <y...@apache.org>
    Signed-off-by: Kent Yao <y...@apache.org>
---
 docs/sql-data-sources-protobuf.md | 243 +++++++++++++++++++++++---------------
 1 file changed, 150 insertions(+), 93 deletions(-)

diff --git a/docs/sql-data-sources-protobuf.md 
b/docs/sql-data-sources-protobuf.md
index f92a8f20b35..c8ee139e344 100644
--- a/docs/sql-data-sources-protobuf.md
+++ b/docs/sql-data-sources-protobuf.md
@@ -18,7 +18,10 @@ license: |
   limitations under the License.
 ---
 
-Since Spark 3.4.0 release, [Spark 
SQL](https://spark.apache.org/docs/latest/sql-programming-guide.html) provides 
built-in support for reading and writing protobuf data.
+* This will become a table of contents (this text will be scraped).
+{:toc}
+
+Since Spark 3.4.0 release, [Spark SQL](sql-programming-guide.html) provides 
built-in support for reading and writing protobuf data.
 
 ## Deploying
 The `spark-protobuf` module is external and not included in `spark-submit` or 
`spark-shell` by default.
@@ -46,45 +49,53 @@ Kafka key-value record will be augmented with some 
metadata, such as the ingesti
 
 Spark SQL schema is generated based on the protobuf descriptor file or 
protobuf class passed to `from_protobuf` and `to_protobuf`. The specified 
protobuf class or protobuf descriptor file must match the data, otherwise, the 
behavior is undefined: it may fail or return arbitrary results.
 
-### Python
+<div class="codetabs">
+
+<div data-lang="python" markdown="1">
+
+<div class="d-none">
+This div is only used to make markdown editor/viewer happy and does not 
display on web
+
 ```python
+</div>
+
+{% highlight python %}
+
 from pyspark.sql.protobuf.functions import from_protobuf, to_protobuf
 
-# `from_protobuf` and `to_protobuf` provides two schema choices. Via Protobuf 
descriptor file,
+# from_protobuf and to_protobuf provide two schema choices. Via Protobuf 
descriptor file,
 # or via shaded Java class.
 # give input .proto protobuf schema
-# syntax  = "proto3"
+# syntax = "proto3"
 # message AppEvent {
-#  string name = 1;
-#  int64 id = 2;
-#  string context = 3;
+#   string name = 1;
+#   int64 id = 2;
+#   string context = 3;
 # }
-
-df = spark\
-.readStream\
-.format("kafka")\
-.option("kafka.bootstrap.servers", "host1:port1,host2:port2")\
-.option("subscribe", "topic1")\
-.load()
+df = spark
+  .readStream
+  .format("kafka")\
+  .option("kafka.bootstrap.servers", "host1:port1,host2:port2")
+  .option("subscribe", "topic1")
+  .load()
 
 # 1. Decode the Protobuf data of schema `AppEvent` into a struct;
 # 2. Filter by column `name`;
 # 3. Encode the column `event` in Protobuf format.
 # The Protobuf protoc command can be used to generate a protobuf descriptor 
file for give .proto file.
-output = df\
-.select(from_protobuf("value", "AppEvent", descriptorFilePath).alias("event"))\
-.where('event.name == "alice"')\
-.select(to_protobuf("event", "AppEvent", descriptorFilePath).alias("event"))
+output = df
+  .select(from_protobuf("value", "AppEvent", 
descriptorFilePath).alias("event"))
+  .where('event.name == "alice"')
+  .select(to_protobuf("event", "AppEvent", descriptorFilePath).alias("event"))
 
 # Alternatively, you can decode and encode the SQL columns into protobuf 
format using protobuf
 # class name. The specified Protobuf class must match the data, otherwise the 
behavior is undefined:
 # it may fail or return arbitrary result. To avoid conflicts, the jar file 
containing the
 # 'com.google.protobuf.*' classes should be shaded. An example of shading can 
be found at
 # https://github.com/rangadi/shaded-protobuf-classes.
-
-output = df\
-.select(from_protobuf("value", 
"org.sparkproject.spark_protobuf.protobuf.AppEvent").alias("event"))\
-.where('event.name == "alice"')
+output = df
+  .select(from_protobuf("value", 
"org.sparkproject.spark_protobuf.protobuf.AppEvent").alias("event"))
+  .where('event.name == "alice"')
 
 output.printSchema()
 # root
@@ -94,52 +105,66 @@ output.printSchema()
 #  |   |-- context: string (nullable = true)
 
 output = output
-.select(to_protobuf("event", 
"org.sparkproject.spark_protobuf.protobuf.AppEvent").alias("event"))
-
-query = output\
-.writeStream\
-.format("kafka")\
-.option("kafka.bootstrap.servers", "host1:port1,host2:port2")\
-.option("topic", "topic2")\
-.start()
+  .select(to_protobuf("event", 
"org.sparkproject.spark_protobuf.protobuf.AppEvent").alias("event"))
+
+query = output
+  .writeStream
+  .format("kafka")
+  .option("kafka.bootstrap.servers", "host1:port1,host2:port2")\
+  .option("topic", "topic2")
+  .start()
+
+{% endhighlight %}
+
+<div class="d-none">
 ```
+</div>
+
+</div>
+
+<div data-lang="scala" markdown="1">
+
+<div class="d-none">
+This div is only used to make markdown editor/viewer happy and does not 
display on web
 
-### Scala
 ```scala
+</div>
+
+{% highlight scala %}
 import org.apache.spark.sql.protobuf.functions._
 
-// `from_protobuf` and `to_protobuf` provides two schema choices. Via Protobuf 
descriptor file,
+// `from_protobuf` and `to_protobuf` provides two schema choices. Via the 
protobuf descriptor file,
 // or via shaded Java class.
 // give input .proto protobuf schema
-// syntax  = "proto3"
+// syntax = "proto3"
 // message AppEvent {
-//    string name = 1;
-//    int64 id = 2;
-//    string context = 3;
+//   string name = 1;
+//   int64 id = 2;
+//   string context = 3;
 // }
 
 val df = spark
-.readStream
-.format("kafka")
-.option("kafka.bootstrap.servers", "host1:port1,host2:port2")
-.option("subscribe", "topic1")
-.load()
+  .readStream
+  .format("kafka")
+  .option("kafka.bootstrap.servers", "host1:port1,host2:port2")
+  .option("subscribe", "topic1")
+  .load()
 
 // 1. Decode the Protobuf data of schema `AppEvent` into a struct;
 // 2. Filter by column `name`;
 // 3. Encode the column `event` in Protobuf format.
 // The Protobuf protoc command can be used to generate a protobuf descriptor 
file for give .proto file.
 val output = df
-.select(from_protobuf($"value", "AppEvent", descriptorFilePath) as $"event")
-.where("event.name == \"alice\"")
-.select(to_protobuf($"user", "AppEvent", descriptorFilePath) as $"event")
+  .select(from_protobuf($"value", "AppEvent", descriptorFilePath) as $"event")
+  .where("event.name == \"alice\"")
+  .select(to_protobuf($"user", "AppEvent", descriptorFilePath) as $"event")
 
 val query = output
-.writeStream
-.format("kafka")
-.option("kafka.bootstrap.servers", "host1:port1,host2:port2")
-.option("topic", "topic2")
-.start()
+  .writeStream
+  .format("kafka")
+  .option("kafka.bootstrap.servers", "host1:port1,host2:port2")
+  .option("topic", "topic2")
+  .start()
 
 // Alternatively, you can decode and encode the SQL columns into protobuf 
format using protobuf
 // class name. The specified Protobuf class must match the data, otherwise the 
behavior is undefined:
@@ -147,8 +172,8 @@ val query = output
 // 'com.google.protobuf.*' classes should be shaded. An example of shading can 
be found at
 // https://github.com/rangadi/shaded-protobuf-classes.
 var output = df
-.select(from_protobuf($"value", "org.example.protos..AppEvent") as $"event")
-.where("event.name == \"alice\"")
+  .select(from_protobuf($"value", "org.example.protos..AppEvent") as $"event")
+  .where("event.name == \"alice\"")
 
 output.printSchema()
 // root
@@ -160,43 +185,56 @@ output.printSchema()
 output = output.select(to_protobuf($"event", 
"org.sparkproject.spark_protobuf.protobuf.AppEvent") as $"event")
 
 val query = output
-.writeStream
-.format("kafka")
-.option("kafka.bootstrap.servers", "host1:port1,host2:port2")
-.option("topic", "topic2")
-.start()
+  .writeStream
+  .format("kafka")
+  .option("kafka.bootstrap.servers", "host1:port1,host2:port2")
+  .option("topic", "topic2")
+  .start()
+
+{% endhighlight %}
+
+<div class="d-none">
 ```
+</div>
+</div>
+
+<div data-lang="java" markdown="1">
+
+<div class="d-none">
+This div is only used to make markdown editor/viewer happy and does not 
display on web
 
-### Java
 ```java
+</div>
+
+{% highlight java %}
 import static org.apache.spark.sql.functions.col;
 import static org.apache.spark.sql.protobuf.functions.*;
 
-// `from_protobuf` and `to_protobuf` provides two schema choices. Via Protobuf 
descriptor file,
+// `from_protobuf` and `to_protobuf` provides two schema choices. Via the 
protobuf descriptor file,
 // or via shaded Java class.
 // give input .proto protobuf schema
-// syntax  = "proto3"
+// syntax = "proto3"
 // message AppEvent {
-//  string name = 1;
-//  int64 id = 2;
-//  string context = 3;
+//   string name = 1;
+//   int64 id = 2;
+//   string context = 3;
 // }
 
 Dataset<Row> df = spark
-.readStream()
-.format("kafka")
-.option("kafka.bootstrap.servers", "host1:port1,host2:port2")
-.option("subscribe", "topic1")
-.load();
+  .readStream()
+  .format("kafka")
+  .option("kafka.bootstrap.servers", "host1:port1,host2:port2")
+  .option("subscribe", "topic1")
+  .load();
 
 // 1. Decode the Protobuf data of schema `AppEvent` into a struct;
 // 2. Filter by column `name`;
 // 3. Encode the column `event` in Protobuf format.
 // The Protobuf protoc command can be used to generate a protobuf descriptor 
file for give .proto file.
 Dataset<Row> output = df
-.select(from_protobuf(col("value"), "AppEvent", 
descriptorFilePath).as("event"))
-.where("event.name == \"alice\"")
-.select(to_protobuf(col("event"), "AppEvent", descriptorFilePath).as("event"));
+  .select(from_protobuf(col("value"), "AppEvent", 
descriptorFilePath).as("event"))
+  .where("event.name == \"alice\"")
+  .select(to_protobuf(col("event"), "AppEvent", 
descriptorFilePath).as("event"));
 
 // Alternatively, you can decode and encode the SQL columns into protobuf 
format using protobuf
 // class name. The specified Protobuf class must match the data, otherwise the 
behavior is undefined:
@@ -204,10 +242,10 @@ Dataset<Row> output = df
 // 'com.google.protobuf.*' classes should be shaded. An example of shading can 
be found at
 // https://github.com/rangadi/shaded-protobuf-classes.
 Dataset<Row> output = df
-.select(
-  from_protobuf(col("value"),
-  "org.sparkproject.spark_protobuf.protobuf.AppEvent").as("event"))
-.where("event.name == \"alice\"")
+  .select(
+    from_protobuf(col("value"),
+    "org.sparkproject.spark_protobuf.protobuf.AppEvent").as("event"))
+  .where("event.name == \"alice\"")
 
 output.printSchema()
 // root
@@ -221,19 +259,28 @@ output = output.select(
   "org.sparkproject.spark_protobuf.protobuf.AppEvent").as("event"));
 
 StreamingQuery query = output
-.writeStream()
-.format("kafka")
-.option("kafka.bootstrap.servers", "host1:port1,host2:port2")
-.option("topic", "topic2")
-.start();
+  .writeStream()
+  .format("kafka")
+  .option("kafka.bootstrap.servers", "host1:port1,host2:port2")
+  .option("topic", "topic2")
+  .start();
+
+{% endhighlight %}
+
+<div class="d-none">
 ```
+</div>
+</div>
+
+</div>
 
 ## Supported types for Protobuf -> Spark SQL conversion
+
 Currently Spark supports reading [protobuf scalar 
types](https://developers.google.com/protocol-buffers/docs/proto3#scalar), 
[enum types](https://developers.google.com/protocol-buffers/docs/proto3#enum), 
[nested 
type](https://developers.google.com/protocol-buffers/docs/proto3#nested), and 
[maps type](https://developers.google.com/protocol-buffers/docs/proto3#maps) 
under messages of Protobuf.
 In addition to the these types, `spark-protobuf` also introduces support for 
Protobuf `OneOf` fields. which allows you to handle messages that can have 
multiple possible sets of fields, but only one set can be present at a time. 
This is useful for situations where the data you are working with is not always 
in the same format, and you need to be able to handle messages with different 
sets of fields without encountering errors.
 
-<table class="table">
-  <tr><th><b>Protobuf type</b></th><th><b>Spark SQL type</b></th></tr>
+<table class="table table-striped">
+  <thead><tr><th><b>Protobuf type</b></th><th><b>Spark SQL 
type</b></th></tr></thead>
   <tr>
     <td>boolean</td>
     <td>BooleanType</td>
@@ -282,16 +329,12 @@ In addition to the these types, `spark-protobuf` also 
introduces support for Pro
     <td>OneOf</td>
     <td>Struct</td>
   </tr>
-  <tr>
-    <td>Any</td>
-    <td>StructType</td>
-  </tr>
 </table>
 
 It also supports reading the following Protobuf types 
[Timestamp](https://developers.google.com/protocol-buffers/docs/reference/google.protobuf#timestamp)
 and 
[Duration](https://developers.google.com/protocol-buffers/docs/reference/google.protobuf#duration)
 
-<table class="table">
-  <tr><th><b>Protobuf logical type</b></th><th><b>Protobuf 
schema</b></th><th><b>Spark SQL type</b></th></tr>
+<table class="table table-striped">
+  <thead><tr><th><b>Protobuf logical type</b></th><th><b>Protobuf 
schema</b></th><th><b>Spark SQL type</b></th></tr></thead>
   <tr>
     <td>duration</td>
     <td>MessageType{seconds: Long, nanos: Int}</td>
@@ -305,10 +348,11 @@ It also supports reading the following Protobuf types 
[Timestamp](https://develo
 </table>
 
 ## Supported types for Spark SQL -> Protobuf conversion
+
 Spark supports the writing of all Spark SQL types into Protobuf. For most 
types, the mapping from Spark types to Protobuf types is straightforward (e.g. 
IntegerType gets converted to int);
 
-<table class="table">
-  <tr><th><b>Spark SQL type</b></th><th><b>Protobuf type</b></th></tr>
+<table class="table table-striped">
+  <thead><tr><th><b>Spark SQL type</b></th><th><b>Protobuf 
type</b></th></tr></thead>
   <tr>
     <td>BooleanType</td>
     <td>boolean</td>
@@ -356,15 +400,23 @@ Spark supports the writing of all Spark SQL types into 
Protobuf. For most types,
 </table>
 
 ## Handling circular references protobuf fields
+
 One common issue that can arise when working with Protobuf data is the 
presence of circular references. In Protobuf, a circular reference occurs when 
a field refers back to itself or to another field that refers back to the 
original field. This can cause issues when parsing the data, as it can result 
in infinite loops or other unexpected behavior.
-To address this issue, the latest version of spark-protobuf introduces a new 
feature: the ability to check for circular references through field types. This 
allows users use the `recursive.fields.max.depth` option to specify the maximum 
number of levels of recursion to allow when parsing the schema. By default, 
`spark-protobuf` will not permit recursive fields by setting 
`recursive.fields.max.depth` to -1. However, you can set this option to 0 to 10 
if needed. 
+To address this issue, the latest version of spark-protobuf introduces a new 
feature: the ability to check for circular references through field types. This 
allows users use the `recursive.fields.max.depth` option to specify the maximum 
number of levels of recursion to allow when parsing the schema. By default, 
`spark-protobuf` will not permit recursive fields by setting 
`recursive.fields.max.depth` to -1. However, you can set this option to 0 to 10 
if needed.
 
 Setting `recursive.fields.max.depth` to 0 drops all recursive fields, setting 
it to 1 allows it to be recursed once, and setting it to 2 allows it to be 
recursed twice. A `recursive.fields.max.depth` value greater than 10 is not 
allowed, as it can lead to performance issues and even stack overflows.
 
 SQL Schema for the below protobuf message will vary based on the value of 
`recursive.fields.max.depth`.
 
-```proto
-syntax  = "proto3"
+<div data-lang="proto" markdown="1">
+<div class="d-none">
+This div is only used to make markdown editor/viewer happy and does not 
display on web
+
+```protobuf
+</div>
+
+{% highlight protobuf %}
+syntax = "proto3"
 message Person {
   string name = 1;
   Person bff = 2
@@ -376,4 +428,9 @@ message Person {
 0: struct<name: string, bff: null>
 1: struct<name string, bff: <name: string, bff: null>>
 2: struct<name string, bff: <name: string, bff: struct<name: string, bff: 
null>>> ...
-```
\ No newline at end of file
+
+{% endhighlight %}
+<div class="d-none">
+```
+</div>
+</div>
\ No newline at end of file


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated: [SPARK-45532][DOCS] Restore codetabs for the Protobuf Data Source Guide

Reply via email to