[spark] branch branch-3.3 updated: [SPARK-39001][SQL][DOCS] Document which options are unsupported in CSV and JSON functions

maxgekk Mon, 25 Apr 2022 10:29:11 -0700

This is an automated email from the ASF dual-hosted git repository.

maxgekk pushed a commit to branch branch-3.3
in repository https://gitbox.apache.org/repos/asf/spark.git



The following commit(s) were added to refs/heads/branch-3.3 by this push:
     new c6698ccebf0 [SPARK-39001][SQL][DOCS] Document which options are 
unsupported in CSV and JSON functions
c6698ccebf0 is described below

commit c6698ccebf0ac6f5b70c6a4e673fcd49388943e6
Author: Hyukjin Kwon <gurwls...@apache.org>
AuthorDate: Mon Apr 25 20:25:56 2022 +0300

    [SPARK-39001][SQL][DOCS] Document which options are unsupported in CSV and 
JSON functions
    
    This PR proposes to document which options do not work and are explicitly 
unsupported in CSV and JSON functions.
    
    To avoid users to misunderstand the options.
    
    Yes, it documents which options don't work in CSV/JSON expressions.
    
    I manually built the docs and checked the HTML output.
    
    Closes #36339 from HyukjinKwon/SPARK-39001.
    
    Authored-by: Hyukjin Kwon <gurwls...@apache.org>
    Signed-off-by: Max Gekk <max.g...@gmail.com>
    (cherry picked from commit 10a643c8af368cce131ef217f6ef610bf84f8b9c)
    Signed-off-by: Max Gekk <max.g...@gmail.com>
---
 docs/sql-data-sources-csv.md  | 18 +++++++++---------
 docs/sql-data-sources-json.md | 16 ++++++++--------
 2 files changed, 17 insertions(+), 17 deletions(-)

diff --git a/docs/sql-data-sources-csv.md b/docs/sql-data-sources-csv.md
index 1dfe8568f9a..1be1d7446e8 100644
--- a/docs/sql-data-sources-csv.md
+++ b/docs/sql-data-sources-csv.md
@@ -63,7 +63,7 @@ Data source options of CSV can be set via:
   <tr>
     <td><code>encoding</code></td>
     <td>UTF-8</td>
-    <td>For reading, decodes the CSV files by the given encoding type. For 
writing, specifies encoding (charset) of saved CSV files</td>
+    <td>For reading, decodes the CSV files by the given encoding type. For 
writing, specifies encoding (charset) of saved CSV files. CSV built-in 
functions ignore this option.</td>
     <td>read/write</td>
   </tr>
   <tr>
@@ -99,19 +99,19 @@ Data source options of CSV can be set via:
   <tr>
     <td><code>header</code></td>
     <td>false</td>
-    <td>For reading, uses the first line as names of columns. For writing, 
writes the names of columns as the first line. Note that if the given path is a 
RDD of Strings, this header option will remove all lines same with the header 
if exists.</td>
+    <td>For reading, uses the first line as names of columns. For writing, 
writes the names of columns as the first line. Note that if the given path is a 
RDD of Strings, this header option will remove all lines same with the header 
if exists. CSV built-in functions ignore this option.</td>
     <td>read/write</td>
   </tr>
   <tr>
     <td><code>inferSchema</code></td>
     <td>false</td>
-    <td>Infers the input schema automatically from data. It requires one extra 
pass over the data.</td>
+    <td>Infers the input schema automatically from data. It requires one extra 
pass over the data. CSV built-in functions ignore this option.</td>
     <td>read</td>
   </tr>
   <tr>
     <td><code>enforceSchema</code></td>
     <td>true</td>
-    <td>If it is set to <code>true</code>, the specified or inferred schema 
will be forcibly applied to datasource files, and headers in CSV files will be 
ignored. If the option is set to <code>false</code>, the schema will be 
validated against all headers in CSV files in the case when the 
<code>header</code> option is set to <code>true</code>. Field names in the 
schema and column names in CSV headers are checked by their positions taking 
into account <code>spark.sql.caseSensitive</code> [...]
+    <td>If it is set to <code>true</code>, the specified or inferred schema 
will be forcibly applied to datasource files, and headers in CSV files will be 
ignored. If the option is set to <code>false</code>, the schema will be 
validated against all headers in CSV files in the case when the 
<code>header</code> option is set to <code>true</code>. Field names in the 
schema and column names in CSV headers are checked by their positions taking 
into account <code>spark.sql.caseSensitive</code> [...]
     <td>read</td>
   </tr>
   <tr>
@@ -186,7 +186,7 @@ Data source options of CSV can be set via:
     <td>Allows a mode for dealing with corrupt records during parsing. It 
supports the following case-insensitive modes. Note that Spark tries to parse 
only required columns in CSV under column pruning. Therefore, corrupt records 
can be different based on required set of fields. This behavior can be 
controlled by <code>spark.sql.csv.parser.columnPruning.enabled</code> (enabled 
by default).<br>
     <ul>
       <li><code>PERMISSIVE</code>: when it meets a corrupted record, puts the 
malformed string into a field configured by 
<code>columnNameOfCorruptRecord</code>, and sets malformed fields to 
<code>null</code>. To keep corrupt records, an user can set a string type field 
named <code>columnNameOfCorruptRecord</code> in an user-defined schema. If a 
schema does not have the field, it drops corrupt records during parsing. A 
record with less/more tokens than schema is not a corrupted record to [...]
-      <li><code>DROPMALFORMED</code>: ignores the whole corrupted records.</li>
+      <li><code>DROPMALFORMED</code>: ignores the whole corrupted records. 
This mode is unsupported in the CSV built-in functions.</li>
       <li><code>FAILFAST</code>: throws an exception when it meets corrupted 
records.</li>
     </ul>
     </td>
@@ -201,7 +201,7 @@ Data source options of CSV can be set via:
   <tr>
     <td><code>multiLine</code></td>
     <td>false</td>
-    <td>Parse one record, which may span multiple lines, per file.</td>
+    <td>Parse one record, which may span multiple lines, per file. CSV 
built-in functions ignore this option.</td>
     <td>read</td>
   </tr>
   <tr>
@@ -213,7 +213,7 @@ Data source options of CSV can be set via:
   <tr>
     <td><code>samplingRatio</code></td>
     <td>1.0</td>
-    <td>Defines fraction of rows used for schema inferring.</td>
+    <td>Defines fraction of rows used for schema inferring. CSV built-in 
functions ignore this option.</td>
     <td>read</td>
   </tr>
   <tr>
@@ -231,7 +231,7 @@ Data source options of CSV can be set via:
   <tr>
     <td><code>lineSep</code></td>
     <td><code>\r</code>, <code>\r\n</code> and <code>\n</code> (for reading), 
<code>\n</code> (for writing)</td>
-    <td>Defines the line separator that should be used for parsing/writing. 
Maximum length is 1 character.</td>
+    <td>Defines the line separator that should be used for parsing/writing. 
Maximum length is 1 character. CSV built-in functions ignore this option.</td>
     <td>read/write</td>
   </tr>
   <tr>
@@ -251,7 +251,7 @@ Data source options of CSV can be set via:
   <tr>
     <td><code>compression</code></td>
     <td>(none)</td>
-    <td>Compression codec to use when saving to file. This can be one of the 
known case-insensitive shorten names (<code>none</code>, <code>bzip2</code>, 
<code>gzip</code>, <code>lz4</code>, <code>snappy</code> and 
<code>deflate</code>).</td>
+    <td>Compression codec to use when saving to file. This can be one of the 
known case-insensitive shorten names (<code>none</code>, <code>bzip2</code>, 
<code>gzip</code>, <code>lz4</code>, <code>snappy</code> and 
<code>deflate</code>). CSV built-in functions ignore this option.</td>
     <td>write</td>
   </tr>
 </table>
diff --git a/docs/sql-data-sources-json.md b/docs/sql-data-sources-json.md
index b5f27aacf41..8128e779ace 100644
--- a/docs/sql-data-sources-json.md
+++ b/docs/sql-data-sources-json.md
@@ -127,13 +127,13 @@ Data source options of JSON can be set via:
   <tr>
     <td><code>primitivesAsString</code></td>
     <td><code>false</code></td>
-    <td>Infers all primitive values as a string type.</td>
+    <td>Infers all primitive values as a string type. JSON built-in functions 
ignore this option.</td>
     <td>read</td>
   </tr>
   <tr>
     <td><code>prefersDecimal</code></td>
     <td><code>false</code></td>
-    <td>Infers all floating-point values as a decimal type. If the values do 
not fit in decimal, then it infers them as doubles.</td>
+    <td>Infers all floating-point values as a decimal type. If the values do 
not fit in decimal, then it infers them as doubles. JSON built-in functions 
ignore this option.</td>
     <td>read</td>
   </tr>
   <tr>
@@ -172,7 +172,7 @@ Data source options of JSON can be set via:
     <td>Allows a mode for dealing with corrupt records during parsing.<br>
     <ul>
       <li><code>PERMISSIVE</code>: when it meets a corrupted record, puts the 
malformed string into a field configured by 
<code>columnNameOfCorruptRecord</code>, and sets malformed fields to 
<code>null</code>. To keep corrupt records, an user can set a string type field 
named <code>columnNameOfCorruptRecord</code> in an user-defined schema. If a 
schema does not have the field, it drops corrupt records during parsing. When 
inferring a schema, it implicitly adds a <code>columnNameOfCorrupt [...]
-      <li><code>DROPMALFORMED</code>: ignores the whole corrupted records.</li>
+      <li><code>DROPMALFORMED</code>: ignores the whole corrupted records. 
This mode is unsupported in the JSON built-in functions.</li>
       <li><code>FAILFAST</code>: throws an exception when it meets corrupted 
records.</li>
     </ul>
     </td>
@@ -205,7 +205,7 @@ Data source options of JSON can be set via:
   <tr>
     <td><code>multiLine</code></td>
     <td><code>false</code></td>
-    <td>Parse one record, which may span multiple lines, per file.</td>
+    <td>Parse one record, which may span multiple lines, per file. JSON 
built-in functions ignore this option.</td>
     <td>read</td>
   </tr>
   <tr>
@@ -217,13 +217,13 @@ Data source options of JSON can be set via:
   <tr>
     <td><code>encoding</code></td>
     <td>Detected automatically when <code>multiLine</code> is set to 
<code>true</code> (for reading), <code>UTF-8</code> (for writing)</td>
-    <td>For reading, allows to forcibly set one of standard basic or extended 
encoding for the JSON files. For example UTF-16BE, UTF-32LE. For writing, 
Specifies encoding (charset) of saved json files.</td>
+    <td>For reading, allows to forcibly set one of standard basic or extended 
encoding for the JSON files. For example UTF-16BE, UTF-32LE. For writing, 
Specifies encoding (charset) of saved json files. JSON built-in functions 
ignore this option.</td>
     <td>read/write</td>
   </tr>
   <tr>
     <td><code>lineSep</code></td>
     <td><code>\r</code>, <code>\r\n</code>, <code>\n</code> (for reading), 
<code>\n</code> (for writing)</td>
-    <td>Defines the line separator that should be used for parsing.</td>
+    <td>Defines the line separator that should be used for parsing. JSON 
built-in functions ignore this option.</td>
     <td>read/write</td>
   </tr>
   <tr>
@@ -235,7 +235,7 @@ Data source options of JSON can be set via:
   <tr>
     <td><code>dropFieldIfAllNull</code></td>
     <td><code>false</code></td>
-    <td>Whether to ignore column of all null values or empty array/struct 
during schema inference.</td>
+    <td>Whether to ignore column of all null values or empty array/struct 
during schema inference. JSON built-in functions ignore this option.</td>
     <td>read</td>
   </tr>
   <tr>
@@ -259,7 +259,7 @@ Data source options of JSON can be set via:
   <tr>
     <td><code>compression</code></td>
     <td>(none)</td>
-    <td>Compression codec to use when saving to file. This can be one of the 
known case-insensitive shorten names (none, bzip2, gzip, lz4, snappy and 
deflate).</td>
+    <td>Compression codec to use when saving to file. This can be one of the 
known case-insensitive shorten names (none, bzip2, gzip, lz4, snappy and 
deflate). JSON built-in functions ignore this option.</td>
     <td>write</td>
   </tr>
   <tr>


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch branch-3.3 updated: [SPARK-39001][SQL][DOCS] Document which options are unsupported in CSV and JSON functions

Reply via email to