Herman van Hövell created SPARK-42690:
-----------------------------------------

             Summary: Implement CSV/JSON parsing funcions
                 Key: SPARK-42690
                 URL: https://issues.apache.org/jira/browse/SPARK-42690
             Project: Spark
          Issue Type: New Feature
          Components: Connect
    Affects Versions: 3.4.0
            Reporter: Herman van Hövell


Implement the following two methods in DataFrameReader:

 

 
{code:java}
/**
* Loads a `Dataset[String]` storing JSON objects (<a 
href="http://jsonlines.org/";>JSON Lines
* text format or newline-delimited JSON</a>) and returns the result as a 
`DataFrame`.
*
* Unless the schema is specified using `schema` function, this function goes 
through the
* input once to determine the input schema.
*
* @param jsonDataset input Dataset with one JSON object per record
* @since 3.4.0
*/
def json(jsonDataset: Dataset[String]): DataFrame
/**
* Loads an `Dataset[String]` storing CSV rows and returns the result as a 
`DataFrame`.
*
* If the schema is not specified using `schema` function and `inferSchema` 
option is enabled,
* this function goes through the input once to determine the input schema.
*
* If the schema is not specified using `schema` function and `inferSchema` 
option is disabled,
* it determines the columns as string types and it reads only the first line to 
determine the
* names and the number of fields.
*
* If the enforceSchema is set to `false`, only the CSV header in the first line 
is checked
* to conform specified or inferred schema.
*
* @note if `header` option is set to `true` when calling this API, all lines 
same with
* the header will be removed if exists.
*
* @param csvDataset input Dataset with one CSV row per record
* @since 3.4.0
*/
def csv(csvDataset: Dataset[String]): DataFrame
{code}
 

For this we need a new message. We cannot use project because we don't know the 
schema upfront.

 
{code:java}
message Parse {
  // (Required) Input relation to Parse. The input is expected to have single 
text column.
  Relation input = 1;
  // (Required) The expected format of the text.
  ParseFormat format = 2;
  enum ParseFormat {
    PARSE_FORMAT_UNSPECIFIED = 0;
    PARSE_FORMAT_CSV = 1;
    PARSE_FORMAT_JSON = 2;
  }
}
{code}
 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to