Spark Dataframe validating column names

Scott W Mon, 04 Jul 2016 22:04:10 -0700

Hello,

I'm processing events using Dataframes converted from a stream of JSON
events (Spark streaming) which eventually gets written out as as Parquet
format. There are different JSON events coming in so we use schema
inference feature of Spark SQL


The problem is some of the JSON events contains spaces in the keys which I
want to log and filter/drop such events from the data frame before
converting it to Parquet because ,;{}()\n\t= are considered special
characters in Parquet schema (CatalystSchemaConverter) as listed in [1]
below and thus should not be allowed in the column names.

How can I do such validations in Dataframe on the column names and drop
such an event altogether without erroring out the Spark Streaming job?

[1] Spark's CatalystSchemaConverter

def checkFieldName(name: String): Unit = {
    // ,;{}()\n\t= and space are special characters in Parquet schema
    checkConversionRequirement(
      !name.matches(".*[ ,;{}()\n\t=].*"),
      s"""Attribute name "$name" contains invalid character(s) among "
,;{}()\\n\\t=".
         |Please use alias to rename it.
       """.stripMargin.split("\n").mkString(" ").trim)
  }

Spark Dataframe validating column names

Reply via email to