Hello, I'm processing events using Dataframes converted from a stream of JSON events (Spark streaming) which eventually gets written out as as Parquet format. There are different JSON events coming in so we use schema inference feature of Spark SQL
The problem is some of the JSON events contains spaces in the keys which I want to log and filter/drop such events from the data frame before converting it to Parquet because ,;{}()\n\t= are considered special characters in Parquet schema (CatalystSchemaConverter) as listed in [1] below and thus should not be allowed in the column names. How can I do such validations in Dataframe on the column names and drop such an event altogether without erroring out the Spark Streaming job? [1] Spark's CatalystSchemaConverter def checkFieldName(name: String): Unit = { // ,;{}()\n\t= and space are special characters in Parquet schema checkConversionRequirement( !name.matches(".*[ ,;{}()\n\t=].*"), s"""Attribute name "$name" contains invalid character(s) among " ,;{}()\\n\\t=". |Please use alias to rename it. """.stripMargin.split("\n").mkString(" ").trim) }