Hello guys,
I'm using Spark 2.2.0 and from time to time my job fails printing into
the log the following errors
scala.MatchError:
profiles.total^@^@f2-a733-9304fda722ac^@^@^@^@profiles.10361.10005^@^@^@^@.total^@^@0075^@^@^@^@
scala.MatchError: pr^?files.10056.10040 (of class java.lang.String)
scala.MatchError: pr^?files.10056.10040 (of class java.lang.String)
scala.MatchError: pr^?files.10056.10040 (of class java.lang.String)
scala.MatchError: pr^?files.10056.10040 (of class java.lang.String)
The job itself looks like the following and contains a few shuffles and UDAFs
val df = spark.read.avro(...).as[...]
.groupBy(...)
.agg(collect_list(...).as(...))
.select(explode(...).as(...))
.groupBy(...)
.agg(sum(...).as(...))
.groupBy(...)
.agg(collectMetrics(...).as(...))
The errors occur in the collectMetrics UDAF in the following snippet
key match {
case "profiles.total" => updateMetrics(...)
case "profiles.biz" => updateMetrics(...)
case ProfileAttrsRegex(...) => updateMetrics(...)
}
... and I'm absolutely ok with scala.MatchError because there is no
"catch all" case in the pattern matching expression, but the strings
containing corrupted characters seem to be very strange.
I've found the following jira issues, but it's hardly difficult to say
whether they are related to my case:
- https://issues.apache.org/jira/browse/SPARK-22092
- https://issues.apache.org/jira/browse/SPARK-23512
So I'm wondering, has anybody ever seen such kind of behaviour and
what could be the problem?
---------------------------------------------------------------------
To unsubscribe e-mail: [email protected]