[ https://issues.apache.org/jira/browse/SPARK-26381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16724750#comment-16724750 ]
Hyukjin Kwon commented on SPARK-26381: -------------------------------------- [~ryan.clancy], please provide the codes to reproduce. > Pickle Serialization Error Causing Crash > ---------------------------------------- > > Key: SPARK-26381 > URL: https://issues.apache.org/jira/browse/SPARK-26381 > Project: Spark > Issue Type: Bug > Components: PySpark > Affects Versions: 2.3.1, 2.4.0 > Environment: Tested on two environments: > * Spark 2.4.0 - single machine only > * Spark 2.3.1 - YARN installation with 5 nodes and files on HDFS > The error occurs in both environments. > Reporter: Ryan > Priority: Major > > There is a pickle serialization error when I try and use AllenNLP for doing > NER within a Spark worker - it is causing a crash. When running on just the > Spark driver or in a standalone program, everything works as expected. > > {code:java} > Caused by: org.apache.spark.api.python.PythonException: Traceback (most > recent call last): > File > "/data/disk12/yarn/local/usercache/raclancy/appcache/application_1543437939000_1040/container_1543437939000_1040_01_000002/pyspark.zip/pyspark/worker.py", > line 217, in main > func, profiler, deserializer, serializer = read_command(pickleSer, infile) > File > "/data/disk12/yarn/local/usercache/raclancy/appcache/application_1543437939000_1040/container_1543437939000_1040_01_000002/pyspark.zip/pyspark/worker.py", > line 61, in read_command > command = serializer.loads(command.value) > File > "/data/disk12/yarn/local/usercache/raclancy/appcache/application_1543437939000_1040/container_1543437939000_1040_01_000002/pyspark.zip/pyspark/serializers.py", > line 559, in loads > return pickle.loads(obj, encoding=encoding) > TypeError: __init__() missing 3 required positional arguments: > 'non_padded_namespaces', 'padding_token', and 'oov_token' > at > org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:298) > > at > org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:438) > at > org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:421) > at > org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:252) > > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) > > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at > org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28) > > at > scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48) > at > scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310) > at > org.apache.spark.InterruptibleIterator.to(InterruptibleIterator.scala:28) > at > scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302) > at > org.apache.spark.InterruptibleIterator.toBuffer(InterruptibleIterator.scala:28) > > at > scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289) > at > org.apache.spark.InterruptibleIterator.toArray(InterruptibleIterator.scala:28) > > at > org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:939) > at > org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:939) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2074) > > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2074) > > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:109) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > > ... 1 more > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org