Re: trouble with jsonRDD and jsonFile in pyspark

Brad Miller Tue, 05 Aug 2014 21:21:25 -0700

Hi All,

I checked out and built master.  Note that Maven had a problem building
Kafka (in my case, at least); I was unable to fix this easily so I moved on
since it seemed unlikely to have any influence on the problem at hand.


Master improves functionality (including the example Nicholas just
demonstrated) but unfortunately there still seems to be a bug related to
using dictionaries as values.  I've put some code below to illustrate the
bug.

*# dictionary as value works fine*
> print sqlCtx.jsonRDD(sc.parallelize(['{"key0": {"key1":
"value"}}'])).collect()
[Row(key0=Row(key1=u'value'))]

*# dictionary as value works fine, even when inner keys are varied*
> print sqlCtx.jsonRDD(sc.parallelize(['{"key0": {"key1": "value1"}}',
'{"key0": {"key2": "value2"}}'])).collect()
[Row(key0=Row(key1=u'value1', key2=None)), Row(key0=Row(key1=None,
key2=u'value2'))]

*# dictionary as value works fine when inner keys are missing and outer key
is present*
> print sqlCtx.jsonRDD(sc.parallelize(['{"key0": {}}', '{"key0": {"key1":
"value1"}}'])).collect()
[Row(key0=Row(key1=None)), Row(key0=Row(key1=u'value1'))]

*# dictionary as value FAILS when outer key is missing*
*> print sqlCtx.jsonRDD(sc.parallelize(['{}', '{"key0": {"key1":
"value1"}}'])).collect()*
Py4JJavaError: An error occurred while calling o84.collect.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task
14 in stage 7.0 failed 4 times, most recent failure: Lost task 14.3 in
stage 7.0 (TID 242, engelland.research.intel-research.net):
java.lang.NullPointerException...

*# dictionary as value FAILS when outer key is present with null value*
*> print sqlCtx.jsonRDD(sc.parallelize(['{"key0": null}', '{"key0":
{"key1": "value1"}}'])).collect()*
Py4JJavaError: An error occurred while calling o98.collect.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task
14 in stage 9.0 failed 4 times, most recent failure: Lost task 14.3 in
stage 9.0 (TID 305, kunitz.research.intel-research.net):
java.lang.NullPointerException...

*# nested lists work even when outer key is missing*
> print sqlCtx.jsonRDD(sc.parallelize(['{}', '{"key0": [["item0", "item1"],
["item2", "item3"]]}'])).collect()
[Row(key0=None), Row(key0=[[u'item0', u'item1'], [u'item2', u'item3']])]

Is anyone able to replicate this behavior?

-Brad




On Tue, Aug 5, 2014 at 6:11 PM, Michael Armbrust <mich...@databricks.com>
wrote:

> We try to keep master very stable, but this is where active development
> happens. YMMV, but a lot of people do run very close to master without
> incident (myself included).
>
> branch-1.0 has been cut for a while and we only merge bug fixes into it
> (this is more strict for non-alpha components like spark core.).  For Spark
> SQL, this branch is pretty far behind as the project is very young and we
> are fixing bugs / adding features very rapidly compared with Spark core.
>
> branch-1.1 was just cut and is being QAed for a release, at this point its
> likely the same as master, but that will change as features start getting
> added to master in the coming weeks.
>
>
>
> On Tue, Aug 5, 2014 at 5:38 PM, Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> collect() works, too.
>>
>> >>> sqlContext.jsonRDD(sc.parallelize(['{"foo":[[1,2,3], [4,5,6]]}', 
>> >>> '{"foo":[[1,2,3], [4,5,6]]}'])).collect()
>> [Row(foo=[[1, 2, 3], [4, 5, 6]]), Row(foo=[[1, 2, 3], [4, 5, 6]])]
>>
>> Can’t answer your question about branch stability, though. Spark is a
>> very active project, so stuff is happening all the time.
>>
>> Nick
>> 
>>
>>
>> On Tue, Aug 5, 2014 at 7:20 PM, Brad Miller <bmill...@eecs.berkeley.edu>
>> wrote:
>>
>>> Hi Nick,
>>>
>>> Can you check that the call to "collect()" works as well as
>>> "printSchema()"?  I actually experience that "printSchema()" works fine,
>>> but then it crashes on "collect()".
>>>
>>> In general, should I expect the master (which seems to be on branch-1.1)
>>> to be any more/less stable than branch-1.0?  While it would be great to
>>> have this fixed, it would be good to know if I should expect lots of other
>>> instability.
>>>
>>> best,
>>> -Brad
>>>
>>>
>>> On Tue, Aug 5, 2014 at 4:15 PM, Nicholas Chammas <
>>> nicholas.cham...@gmail.com> wrote:
>>>
>>>> This looks to be fixed in master:
>>>>
>>>> >>> from pyspark.sql import SQLContext>>> sqlContext = SQLContext(sc)
>>>> >>> sc.parallelize(['{"foo":[[1,2,3], [4,5,6]]}', '{"foo":[[1,2,3], 
>>>> >>> [4,5,6]]}'
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> ])
>>>> ParallelCollectionRDD[5] at parallelize at PythonRDD.scala:315>>> 
>>>> sqlContext.jsonRDD(sc.parallelize(['{"foo":[[1,2,3], [4,5,6]]}', 
>>>> '{"foo":[[1,2,3], [4,5,6]]}']))
>>>> MapPartitionsRDD[14] at mapPartitions at SchemaRDD.scala:408>>> 
>>>> sqlContext.jsonRDD(sc.parallelize(['{"foo":[[1,2,3], [4,5,6]]}', 
>>>> '{"foo":[[1,2,3], [4,5,6]]}'])).printSchema()
>>>> root
>>>>  |-- foo: array (nullable = true)
>>>>  |    |-- element: array (containsNull = false)
>>>>  |    |    |-- element: integer (containsNull = false)
>>>>
>>>> >>>
>>>>
>>>> Nick
>>>> 
>>>>
>>>>
>>>> On Tue, Aug 5, 2014 at 7:12 PM, Brad Miller <bmill...@eecs.berkeley.edu
>>>> > wrote:
>>>>
>>>>> Hi All,
>>>>>
>>>>> I've built and deployed the current head of branch-1.0, but it seems
>>>>> to have only partly fixed the bug.
>>>>>
>>>>> This code now runs as expected with the indicated output:
>>>>> > srdd = sqlCtx.jsonRDD(sc.parallelize(['{"foo":[1,2,3]}',
>>>>> '{"foo":[4,5,6]}']))
>>>>> > srdd.printSchema()
>>>>> root
>>>>>  |-- foo: ArrayType[IntegerType]
>>>>> > srdd.collect()
>>>>> [{u'foo': [1, 2, 3]}, {u'foo': [4, 5, 6]}]
>>>>>
>>>>> This code still crashes:
>>>>> > srdd = sqlCtx.jsonRDD(sc.parallelize(['{"foo":[[1,2,3], [4,5,6]]}',
>>>>> '{"foo":[[1,2,3], [4,5,6]]}']))
>>>>> > srdd.printSchema()
>>>>> root
>>>>>  |-- foo: ArrayType[ArrayType(IntegerType)]
>>>>> > srdd.collect()
>>>>> Py4JJavaError: An error occurred while calling o63.collect.
>>>>> : org.apache.spark.SparkException: Job aborted due to stage failure:
>>>>> Task 3.0:29 failed 4 times, most recent failure: Exception failure in TID
>>>>> 67 on host kunitz.research.intel-research.net:
>>>>> net.razorvine.pickle.PickleException: couldn't introspect javabean:
>>>>> java.lang.IllegalArgumentException: wrong number of arguments
>>>>>
>>>>> I may be able to see if this is fixed in master, but since it's not
>>>>> fixed in 1.0.3 it seems unlikely to be fixed in master either. I 
>>>>> previously
>>>>> tried master as well, but ran into a build problem that did not occur with
>>>>> the 1.0 branch.
>>>>>
>>>>> Can anybody else verify that the second example still crashes (and is
>>>>> meant to work)? If so, would it be best to modify JIRA-2376 or start a new
>>>>> bug?
>>>>> https://issues.apache.org/jira/browse/SPARK-2376
>>>>>
>>>>> best,
>>>>> -Brad
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Aug 5, 2014 at 12:10 PM, Brad Miller <
>>>>> bmill...@eecs.berkeley.edu> wrote:
>>>>>
>>>>>> Nick: Thanks for both the original JIRA bug report and the link.
>>>>>>
>>>>>> Michael: This is on the 1.0.1 release.  I'll update to master and
>>>>>> follow-up if I have any problems.
>>>>>>
>>>>>> best,
>>>>>> -Brad
>>>>>>
>>>>>>
>>>>>> On Tue, Aug 5, 2014 at 12:04 PM, Michael Armbrust <
>>>>>> mich...@databricks.com> wrote:
>>>>>>
>>>>>>> Is this on 1.0.1?  I'd suggest running this on master or the 1.1-RC
>>>>>>> which should be coming out this week.  Pyspark did not have good support
>>>>>>> for nested data previously.  If you still encounter issues using a more
>>>>>>> recent version, please file a JIRA.  Thanks!
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Aug 5, 2014 at 11:55 AM, Brad Miller <
>>>>>>> bmill...@eecs.berkeley.edu> wrote:
>>>>>>>
>>>>>>>> Hi All,
>>>>>>>>
>>>>>>>> I am interested to use jsonRDD and jsonFile to create a SchemaRDD
>>>>>>>> out of some JSON data I have, but I've run into some instability 
>>>>>>>> involving
>>>>>>>> the following java exception:
>>>>>>>>
>>>>>>>> An error occurred while calling o1326.collect.
>>>>>>>> : org.apache.spark.SparkException: Job aborted due to stage
>>>>>>>> failure: Task 181.0:29 failed 4 times, most recent failure: Exception
>>>>>>>> failure in TID 1664 on host neal.research.intel-research.net:
>>>>>>>> net.razorvine.pickle.PickleException: couldn't introspect javabean:
>>>>>>>> java.lang.IllegalArgumentException: wrong number of arguments
>>>>>>>>
>>>>>>>> I've pasted code which produces the error as well as the full
>>>>>>>> traceback below.  Note that I don't have any problem when I parse the 
>>>>>>>> JSON
>>>>>>>> myself and use inferSchema.
>>>>>>>>
>>>>>>>> Is anybody able to reproduce this bug?
>>>>>>>>
>>>>>>>> -Brad
>>>>>>>>
>>>>>>>> > srdd = sqlCtx.jsonRDD(sc.parallelize(['{"foo":"bar",
>>>>>>>> "baz":[1,2,3]}', '{"foo":"boom", "baz":[1,2,3]}']))
>>>>>>>> > srdd.printSchema()
>>>>>>>>
>>>>>>>> root
>>>>>>>>  |-- baz: ArrayType[IntegerType]
>>>>>>>>  |-- foo: StringType
>>>>>>>>
>>>>>>>> > srdd.collect()
>>>>>>>>
>>>>>>>>
>>>>>>>> ---------------------------------------------------------------------------
>>>>>>>> Py4JJavaError                             Traceback (most recent
>>>>>>>> call last)
>>>>>>>> <ipython-input-89-ec7e8e8c68c4> in <module>()
>>>>>>>> ----> 1 srdd.collect()
>>>>>>>>
>>>>>>>> /home/spark/spark-1.0.1-bin-hadoop1/python/pyspark/rdd.py in
>>>>>>>> collect(self)
>>>>>>>>     581         """
>>>>>>>>     582         with _JavaStackTrace(self.context) as st:
>>>>>>>> --> 583           bytesInJava = self._jrdd.collect().iterator()
>>>>>>>>     584         return
>>>>>>>> list(self._collect_iterator_through_file(bytesInJava))
>>>>>>>>     585
>>>>>>>>
>>>>>>>> /usr/local/lib/python2.7/dist-packages/py4j/java_gateway.pyc in
>>>>>>>> __call__(self, *args)
>>>>>>>>     535         answer = self.gateway_client.send_command(command)
>>>>>>>>     536         return_value = get_return_value(answer,
>>>>>>>> self.gateway_client,
>>>>>>>> --> 537                 self.target_id, self.name)
>>>>>>>>     538
>>>>>>>>     539         for temp_arg in temp_args:
>>>>>>>>
>>>>>>>> /usr/local/lib/python2.7/dist-packages/py4j/protocol.pyc in
>>>>>>>> get_return_value(answer, gateway_client, target_id, name)
>>>>>>>>     298                 raise Py4JJavaError(
>>>>>>>>     299                     'An error occurred while calling
>>>>>>>> {0}{1}{2}.\n'.
>>>>>>>> --> 300                     format(target_id, '.', name), value)
>>>>>>>>     301             else:
>>>>>>>>     302                 raise Py4JError(
>>>>>>>>
>>>>>>>> Py4JJavaError: An error occurred while calling o1326.collect.
>>>>>>>> : org.apache.spark.SparkException: Job aborted due to stage
>>>>>>>> failure: Task 181.0:29 failed 4 times, most recent failure: Exception
>>>>>>>> failure in TID 1664 on host neal.research.intel-research.net:
>>>>>>>> net.razorvine.pickle.PickleException: couldn't introspect javabean:
>>>>>>>> java.lang.IllegalArgumentException: wrong number of arguments
>>>>>>>>         net.razorvine.pickle.Pickler.put_javabean(Pickler.java:603)
>>>>>>>>         net.razorvine.pickle.Pickler.dispatch(Pickler.java:299)
>>>>>>>>         net.razorvine.pickle.Pickler.save(Pickler.java:125)
>>>>>>>>         net.razorvine.pickle.Pickler.put_map(Pickler.java:322)
>>>>>>>>         net.razorvine.pickle.Pickler.dispatch(Pickler.java:286)
>>>>>>>>         net.razorvine.pickle.Pickler.save(Pickler.java:125)
>>>>>>>>
>>>>>>>> net.razorvine.pickle.Pickler.put_arrayOfObjects(Pickler.java:392)
>>>>>>>>         net.razorvine.pickle.Pickler.dispatch(Pickler.java:195)
>>>>>>>>         net.razorvine.pickle.Pickler.save(Pickler.java:125)
>>>>>>>>         net.razorvine.pickle.Pickler.dump(Pickler.java:95)
>>>>>>>>         net.razorvine.pickle.Pickler.dumps(Pickler.java:80)
>>>>>>>>
>>>>>>>> org.apache.spark.sql.SchemaRDD$anonfun$javaToPython$1$anonfun$apply$3.apply(SchemaRDD.scala:385)
>>>>>>>>
>>>>>>>> org.apache.spark.sql.SchemaRDD$anonfun$javaToPython$1$anonfun$apply$3.apply(SchemaRDD.scala:385)
>>>>>>>>         scala.collection.Iterator$anon$11.next(Iterator.scala:328)
>>>>>>>>
>>>>>>>> org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:294)
>>>>>>>>
>>>>>>>> org.apache.spark.api.python.PythonRDD$WriterThread$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:200)
>>>>>>>>
>>>>>>>> org.apache.spark.api.python.PythonRDD$WriterThread$anonfun$run$1.apply(PythonRDD.scala:175)
>>>>>>>>
>>>>>>>> org.apache.spark.api.python.PythonRDD$WriterThread$anonfun$run$1.apply(PythonRDD.scala:175)
>>>>>>>>
>>>>>>>> org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1160)
>>>>>>>>
>>>>>>>> org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:174)
>>>>>>>> Driver stacktrace:
>>>>>>>> at org.apache.spark.scheduler.DAGScheduler.org
>>>>>>>> $apache$spark$scheduler$DAGScheduler$failJobAndIndependentStages(DAGScheduler.scala:1044)
>>>>>>>> at
>>>>>>>> org.apache.spark.scheduler.DAGScheduler$anonfun$abortStage$1.apply(DAGScheduler.scala:1028)
>>>>>>>> at
>>>>>>>> org.apache.spark.scheduler.DAGScheduler$anonfun$abortStage$1.apply(DAGScheduler.scala:1026)
>>>>>>>> at
>>>>>>>> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>>>>>>>> at
>>>>>>>> scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>>>>>>>> at
>>>>>>>> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1026)
>>>>>>>> at
>>>>>>>> org.apache.spark.scheduler.DAGScheduler$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:634)
>>>>>>>> at
>>>>>>>> org.apache.spark.scheduler.DAGScheduler$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:634)
>>>>>>>> at scala.Option.foreach(Option.scala:236)
>>>>>>>> at
>>>>>>>> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:634)
>>>>>>>> at
>>>>>>>> org.apache.spark.scheduler.DAGSchedulerEventProcessActor$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1229)
>>>>>>>> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
>>>>>>>> at akka.actor.ActorCell.invoke(ActorCell.scala:456)
>>>>>>>> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
>>>>>>>> at akka.dispatch.Mailbox.run(Mailbox.scala:219)
>>>>>>>> at
>>>>>>>> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
>>>>>>>> at
>>>>>>>> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>>>>>>>> at
>>>>>>>> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>>>>>>>> at
>>>>>>>> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>>>>>>>> at
>>>>>>>> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: trouble with jsonRDD and jsonFile in pyspark

Reply via email to