Re: pyspark inferSchema

2014-08-05 Thread Brad Miller
I've followed up in a thread more directly related to jsonRDD and jsonFile, but it seems like after building from the current master I'm still having some problems with nested dictionaries. http://apache-spark-user-list.1001560.n3.nabble.com/trouble-with-jsonRDD-and-jsonFile-in-pyspark-tp11461p115

Re: pyspark inferSchema

2014-08-05 Thread Yin Huai
Yes, 2376 has been fixed in master. Can you give it a try? Also, for inferSchema, because Python is dynamically typed, I agree with Davies to provide a way to scan a subset (or entire) of the dataset to figure out the proper schema. We will take a look it. Thanks, Yin On Tue, Aug 5, 2014 at 12

Re: pyspark inferSchema

2014-08-05 Thread Brad Miller
Assuming updating to master fixes the bug I was experiencing with jsonRDD and jsonFile, then pushing "sample" to master will probably not be necessary. We believe that the link below was the bug I experienced, and I've been told it is fixed in master. https://issues.apache.org/jira/browse/SPARK-2

Re: pyspark inferSchema

2014-08-05 Thread Davies Liu
This "sample" argument of inferSchema is still no in master, if will try to add it if it make sense. On Tue, Aug 5, 2014 at 12:14 PM, Brad Miller wrote: > Hi Davies, > > Thanks for the response and tips. Is the "sample" argument to inferSchema > available in the 1.0.1 release of pyspark? I'm no

Re: pyspark inferSchema

2014-08-05 Thread Brad Miller
Hi Davies, Thanks for the response and tips. Is the "sample" argument to inferSchema available in the 1.0.1 release of pyspark? I'm not sure (based on the documentation linked below) that it is. http://spark.apache.org/docs/latest/api/python/pyspark.sql.SQLContext-class.html#inferSchema It soun

Re: pyspark inferSchema

2014-08-05 Thread Brad Miller
Got it. Thanks! On Tue, Aug 5, 2014 at 11:53 AM, Nicholas Chammas < nicholas.cham...@gmail.com> wrote: > Notice the difference in the schema. Are you running the 1.0.1 release, >> or a more bleeding-edge version from the repository? > > Yep, my bad. I’m running off master at commit > 184048f80

Re: pyspark inferSchema

2014-08-05 Thread Davies Liu
On Tue, Aug 5, 2014 at 11:01 AM, Nicholas Chammas wrote: > I was just about to ask about this. > > Currently, there are two methods, sqlContext.jsonFile() and > sqlContext.jsonRDD(), that work on JSON text and infer a schema that covers > the whole data set. > > For example: > > from pyspark.sql i

Re: pyspark inferSchema

2014-08-05 Thread Nicholas Chammas
Notice the difference in the schema. Are you running the 1.0.1 release, or > a more bleeding-edge version from the repository? Yep, my bad. I’m running off master at commit 184048f80b6fa160c89d5bb47b937a0a89534a95. Nick ​

Re: pyspark inferSchema

2014-08-05 Thread Brad Miller
Hi Nick, Thanks for the great response. I actually already investigated jsonRDD and jsonFile, although I did not realize they provide more complete schema inference. I did however have other problems with jsonRDD and jsonFile, but I will now describe in a separate thread with an appropriate subj

Re: pyspark inferSchema

2014-08-05 Thread Nicholas Chammas
I was just about to ask about this. Currently, there are two methods, sqlContext.jsonFile() and sqlContext.jsonRDD(), that work on JSON text and infer a schema that covers the whole data set. For example: from pyspark.sql import SQLContext sqlContext = SQLContext(sc) >>> a = sqlContext.jsonRDD(s