You could open up a JIRA to add a version of from_json that supports schema
inference, but unfortunately that would not be super easy to implement.  In
particular, it would introduce a weird case where only this specific
function would block for a long time while we infer the schema (instead of
waiting for an action).  This blocking would be kind of odd for a call like
df.select(...).  If there is enough interest, though, we should still do it.

To give a little more detail, your version of the code is actually doing
two passes over the data: one to infer the schema and a second for whatever
processing you are asking it to do.  We have to know the schema at each
step of DataFrame construction, so we'd have to do this even before you
called an action.

Personally, I usually take a small sample of data and use schema inference
on that.  I then hardcode that schema into my program.  This makes your
spark jobs much faster and removes the possibility of the schema changing
underneath the covers.

Here's some code I use to build the static schema code automatically
<https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/1023043053387187/1128172975083446/2840265927289860/latest.html>
.

Would that work for you? If not, why not?

On Wed, Nov 23, 2016 at 2:48 AM, kant kodali <kanth...@gmail.com> wrote:

> Hi Michael,
>
> Looks like all from_json functions will require me to pass schema and that
> can be little tricky for us but the code below doesn't require me to pass
> schema at all.
>
> import org.apache.spark.sql._
> val rdd = df2.rdd.map { case Row(j: String) => j }
> spark.read.json(rdd).show()
>
>
> On Tue, Nov 22, 2016 at 2:42 PM, Michael Armbrust <mich...@databricks.com>
> wrote:
>
>> The first release candidate should be coming out this week. You can
>> subscribe to the dev list if you want to follow the release schedule.
>>
>> On Mon, Nov 21, 2016 at 9:34 PM, kant kodali <kanth...@gmail.com> wrote:
>>
>>> Hi Michael,
>>>
>>> I only see spark 2.0.2 which is what I am using currently. Any idea on
>>> when 2.1 will be released?
>>>
>>> Thanks,
>>> kant
>>>
>>> On Mon, Nov 21, 2016 at 5:12 PM, Michael Armbrust <
>>> mich...@databricks.com> wrote:
>>>
>>>> In Spark 2.1 we've added a from_json
>>>> <https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L2902>
>>>> function that I think will do what you want.
>>>>
>>>> On Fri, Nov 18, 2016 at 2:29 AM, kant kodali <kanth...@gmail.com>
>>>> wrote:
>>>>
>>>>> This seem to work
>>>>>
>>>>> import org.apache.spark.sql._
>>>>> val rdd = df2.rdd.map { case Row(j: String) => j }
>>>>> spark.read.json(rdd).show()
>>>>>
>>>>> However I wonder if this any inefficiency here ? since I have to apply
>>>>> this function for billion rows.
>>>>>
>>>>>
>>>>
>>>
>>
>

Reply via email to