Re: specifing schema on dataframe

Michael Armbrust Sun, 05 Feb 2017 17:13:13 -0800

If you already have the expected schema, and you know that all numbers will
always be formatted as strings in the input JSON, you could probably derive
this list automatically.


Wouldn't it be simpler to just regex replace the numbers to remove the
> quotes?


I think this is likely to be a slower and less robust solution.  You would
have to make sure that you got all the corner cases right (i.e. escaping
and what not).

On Sun, Feb 5, 2017 at 3:13 PM, Sam Elamin <hussam.ela...@gmail.com> wrote:

> I see so for the connector I need to pass in an array/list of numerical
> columns?
>
> Wouldnt it be simpler to just regex replace the numbers to remove the
> quotes?
>
>
> Regards
> Sam
>
> On Sun, Feb 5, 2017 at 11:11 PM, Michael Armbrust <mich...@databricks.com>
> wrote:
>
>> Specifying the schema when parsing JSON will only let you pick between
>> similar datatypes (i.e should this be a short, long float, double etc).  It
>> will not let you perform conversions like string <-> number.  This has to
>> be done with explicit casts after the data has been loaded.
>>
>> I think you can make a solution that uses select or withColumn generic.
>> Just load the dataframe with a "parse schema" that treats numbers as
>> strings.  Then construct a list of columns that should be numbers and apply
>> the necessary conversions.
>>
>> import org.apache.spark.sql.functions.col
>> var df = spark.read.schema(parseSchema).json("...")
>> numericColumns.foreach { columnName =>
>>   df = df.withColumn(columnName, col(columnName).cast("long"))
>> }
>>
>>
>>
>> On Sun, Feb 5, 2017 at 2:09 PM, Sam Elamin <hussam.ela...@gmail.com>
>> wrote:
>>
>>> Thanks Micheal
>>>
>>> I've been spending the past few days researching this
>>>
>>> The problem is the generated json has double quotes on fields that are
>>> numbers because the producing datastore doesn't want to lose precision
>>>
>>> I can change the data type true but that would be on specific to a job
>>> rather than a generic streaming job. I'm writing a structured streaming
>>> connector and I have the schema the generated dataframe should match.
>>>
>>> Unfortunately using withColumn won't help me here since the solution
>>> needs to be generic
>>>
>>> To summarise assume I have the following json
>>>
>>> [{
>>> "customerid": "535137",
>>> "foo": "bar"
>>> }]
>>>
>>>
>>> and I know the schema should be:
>>> StructType(Array(StructField("customerid",LongType,true),Str
>>> uctField("foo",StringType,true)))
>>>
>>> Whats the best way of solving this?
>>>
>>> My current approach is to iterate over the JSON and identify which
>>> fields are numbers and which arent then recreate the json
>>>
>>> But to be honest that doesnt seem like the cleanest approach, so happy
>>> for advice on this
>>>
>>> Regards
>>> Sam
>>>
>>> On Sun, 5 Feb 2017 at 22:00, Michael Armbrust <mich...@databricks.com>
>>> wrote:
>>>
>>>> -dev
>>>>
>>>> You can use withColumn to change the type after the data has been
>>>> loaded
>>>> <https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/1023043053387187/1572067047091340/2840265927289860/latest.html>
>>>> .
>>>>
>>>> On Sat, Feb 4, 2017 at 6:22 AM, Sam Elamin <hussam.ela...@gmail.com>
>>>> wrote:
>>>>
>>>> Hi Direceu
>>>>
>>>> Thanks your right! that did work
>>>>
>>>>
>>>> But now im facing an even bigger problem since i dont have access to
>>>> change the underlying data, I just want to apply a schema over something
>>>> that was written via the sparkContext.newAPIHadoopRDD
>>>>
>>>> Basically I am reading in a RDD[JsonObject] and would like to convert
>>>> it into a dataframe which I pass the schema into
>>>>
>>>> Whats the best way to do this?
>>>>
>>>> I doubt removing all the quotes in the JSON is the best solution is it?
>>>>
>>>> Regards
>>>> Sam
>>>>
>>>> On Sat, Feb 4, 2017 at 2:13 PM, Dirceu Semighini Filho <
>>>> dirceu.semigh...@gmail.com> wrote:
>>>>
>>>> Hi Sam
>>>> Remove the " from the number that it will work
>>>>
>>>> Em 4 de fev de 2017 11:46 AM, "Sam Elamin" <hussam.ela...@gmail.com>
>>>> escreveu:
>>>>
>>>> Hi All
>>>>
>>>> I would like to specify a schema when reading from a json but when
>>>> trying to map a number to a Double it fails, I tried FloatType and IntType
>>>> with no joy!
>>>>
>>>>
>>>> When inferring the schema customer id is set to String, and I would
>>>> like to cast it as Double
>>>>
>>>> so df1 is corrupted while df2 shows
>>>>
>>>>
>>>> Also FYI I need this to be generic as I would like to apply it to any
>>>> json, I specified the below schema as an example of the issue I am facing
>>>>
>>>> import org.apache.spark.sql.types.{BinaryType, StringType, StructField, 
>>>> DoubleType,FloatType, StructType, LongType,DecimalType}
>>>> val testSchema = StructType(Array(StructField("customerid",DoubleType)))
>>>> val df1 = 
>>>> spark.read.schema(testSchema).json(sc.parallelize(Array("""{"customerid":"535137"}""")))
>>>> val df2 = 
>>>> spark.read.json(sc.parallelize(Array("""{"customerid":"535137"}""")))
>>>> df1.show(1)
>>>> df2.show(1)
>>>>
>>>>
>>>> Any help would be appreciated, I am sure I am missing something obvious
>>>> but for the life of me I cant tell what it is!
>>>>
>>>>
>>>> Kind Regards
>>>> Sam
>>>>
>>>>
>>>>
>>>>
>>
>

Re: specifing schema on dataframe

Reply via email to