Re: specifing schema on dataframe

ayan guha Mon, 06 Feb 2017 01:07:22 -0800

If I am not missing anything here, "So I know which columns are numeric and
which arent because I have a StructType and all the internal StructFields
will tell me which ones have a DataType which is numeric and which arent"
will lead to getting to a list of fields which should be numeric.


Essentially, You will create a list of numeric fields from your "should-be"
struct type. Then you will load your raw data using built-in json reader.
At this point, your data have a wrong schema. Now, you will need to correct
it. How? You will loop over the list of numeric fields (or, you can do it
directly on the struct type), and try to match the type. If you find a
mismatch, you'd add a withColumn clause to cast to the correct data type
(from your "should-be" struct).

HTH?

Best
Ayan

On Mon, Feb 6, 2017 at 8:00 PM, Sam Elamin <hussam.ela...@gmail.com> wrote:

> Yup sorry I should have explained myself better
>
> So I know which columns are numeric and which arent because I have a
> StructType and all the internal StructFields will tell me which ones have a
> DataType which is numeric and which arent
>
> So assuming I have a json string which has double quotes on numbers when
> it shouldnt, and I have the correct schema in a struct type
>
>
> how can I iterate over them to programatically create the new dataframe in
> the correct format
>
> do i iterate over the columns in the StructType? or iterate over the
> columns in the dataframe and try to match them with the StructType?
>
> I hope I cleared things up, What I wouldnt do for a drawing board right
> now!
>
>
> On Mon, Feb 6, 2017 at 8:56 AM, ayan guha <guha.a...@gmail.com> wrote:
>
>> Umm....I think the premise is you need to "know" beforehand which columns
>> are numeric.....Unless you know it, how would you apply the schema?
>>
>> On Mon, Feb 6, 2017 at 7:54 PM, Sam Elamin <hussam.ela...@gmail.com>
>> wrote:
>>
>>> Thanks ayan but I meant how to derive the list automatically
>>>
>>> In your example you are specifying the numeric columns and I would like
>>> it to be applied to any schema if that makes sense
>>> On Mon, 6 Feb 2017 at 08:49, ayan guha <guha.a...@gmail.com> wrote:
>>>
>>>> SImple (pyspark) example:
>>>>
>>>> >>> df = sqlContext.read.json("/user/l_aguha/spark_qs.json")
>>>> >>> df.printSchema()
>>>> root
>>>>  |-- customerid: string (nullable = true)
>>>>  |-- foo: string (nullable = true)
>>>>
>>>> >>> numeric_field_list = ['customerid']
>>>>
>>>> >>> for k in numeric_field_list:
>>>> ...     df = df.withColumn(k,df[k].cast("long"))
>>>> ...
>>>> >>> df.printSchema()
>>>> root
>>>>  |-- customerid: long (nullable = true)
>>>>  |-- foo: string (nullable = true)
>>>>
>>>>
>>>> On Mon, Feb 6, 2017 at 6:56 PM, Sam Elamin <hussam.ela...@gmail.com>
>>>> wrote:
>>>>
>>>> Ok thanks Micheal!
>>>>
>>>>
>>>> Can I get an idea on where to start? Assuming I have the end schema and
>>>> the current dataframe...
>>>> How can I loop through it and create a new dataframe using the
>>>> WithColumn?
>>>>
>>>>
>>>> Am I iterating through the dataframe or the schema?
>>>>
>>>> I'm assuming it's easier to iterate through the columns in the old df.
>>>> For each column cast it correctly and generate a new df?
>>>>
>>>>
>>>> Would you recommend that?
>>>>
>>>> Regards
>>>> Sam
>>>> On Mon, 6 Feb 2017 at 01:12, Michael Armbrust <mich...@databricks.com>
>>>> wrote:
>>>>
>>>> If you already have the expected schema, and you know that all numbers
>>>> will always be formatted as strings in the input JSON, you could probably
>>>> derive this list automatically.
>>>>
>>>> Wouldn't it be simpler to just regex replace the numbers to remove the
>>>> quotes?
>>>>
>>>>
>>>> I think this is likely to be a slower and less robust solution.  You
>>>> would have to make sure that you got all the corner cases right (i.e.
>>>> escaping and what not).
>>>>
>>>> On Sun, Feb 5, 2017 at 3:13 PM, Sam Elamin <hussam.ela...@gmail.com>
>>>> wrote:
>>>>
>>>> I see so for the connector I need to pass in an array/list of numerical
>>>> columns?
>>>>
>>>> Wouldnt it be simpler to just regex replace the numbers to remove the
>>>> quotes?
>>>>
>>>>
>>>> Regards
>>>> Sam
>>>>
>>>> On Sun, Feb 5, 2017 at 11:11 PM, Michael Armbrust <
>>>> mich...@databricks.com> wrote:
>>>>
>>>> Specifying the schema when parsing JSON will only let you pick between
>>>> similar datatypes (i.e should this be a short, long float, double etc).  It
>>>> will not let you perform conversions like string <-> number.  This has to
>>>> be done with explicit casts after the data has been loaded.
>>>>
>>>> I think you can make a solution that uses select or withColumn
>>>> generic.  Just load the dataframe with a "parse schema" that treats numbers
>>>> as strings.  Then construct a list of columns that should be numbers and
>>>> apply the necessary conversions.
>>>>
>>>> import org.apache.spark.sql.functions.col
>>>> var df = spark.read.schema(parseSchema).json("...")
>>>> numericColumns.foreach { columnName =>
>>>>   df = df.withColumn(columnName, col(columnName).cast("long"))
>>>> }
>>>>
>>>>
>>>>
>>>> On Sun, Feb 5, 2017 at 2:09 PM, Sam Elamin <hussam.ela...@gmail.com>
>>>> wrote:
>>>>
>>>> Thanks Micheal
>>>>
>>>> I've been spending the past few days researching this
>>>>
>>>> The problem is the generated json has double quotes on fields that are
>>>> numbers because the producing datastore doesn't want to lose precision
>>>>
>>>> I can change the data type true but that would be on specific to a job
>>>> rather than a generic streaming job. I'm writing a structured streaming
>>>> connector and I have the schema the generated dataframe should match.
>>>>
>>>> Unfortunately using withColumn won't help me here since the solution
>>>> needs to be generic
>>>>
>>>> To summarise assume I have the following json
>>>>
>>>> [{
>>>> "customerid": "535137",
>>>> "foo": "bar"
>>>> }]
>>>>
>>>>
>>>> and I know the schema should be:
>>>> StructType(Array(StructField("customerid",LongType,true),Str
>>>> uctField("foo",StringType,true)))
>>>>
>>>> Whats the best way of solving this?
>>>>
>>>> My current approach is to iterate over the JSON and identify which
>>>> fields are numbers and which arent then recreate the json
>>>>
>>>> But to be honest that doesnt seem like the cleanest approach, so happy
>>>> for advice on this
>>>>
>>>> Regards
>>>> Sam
>>>>
>>>> On Sun, 5 Feb 2017 at 22:00, Michael Armbrust <mich...@databricks.com>
>>>> wrote:
>>>>
>>>> -dev
>>>>
>>>> You can use withColumn to change the type after the data has been
>>>> loaded
>>>> <https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/1023043053387187/1572067047091340/2840265927289860/latest.html>
>>>> .
>>>>
>>>> On Sat, Feb 4, 2017 at 6:22 AM, Sam Elamin <hussam.ela...@gmail.com>
>>>> wrote:
>>>>
>>>> Hi Direceu
>>>>
>>>> Thanks your right! that did work
>>>>
>>>>
>>>> But now im facing an even bigger problem since i dont have access to
>>>> change the underlying data, I just want to apply a schema over something
>>>> that was written via the sparkContext.newAPIHadoopRDD
>>>>
>>>> Basically I am reading in a RDD[JsonObject] and would like to convert
>>>> it into a dataframe which I pass the schema into
>>>>
>>>> Whats the best way to do this?
>>>>
>>>> I doubt removing all the quotes in the JSON is the best solution is it?
>>>>
>>>> Regards
>>>> Sam
>>>>
>>>> On Sat, Feb 4, 2017 at 2:13 PM, Dirceu Semighini Filho <
>>>> dirceu.semigh...@gmail.com> wrote:
>>>>
>>>> Hi Sam
>>>> Remove the " from the number that it will work
>>>>
>>>> Em 4 de fev de 2017 11:46 AM, "Sam Elamin" <hussam.ela...@gmail.com>
>>>> escreveu:
>>>>
>>>> Hi All
>>>>
>>>> I would like to specify a schema when reading from a json but when
>>>> trying to map a number to a Double it fails, I tried FloatType and IntType
>>>> with no joy!
>>>>
>>>>
>>>> When inferring the schema customer id is set to String, and I would
>>>> like to cast it as Double
>>>>
>>>> so df1 is corrupted while df2 shows
>>>>
>>>>
>>>> Also FYI I need this to be generic as I would like to apply it to any
>>>> json, I specified the below schema as an example of the issue I am facing
>>>>
>>>> import org.apache.spark.sql.types.{BinaryType, StringType, StructField, 
>>>> DoubleType,FloatType, StructType, LongType,DecimalType}
>>>> val testSchema = StructType(Array(StructField("customerid",DoubleType)))
>>>> val df1 = 
>>>> spark.read.schema(testSchema).json(sc.parallelize(Array("""{"customerid":"535137"}""")))
>>>> val df2 = 
>>>> spark.read.json(sc.parallelize(Array("""{"customerid":"535137"}""")))
>>>> df1.show(1)
>>>> df2.show(1)
>>>>
>>>>
>>>> Any help would be appreciated, I am sure I am missing something obvious
>>>> but for the life of me I cant tell what it is!
>>>>
>>>>
>>>> Kind Regards
>>>> Sam
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Best Regards,
>>>> Ayan Guha
>>>>
>>>
>>
>>
>> --
>> Best Regards,
>> Ayan Guha
>>
>
>


-- 
Best Regards,
Ayan Guha

Re: specifing schema on dataframe

Reply via email to