Re: Dataframe, Spark SQL - Drops First 8 Characters of String on Amazon EMR

Jonathan Kelly Thu, 28 Jan 2016 22:54:38 -0800

Just FYI, Spark 1.6 was released on emr-4.3.0 a couple days ago:
https://aws.amazon.com/blogs/aws/emr-4-3-0-new-updated-applications-command-line-export/
On Thu, Jan 28, 2016 at 7:30 PM Andrew Zurn <awz...@gmail.com> wrote:


> Hey Daniel,
>
> Thanks for the response.
>
> After playing around for a bit, it looks like it's probably the something
> similar to the first situation you mentioned, with the Parquet format
> causing issues. Both programmatically created dataset and a dataset pulled
> off the internet (rather than out of S3 and put into HDFS/Hive) acted with
> DataFrames as one would expect (printed out everything, grouped properly,
> etc.)
>
> It looks like there is more than likely an outstanding bug that causes
> issues with data coming from S3 and is converted in the parquet format
> (found an article here highlighting it was around in 1.4, and I guess it
> wouldn't be out of the realm of things for it still to exist. Link to
> article:
> https://www.appsflyer.com/blog/the-bleeding-edge-spark-parquet-and-s3/
>
> Hopefully a little more stability will come out with the upcoming Spark
> 1.6 release on EMR (I think that is happening sometime soon).
>
> Thanks again for the advice on where to dig further into. Much appreciated.
>
> Andrew
>
> On Tue, Jan 26, 2016 at 9:18 AM, Daniel Darabos <
> daniel.dara...@lynxanalytics.com> wrote:
>
>> Have you tried setting spark.emr.dropCharacters to a lower value? (It
>> defaults to 8.)
>>
>> :) Just joking, sorry! Fantastic bug.
>>
>> What data source do you have for this DataFrame? I could imagine for
>> example that it's a Parquet file and on EMR you are running with two wrong
>> version of the Parquet library and it messes up strings. It should be easy
>> enough to try a different data format. You could also try what happens if
>> you just create the DataFrame programmatically, e.g.
>> sc.parallelize(Seq("asdfasdfasdf")).toDF.
>>
>> To understand better at which point the characters are lost you could try
>> grouping by a string attribute. I see "education" ends up either as ""
>> (empty string) or "y" in the printed output. But are the characters already
>> lost when you try grouping by the attribute? Will there be a single ""
>> category, or will you have separate categories for "primary" and "tertiary"?
>>
>> I think the correct output through the RDD suggests that the issue
>> happens at the very end. So it will probably happen also with different
>> data sources, and grouping will create separate groups for "primary" and
>> "tertiary" even though they are printed as the same string at the end. You
>> should also check the data from "take(10)" to rule out any issues with
>> printing. You could try the same "groupBy" trick after "take(10)". Or you
>> could print the lengths of the strings.
>>
>> Good luck!
>>
>> On Tue, Jan 26, 2016 at 3:53 AM, awzurn <awz...@gmail.com> wrote:
>>
>>> Sorry for the bump, but wondering if anyone else has seen this before.
>>> We're
>>> hoping to either resolve this soon, or move on with further steps to move
>>> this into an issue.
>>>
>>> Thanks in advance,
>>>
>>> Andrew Zurn
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Dataframe-Spark-SQL-Drops-First-8-Characters-of-String-on-Amazon-EMR-tp26022p26065.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>>
>

Re: Dataframe, Spark SQL - Drops First 8 Characters of String on Amazon EMR

Reply via email to