Just FYI, Spark 1.6 was released on emr-4.3.0 a couple days ago: https://aws.amazon.com/blogs/aws/emr-4-3-0-new-updated-applications-command-line-export/ On Thu, Jan 28, 2016 at 7:30 PM Andrew Zurn <awz...@gmail.com> wrote:
> Hey Daniel, > > Thanks for the response. > > After playing around for a bit, it looks like it's probably the something > similar to the first situation you mentioned, with the Parquet format > causing issues. Both programmatically created dataset and a dataset pulled > off the internet (rather than out of S3 and put into HDFS/Hive) acted with > DataFrames as one would expect (printed out everything, grouped properly, > etc.) > > It looks like there is more than likely an outstanding bug that causes > issues with data coming from S3 and is converted in the parquet format > (found an article here highlighting it was around in 1.4, and I guess it > wouldn't be out of the realm of things for it still to exist. Link to > article: > https://www.appsflyer.com/blog/the-bleeding-edge-spark-parquet-and-s3/ > > Hopefully a little more stability will come out with the upcoming Spark > 1.6 release on EMR (I think that is happening sometime soon). > > Thanks again for the advice on where to dig further into. Much appreciated. > > Andrew > > On Tue, Jan 26, 2016 at 9:18 AM, Daniel Darabos < > daniel.dara...@lynxanalytics.com> wrote: > >> Have you tried setting spark.emr.dropCharacters to a lower value? (It >> defaults to 8.) >> >> :) Just joking, sorry! Fantastic bug. >> >> What data source do you have for this DataFrame? I could imagine for >> example that it's a Parquet file and on EMR you are running with two wrong >> version of the Parquet library and it messes up strings. It should be easy >> enough to try a different data format. You could also try what happens if >> you just create the DataFrame programmatically, e.g. >> sc.parallelize(Seq("asdfasdfasdf")).toDF. >> >> To understand better at which point the characters are lost you could try >> grouping by a string attribute. I see "education" ends up either as "" >> (empty string) or "y" in the printed output. But are the characters already >> lost when you try grouping by the attribute? Will there be a single "" >> category, or will you have separate categories for "primary" and "tertiary"? >> >> I think the correct output through the RDD suggests that the issue >> happens at the very end. So it will probably happen also with different >> data sources, and grouping will create separate groups for "primary" and >> "tertiary" even though they are printed as the same string at the end. You >> should also check the data from "take(10)" to rule out any issues with >> printing. You could try the same "groupBy" trick after "take(10)". Or you >> could print the lengths of the strings. >> >> Good luck! >> >> On Tue, Jan 26, 2016 at 3:53 AM, awzurn <awz...@gmail.com> wrote: >> >>> Sorry for the bump, but wondering if anyone else has seen this before. >>> We're >>> hoping to either resolve this soon, or move on with further steps to move >>> this into an issue. >>> >>> Thanks in advance, >>> >>> Andrew Zurn >>> >>> >>> >>> -- >>> View this message in context: >>> http://apache-spark-user-list.1001560.n3.nabble.com/Dataframe-Spark-SQL-Drops-First-8-Characters-of-String-on-Amazon-EMR-tp26022p26065.html >>> Sent from the Apache Spark User List mailing list archive at Nabble.com. >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>> For additional commands, e-mail: user-h...@spark.apache.org >>> >>> >> >