Re: Spark 2.0 regression when querying very wide data frames
I dont think thats the issue. It sound very much like this https://issues.apache.org/jira/browse/SPARK-16664 Morten > Den 20. aug. 2016 kl. 21.24 skrev ponkin [via Apache Spark User List] > <ml-node+s1001560n27571...@n3.nabble.com>: > > Did you try to load wide, for example, CSV file or Parquet? May be the > problem is in spark-cassandra-connector not Spark itself? Are you using > spark-cassandra-connector(https://github.com/datastax/spark-cassandra-connector)? > > > If you reply to this email, your message will be added to the discussion > below: > http://apache-spark-user-list.1001560.n3.nabble.com/Spark-2-0-regression-when-querying-very-wide-data-frames-tp27567p27571.html > To unsubscribe from Spark 2.0 regression when querying very wide data frames, > click here. > NAML -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-2-0-regression-when-querying-very-wide-data-frames-tp27567p27580.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Spark 2.0 regression when querying very wide data frames
I generated CSV file with 300 columns, and it seems to work fine with Spark Dataframes(Spark 2.0). I think you need to post your issue in spark-cassandra-connector community (https://groups.google.com/a/lists.datastax.com/forum/#!forum/spark-connector-user) - if you are using it. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-2-0-regression-when-querying-very-wide-data-frames-tp27567p27572.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: Spark 2.0 regression when querying very wide data frames
Did you try to load wide, for example, CSV file or Parquet? May be the problem is in spark-cassandra-connector not Spark itself? Are you using spark-cassandra-connector(https://github.com/datastax/spark-cassandra-connector)? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-2-0-regression-when-querying-very-wide-data-frames-tp27567p27571.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: Spark 2.0 regression when querying very wide data frames
Cassandra. Morten > Den 20. aug. 2016 kl. 13.53 skrev ponkin [via Apache Spark User List] > <ml-node+s1001560n2756...@n3.nabble.com>: > > Hi, > What kind of datasource do you have? CSV, Avro, Parquet? > > If you reply to this email, your message will be added to the discussion > below: > http://apache-spark-user-list.1001560.n3.nabble.com/Spark-2-0-regression-when-querying-very-wide-data-frames-tp27567p27569.html > To unsubscribe from Spark 2.0 regression when querying very wide data frames, > click here. > NAML -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-2-0-regression-when-querying-very-wide-data-frames-tp27567p27570.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Spark 2.0 regression when querying very wide data frames
Hi, What kind of datasource do you have? CSV, Avro, Parquet? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-2-0-regression-when-querying-very-wide-data-frames-tp27567p27569.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: Spark 2.0 regression when querying very wide data frames
Yes, have a look through JIRA in cases like this. https://issues.apache.org/jira/browse/SPARK-16664 On Sat, Aug 20, 2016 at 1:57 AM, mhornbechwrote: > I did some extra digging. Running the query "select column1 from myTable" I > can reproduce the problem on a frame with a single row - it occurs exactly > when the frame has more than 200 columns, which smells a bit like a > hardcoded limit. > > Interestingly the problem disappears when replacing the query with "select > column1 from myTable limit N" where N is arbitrary. However it appears again > when running "select * from myTable limit N" with sufficiently many columns > (haven't determined the exact threshold here). > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Spark-2-0-regression-when-querying-very-wide-data-frames-tp27567p27568.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: Spark 2.0 regression when querying very wide data frames
I did some extra digging. Running the query "select column1 from myTable" I can reproduce the problem on a frame with a single row - it occurs exactly when the frame has more than 200 columns, which smells a bit like a hardcoded limit. Interestingly the problem disappears when replacing the query with "select column1 from myTable limit N" where N is arbitrary. However it appears again when running "select * from myTable limit N" with sufficiently many columns (haven't determined the exact threshold here). -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-2-0-regression-when-querying-very-wide-data-frames-tp27567p27568.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Spark 2.0 regression when querying very wide data frames
Hi We currently have some workloads in Spark 1.6.2 with queries operating on a data frame with 1500+ columns (17000 rows). This has never been quite stable, and some queries, such as "select *" would yield empty result sets, but queries restricting to specific columns have mostly worked. Needless to say that 1500+ columns isn't "desirable", but that's what the client's data looks like and our preference have been to load it and normalize it through Spark. We have been waiting to see how this would work with Spark 2.0, and unfortunately the problem has gotten worse. Almost all queries on this large data frame that worked before will now return data frames with only null values. Is this a known issue with Spark? If yes, does anyone know why it has been left untouched / made worse in Spark 2.0? If data frames with many columns is a limitation that goes deep into Spark, I would prefer hard errors rather than queries that run with meaningless results. The problem is easy to reproduce, but I am not familiar enough debugging the Spark source code to find the root cause. Hope some of you can enlighten me :-) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-2-0-regression-when-querying-very-wide-data-frames-tp27567.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org