[jira] [Created] (SPARK-15285) Generated SpecificSafeProjection.apply method grows beyond 64 KB
Konstantin Shaposhnikov created SPARK-15285: --- Summary: Generated SpecificSafeProjection.apply method grows beyond 64 KB Key: SPARK-15285 URL: https://issues.apache.org/jira/browse/SPARK-15285 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.6.1, 2.0.0 Reporter: Konstantin Shaposhnikov The following code snippet results in {noformat} org.codehaus.janino.JaninoRuntimeException: Code of method "(Ljava/lang/Object;)Ljava/lang/Object;" of class "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificSafeProjection" grows beyond 64 KB at org.codehaus.janino.CodeContext.makeSpace(CodeContext.java:941) {noformat} {code} case class S100(s1:String="1", s2:String="2", s3:String="3", s4:String="4", s5:String="5", s6:String="6", s7:String="7", s8:String="8", s9:String="9", s10:String="10", s11:String="11", s12:String="12", s13:String="13", s14:String="14", s15:String="15", s16:String="16", s17:String="17", s18:String="18", s19:String="19", s20:String="20", s21:String="21", s22:String="22", s23:String="23", s24:String="24", s25:String="25", s26:String="26", s27:String="27", s28:String="28", s29:String="29", s30:String="30", s31:String="31", s32:String="32", s33:String="33", s34:String="34", s35:String="35", s36:String="36", s37:String="37", s38:String="38", s39:String="39", s40:String="40", s41:String="41", s42:String="42", s43:String="43", s44:String="44", s45:String="45", s46:String="46", s47:String="47", s48:String="48", s49:String="49", s50:String="50", s51:String="51", s52:String="52", s53:String="53", s54:String="54", s55:String="55", s56:String="56", s57:String="57", s58:String="58", s59:String="59", s60:String="60", s61:String="61", s62:String="62", s63:String="63", s64:String="64", s65:String="65", s66:String="66", s67:String="67", s68:String="68", s69:String="69", s70:String="70", s71:String="71", s72:String="72", s73:String="73", s74:String="74", s75:String="75", s76:String="76", s77:String="77", s78:String="78", s79:String="79", s80:String="80", s81:String="81", s82:String="82", s83:String="83", s84:String="84", s85:String="85", s86:String="86", s87:String="87", s88:String="88", s89:String="89", s90:String="90", s91:String="91", s92:String="92", s93:String="93", s94:String="94", s95:String="95", s96:String="96", s97:String="97", s98:String="98", s99:String="99", s100:String="100") case class S(s1: S100=S100(), s2: S100=S100(), s3: S100=S100(), s4: S100=S100(), s5: S100=S100(), s6: S100=S100(), s7: S100=S100(), s8: S100=S100(), s9: S100=S100(), s10: S100=S100()) val ds = Seq(S(),S(),S()).toDS ds.show() {code} I could reproduce this with Spark built from 1.6 branch and with https://home.apache.org/~pwendell/spark-nightly/spark-master-bin/spark-2.0.0-SNAPSHOT-2016_05_11_01_03-8beae59-bin/ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2984) FileNotFoundException on _temporary directory
[ https://issues.apache.org/jira/browse/SPARK-2984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15093511#comment-15093511 ] Konstantin Shaposhnikov commented on SPARK-2984: I am seeing the same error message with Spark 1.6 and HDFS. This happens after an earlier job failure (ClassCastException) > FileNotFoundException on _temporary directory > - > > Key: SPARK-2984 > URL: https://issues.apache.org/jira/browse/SPARK-2984 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.1.0 >Reporter: Andrew Ash >Assignee: Josh Rosen >Priority: Critical > Fix For: 1.3.0 > > > We've seen several stacktraces and threads on the user mailing list where > people are having issues with a {{FileNotFoundException}} stemming from an > HDFS path containing {{_temporary}}. > I ([~aash]) think this may be related to {{spark.speculation}}. I think the > error condition might manifest in this circumstance: > 1) task T starts on a executor E1 > 2) it takes a long time, so task T' is started on another executor E2 > 3) T finishes in E1 so moves its data from {{_temporary}} to the final > destination and deletes the {{_temporary}} directory during cleanup > 4) T' finishes in E2 and attempts to move its data from {{_temporary}}, but > those files no longer exist! exception > Some samples: > {noformat} > 14/08/11 08:05:08 ERROR JobScheduler: Error running job streaming job > 140774430 ms.0 > java.io.FileNotFoundException: File > hdfs://hadoopc/user/csong/output/human_bot/-140774430.out/_temporary/0/task_201408110805__m_07 > does not exist. > at > org.apache.hadoop.hdfs.DistributedFileSystem.listStatusInternal(DistributedFileSystem.java:654) > at > org.apache.hadoop.hdfs.DistributedFileSystem.access$600(DistributedFileSystem.java:102) > at > org.apache.hadoop.hdfs.DistributedFileSystem$14.doCall(DistributedFileSystem.java:712) > at > org.apache.hadoop.hdfs.DistributedFileSystem$14.doCall(DistributedFileSystem.java:708) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:708) > at > org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(FileOutputCommitter.java:360) > at > org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJob(FileOutputCommitter.java:310) > at > org.apache.hadoop.mapred.FileOutputCommitter.commitJob(FileOutputCommitter.java:136) > at > org.apache.spark.SparkHadoopWriter.commitJob(SparkHadoopWriter.scala:126) > at > org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:841) > at > org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:724) > at > org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:643) > at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:1068) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$8.apply(DStream.scala:773) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$8.apply(DStream.scala:771) > at > org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:41) > at > org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40) > at > org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40) > at scala.util.Try$.apply(Try.scala:161) > at org.apache.spark.streaming.scheduler.Job.run(Job.scala:32) > at > org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:172) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {noformat} > -- Chen Song at > http://apache-spark-user-list.1001560.n3.nabble.com/saveAsTextFiles-file-not-found-exception-td10686.html > {noformat} > I am running a Spark Streaming job that uses saveAsTextFiles to save results > into hdfs files. However, it has an exception after 20 batches > result-140631234/_temporary/0/task_201407251119__m_03 does not > exist. > {noformat} > and > {noformat} > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException): > No lease on /apps/data/vddil/real-time/checkpoint/temp: File does not exist. > Holder DFSClient_NONMAPREDUCE_327993456_13 does not have any open files. > at >
[jira] [Commented] (SPARK-10066) Can't create HiveContext with spark-shell or spark-sql on snapshot
[ https://issues.apache.org/jira/browse/SPARK-10066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14966346#comment-14966346 ] Konstantin Shaposhnikov commented on SPARK-10066: - I have the same problem when creating HiveContext programmatically (from a Scala app) on Windows. > Can't create HiveContext with spark-shell or spark-sql on snapshot > -- > > Key: SPARK-10066 > URL: https://issues.apache.org/jira/browse/SPARK-10066 > Project: Spark > Issue Type: Bug > Components: Spark Shell, SQL >Affects Versions: 1.5.0 > Environment: Centos 6.6 >Reporter: Robert Beauchemin >Priority: Minor > > Built the 1.5.0-preview-20150812 with the following: > ./make-distribution.sh -Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.0 -Phive > -Phive-thriftserver -Psparkr -DskipTests > Starting spark-shell or spark-sql returns the following error: > java.lang.RuntimeException: java.lang.RuntimeException: The root scratch dir: > /tmp/hive on HDFS should be writable. Current permissions are: rwx-- > at > org.apache.hadoop.hive.ql.session.SessionState.createRootHDFSDir(SessionState.java:612) > [elided] > at > org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:508) > > It's trying to create a new HiveContext. Running pySpark or sparkR works and > creates a HiveContext successfully. SqlContext can be created successfully > with any shell. > I've tried changing permissions on that HDFS directory (even as far as making > it world-writable) without success. Tried changing SPARK_USER and also > running spark-shell as different users without success. > This works on same machine on 1.4.1 and on earlier pre-release versions of > Spark 1.5.0 (same make-distribution parms) sucessfully. Just trying the > snapshot... -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-10066) Can't create HiveContext with spark-shell or spark-sql on snapshot
[ https://issues.apache.org/jira/browse/SPARK-10066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14966346#comment-14966346 ] Konstantin Shaposhnikov edited comment on SPARK-10066 at 10/21/15 7:19 AM: --- I have the same problem when creating HiveContext programmatically (from a Scala app) on Windows using the latest Spark from 1.5 branch was (Author: k.shaposhni...@gmail.com): I have the same problem when creating HiveContext programmatically (from a Scala app) on Windows. > Can't create HiveContext with spark-shell or spark-sql on snapshot > -- > > Key: SPARK-10066 > URL: https://issues.apache.org/jira/browse/SPARK-10066 > Project: Spark > Issue Type: Bug > Components: Spark Shell, SQL >Affects Versions: 1.5.0 > Environment: Centos 6.6 >Reporter: Robert Beauchemin >Priority: Minor > > Built the 1.5.0-preview-20150812 with the following: > ./make-distribution.sh -Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.0 -Phive > -Phive-thriftserver -Psparkr -DskipTests > Starting spark-shell or spark-sql returns the following error: > java.lang.RuntimeException: java.lang.RuntimeException: The root scratch dir: > /tmp/hive on HDFS should be writable. Current permissions are: rwx-- > at > org.apache.hadoop.hive.ql.session.SessionState.createRootHDFSDir(SessionState.java:612) > [elided] > at > org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:508) > > It's trying to create a new HiveContext. Running pySpark or sparkR works and > creates a HiveContext successfully. SqlContext can be created successfully > with any shell. > I've tried changing permissions on that HDFS directory (even as far as making > it world-writable) without success. Tried changing SPARK_USER and also > running spark-shell as different users without success. > This works on same machine on 1.4.1 and on earlier pre-release versions of > Spark 1.5.0 (same make-distribution parms) sucessfully. Just trying the > snapshot... -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8824) Support Parquet logical types TIMESTAMP_MILLIS and TIMESTAMP_MICROS
[ https://issues.apache.org/jira/browse/SPARK-8824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14681418#comment-14681418 ] Konstantin Shaposhnikov commented on SPARK-8824: Ok, thank you for the update. Support Parquet logical types TIMESTAMP_MILLIS and TIMESTAMP_MICROS --- Key: SPARK-8824 URL: https://issues.apache.org/jira/browse/SPARK-8824 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.5.0 Reporter: Cheng Lian -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8824) Support Parquet logical types TIMESTAMP_MILLIS and TIMESTAMP_MICROS
[ https://issues.apache.org/jira/browse/SPARK-8824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14681146#comment-14681146 ] Konstantin Shaposhnikov commented on SPARK-8824: parquet-mr 1.7.0+ depends on parquet-format 2.3.0-incubating that includes support for TIMESTAMP_MILLIS: https://github.com/apache/parquet-format/blob/apache-parquet-format-2.3.0-incubating/src/thrift/parquet.thrift#L104 TIMESTAMP_MICROS is not there indeed. Support Parquet logical types TIMESTAMP_MILLIS and TIMESTAMP_MICROS --- Key: SPARK-8824 URL: https://issues.apache.org/jira/browse/SPARK-8824 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.5.0 Reporter: Cheng Lian -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8824) Support Parquet logical types TIMESTAMP_MILLIS and TIMESTAMP_MICROS
[ https://issues.apache.org/jira/browse/SPARK-8824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14653060#comment-14653060 ] Konstantin Shaposhnikov commented on SPARK-8824: Can TIMESTAMP_MILLIS support (for INT64 type) be implemented first? Would a pull request for this be accepted for Spark 1.5? Thank you. Support Parquet logical types TIMESTAMP_MILLIS and TIMESTAMP_MICROS --- Key: SPARK-8824 URL: https://issues.apache.org/jira/browse/SPARK-8824 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.5.0 Reporter: Cheng Lian -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8819) Spark doesn't compile with maven 3.3.x
[ https://issues.apache.org/jira/browse/SPARK-8819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14648680#comment-14648680 ] Konstantin Shaposhnikov commented on SPARK-8819: [MSHADE-148] has been fixed in maven-shade-plugin 2.4.1 It would be good to update pom.xml to use it and remove {{create.dependency.reduced.pom}} workaround. Spark doesn't compile with maven 3.3.x -- Key: SPARK-8819 URL: https://issues.apache.org/jira/browse/SPARK-8819 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.3.2, 1.4.0, 1.5.0 Reporter: Andrew Or Assignee: Andrew Or Priority: Blocker Fix For: 1.3.2, 1.4.1, 1.5.0 Simple reproduction: Install maven 3.3.3 and run build/mvn clean package -DskipTests This works just fine for maven 3.2.1 but not for 3.3.x. The result is an infinite loop caused by MSHADE-148: {code} [INFO] Replacing /Users/andrew/Documents/dev/spark/andrew-spark/bagel/target/spark-bagel_2.10-1.5.0-SNAPSHOT.jar with /Users/andrew/Documents/dev/spark/andrew-spark/bagel/target/spark-bagel_2.10-1.5.0-SNAPSHOT-shaded.jar [INFO] Dependency-reduced POM written at: /Users/andrew/Documents/dev/spark/andrew-spark/bagel/dependency-reduced-pom.xml [INFO] Dependency-reduced POM written at: /Users/andrew/Documents/dev/spark/andrew-spark/bagel/dependency-reduced-pom.xml ... {code} This is ultimately caused by SPARK-7558 (master 9eb222c13991c2b4a22db485710dc2e27ccf06dd) but is recently revealed through SPARK-8781 (master 82cf3315e690f4ac15b50edea6a3d673aa5be4c0). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8781) Pusblished POMs are no longer effective POMs
[ https://issues.apache.org/jira/browse/SPARK-8781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14611600#comment-14611600 ] Konstantin Shaposhnikov commented on SPARK-8781: I believe this will affect both released and SNAPSHOT artefacts. Basically, as part of SPARK-3812 the build was changed to deploy an effective POMs into maven repository. E.g. in https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.11/1.4.0/spark-core_2.11-1.4.0.pom you won't find {{$\{scala.binary.version}}, it was resolved to 2.11 by the maven during the build. This is required for Scala 2.11 build to make sure that jars that are built with Scala 2.11 reference Scala 2.11 jars (e.g. spark-core_2.11 should depend on spark-launcher_2.11, not on spark-launcher_2.10). By default {{$\{scala.binary.version}} will be resolved to 2.10 because scala-2.10 maven profile is the active by default. Publishing of effective POMs is implemented using maven-shade-plugin. To be honest I am not sure how exactly it works. However when I removed the following line from the parent POM {{createDependencyReducedPomfalse/createDependencyReducedPom}} the build started to deploy effective POMs again. I hope my explanation helps. Pusblished POMs are no longer effective POMs Key: SPARK-8781 URL: https://issues.apache.org/jira/browse/SPARK-8781 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.3.2, 1.4.1, 1.5.0 Reporter: Konstantin Shaposhnikov Published to maven repository POMs are no longer effective POMs. E.g. In https://repository.apache.org/content/repositories/snapshots/org/apache/spark/spark-core_2.11/1.4.2-SNAPSHOT/spark-core_2.11-1.4.2-20150702.043114-52.pom: {noformat} ... dependency groupIdorg.apache.spark/groupId artifactIdspark-launcher_${scala.binary.version}/artifactId version${project.version}/version /dependency ... {noformat} while it should be {noformat} ... dependency groupIdorg.apache.spark/groupId artifactIdspark-launcher_2.11/artifactId version${project.version}/version /dependency ... {noformat} The following commits are most likely the cause of it: - for branch-1.3: https://github.com/apache/spark/commit/ce137b8ed3b240b7516046699ac96daa55ddc129 - for branch-1.4: https://github.com/apache/spark/commit/84da653192a2d9edb82d0dbe50f577c4dc6a0c78 - for master: https://github.com/apache/spark/commit/984ad60147c933f2d5a2040c87ae687c14eb1724 On branch-1.4 reverting the commit fixed the issue. See SPARK-3812 for additional details -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8781) Pusblished POMs are no longer effective POMs
[ https://issues.apache.org/jira/browse/SPARK-8781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14611606#comment-14611606 ] Konstantin Shaposhnikov commented on SPARK-8781: The original commit that adds effective POM publishing: https://github.com/apache/spark/commit/6e09c98b5d7ad92cf01a3b415008f48782f2f1a3 Pusblished POMs are no longer effective POMs Key: SPARK-8781 URL: https://issues.apache.org/jira/browse/SPARK-8781 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.3.2, 1.4.1, 1.5.0 Reporter: Konstantin Shaposhnikov Published to maven repository POMs are no longer effective POMs. E.g. In https://repository.apache.org/content/repositories/snapshots/org/apache/spark/spark-core_2.11/1.4.2-SNAPSHOT/spark-core_2.11-1.4.2-20150702.043114-52.pom: {noformat} ... dependency groupIdorg.apache.spark/groupId artifactIdspark-launcher_${scala.binary.version}/artifactId version${project.version}/version /dependency ... {noformat} while it should be {noformat} ... dependency groupIdorg.apache.spark/groupId artifactIdspark-launcher_2.11/artifactId version${project.version}/version /dependency ... {noformat} The following commits are most likely the cause of it: - for branch-1.3: https://github.com/apache/spark/commit/ce137b8ed3b240b7516046699ac96daa55ddc129 - for branch-1.4: https://github.com/apache/spark/commit/84da653192a2d9edb82d0dbe50f577c4dc6a0c78 - for master: https://github.com/apache/spark/commit/984ad60147c933f2d5a2040c87ae687c14eb1724 On branch-1.4 reverting the commit fixed the issue. See SPARK-3812 for additional details -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8781) Pusblished POMs are no longer effective POMs
Konstantin Shaposhnikov created SPARK-8781: -- Summary: Pusblished POMs are no longer effective POMs Key: SPARK-8781 URL: https://issues.apache.org/jira/browse/SPARK-8781 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.3.2, 1.4.1, 1.5.0 Reporter: Konstantin Shaposhnikov Published to maven repository POMs are no longer effective POMs. E.g. In https://repository.apache.org/content/repositories/snapshots/org/apache/spark/spark-core_2.11/1.4.2-SNAPSHOT/spark-core_2.11-1.4.2-20150702.043114-52.pom: {noformat} ... dependency groupIdorg.apache.spark/groupId artifactIdspark-launcher_${scala.binary.version}/artifactId version${project.version}/version /dependency ... {noformat} while it should be {noformat} ... dependency groupIdorg.apache.spark/groupId artifactIdspark-launcher_2.11/artifactId version${project.version}/version /dependency ... {noformat} The following commits are most likely the cause of it: - for branch-1.3: https://github.com/apache/spark/commit/ce137b8ed3b240b7516046699ac96daa55ddc129 - for branch-1.4: https://github.com/apache/spark/commit/84da653192a2d9edb82d0dbe50f577c4dc6a0c78 - for master: https://github.com/apache/spark/commit/984ad60147c933f2d5a2040c87ae687c14eb1724 On branch-1.4 reverting the commit fixed the issue. See SPARK-3812 for additional details -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8585) Support LATERAL VIEW in Spark SQL parser
Konstantin Shaposhnikov created SPARK-8585: -- Summary: Support LATERAL VIEW in Spark SQL parser Key: SPARK-8585 URL: https://issues.apache.org/jira/browse/SPARK-8585 Project: Spark Issue Type: Improvement Reporter: Konstantin Shaposhnikov It would be good to support LATERAL VIEW SQL syntax without need to create HiveContext. Docs: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+LateralView -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8122) ParquetRelation.enableLogForwarding() may fail to configure loggers
[ https://issues.apache.org/jira/browse/SPARK-8122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14576536#comment-14576536 ] Konstantin Shaposhnikov commented on SPARK-8122: Yes, a strong reference is required for all Logger instances that are configured in enableLogForwarding. http://docs.oracle.com/javase/7/docs/api/java/util/logging/Logger.html#getLogger(java.lang.String) ParquetRelation.enableLogForwarding() may fail to configure loggers --- Key: SPARK-8122 URL: https://issues.apache.org/jira/browse/SPARK-8122 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 1.5.0 Reporter: Konstantin Shaposhnikov Priority: Minor _enableLogForwarding()_ doesn't hold to the created loggers that can be garbage collected and all configuration changes will be gone. From https://docs.oracle.com/javase/6/docs/api/java/util/logging/Logger.html javadocs: _It is important to note that the Logger returned by one of the getLogger factory methods may be garbage collected at any time if a strong reference to the Logger is not kept._ All created logger references need to be kept, e.g. in static variables. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8122) ParquetRelation.enableLogForwarding() may fail to configure loggers
[ https://issues.apache.org/jira/browse/SPARK-8122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14574419#comment-14574419 ] Konstantin Shaposhnikov commented on SPARK-8122: Parquet itself surfers from this issue too but its almost impossible to hit it because the static block in Log is most likely called very shortly before some Logger instance is strongly referenced from a static LOG field (Log - Logger - parent Logger). It is very unlikely that GC happens between these two events. But when there is a bigger interval between a Logger is configured in `enableLogForwarding()` and is actually used to log something there is a bigger chance to see this. In one of my applications I used similar code to redirect parquet logging to slf4j and saw once that the redirect wasn't setup properly due to GC. To be honest I wish parquet just used slf4j and didn't mess with logging set up ;) ParquetRelation.enableLogForwarding() may fail to configure loggers --- Key: SPARK-8122 URL: https://issues.apache.org/jira/browse/SPARK-8122 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 1.5.0 Reporter: Konstantin Shaposhnikov Priority: Minor _enableLogForwarding()_ doesn't hold to the created loggers that can be garbage collected and all configuration changes will be gone. From https://docs.oracle.com/javase/6/docs/api/java/util/logging/Logger.html javadocs: _It is important to note that the Logger returned by one of the getLogger factory methods may be garbage collected at any time if a strong reference to the Logger is not kept._ All created logger references need to be kept, e.g. in static variables. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8122) A few problems in ParquetRelation.enableLogForwarding()
Konstantin Shaposhnikov created SPARK-8122: -- Summary: A few problems in ParquetRelation.enableLogForwarding() Key: SPARK-8122 URL: https://issues.apache.org/jira/browse/SPARK-8122 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.5.0 Reporter: Konstantin Shaposhnikov _enableLogForwarding()_ should be updated after parquet 1.7.0 update , because name of the logger has been changed to `org.apache.parquet`. From parquet-mr Log: {code} // add a default handler in case there is none Logger logger = Logger.getLogger(Log.class.getPackage().getName()); {code} Another problem with _enableLogForwarding()_ is that it doesn't hold to the created loggers that can be garbage collected and all configuration changes will be gone. From https://docs.oracle.com/javase/6/docs/api/java/util/logging/Logger.html javadocs: _It is important to note that the Logger returned by one of the getLogger factory methods may be garbage collected at any time if a strong reference to the Logger is not kept._ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8122) ParquetRelation.enableLogForwarding() may fail to configure loggers
[ https://issues.apache.org/jira/browse/SPARK-8122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Konstantin Shaposhnikov updated SPARK-8122: --- Description: _enableLogForwarding()_ doesn't hold to the created loggers that can be garbage collected and all configuration changes will be gone. From https://docs.oracle.com/javase/6/docs/api/java/util/logging/Logger.html javadocs: _It is important to note that the Logger returned by one of the getLogger factory methods may be garbage collected at any time if a strong reference to the Logger is not kept._ All created logger references need to be kept, e.g. in static variables. was: Another problem with _enableLogForwarding()_ is that it doesn't hold to the created loggers that can be garbage collected and all configuration changes will be gone. From https://docs.oracle.com/javase/6/docs/api/java/util/logging/Logger.html javadocs: _It is important to note that the Logger returned by one of the getLogger factory methods may be garbage collected at any time if a strong reference to the Logger is not kept._ ParquetRelation.enableLogForwarding() may fail to configure loggers --- Key: SPARK-8122 URL: https://issues.apache.org/jira/browse/SPARK-8122 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 1.5.0 Reporter: Konstantin Shaposhnikov Priority: Minor _enableLogForwarding()_ doesn't hold to the created loggers that can be garbage collected and all configuration changes will be gone. From https://docs.oracle.com/javase/6/docs/api/java/util/logging/Logger.html javadocs: _It is important to note that the Logger returned by one of the getLogger factory methods may be garbage collected at any time if a strong reference to the Logger is not kept._ All created logger references need to be kept, e.g. in static variables. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8118) Turn off noisy log output produced by Parquet 1.7.0
[ https://issues.apache.org/jira/browse/SPARK-8118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14573981#comment-14573981 ] Konstantin Shaposhnikov commented on SPARK-8118: Name of the logger has been changed to _org.apache.parquet_. From parquet-mr Log: {code} // add a default handler in case there is none Logger logger = Logger.getLogger(Log.class.getPackage().getName()); {code} Turn off noisy log output produced by Parquet 1.7.0 --- Key: SPARK-8118 URL: https://issues.apache.org/jira/browse/SPARK-8118 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 1.4.1, 1.5.0 Reporter: Cheng Lian Assignee: Cheng Lian Priority: Minor Parquet 1.7.0 renames package name to org.apache.parquet, need to adjust {{ParquetRelation.enableLogForwarding}} accordingly to avoid noisy log output. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8122) A few problems in ParquetRelation.enableLogForwarding()
[ https://issues.apache.org/jira/browse/SPARK-8122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14573973#comment-14573973 ] Konstantin Shaposhnikov commented on SPARK-8122: I believe that currently `ParquetRelation.enableLogForwarding` doesn't do anything as it configures the wrong logger (parquet instead of org.apache.parquet). I haven't tested it though. A few problems in ParquetRelation.enableLogForwarding() --- Key: SPARK-8122 URL: https://issues.apache.org/jira/browse/SPARK-8122 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 1.5.0 Reporter: Konstantin Shaposhnikov _enableLogForwarding()_ should be updated after parquet 1.7.0 update , because name of the logger has been changed to `org.apache.parquet`. From parquet-mr Log: {code} // add a default handler in case there is none Logger logger = Logger.getLogger(Log.class.getPackage().getName()); {code} Another problem with _enableLogForwarding()_ is that it doesn't hold to the created loggers that can be garbage collected and all configuration changes will be gone. From https://docs.oracle.com/javase/6/docs/api/java/util/logging/Logger.html javadocs: _It is important to note that the Logger returned by one of the getLogger factory methods may be garbage collected at any time if a strong reference to the Logger is not kept._ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8122) ParquetRelation.enableLogForwarding() may fail to configure loggers
[ https://issues.apache.org/jira/browse/SPARK-8122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Konstantin Shaposhnikov updated SPARK-8122: --- Priority: Minor (was: Major) ParquetRelation.enableLogForwarding() may fail to configure loggers --- Key: SPARK-8122 URL: https://issues.apache.org/jira/browse/SPARK-8122 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 1.5.0 Reporter: Konstantin Shaposhnikov Priority: Minor _enableLogForwarding()_ should be updated after parquet 1.7.0 update , because name of the logger has been changed to `org.apache.parquet`. From parquet-mr Log: {code} // add a default handler in case there is none Logger logger = Logger.getLogger(Log.class.getPackage().getName()); {code} Another problem with _enableLogForwarding()_ is that it doesn't hold to the created loggers that can be garbage collected and all configuration changes will be gone. From https://docs.oracle.com/javase/6/docs/api/java/util/logging/Logger.html javadocs: _It is important to note that the Logger returned by one of the getLogger factory methods may be garbage collected at any time if a strong reference to the Logger is not kept._ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8122) ParquetRelation.enableLogForwarding() may fail to configure loggers
[ https://issues.apache.org/jira/browse/SPARK-8122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Konstantin Shaposhnikov updated SPARK-8122: --- Summary: ParquetRelation.enableLogForwarding() may fail to configure loggers (was: A few problems in ParquetRelation.enableLogForwarding()) ParquetRelation.enableLogForwarding() may fail to configure loggers --- Key: SPARK-8122 URL: https://issues.apache.org/jira/browse/SPARK-8122 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 1.5.0 Reporter: Konstantin Shaposhnikov _enableLogForwarding()_ should be updated after parquet 1.7.0 update , because name of the logger has been changed to `org.apache.parquet`. From parquet-mr Log: {code} // add a default handler in case there is none Logger logger = Logger.getLogger(Log.class.getPackage().getName()); {code} Another problem with _enableLogForwarding()_ is that it doesn't hold to the created loggers that can be garbage collected and all configuration changes will be gone. From https://docs.oracle.com/javase/6/docs/api/java/util/logging/Logger.html javadocs: _It is important to note that the Logger returned by one of the getLogger factory methods may be garbage collected at any time if a strong reference to the Logger is not kept._ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8122) ParquetRelation.enableLogForwarding() may fail to configure loggers
[ https://issues.apache.org/jira/browse/SPARK-8122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Konstantin Shaposhnikov updated SPARK-8122: --- Description: Another problem with _enableLogForwarding()_ is that it doesn't hold to the created loggers that can be garbage collected and all configuration changes will be gone. From https://docs.oracle.com/javase/6/docs/api/java/util/logging/Logger.html javadocs: _It is important to note that the Logger returned by one of the getLogger factory methods may be garbage collected at any time if a strong reference to the Logger is not kept._ was: _enableLogForwarding()_ should be updated after parquet 1.7.0 update , because name of the logger has been changed to `org.apache.parquet`. From parquet-mr Log: {code} // add a default handler in case there is none Logger logger = Logger.getLogger(Log.class.getPackage().getName()); {code} Another problem with _enableLogForwarding()_ is that it doesn't hold to the created loggers that can be garbage collected and all configuration changes will be gone. From https://docs.oracle.com/javase/6/docs/api/java/util/logging/Logger.html javadocs: _It is important to note that the Logger returned by one of the getLogger factory methods may be garbage collected at any time if a strong reference to the Logger is not kept._ ParquetRelation.enableLogForwarding() may fail to configure loggers --- Key: SPARK-8122 URL: https://issues.apache.org/jira/browse/SPARK-8122 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 1.5.0 Reporter: Konstantin Shaposhnikov Priority: Minor Another problem with _enableLogForwarding()_ is that it doesn't hold to the created loggers that can be garbage collected and all configuration changes will be gone. From https://docs.oracle.com/javase/6/docs/api/java/util/logging/Logger.html javadocs: _It is important to note that the Logger returned by one of the getLogger factory methods may be garbage collected at any time if a strong reference to the Logger is not kept._ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8122) A few problems in ParquetRelation.enableLogForwarding()
[ https://issues.apache.org/jira/browse/SPARK-8122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14573978#comment-14573978 ] Konstantin Shaposhnikov commented on SPARK-8122: SPARK-8118 is for the first problem described in this issue. The second problem (the loggers can be garbage collected) is another issue and should be fixed separately. I will updated the JIRA. A few problems in ParquetRelation.enableLogForwarding() --- Key: SPARK-8122 URL: https://issues.apache.org/jira/browse/SPARK-8122 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 1.5.0 Reporter: Konstantin Shaposhnikov _enableLogForwarding()_ should be updated after parquet 1.7.0 update , because name of the logger has been changed to `org.apache.parquet`. From parquet-mr Log: {code} // add a default handler in case there is none Logger logger = Logger.getLogger(Log.class.getPackage().getName()); {code} Another problem with _enableLogForwarding()_ is that it doesn't hold to the created loggers that can be garbage collected and all configuration changes will be gone. From https://docs.oracle.com/javase/6/docs/api/java/util/logging/Logger.html javadocs: _It is important to note that the Logger returned by one of the getLogger factory methods may be garbage collected at any time if a strong reference to the Logger is not kept._ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7042) Spark version of akka-actor_2.11 is not compatible with the official akka-actor_2.11 2.3.x
[ https://issues.apache.org/jira/browse/SPARK-7042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14564339#comment-14564339 ] Konstantin Shaposhnikov commented on SPARK-7042: I've created a pull request with akka 2.3.11 update. You can merge it if update of akka to version 2.3.11 looks reasonable. Spark version of akka-actor_2.11 is not compatible with the official akka-actor_2.11 2.3.x -- Key: SPARK-7042 URL: https://issues.apache.org/jira/browse/SPARK-7042 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.3.1 Reporter: Konstantin Shaposhnikov Assignee: Konstantin Shaposhnikov Priority: Minor Fix For: 1.5.0 When connecting to a remote Spark cluster (that runs Spark branch-1.3 built with Scala 2.11) from an application that uses akka 2.3.9 I get the following error: {noformat} 2015-04-22 09:01:38,924 - [WARN] - [akka.remote.ReliableDeliverySupervisor] [sparkDriver-akka.actor.default-dispatcher-5] - Association with remote system [akka.tcp://sparkExecutor@server:59007] has failed, address is now gated for [5000] ms. Reason is: [akka.actor.Identify; local class incompatible: stream classdesc serialVersionUID = -213377755528332889, local class serialVersionUID = 1]. {noformat} It looks like akka-actor_2.11 2.3.4-spark that is used by Spark has been built using Scala compiler 2.11.0 that ignores SerialVersionUID annotations (see https://issues.scala-lang.org/browse/SI-8549). The following steps can resolve the issue: - re-build the custom akka library that is used by Spark with the more recent version of Scala compiler (e.g. 2.11.6) - deploy a new version (e.g. 2.3.4.1-spark) to a maven repo - update version of akka used by spark (master and 1.3 branch) I would also suggest to upgrade to the latest version of akka 2.3.9 (or 2.3.10 that should be released soon). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7042) Spark version of akka-actor_2.11 is not compatible with the official akka-actor_2.11 2.3.x
[ https://issues.apache.org/jira/browse/SPARK-7042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14560688#comment-14560688 ] Konstantin Shaposhnikov commented on SPARK-7042: Yes, I've just tested it locally - 2.11 Spark build works with akka 2.3.11 Spark version of akka-actor_2.11 is not compatible with the official akka-actor_2.11 2.3.x -- Key: SPARK-7042 URL: https://issues.apache.org/jira/browse/SPARK-7042 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.3.1 Reporter: Konstantin Shaposhnikov Assignee: Konstantin Shaposhnikov Priority: Minor Fix For: 1.5.0 When connecting to a remote Spark cluster (that runs Spark branch-1.3 built with Scala 2.11) from an application that uses akka 2.3.9 I get the following error: {noformat} 2015-04-22 09:01:38,924 - [WARN] - [akka.remote.ReliableDeliverySupervisor] [sparkDriver-akka.actor.default-dispatcher-5] - Association with remote system [akka.tcp://sparkExecutor@server:59007] has failed, address is now gated for [5000] ms. Reason is: [akka.actor.Identify; local class incompatible: stream classdesc serialVersionUID = -213377755528332889, local class serialVersionUID = 1]. {noformat} It looks like akka-actor_2.11 2.3.4-spark that is used by Spark has been built using Scala compiler 2.11.0 that ignores SerialVersionUID annotations (see https://issues.scala-lang.org/browse/SI-8549). The following steps can resolve the issue: - re-build the custom akka library that is used by Spark with the more recent version of Scala compiler (e.g. 2.11.6) - deploy a new version (e.g. 2.3.4.1-spark) to a maven repo - update version of akka used by spark (master and 1.3 branch) I would also suggest to upgrade to the latest version of akka 2.3.9 (or 2.3.10 that should be released soon). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7042) Spark version of akka-actor_2.11 is not compatible with the official akka-actor_2.11 2.3.x
[ https://issues.apache.org/jira/browse/SPARK-7042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14560180#comment-14560180 ] Konstantin Shaposhnikov commented on SPARK-7042: It looks like akka-zeromq_2.11 is only available for versions 2.3.7+, though the rest of the akka libraries are available for 2.3.4. I wonder if akka version can just be updated to the latest 2.3.11? Spark version of akka-actor_2.11 is not compatible with the official akka-actor_2.11 2.3.x -- Key: SPARK-7042 URL: https://issues.apache.org/jira/browse/SPARK-7042 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.3.1 Reporter: Konstantin Shaposhnikov Assignee: Konstantin Shaposhnikov Priority: Minor Fix For: 1.5.0 When connecting to a remote Spark cluster (that runs Spark branch-1.3 built with Scala 2.11) from an application that uses akka 2.3.9 I get the following error: {noformat} 2015-04-22 09:01:38,924 - [WARN] - [akka.remote.ReliableDeliverySupervisor] [sparkDriver-akka.actor.default-dispatcher-5] - Association with remote system [akka.tcp://sparkExecutor@server:59007] has failed, address is now gated for [5000] ms. Reason is: [akka.actor.Identify; local class incompatible: stream classdesc serialVersionUID = -213377755528332889, local class serialVersionUID = 1]. {noformat} It looks like akka-actor_2.11 2.3.4-spark that is used by Spark has been built using Scala compiler 2.11.0 that ignores SerialVersionUID annotations (see https://issues.scala-lang.org/browse/SI-8549). The following steps can resolve the issue: - re-build the custom akka library that is used by Spark with the more recent version of Scala compiler (e.g. 2.11.6) - deploy a new version (e.g. 2.3.4.1-spark) to a maven repo - update version of akka used by spark (master and 1.3 branch) I would also suggest to upgrade to the latest version of akka 2.3.9 (or 2.3.10 that should be released soon). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7042) Spark version of akka-actor_2.11 is not compatible with the official akka-actor_2.11 2.3.x
[ https://issues.apache.org/jira/browse/SPARK-7042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14560181#comment-14560181 ] Konstantin Shaposhnikov commented on SPARK-7042: It looks like akka-zeromq_2.11 is only available for versions 2.3.7+, though the rest of the akka libraries are available for 2.3.4. I wonder if akka version can just be updated to the latest 2.3.11? Spark version of akka-actor_2.11 is not compatible with the official akka-actor_2.11 2.3.x -- Key: SPARK-7042 URL: https://issues.apache.org/jira/browse/SPARK-7042 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.3.1 Reporter: Konstantin Shaposhnikov Assignee: Konstantin Shaposhnikov Priority: Minor Fix For: 1.5.0 When connecting to a remote Spark cluster (that runs Spark branch-1.3 built with Scala 2.11) from an application that uses akka 2.3.9 I get the following error: {noformat} 2015-04-22 09:01:38,924 - [WARN] - [akka.remote.ReliableDeliverySupervisor] [sparkDriver-akka.actor.default-dispatcher-5] - Association with remote system [akka.tcp://sparkExecutor@server:59007] has failed, address is now gated for [5000] ms. Reason is: [akka.actor.Identify; local class incompatible: stream classdesc serialVersionUID = -213377755528332889, local class serialVersionUID = 1]. {noformat} It looks like akka-actor_2.11 2.3.4-spark that is used by Spark has been built using Scala compiler 2.11.0 that ignores SerialVersionUID annotations (see https://issues.scala-lang.org/browse/SI-8549). The following steps can resolve the issue: - re-build the custom akka library that is used by Spark with the more recent version of Scala compiler (e.g. 2.11.6) - deploy a new version (e.g. 2.3.4.1-spark) to a maven repo - update version of akka used by spark (master and 1.3 branch) I would also suggest to upgrade to the latest version of akka 2.3.9 (or 2.3.10 that should be released soon). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-7042) Spark version of akka-actor_2.11 is not compatible with the official akka-actor_2.11 2.3.x
[ https://issues.apache.org/jira/browse/SPARK-7042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Konstantin Shaposhnikov updated SPARK-7042: --- Comment: was deleted (was: It looks like akka-zeromq_2.11 is only available for versions 2.3.7+, though the rest of the akka libraries are available for 2.3.4. I wonder if akka version can just be updated to the latest 2.3.11?) Spark version of akka-actor_2.11 is not compatible with the official akka-actor_2.11 2.3.x -- Key: SPARK-7042 URL: https://issues.apache.org/jira/browse/SPARK-7042 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.3.1 Reporter: Konstantin Shaposhnikov Assignee: Konstantin Shaposhnikov Priority: Minor Fix For: 1.5.0 When connecting to a remote Spark cluster (that runs Spark branch-1.3 built with Scala 2.11) from an application that uses akka 2.3.9 I get the following error: {noformat} 2015-04-22 09:01:38,924 - [WARN] - [akka.remote.ReliableDeliverySupervisor] [sparkDriver-akka.actor.default-dispatcher-5] - Association with remote system [akka.tcp://sparkExecutor@server:59007] has failed, address is now gated for [5000] ms. Reason is: [akka.actor.Identify; local class incompatible: stream classdesc serialVersionUID = -213377755528332889, local class serialVersionUID = 1]. {noformat} It looks like akka-actor_2.11 2.3.4-spark that is used by Spark has been built using Scala compiler 2.11.0 that ignores SerialVersionUID annotations (see https://issues.scala-lang.org/browse/SI-8549). The following steps can resolve the issue: - re-build the custom akka library that is used by Spark with the more recent version of Scala compiler (e.g. 2.11.6) - deploy a new version (e.g. 2.3.4.1-spark) to a maven repo - update version of akka used by spark (master and 1.3 branch) I would also suggest to upgrade to the latest version of akka 2.3.9 (or 2.3.10 that should be released soon). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7042) Spark version of akka-actor_2.11 is not compatible with the official akka-actor_2.11 2.3.x
[ https://issues.apache.org/jira/browse/SPARK-7042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14560166#comment-14560166 ] Konstantin Shaposhnikov commented on SPARK-7042: That is not true: http://search.maven.org/#browse%7C-1552622333 (http://search.maven.org/#artifactdetails%7Ccom.typesafe.akka%7Cakka-actor_2.11%7C2.3.4%7Cjar) What exactly was broken in scala 2.11 build? Spark version of akka-actor_2.11 is not compatible with the official akka-actor_2.11 2.3.x -- Key: SPARK-7042 URL: https://issues.apache.org/jira/browse/SPARK-7042 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.3.1 Reporter: Konstantin Shaposhnikov Assignee: Konstantin Shaposhnikov Priority: Minor Fix For: 1.5.0 When connecting to a remote Spark cluster (that runs Spark branch-1.3 built with Scala 2.11) from an application that uses akka 2.3.9 I get the following error: {noformat} 2015-04-22 09:01:38,924 - [WARN] - [akka.remote.ReliableDeliverySupervisor] [sparkDriver-akka.actor.default-dispatcher-5] - Association with remote system [akka.tcp://sparkExecutor@server:59007] has failed, address is now gated for [5000] ms. Reason is: [akka.actor.Identify; local class incompatible: stream classdesc serialVersionUID = -213377755528332889, local class serialVersionUID = 1]. {noformat} It looks like akka-actor_2.11 2.3.4-spark that is used by Spark has been built using Scala compiler 2.11.0 that ignores SerialVersionUID annotations (see https://issues.scala-lang.org/browse/SI-8549). The following steps can resolve the issue: - re-build the custom akka library that is used by Spark with the more recent version of Scala compiler (e.g. 2.11.6) - deploy a new version (e.g. 2.3.4.1-spark) to a maven repo - update version of akka used by spark (master and 1.3 branch) I would also suggest to upgrade to the latest version of akka 2.3.9 (or 2.3.10 that should be released soon). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7042) Spark version of akka-actor_2.11 is not compatible with the official akka-actor_2.11 2.3.x
[ https://issues.apache.org/jira/browse/SPARK-7042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14560283#comment-14560283 ] Konstantin Shaposhnikov commented on SPARK-7042: It looks like the Spark specific akka-zeromq version (2.3.4-spark) has been modified to work with Scala 2.11 In fact the standard build of akka-zeromq_2.11 (that is available for versions 2.3.7+) depends on zeromq scala bindings created by Spark project (org.spark-project.zeromq:zeromq-scala-binding_2.11:0.0.7-spark). Spark version of akka-actor_2.11 is not compatible with the official akka-actor_2.11 2.3.x -- Key: SPARK-7042 URL: https://issues.apache.org/jira/browse/SPARK-7042 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.3.1 Reporter: Konstantin Shaposhnikov Assignee: Konstantin Shaposhnikov Priority: Minor Fix For: 1.5.0 When connecting to a remote Spark cluster (that runs Spark branch-1.3 built with Scala 2.11) from an application that uses akka 2.3.9 I get the following error: {noformat} 2015-04-22 09:01:38,924 - [WARN] - [akka.remote.ReliableDeliverySupervisor] [sparkDriver-akka.actor.default-dispatcher-5] - Association with remote system [akka.tcp://sparkExecutor@server:59007] has failed, address is now gated for [5000] ms. Reason is: [akka.actor.Identify; local class incompatible: stream classdesc serialVersionUID = -213377755528332889, local class serialVersionUID = 1]. {noformat} It looks like akka-actor_2.11 2.3.4-spark that is used by Spark has been built using Scala compiler 2.11.0 that ignores SerialVersionUID annotations (see https://issues.scala-lang.org/browse/SI-8549). The following steps can resolve the issue: - re-build the custom akka library that is used by Spark with the more recent version of Scala compiler (e.g. 2.11.6) - deploy a new version (e.g. 2.3.4.1-spark) to a maven repo - update version of akka used by spark (master and 1.3 branch) I would also suggest to upgrade to the latest version of akka 2.3.9 (or 2.3.10 that should be released soon). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7042) Spark version of akka-actor_2.11 is not compatible with the official akka-actor_2.11 2.3.x
[ https://issues.apache.org/jira/browse/SPARK-7042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14545375#comment-14545375 ] Konstantin Shaposhnikov commented on SPARK-7042: There is nothing wrong with the standard Akka 2.11 build. In fact we have a custom build of Spark now that uses standard Akka 2.3.9 from maven central repository without any problems. The error appears only with the custom build of akka (because it was compiled with buggy version of Scala) that comes with spark by default. I agree that number of users affected by this problem is probably quite small (only 1? ;) Spark version of akka-actor_2.11 is not compatible with the official akka-actor_2.11 2.3.x -- Key: SPARK-7042 URL: https://issues.apache.org/jira/browse/SPARK-7042 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.3.1 Reporter: Konstantin Shaposhnikov Priority: Minor When connecting to a remote Spark cluster (that runs Spark branch-1.3 built with Scala 2.11) from an application that uses akka 2.3.9 I get the following error: {noformat} 2015-04-22 09:01:38,924 - [WARN] - [akka.remote.ReliableDeliverySupervisor] [sparkDriver-akka.actor.default-dispatcher-5] - Association with remote system [akka.tcp://sparkExecutor@server:59007] has failed, address is now gated for [5000] ms. Reason is: [akka.actor.Identify; local class incompatible: stream classdesc serialVersionUID = -213377755528332889, local class serialVersionUID = 1]. {noformat} It looks like akka-actor_2.11 2.3.4-spark that is used by Spark has been built using Scala compiler 2.11.0 that ignores SerialVersionUID annotations (see https://issues.scala-lang.org/browse/SI-8549). The following steps can resolve the issue: - re-build the custom akka library that is used by Spark with the more recent version of Scala compiler (e.g. 2.11.6) - deploy a new version (e.g. 2.3.4.1-spark) to a maven repo - update version of akka used by spark (master and 1.3 branch) I would also suggest to upgrade to the latest version of akka 2.3.9 (or 2.3.10 that should be released soon). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7249) Updated Hadoop dependencies due to inconsistency in the versions
[ https://issues.apache.org/jira/browse/SPARK-7249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14529771#comment-14529771 ] Konstantin Shaposhnikov commented on SPARK-7249: SPARK-7042 is somewhat related to this issue. If my understanding is correct then the standard akka build can be used with hadoop 2.x because both of them use protobuf 2.5 Updated Hadoop dependencies due to inconsistency in the versions - Key: SPARK-7249 URL: https://issues.apache.org/jira/browse/SPARK-7249 Project: Spark Issue Type: Dependency upgrade Components: Build Affects Versions: 1.3.1 Environment: Ubuntu 14.04. Apache Mesos in cluster mode with HDFS from cloudera 2.5.0-cdh5.3.3. Reporter: Favio Vázquez Priority: Blocker Updated Hadoop dependencies due to inconsistency in the versions. Now the global properties are the ones used by the hadoop-2.2 profile, and the profile was set to empty but kept for backwards compatibility reasons. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7042) Spark version of akka-actor_2.11 is not compatible with the official akka-actor_2.11 2.3.x
[ https://issues.apache.org/jira/browse/SPARK-7042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14507243#comment-14507243 ] Konstantin Shaposhnikov commented on SPARK-7042: Is my understanding correct that the custom version of akka is required only for Hadoop-1? In this case it might be good idea to use standard akka jars when building with hadoop-2 profile enabled, Spark version of akka-actor_2.11 is not compatible with the official akka-actor_2.11 2.3.x -- Key: SPARK-7042 URL: https://issues.apache.org/jira/browse/SPARK-7042 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.3.1 Reporter: Konstantin Shaposhnikov When connecting to a remote Spark cluster (that runs Spark branch-1.3 built with Scala 2.11) from an application that uses akka 2.3.9 I get the following error: {noformat} 2015-04-22 09:01:38,924 - [WARN] - [akka.remote.ReliableDeliverySupervisor] [sparkDriver-akka.actor.default-dispatcher-5] - Association with remote system [akka.tcp://sparkExecutor@server:59007] has failed, address is now gated for [5000] ms. Reason is: [akka.actor.Identify; local class incompatible: stream classdesc serialVersionUID = -213377755528332889, local class serialVersionUID = 1]. {noformat} It looks like akka-actor_2.11 2.3.4-spark that is used by Spark has been built using Scala compiler 2.11.0 that ignores SerialVersionUID annotations (see https://issues.scala-lang.org/browse/SI-8549). The following steps can resolve the issue: - re-build the custom akka library that is used by Spark with the more recent version of Scala compiler (e.g. 2.11.6) - deploy a new version (e.g. 2.3.4.1-spark) to a maven repo - update version of akka used by spark (master and 1.3 branch) I would also suggest to upgrade to the latest version of akka 2.3.9 (or 2.3.10 that should be released soon). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7042) Spark version of akka-actor_2.11 is not compatible with the official akka-actor_2.11 2.3.x
Konstantin Shaposhnikov created SPARK-7042: -- Summary: Spark version of akka-actor_2.11 is not compatible with the official akka-actor_2.11 2.3.x Key: SPARK-7042 URL: https://issues.apache.org/jira/browse/SPARK-7042 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.3.1 Reporter: Konstantin Shaposhnikov When connecting to a remote Spark cluster (that runs Spark branch-1.3 built with Scala 2.11) from an application that uses akka 2.3.9 I get the following error: {noformat} 2015-04-22 09:01:38,924 - [WARN] - [akka.remote.ReliableDeliverySupervisor] [sparkDriver-akka.actor.default-dispatcher-5] - Association with remote system [akka.tcp://sparkExecutor@server:59007] has failed, address is now gated for [5000] ms. Reason is: [akka.actor.Identify; local class incompatible: stream classdesc serialVersionUID = -213377755528332889, local class serialVersionUID = 1]. {noformat} It looks like akka-actor_2.11 2.3.4-spark that is used by Spark has been built using Scala compiler 2.11.0 that ignores SerialVersionUID annotations (see https://issues.scala-lang.org/browse/SI-8549). The following steps can resolve the issue: - re-build the custom akka library that is used by Spark with the more recent version of Scala compiler (e.g. 2.11.6) - deploy a new version (e.g. 2.3.4.1-spark) to a maven repo - update version of akka used by spark (master and 1.3 branch) I would also suggest to upgrade to the latest version of akka 2.3.9 (or 2.3.10 that should be released soon). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6566) Update Spark to use the latest version of Parquet libraries
[ https://issues.apache.org/jira/browse/SPARK-6566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14386170#comment-14386170 ] Konstantin Shaposhnikov commented on SPARK-6566: Thank you for the update [~lian cheng] Update Spark to use the latest version of Parquet libraries --- Key: SPARK-6566 URL: https://issues.apache.org/jira/browse/SPARK-6566 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.3.0 Reporter: Konstantin Shaposhnikov There are a lot of bug fixes in the latest version of parquet (1.6.0rc7). E.g. PARQUET-136 It would be good to update Spark to use the latest parquet version. The following changes are required: {code} diff --git a/pom.xml b/pom.xml index 5ad39a9..095b519 100644 --- a/pom.xml +++ b/pom.xml @@ -132,7 +132,7 @@ !-- Version used for internal directory structure -- hive.version.short0.13.1/hive.version.short derby.version10.10.1.1/derby.version -parquet.version1.6.0rc3/parquet.version +parquet.version1.6.0rc7/parquet.version jblas.version1.2.3/jblas.version jetty.version8.1.14.v20131031/jetty.version orbit.version3.0.0.v201112011016/orbit.version {code} and {code} --- a/sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTableOperations.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTableOperations.scala @@ -480,7 +480,7 @@ private[parquet] class FilteringParquetRowInputFormat globalMetaData = new GlobalMetaData(globalMetaData.getSchema, mergedMetadata, globalMetaData.getCreatedBy) -val readContext = getReadSupport(configuration).init( +val readContext = ParquetInputFormat.getReadSupportInstance(configuration).init( new InitContext(configuration, globalMetaData.getKeyValueMetaData, globalMetaData.getSchema)) {code} I am happy to prepare a pull request if necessary. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6566) Update Spark to use the latest version of Parquet libraries
Konstantin Shaposhnikov created SPARK-6566: -- Summary: Update Spark to use the latest version of Parquet libraries Key: SPARK-6566 URL: https://issues.apache.org/jira/browse/SPARK-6566 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.3.0 Reporter: Konstantin Shaposhnikov There are a lot of bug fixes in the latest version of parquet (1.6.0rc7). E.g. PARQUET-136 It would be good to update Spark to use the latest parquet version. The following changes are required: {code} diff --git a/pom.xml b/pom.xml index 5ad39a9..095b519 100644 --- a/pom.xml +++ b/pom.xml @@ -132,7 +132,7 @@ !-- Version used for internal directory structure -- hive.version.short0.13.1/hive.version.short derby.version10.10.1.1/derby.version -parquet.version1.6.0rc3/parquet.version +parquet.version1.6.0rc7/parquet.version jblas.version1.2.3/jblas.version jetty.version8.1.14.v20131031/jetty.version orbit.version3.0.0.v201112011016/orbit.version {code} and {code} --- a/sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTableOperations.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTableOperations.scala @@ -480,7 +480,7 @@ private[parquet] class FilteringParquetRowInputFormat globalMetaData = new GlobalMetaData(globalMetaData.getSchema, mergedMetadata, globalMetaData.getCreatedBy) -val readContext = getReadSupport(configuration).init( +val readContext = ParquetInputFormat.getReadSupportInstance(configuration).init( new InitContext(configuration, globalMetaData.getKeyValueMetaData, globalMetaData.getSchema)) {code} I am happy to prepare a pull request if necessary. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6489) Optimize lateral view with explode to not read unnecessary columns
Konstantin Shaposhnikov created SPARK-6489: -- Summary: Optimize lateral view with explode to not read unnecessary columns Key: SPARK-6489 URL: https://issues.apache.org/jira/browse/SPARK-6489 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.3.0 Reporter: Konstantin Shaposhnikov Currently a query with lateral view explode(...) results in an execution plan that reads all columns of the underlying RDD. E.g. given *ppl* table is DF created from Person case class: {code} case class Person(val name: String, val age: Int, val data: Array[Int]) {code} the following SQL: {code} select name, sum(d) from ppl lateral view explode(data) d as d group by name {code} executes as follows: {noformat} == Physical Plan == Aggregate false, [name#0], [name#0,SUM(PartialSum#8L) AS _c1#3L] Exchange (HashPartitioning [name#0], 200) Aggregate true, [name#0], [name#0,SUM(CAST(d#6, LongType)) AS PartialSum#8L] Project [name#0,d#6] Generate explode(data#2), true, false PhysicalRDD [name#0,age#1,data#2], MapPartitionsRDD[1] at mapPartitions at ExistingRDD.scala:35 {noformat} Note that *age* column is not needed to produce the output but it is still read from the underlying RDD. A sample program to demonstrate the issue: {code} case class Person(val name: String, val age: Int, val data: Array[Int]) object ExplodeDemo extends App { val ppl = Array( Person(A, 20, Array(10, 12, 19)), Person(B, 25, Array(7, 8, 4)), Person(C, 19, Array(12, 4, 232))) val conf = new SparkConf().setMaster(local[2]).setAppName(sql) val sc = new SparkContext(conf) val sqlCtx = new HiveContext(sc) import sqlCtx.implicits._ val df = sc.makeRDD(ppl).toDF df.registerTempTable(ppl) val s = sqlCtx.sql(select name, sum(d) from ppl lateral view explode(data) d as d group by name) s.explain(true) } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6489) Optimize lateral view with explode to not read unnecessary columns
[ https://issues.apache.org/jira/browse/SPARK-6489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Konstantin Shaposhnikov updated SPARK-6489: --- Description: Currently a query with lateral view explode(...) results in an execution plan that reads all columns of the underlying RDD. E.g. given *ppl* table is DF created from Person case class: {code} case class Person(val name: String, val age: Int, val data: Array[Int]) {code} the following SQL: {code} select name, sum(d) from ppl lateral view explode(data) d as d group by name {code} executes as follows: {noformat} == Physical Plan == Aggregate false, [name#0], [name#0,SUM(PartialSum#38L) AS _c1#18L] Exchange (HashPartitioning [name#0], 200) Aggregate true, [name#0], [name#0,SUM(CAST(d#21, LongType)) AS PartialSum#38L] Project [name#0,d#21] Generate explode(data#2), true, false InMemoryColumnarTableScan [name#0,age#1,data#2], [], (InMemoryRelation [name#0,age#1,data#2], true, 1, StorageLevel(true, true, false, true, 1), (PhysicalRDD [name#0,age#1,data#2], MapPartitionsRDD[1] at mapPartitions at ExistingRDD.scala:35), Some(ppl)) {noformat} Note that *age* column is not needed to produce the output but it is still read from the underlying RDD. A sample program to demonstrate the issue: {code} case class Person(val name: String, val age: Int, val data: Array[Int]) object ExplodeDemo extends App { val ppl = Array( Person(A, 20, Array(10, 12, 19)), Person(B, 25, Array(7, 8, 4)), Person(C, 19, Array(12, 4, 232))) val conf = new SparkConf().setMaster(local[2]).setAppName(sql) val sc = new SparkContext(conf) val sqlCtx = new HiveContext(sc) import sqlCtx.implicits._ val df = sc.makeRDD(ppl).toDF df.registerTempTable(ppl) sqlCtx.cacheTable(ppl) // cache table otherwise ExistingRDD will be used that do not support column pruning val s = sqlCtx.sql(select name, sum(d) from ppl lateral view explode(data) d as d group by name) s.explain(true) } {code} was: Currently a query with lateral view explode(...) results in an execution plan that reads all columns of the underlying RDD. E.g. given *ppl* table is DF created from Person case class: {code} case class Person(val name: String, val age: Int, val data: Array[Int]) {code} the following SQL: {code} select name, sum(d) from ppl lateral view explode(data) d as d group by name {code} executes as follows: {noformat} == Physical Plan == Aggregate false, [name#0], [name#0,SUM(PartialSum#8L) AS _c1#3L] Exchange (HashPartitioning [name#0], 200) Aggregate true, [name#0], [name#0,SUM(CAST(d#6, LongType)) AS PartialSum#8L] Project [name#0,d#6] Generate explode(data#2), true, false PhysicalRDD [name#0,age#1,data#2], MapPartitionsRDD[1] at mapPartitions at ExistingRDD.scala:35 {noformat} Note that *age* column is not needed to produce the output but it is still read from the underlying RDD. A sample program to demonstrate the issue: {code} case class Person(val name: String, val age: Int, val data: Array[Int]) object ExplodeDemo extends App { val ppl = Array( Person(A, 20, Array(10, 12, 19)), Person(B, 25, Array(7, 8, 4)), Person(C, 19, Array(12, 4, 232))) val conf = new SparkConf().setMaster(local[2]).setAppName(sql) val sc = new SparkContext(conf) val sqlCtx = new HiveContext(sc) import sqlCtx.implicits._ val df = sc.makeRDD(ppl).toDF df.registerTempTable(ppl) val s = sqlCtx.sql(select name, sum(d) from ppl lateral view explode(data) d as d group by name) s.explain(true) } {code} Optimize lateral view with explode to not read unnecessary columns -- Key: SPARK-6489 URL: https://issues.apache.org/jira/browse/SPARK-6489 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.3.0 Reporter: Konstantin Shaposhnikov Currently a query with lateral view explode(...) results in an execution plan that reads all columns of the underlying RDD. E.g. given *ppl* table is DF created from Person case class: {code} case class Person(val name: String, val age: Int, val data: Array[Int]) {code} the following SQL: {code} select name, sum(d) from ppl lateral view explode(data) d as d group by name {code} executes as follows: {noformat} == Physical Plan == Aggregate false, [name#0], [name#0,SUM(PartialSum#38L) AS _c1#18L] Exchange (HashPartitioning [name#0], 200) Aggregate true, [name#0], [name#0,SUM(CAST(d#21, LongType)) AS PartialSum#38L] Project [name#0,d#21] Generate explode(data#2), true, false InMemoryColumnarTableScan [name#0,age#1,data#2], [], (InMemoryRelation [name#0,age#1,data#2], true, 1, StorageLevel(true, true, false, true, 1), (PhysicalRDD [name#0,age#1,data#2], MapPartitionsRDD[1] at mapPartitions at ExistingRDD.scala:35),