[jira] [Comment Edited] (SPARK-24826) Self-Join not working in Apache Spark 2.2.2
[ https://issues.apache.org/jira/browse/SPARK-24826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16566981#comment-16566981 ] Joseph Fourny edited comment on SPARK-24826 at 8/2/18 3:57 PM: --- I was able to reproduce this defect with an inner-join of two temp views that refer to equivalent local relations. I started by creating 2 datasets (in Java) from a List of GenericRow and registered them as separate views. As far as the optimizer is concerned, the contents of the local relations are the same. If you update one of the datasets to make them distinct, then the assertion is not longer triggered. Note: I have to force a SortMergeJoin to trigger the issue in ExchangeCoordinator. was (Author: josephfourny): I was able to reproduce this defect with an inner-join of two temp views that refer to equivalent local relations. I started by creating 2 datasets (in Java) from a List of GenericRow and registered them as separate views. As far as the optimizer is concerned, the contents of the local relations are the same. Note: I have to force a SortMergeJoin to trigger the issue in ExchangeCoordinator. > Self-Join not working in Apache Spark 2.2.2 > --- > > Key: SPARK-24826 > URL: https://issues.apache.org/jira/browse/SPARK-24826 > Project: Spark > Issue Type: Bug > Components: Optimizer >Affects Versions: 2.2.2 >Reporter: Michail Giannakopoulos >Priority: Major > Attachments: > part-0-48210471-3088-4cee-8670-a332444bae66-c000.gz.parquet > > > Running a self-join against a table derived from a parquet file with many > columns fails during the planning phase with the following stack-trace: > org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree: > Exchange(coordinator id: 331918455) hashpartitioning(_row_id#0L, 2), > coordinator[target post-shuffle partition size: 67108864] > +- Project [_row_id#0L, id#1L, member_id#2L, loan_amnt#3L, funded_amnt#4L, > funded_amnt_inv#5L, term#6, int_rate#7, installment#8, grade#9, sub_grade#10, > emp_title#11, emp_length#12, home_ownership#13, annual_inc#14, > verification_status#15, issue_d#16, loan_status#17, pymnt_plan#18, url#19, > desc_#20, purpose#21, title#22, zip_code#23, ... 92 more fields|#0L, id#1L, > member_id#2L, loan_amnt#3L, funded_amnt#4L, funded_amnt_inv#5L, term#6, > int_rate#7, installment#8, grade#9, sub_grade#10, emp_title#11, > emp_length#12, home_ownership#13, annual_inc#14, verification_status#15, > issue_d#16, loan_status#17, pymnt_plan#18, url#19, desc_#20, purpose#21, > title#22, zip_code#23, ... 92 more fields] > +- Filter isnotnull(_row_id#0L) > +- FileScan parquet > [_row_id#0L,id#1L,member_id#2L,loan_amnt#3L,funded_amnt#4L,funded_amnt_inv#5L,term#6,int_rate#7,installment#8,grade#9,sub_grade#10,emp_title#11,emp_length#12,home_ownership#13,annual_inc#14,verification_status#15,issue_d#16,loan_status#17,pymnt_plan#18,url#19,desc_#20,purpose#21,title#22,zip_code#23,... > 92 more > fields|#0L,id#1L,member_id#2L,loan_amnt#3L,funded_amnt#4L,funded_amnt_inv#5L,term#6,int_rate#7,installment#8,grade#9,sub_grade#10,emp_title#11,emp_length#12,home_ownership#13,annual_inc#14,verification_status#15,issue_d#16,loan_status#17,pymnt_plan#18,url#19,desc_#20,purpose#21,title#22,zip_code#23,... > 92 more fields] Batched: false, Format: Parquet, Location: > InMemoryFileIndex[file:/c:/Users/gianna/Desktop/alpha.parquet/part-0-48210471-3088-4cee-8670-..., > PartitionFilters: [], PushedFilters: [IsNotNull(_row_id)], ReadSchema: > struct<_row_id:bigint,id:bigint,member_id:bigint,loan_amnt:bigint,funded_amnt:bigint,funded_amnt_... > at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56) > at > org.apache.spark.sql.execution.exchange.ShuffleExchange.doExecute(ShuffleExchange.scala:115) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135) > at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116) > at org.apache.spark.sql.execution.SortExec.doExecute(SortExec.scala:101) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138) > at >
[jira] [Commented] (SPARK-24826) Self-Join not working in Apache Spark 2.2.2
[ https://issues.apache.org/jira/browse/SPARK-24826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16566981#comment-16566981 ] Joseph Fourny commented on SPARK-24826: --- I was able to reproduce this defect with an inner-join of two temp views that refer to equivalent local relations. I started by creating 2 datasets (in Java) from a List of GenericRow and registered them as separate views. As far as the optimizer is concerned, the contents of the local relations are the same. Note: I have to force a SortMergeJoin to trigger the issue in ExchangeCoordinator. > Self-Join not working in Apache Spark 2.2.2 > --- > > Key: SPARK-24826 > URL: https://issues.apache.org/jira/browse/SPARK-24826 > Project: Spark > Issue Type: Bug > Components: Optimizer >Affects Versions: 2.2.2 >Reporter: Michail Giannakopoulos >Priority: Major > Attachments: > part-0-48210471-3088-4cee-8670-a332444bae66-c000.gz.parquet > > > Running a self-join against a table derived from a parquet file with many > columns fails during the planning phase with the following stack-trace: > org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree: > Exchange(coordinator id: 331918455) hashpartitioning(_row_id#0L, 2), > coordinator[target post-shuffle partition size: 67108864] > +- Project [_row_id#0L, id#1L, member_id#2L, loan_amnt#3L, funded_amnt#4L, > funded_amnt_inv#5L, term#6, int_rate#7, installment#8, grade#9, sub_grade#10, > emp_title#11, emp_length#12, home_ownership#13, annual_inc#14, > verification_status#15, issue_d#16, loan_status#17, pymnt_plan#18, url#19, > desc_#20, purpose#21, title#22, zip_code#23, ... 92 more fields|#0L, id#1L, > member_id#2L, loan_amnt#3L, funded_amnt#4L, funded_amnt_inv#5L, term#6, > int_rate#7, installment#8, grade#9, sub_grade#10, emp_title#11, > emp_length#12, home_ownership#13, annual_inc#14, verification_status#15, > issue_d#16, loan_status#17, pymnt_plan#18, url#19, desc_#20, purpose#21, > title#22, zip_code#23, ... 92 more fields] > +- Filter isnotnull(_row_id#0L) > +- FileScan parquet > [_row_id#0L,id#1L,member_id#2L,loan_amnt#3L,funded_amnt#4L,funded_amnt_inv#5L,term#6,int_rate#7,installment#8,grade#9,sub_grade#10,emp_title#11,emp_length#12,home_ownership#13,annual_inc#14,verification_status#15,issue_d#16,loan_status#17,pymnt_plan#18,url#19,desc_#20,purpose#21,title#22,zip_code#23,... > 92 more > fields|#0L,id#1L,member_id#2L,loan_amnt#3L,funded_amnt#4L,funded_amnt_inv#5L,term#6,int_rate#7,installment#8,grade#9,sub_grade#10,emp_title#11,emp_length#12,home_ownership#13,annual_inc#14,verification_status#15,issue_d#16,loan_status#17,pymnt_plan#18,url#19,desc_#20,purpose#21,title#22,zip_code#23,... > 92 more fields] Batched: false, Format: Parquet, Location: > InMemoryFileIndex[file:/c:/Users/gianna/Desktop/alpha.parquet/part-0-48210471-3088-4cee-8670-..., > PartitionFilters: [], PushedFilters: [IsNotNull(_row_id)], ReadSchema: > struct<_row_id:bigint,id:bigint,member_id:bigint,loan_amnt:bigint,funded_amnt:bigint,funded_amnt_... > at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56) > at > org.apache.spark.sql.execution.exchange.ShuffleExchange.doExecute(ShuffleExchange.scala:115) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135) > at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116) > at org.apache.spark.sql.execution.SortExec.doExecute(SortExec.scala:101) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135) > at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116) > at > org.apache.spark.sql.execution.joins.SortMergeJoinExec.doExecute(SortMergeJoinExec.scala:141) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138) > at >
[jira] [Commented] (SPARK-19566) Error initializing SparkContext under a Windows SYSTEM user
[ https://issues.apache.org/jira/browse/SPARK-19566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16287015#comment-16287015 ] Joseph Fourny commented on SPARK-19566: --- Did this ever get resolved? I am stuck with the same issue when using a system account. :( > Error initializing SparkContext under a Windows SYSTEM user > --- > > Key: SPARK-19566 > URL: https://issues.apache.org/jira/browse/SPARK-19566 > Project: Spark > Issue Type: Bug > Components: Windows >Affects Versions: 2.1.0 >Reporter: boddie > > We use a SparkLauncher class in our application which is running in a > WebSphere Application Server (it is started as a service). When we try to > submit an application to Spark (running in standalone mode without Hadoop) , > we get this error: > ERROR SparkContext: Error initializing SparkContext. > Exception in thread "main" java.io.IOException: failure to login > at > org.apache.hadoop.security.UserGroupInformation.loginUserFromSubject(UserGroupInformation.java:824) > at > org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:761) > at > org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:634) > at > org.apache.hadoop.fs.FileSystem$Cache$Key.(FileSystem.java:2828) > at > org.apache.hadoop.fs.FileSystem$Cache$Key.(FileSystem.java:2818) > at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2684) > at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373) > at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295) > at org.apache.spark.SparkContext.addFile(SparkContext.scala:1452) > at org.apache.spark.SparkContext.addFile(SparkContext.scala:1425) > at > org.apache.spark.SparkContext$$anonfun$12.apply(SparkContext.scala:470) > at > org.apache.spark.SparkContext$$anonfun$12.apply(SparkContext.scala:470) > at scala.collection.immutable.List.foreach(List.scala:381) > at org.apache.spark.SparkContext.(SparkContext.scala:470) > at org.apache.spark.SparkContext.(SparkContext.scala:117) > at > org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:53) > at > com.ibm.el.expertise.spark.MatrixCompletionRunner.main(MatrixCompletionRunner.java:46) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:95) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:56) > at java.lang.reflect.Method.invoke(Method.java:620) > at > org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:738) > at > org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:187) > at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:212) > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:126) > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) > Caused by: javax.security.auth.login.LoginException: > java.lang.ArrayIndexOutOfBoundsException: Array index out of range: 3 > at com.ibm.security.auth.module.Win64System.getCurrent(Native Method) > at com.ibm.security.auth.module.Win64System.(Win64System.java:74) > at > com.ibm.security.auth.module.Win64LoginModule.login(Win64LoginModule.java:143) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:95) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:56) > at java.lang.reflect.Method.invoke(Method.java:620) > at javax.security.auth.login.LoginContext.invoke(LoginContext.java:781) > at > javax.security.auth.login.LoginContext.access$000(LoginContext.java:215) > at javax.security.auth.login.LoginContext$4.run(LoginContext.java:706) > at javax.security.auth.login.LoginContext$4.run(LoginContext.java:704) > at > java.security.AccessController.doPrivileged(AccessController.java:456) > at > javax.security.auth.login.LoginContext.invokePriv(LoginContext.java:703) > at javax.security.auth.login.LoginContext.login(LoginContext.java:609) > at > org.apache.hadoop.security.UserGroupInformation.loginUserFromSubject(UserGroupInformation.java:799) > at > org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:761) > at > org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:634) > at > org.apache.hadoop.fs.FileSystem$Cache$Key.(FileSystem.java:2828) > at > org.apache.hadoop.fs.FileSystem$Cache$Key.(FileSystem.java:2818)
[jira] [Comment Edited] (SPARK-15690) Fast single-node (single-process) in-memory shuffle
[ https://issues.apache.org/jira/browse/SPARK-15690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15332960#comment-15332960 ] Joseph Fourny edited comment on SPARK-15690 at 6/16/16 3:00 AM: I am trying to develop single-node clusters on large servers (30+ CPU cores) with 2-3 TB or RAM. Our use cases involve small to medium size datasets, but with a huge amount of concurrent jobs (shared, multi-tenant environments). Efficiency and sub-second response times are the primary requirements. This shuffle between stages is the current bottleneck. Writing anything to disk is just a waste of time if all computations are done in the same JVM (or a small set of JVMs on the same machine). We tried using RAMFS to avoid disk I/O, but still a lot of CPU time is spent in compression and serialization. I would rather not hack my way out of this one. Is it wishful thinking to have this officially supported? was (Author: josephfourny): +1 on this. I am trying to develop single-node clusters on large servers (30+ CPU cores) with 2-3 TB or RAM. Our use cases involve small to medium size datasets, but with a huge amount of concurrent jobs (shared, multi-tenant environments). Efficiency and sub-second response times are the primary requirements. This shuffle between stages is the current bottleneck. Writing anything to disk is just a waste of time if all computations are done in the same JVM (or a small set of JVMs on the same machine). We tried using RAMFS to avoid disk I/O, but still a lot of CPU time is spent in compression and serialization. I would rather not hack my way out of this one. Is it wishful thinking to have this officially supported? > Fast single-node (single-process) in-memory shuffle > --- > > Key: SPARK-15690 > URL: https://issues.apache.org/jira/browse/SPARK-15690 > Project: Spark > Issue Type: New Feature > Components: Shuffle, SQL >Reporter: Reynold Xin > > Spark's current shuffle implementation sorts all intermediate data by their > partition id, and then write the data to disk. This is not a big bottleneck > because the network throughput on commodity clusters tend to be low. However, > an increasing number of Spark users are using the system to process data on a > single-node. When in a single node operating against intermediate data that > fits in memory, the existing shuffle code path can become a big bottleneck. > The goal of this ticket is to change Spark so it can use in-memory radix sort > to do data shuffling on a single node, and still gracefully fallback to disk > if the data size does not fit in memory. Given the number of partitions is > usually small (say less than 256), it'd require only a single pass do to the > radix sort with pretty decent CPU efficiency. > Note that there have been many in-memory shuffle attempts in the past. This > ticket has a smaller scope (single-process), and aims to actually > productionize this code. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15690) Fast single-node (single-process) in-memory shuffle
[ https://issues.apache.org/jira/browse/SPARK-15690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15332960#comment-15332960 ] Joseph Fourny commented on SPARK-15690: --- +1 on this. I am trying to develop single-node clusters on large servers (30+ CPU cores) with 2-3 TB or RAM. Our use cases involve small to medium size datasets, but with a huge amount of concurrent jobs (shared, multi-tenant environments). Efficiency and sub-second response times are the primary requirements. This shuffle between stages is the current bottleneck. Writing anything to disk is just a waste of time if all computations are done in the same JVM (or a small set of JVMs on the same machine). We tried using RAMFS to avoid disk I/O, but still a lot of CPU time is spent in compression and serialization. I would rather not hack my way out of this one. Is it wishful thinking to have this officially supported? > Fast single-node (single-process) in-memory shuffle > --- > > Key: SPARK-15690 > URL: https://issues.apache.org/jira/browse/SPARK-15690 > Project: Spark > Issue Type: New Feature > Components: Shuffle, SQL >Reporter: Reynold Xin > > Spark's current shuffle implementation sorts all intermediate data by their > partition id, and then write the data to disk. This is not a big bottleneck > because the network throughput on commodity clusters tend to be low. However, > an increasing number of Spark users are using the system to process data on a > single-node. When in a single node operating against intermediate data that > fits in memory, the existing shuffle code path can become a big bottleneck. > The goal of this ticket is to change Spark so it can use in-memory radix sort > to do data shuffling on a single node, and still gracefully fallback to disk > if the data size does not fit in memory. Given the number of partitions is > usually small (say less than 256), it'd require only a single pass do to the > radix sort with pretty decent CPU efficiency. > Note that there have been many in-memory shuffle attempts in the past. This > ticket has a smaller scope (single-process), and aims to actually > productionize this code. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4815) ThriftServer use only one SessionState to run sql using hive
[ https://issues.apache.org/jira/browse/SPARK-4815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14743555#comment-14743555 ] Joseph Fourny commented on SPARK-4815: -- Is this really fixed? I am on Spark 1.5.0 (rc3) and I see very little isolation between JDBC connections to the ThriftServer. For example, "SET X=Y" or "USE DATABASE X" on one connection immediately affects all other connections. This is extremely undesirable behavior. Was there a regression at some point? > ThriftServer use only one SessionState to run sql using hive > - > > Key: SPARK-4815 > URL: https://issues.apache.org/jira/browse/SPARK-4815 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.3.0 >Reporter: guowei > > ThriftServer use only one SessionState to run sql using hive, though it from > different hive sessions. > This will make mistakes: > For example, one user run "use database" in one beeline client. the database > in other beeline change too. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org