[jira] [Comment Edited] (SPARK-24826) Self-Join not working in Apache Spark 2.2.2

2018-08-02 Thread Joseph Fourny (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16566981#comment-16566981
 ] 

Joseph Fourny edited comment on SPARK-24826 at 8/2/18 3:57 PM:
---

I was able to reproduce this defect with an inner-join of two temp views that 
refer to equivalent local relations. I started by creating 2 datasets (in Java) 
from a List of GenericRow and registered them as separate views. As far as the 
optimizer is concerned, the contents of the local relations are the same. If 
you update one of the datasets to make them distinct, then the assertion is not 
longer triggered. Note: I have to force a SortMergeJoin to trigger the issue in 
ExchangeCoordinator. 


was (Author: josephfourny):
I was able to reproduce this defect with an inner-join of two temp views that 
refer to equivalent local relations. I started by creating 2 datasets (in Java) 
from a List of GenericRow and registered them as separate views. As far as the 
optimizer is concerned, the contents of the local relations are the same. Note: 
I have to force a SortMergeJoin to trigger the issue in ExchangeCoordinator. 

> Self-Join not working in Apache Spark 2.2.2
> ---
>
> Key: SPARK-24826
> URL: https://issues.apache.org/jira/browse/SPARK-24826
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 2.2.2
>Reporter: Michail Giannakopoulos
>Priority: Major
> Attachments: 
> part-0-48210471-3088-4cee-8670-a332444bae66-c000.gz.parquet
>
>
> Running a self-join against a table derived from a parquet file with many 
> columns fails during the planning phase with the following stack-trace:
> org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree:
>  Exchange(coordinator id: 331918455) hashpartitioning(_row_id#0L, 2), 
> coordinator[target post-shuffle partition size: 67108864]
>  +- Project [_row_id#0L, id#1L, member_id#2L, loan_amnt#3L, funded_amnt#4L, 
> funded_amnt_inv#5L, term#6, int_rate#7, installment#8, grade#9, sub_grade#10, 
> emp_title#11, emp_length#12, home_ownership#13, annual_inc#14, 
> verification_status#15, issue_d#16, loan_status#17, pymnt_plan#18, url#19, 
> desc_#20, purpose#21, title#22, zip_code#23, ... 92 more fields|#0L, id#1L, 
> member_id#2L, loan_amnt#3L, funded_amnt#4L, funded_amnt_inv#5L, term#6, 
> int_rate#7, installment#8, grade#9, sub_grade#10, emp_title#11, 
> emp_length#12, home_ownership#13, annual_inc#14, verification_status#15, 
> issue_d#16, loan_status#17, pymnt_plan#18, url#19, desc_#20, purpose#21, 
> title#22, zip_code#23, ... 92 more fields]
>  +- Filter isnotnull(_row_id#0L)
>  +- FileScan parquet 
> [_row_id#0L,id#1L,member_id#2L,loan_amnt#3L,funded_amnt#4L,funded_amnt_inv#5L,term#6,int_rate#7,installment#8,grade#9,sub_grade#10,emp_title#11,emp_length#12,home_ownership#13,annual_inc#14,verification_status#15,issue_d#16,loan_status#17,pymnt_plan#18,url#19,desc_#20,purpose#21,title#22,zip_code#23,...
>  92 more 
> fields|#0L,id#1L,member_id#2L,loan_amnt#3L,funded_amnt#4L,funded_amnt_inv#5L,term#6,int_rate#7,installment#8,grade#9,sub_grade#10,emp_title#11,emp_length#12,home_ownership#13,annual_inc#14,verification_status#15,issue_d#16,loan_status#17,pymnt_plan#18,url#19,desc_#20,purpose#21,title#22,zip_code#23,...
>  92 more fields] Batched: false, Format: Parquet, Location: 
> InMemoryFileIndex[file:/c:/Users/gianna/Desktop/alpha.parquet/part-0-48210471-3088-4cee-8670-...,
>  PartitionFilters: [], PushedFilters: [IsNotNull(_row_id)], ReadSchema: 
> struct<_row_id:bigint,id:bigint,member_id:bigint,loan_amnt:bigint,funded_amnt:bigint,funded_amnt_...
> at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)
>  at 
> org.apache.spark.sql.execution.exchange.ShuffleExchange.doExecute(ShuffleExchange.scala:115)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>  at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)
>  at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116)
>  at org.apache.spark.sql.execution.SortExec.doExecute(SortExec.scala:101)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)
>  at 
> 

[jira] [Commented] (SPARK-24826) Self-Join not working in Apache Spark 2.2.2

2018-08-02 Thread Joseph Fourny (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16566981#comment-16566981
 ] 

Joseph Fourny commented on SPARK-24826:
---

I was able to reproduce this defect with an inner-join of two temp views that 
refer to equivalent local relations. I started by creating 2 datasets (in Java) 
from a List of GenericRow and registered them as separate views. As far as the 
optimizer is concerned, the contents of the local relations are the same. Note: 
I have to force a SortMergeJoin to trigger the issue in ExchangeCoordinator. 

> Self-Join not working in Apache Spark 2.2.2
> ---
>
> Key: SPARK-24826
> URL: https://issues.apache.org/jira/browse/SPARK-24826
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 2.2.2
>Reporter: Michail Giannakopoulos
>Priority: Major
> Attachments: 
> part-0-48210471-3088-4cee-8670-a332444bae66-c000.gz.parquet
>
>
> Running a self-join against a table derived from a parquet file with many 
> columns fails during the planning phase with the following stack-trace:
> org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree:
>  Exchange(coordinator id: 331918455) hashpartitioning(_row_id#0L, 2), 
> coordinator[target post-shuffle partition size: 67108864]
>  +- Project [_row_id#0L, id#1L, member_id#2L, loan_amnt#3L, funded_amnt#4L, 
> funded_amnt_inv#5L, term#6, int_rate#7, installment#8, grade#9, sub_grade#10, 
> emp_title#11, emp_length#12, home_ownership#13, annual_inc#14, 
> verification_status#15, issue_d#16, loan_status#17, pymnt_plan#18, url#19, 
> desc_#20, purpose#21, title#22, zip_code#23, ... 92 more fields|#0L, id#1L, 
> member_id#2L, loan_amnt#3L, funded_amnt#4L, funded_amnt_inv#5L, term#6, 
> int_rate#7, installment#8, grade#9, sub_grade#10, emp_title#11, 
> emp_length#12, home_ownership#13, annual_inc#14, verification_status#15, 
> issue_d#16, loan_status#17, pymnt_plan#18, url#19, desc_#20, purpose#21, 
> title#22, zip_code#23, ... 92 more fields]
>  +- Filter isnotnull(_row_id#0L)
>  +- FileScan parquet 
> [_row_id#0L,id#1L,member_id#2L,loan_amnt#3L,funded_amnt#4L,funded_amnt_inv#5L,term#6,int_rate#7,installment#8,grade#9,sub_grade#10,emp_title#11,emp_length#12,home_ownership#13,annual_inc#14,verification_status#15,issue_d#16,loan_status#17,pymnt_plan#18,url#19,desc_#20,purpose#21,title#22,zip_code#23,...
>  92 more 
> fields|#0L,id#1L,member_id#2L,loan_amnt#3L,funded_amnt#4L,funded_amnt_inv#5L,term#6,int_rate#7,installment#8,grade#9,sub_grade#10,emp_title#11,emp_length#12,home_ownership#13,annual_inc#14,verification_status#15,issue_d#16,loan_status#17,pymnt_plan#18,url#19,desc_#20,purpose#21,title#22,zip_code#23,...
>  92 more fields] Batched: false, Format: Parquet, Location: 
> InMemoryFileIndex[file:/c:/Users/gianna/Desktop/alpha.parquet/part-0-48210471-3088-4cee-8670-...,
>  PartitionFilters: [], PushedFilters: [IsNotNull(_row_id)], ReadSchema: 
> struct<_row_id:bigint,id:bigint,member_id:bigint,loan_amnt:bigint,funded_amnt:bigint,funded_amnt_...
> at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)
>  at 
> org.apache.spark.sql.execution.exchange.ShuffleExchange.doExecute(ShuffleExchange.scala:115)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>  at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)
>  at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116)
>  at org.apache.spark.sql.execution.SortExec.doExecute(SortExec.scala:101)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>  at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)
>  at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116)
>  at 
> org.apache.spark.sql.execution.joins.SortMergeJoinExec.doExecute(SortMergeJoinExec.scala:141)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)
>  at 
> 

[jira] [Commented] (SPARK-19566) Error initializing SparkContext under a Windows SYSTEM user

2017-12-11 Thread Joseph Fourny (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16287015#comment-16287015
 ] 

Joseph Fourny commented on SPARK-19566:
---

Did this ever get resolved? I am stuck with the same issue when using a system 
account. :(

> Error initializing SparkContext under a Windows SYSTEM user
> ---
>
> Key: SPARK-19566
> URL: https://issues.apache.org/jira/browse/SPARK-19566
> Project: Spark
>  Issue Type: Bug
>  Components: Windows
>Affects Versions: 2.1.0
>Reporter: boddie
>
> We use a SparkLauncher class in our application which is running in a 
> WebSphere Application Server (it is started as a service). When we try to 
> submit an application to Spark (running in standalone mode without Hadoop) , 
> we get this error:
> ERROR SparkContext: Error initializing SparkContext.
> Exception in thread "main" java.io.IOException: failure to login
>   at 
> org.apache.hadoop.security.UserGroupInformation.loginUserFromSubject(UserGroupInformation.java:824)
>   at 
> org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:761)
>   at 
> org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:634)
>   at 
> org.apache.hadoop.fs.FileSystem$Cache$Key.(FileSystem.java:2828)
>   at 
> org.apache.hadoop.fs.FileSystem$Cache$Key.(FileSystem.java:2818)
>   at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2684)
>   at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
>   at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
>   at org.apache.spark.SparkContext.addFile(SparkContext.scala:1452)
>   at org.apache.spark.SparkContext.addFile(SparkContext.scala:1425)
>   at 
> org.apache.spark.SparkContext$$anonfun$12.apply(SparkContext.scala:470)
>   at 
> org.apache.spark.SparkContext$$anonfun$12.apply(SparkContext.scala:470)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at org.apache.spark.SparkContext.(SparkContext.scala:470)
>   at org.apache.spark.SparkContext.(SparkContext.scala:117)
>   at 
> org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:53)
>   at 
> com.ibm.el.expertise.spark.MatrixCompletionRunner.main(MatrixCompletionRunner.java:46)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:95)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:56)
>   at java.lang.reflect.Method.invoke(Method.java:620)
>   at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:738)
>   at 
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:187)
>   at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:212)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:126)
>   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> Caused by: javax.security.auth.login.LoginException: 
> java.lang.ArrayIndexOutOfBoundsException: Array index out of range: 3
>   at com.ibm.security.auth.module.Win64System.getCurrent(Native Method)
>   at com.ibm.security.auth.module.Win64System.(Win64System.java:74)
>   at 
> com.ibm.security.auth.module.Win64LoginModule.login(Win64LoginModule.java:143)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:95)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:56)
>   at java.lang.reflect.Method.invoke(Method.java:620)
>   at javax.security.auth.login.LoginContext.invoke(LoginContext.java:781)
>   at 
> javax.security.auth.login.LoginContext.access$000(LoginContext.java:215)
>   at javax.security.auth.login.LoginContext$4.run(LoginContext.java:706)
>   at javax.security.auth.login.LoginContext$4.run(LoginContext.java:704)
>   at 
> java.security.AccessController.doPrivileged(AccessController.java:456)
>   at 
> javax.security.auth.login.LoginContext.invokePriv(LoginContext.java:703)
>   at javax.security.auth.login.LoginContext.login(LoginContext.java:609)
>   at 
> org.apache.hadoop.security.UserGroupInformation.loginUserFromSubject(UserGroupInformation.java:799)
>   at 
> org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:761)
>   at 
> org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:634)
>   at 
> org.apache.hadoop.fs.FileSystem$Cache$Key.(FileSystem.java:2828)
>   at 
> org.apache.hadoop.fs.FileSystem$Cache$Key.(FileSystem.java:2818)

[jira] [Comment Edited] (SPARK-15690) Fast single-node (single-process) in-memory shuffle

2016-06-15 Thread Joseph Fourny (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15332960#comment-15332960
 ] 

Joseph Fourny edited comment on SPARK-15690 at 6/16/16 3:00 AM:


I am trying to develop single-node clusters on large servers (30+ CPU cores) 
with 2-3 TB or RAM. Our use cases involve small to medium size datasets, but 
with a huge amount of concurrent jobs (shared, multi-tenant environments). 
Efficiency and sub-second response times are the primary requirements. This 
shuffle between stages is the current bottleneck. Writing anything to disk is 
just a waste of time if all computations are done in the same JVM (or a small 
set of JVMs on the same machine). We tried using RAMFS to avoid disk I/O, but 
still a lot of CPU time is spent in compression and serialization. I would 
rather not hack my way out of this one. Is it wishful thinking to have this 
officially supported?


was (Author: josephfourny):
+1 on this. I am trying to develop single-node clusters on large servers (30+ 
CPU cores) with 2-3 TB or RAM. Our use cases involve small to medium size 
datasets, but with a huge amount of concurrent jobs (shared, multi-tenant 
environments). Efficiency and sub-second response times are the primary 
requirements. This shuffle between stages is the current bottleneck. Writing 
anything to disk is just a waste of time if all computations are done in the 
same JVM (or a small set of JVMs on the same machine). We tried using RAMFS to 
avoid disk I/O, but still a lot of CPU time is spent in compression and 
serialization. I would rather not hack my way out of this one. Is it wishful 
thinking to have this officially supported?

> Fast single-node (single-process) in-memory shuffle
> ---
>
> Key: SPARK-15690
> URL: https://issues.apache.org/jira/browse/SPARK-15690
> Project: Spark
>  Issue Type: New Feature
>  Components: Shuffle, SQL
>Reporter: Reynold Xin
>
> Spark's current shuffle implementation sorts all intermediate data by their 
> partition id, and then write the data to disk. This is not a big bottleneck 
> because the network throughput on commodity clusters tend to be low. However, 
> an increasing number of Spark users are using the system to process data on a 
> single-node. When in a single node operating against intermediate data that 
> fits in memory, the existing shuffle code path can become a big bottleneck.
> The goal of this ticket is to change Spark so it can use in-memory radix sort 
> to do data shuffling on a single node, and still gracefully fallback to disk 
> if the data size does not fit in memory. Given the number of partitions is 
> usually small (say less than 256), it'd require only a single pass do to the 
> radix sort with pretty decent CPU efficiency.
> Note that there have been many in-memory shuffle attempts in the past. This 
> ticket has a smaller scope (single-process), and aims to actually 
> productionize this code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15690) Fast single-node (single-process) in-memory shuffle

2016-06-15 Thread Joseph Fourny (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15332960#comment-15332960
 ] 

Joseph Fourny commented on SPARK-15690:
---

+1 on this. I am trying to develop single-node clusters on large servers (30+ 
CPU cores) with 2-3 TB or RAM. Our use cases involve small to medium size 
datasets, but with a huge amount of concurrent jobs (shared, multi-tenant 
environments). Efficiency and sub-second response times are the primary 
requirements. This shuffle between stages is the current bottleneck. Writing 
anything to disk is just a waste of time if all computations are done in the 
same JVM (or a small set of JVMs on the same machine). We tried using RAMFS to 
avoid disk I/O, but still a lot of CPU time is spent in compression and 
serialization. I would rather not hack my way out of this one. Is it wishful 
thinking to have this officially supported?

> Fast single-node (single-process) in-memory shuffle
> ---
>
> Key: SPARK-15690
> URL: https://issues.apache.org/jira/browse/SPARK-15690
> Project: Spark
>  Issue Type: New Feature
>  Components: Shuffle, SQL
>Reporter: Reynold Xin
>
> Spark's current shuffle implementation sorts all intermediate data by their 
> partition id, and then write the data to disk. This is not a big bottleneck 
> because the network throughput on commodity clusters tend to be low. However, 
> an increasing number of Spark users are using the system to process data on a 
> single-node. When in a single node operating against intermediate data that 
> fits in memory, the existing shuffle code path can become a big bottleneck.
> The goal of this ticket is to change Spark so it can use in-memory radix sort 
> to do data shuffling on a single node, and still gracefully fallback to disk 
> if the data size does not fit in memory. Given the number of partitions is 
> usually small (say less than 256), it'd require only a single pass do to the 
> radix sort with pretty decent CPU efficiency.
> Note that there have been many in-memory shuffle attempts in the past. This 
> ticket has a smaller scope (single-process), and aims to actually 
> productionize this code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4815) ThriftServer use only one SessionState to run sql using hive

2015-09-14 Thread Joseph Fourny (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14743555#comment-14743555
 ] 

Joseph Fourny commented on SPARK-4815:
--

Is this really fixed? I am on Spark 1.5.0 (rc3) and I see very little isolation 
between JDBC connections to the ThriftServer. For example, "SET X=Y" or "USE 
DATABASE X" on one connection immediately affects all other connections. This 
is extremely undesirable behavior. Was there a regression at some point?

> ThriftServer use only one SessionState to run sql using hive 
> -
>
> Key: SPARK-4815
> URL: https://issues.apache.org/jira/browse/SPARK-4815
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.0
>Reporter: guowei
>
> ThriftServer use only one SessionState to run sql using hive, though it from 
> different hive sessions.
> This will make mistakes:
> For example, one user run "use database" in one beeline client. the database 
> in other  beeline change too.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org