[jira] [Updated] (SPARK-46354) Add restriction parameters to dynamic partitions for writing datasource

2023-12-10 Thread hao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hao updated SPARK-46354:

Priority: Trivial  (was: Major)

> Add restriction parameters to dynamic partitions for writing datasource
> ---
>
> Key: SPARK-46354
> URL: https://issues.apache.org/jira/browse/SPARK-46354
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: hao
>Priority: Trivial
>
> Add restriction parameters to dynamic partitions for writing datasource. 
> `InsertIntoHiveTable 'limits the number of single dynamic partition writes, 
> while' InsertIntoHadoopFsRelationCommand 'does not limit the number of single 
> dynamic partitions and should be aligned with it



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-46354) Add restriction parameters to dynamic partitions for writing datasource

2023-12-10 Thread hao (Jira)
hao created SPARK-46354:
---

 Summary: Add restriction parameters to dynamic partitions for 
writing datasource
 Key: SPARK-46354
 URL: https://issues.apache.org/jira/browse/SPARK-46354
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.5.0
Reporter: hao


Add restriction parameters to dynamic partitions for writing datasource. 

`InsertIntoHiveTable 'limits the number of single dynamic partition writes, 
while' InsertIntoHadoopFsRelationCommand 'does not limit the number of single 
dynamic partitions and should be aligned with it



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44483) When using Spark to read the hive table, the number of file partitions cannot be set using Spark's configuration settings

2023-07-19 Thread hao (Jira)
hao created SPARK-44483:
---

 Summary: When using Spark to read the hive table, the number of 
file partitions cannot be set using Spark's configuration settings
 Key: SPARK-44483
 URL: https://issues.apache.org/jira/browse/SPARK-44483
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.4.1
Reporter: hao


When using Spark to read the hive table, the number of file partitions cannot 
be set using Spark's configuration settings



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42714) Sparksql temporary file conflict

2023-03-07 Thread hao (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17697760#comment-17697760
 ] 

hao commented on SPARK-42714:
-

This problem will cause the task to throw the problem of deleting the current 
temporary file. The detailed error is as follows:
 -

User class threw exception: org.apache.spark.SparkException: Job aborted.
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:231)
at 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:188)
at 
org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:108)
at 
org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:106)
at 
org.apache.spark.sql.execution.command.DataWritingCommandExec.executeCollect(commands.scala:120)
at 
org.apache.spark.sql.Dataset.$anonfun$logicalPlan$1(Dataset.scala:228)
at 
org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3700)
at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
at 
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3698)
at org.apache.spark.sql.Dataset.(Dataset.scala:228)
at org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:99)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:96)
at 
org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:618)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:613)
at com.ly.process.SparkSQL.main(SparkSQL.java:55)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:732)
Caused by: java.io.FileNotFoundException: File 
/ns-tcly/com//_temporary/0/task_202303070204281920649928402071557_0031_m_001866/type=2
 does not exist.
at 
org.apache.hadoop.hdfs.DistributedFileSystem.listStatusInternal(DistributedFileSystem.java:1058)
at 
org.apache.hadoop.hdfs.DistributedFileSystem.access$1000(DistributedFileSystem.java:131)
at 
org.apache.hadoop.hdfs.DistributedFileSystem$24.doCall(DistributedFileSystem.java:1118)
at 
org.apache.hadoop.hdfs.DistributedFileSystem$24.doCall(DistributedFileSystem.java:1115)
at 
org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at 
org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:1125)
at 
org.apache.hadoop.fs.FilterFileSystem.listStatus(FilterFileSystem.java:270)
at 
org.apache.hadoop.fs.viewfs.ChRootedFileSystem.listStatus(ChRootedFileSystem.java:255)
at 
org.apache.hadoop.fs.viewfs.ViewFileSystem.listStatus(ViewFileSystem.java:411)
at 
org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(FileOutputCommitter.java:484)
at 
org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(FileOutputCommitter.java:486)
at 
org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJobInternal(FileOutputCommitter.java:403)
at 
org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJob(FileOutputCommitter.java:375)
at 
org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.commitJob(HadoopMapReduceCommitProtocol.scala:182)
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:220)
... 25 more

> Sparksql temporary file conflict
> 
>
> Key: SPARK-42714
> URL: https://issues.apache.org/jira/browse/SPARK-42714
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.2
>Reporter: hao
>Priority: Major
>
> When sparksql inserts overwrite, the name of the temporary file in the middle 
> is not unique. This will 

[jira] [Created] (SPARK-42714) Sparksql temporary file conflict

2023-03-07 Thread hao (Jira)
hao created SPARK-42714:
---

 Summary: Sparksql temporary file conflict
 Key: SPARK-42714
 URL: https://issues.apache.org/jira/browse/SPARK-42714
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.3.2
Reporter: hao


When sparksql inserts overwrite, the name of the temporary file in the middle 
is not unique. This will cause that when multiple applications write different 
partition data to the same partition table, it will be possible to delete each 
other's temporary files between applications, resulting in task failure





--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-41241) Use Hive and Spark SQL to modify table field comment, the modified results of Hive cannot be queried using Spark SQL

2023-02-01 Thread weiliang hao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

weiliang hao updated SPARK-41241:
-
Description: 
-- Hive

> create table table_test(id int);

> alter table table_test change column id id int comment "hive comment";

> desc formatted table_test;
{code:java}
+---+++
|           col_name            |                     data_type                 
     |                      comment                       |
+---+++
| # col_name                    | data_type                                     
     | comment                                            |
| id                            | int                                           
     | hive comment                                        |
|                               | NULL                                          
     | NULL                                               |
| # Detailed Table Information  | NULL                                          
     | NULL                                               |
| Database:                     | default                                       
     | NULL                                               |
| OwnerType:                    | USER                                          
     | NULL                                               |
| Owner:                        | anonymous                                     
     | NULL                                               |
| CreateTime:                   | Wed Nov 23 23:06:41 CST 2022                  
     | NULL                                               |
| LastAccessTime:               | UNKNOWN                                       
     | NULL                                               |
| Retention:                    | 0                                             
     | NULL                                               |
| Location:                     | 
hdfs://localhost:8020/warehouse/tablespace/managed/hive/table_test | NULL       
                                        |
| Table Type:                   | MANAGED_TABLE                                 
     | NULL                                               |
| Table Parameters:             | NULL                                          
     | NULL                                               |
|                               | COLUMN_STATS_ACCURATE                         
     | {\"BASIC_STATS\":\"true\",\"COLUMN_STATS\":{\"id\":\"true\"}} |
|                               | bucketing_version                             
     | 2                                                  |
|                               | last_modified_by                              
     | anonymous                                          |
|                               | last_modified_time                            
     | 1669216665                                         |
|                               | numFiles                                      
     | 0                                                  |
|                               | numRows                                       
     | 0                                                  |
|                               | rawDataSize                                   
     | 0                                                  |
|                               | totalSize                                     
     | 0                                                  |
|                               | transactional                                 
     | true                                               |
|                               | transactional_properties                      
     | default                                            |
|                               | transient_lastDdlTime                         
     | 1669216665                                         |
|                               | NULL                                          
     | NULL                                               |
| # Storage Information         | NULL                                          
     | NULL                                               |
| SerDe Library:                | org.apache.hadoop.hive.ql.io.orc.OrcSerde     
     | NULL                                               |
| InputFormat:                  | 
org.apache.hadoop.hive.ql.io.orc.OrcInputFormat    | NULL                       
                        |
| OutputFormat:                 | 
org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat   | NULL                       
                        |
| Compressed:                   | No                                            
 

[jira] [Commented] (SPARK-41241) Use Hive and Spark SQL to modify table field comment, the modified results of Hive cannot be queried using Spark SQL

2022-12-04 Thread weiliang hao (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17643141#comment-17643141
 ] 

weiliang hao commented on SPARK-41241:
--

[~xkrogen] The problem is that Spark modifies the Hive table field comment, and 
then uses Hive to modify, Spark cannot find the latest comment. I think Spark 
should be compatible with Hive, and there should be no data inconsistency when 
using the Spark or Hive engine to query.

> Use Hive and Spark SQL to modify table field comment, the modified results of 
> Hive cannot be queried using Spark SQL
> 
>
> Key: SPARK-41241
> URL: https://issues.apache.org/jira/browse/SPARK-41241
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0, 3.2.0, 3.3.0
>Reporter: weiliang hao
>Priority: Major
>
> ---HIVE---
> > create table table_test(id int);
> > alter table table_test change column id id int comment "hive comment";
> > desc formatted table_test;
> {code:java}
> +---+++
> |           col_name            |                     data_type               
>        |                      comment                       |
> +---+++
> | # col_name                    | data_type                                   
>        | comment                                            |
> | id                            | int                                         
>        | hive comment                                        |
> |                               | NULL                                        
>        | NULL                                               |
> | # Detailed Table Information  | NULL                                        
>        | NULL                                               |
> | Database:                     | default                                     
>        | NULL                                               |
> | OwnerType:                    | USER                                        
>        | NULL                                               |
> | Owner:                        | anonymous                                   
>        | NULL                                               |
> | CreateTime:                   | Wed Nov 23 23:06:41 CST 2022                
>        | NULL                                               |
> | LastAccessTime:               | UNKNOWN                                     
>        | NULL                                               |
> | Retention:                    | 0                                           
>        | NULL                                               |
> | Location:                     | 
> hdfs://localhost:8020/warehouse/tablespace/managed/hive/table_test | NULL     
>                                           |
> | Table Type:                   | MANAGED_TABLE                               
>        | NULL                                               |
> | Table Parameters:             | NULL                                        
>        | NULL                                               |
> |                               | COLUMN_STATS_ACCURATE                       
>        | {\"BASIC_STATS\":\"true\",\"COLUMN_STATS\":{\"id\":\"true\"}} |
> |                               | bucketing_version                           
>        | 2                                                  |
> |                               | last_modified_by                            
>        | anonymous                                          |
> |                               | last_modified_time                          
>        | 1669216665                                         |
> |                               | numFiles                                    
>        | 0                                                  |
> |                               | numRows                                     
>        | 0                                                  |
> |                               | rawDataSize                                 
>        | 0                                                  |
> |                               | totalSize                                   
>        | 0                                                  |
> |                               | transactional                               
>        | true                                               |
> |                               | transactional_properties                    
>        | default        

[jira] [Commented] (SPARK-41241) Use Hive and Spark SQL to modify table field comment, the modified results of Hive cannot be queried using Spark SQL

2022-11-23 Thread weiliang hao (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17637848#comment-17637848
 ] 

weiliang hao commented on SPARK-41241:
--

I will fix it

> Use Hive and Spark SQL to modify table field comment, the modified results of 
> Hive cannot be queried using Spark SQL
> 
>
> Key: SPARK-41241
> URL: https://issues.apache.org/jira/browse/SPARK-41241
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0, 3.2.0, 3.3.0
>Reporter: weiliang hao
>Priority: Major
>
> ---HIVE---
> > create table table_test(id int);
> > alter table table_test change column id id int comment "hive comment";
> > desc formatted table_test;
> {code:java}
> +---+++
> |           col_name            |                     data_type               
>        |                      comment                       |
> +---+++
> | # col_name                    | data_type                                   
>        | comment                                            |
> | id                            | int                                         
>        | hive comment                                        |
> |                               | NULL                                        
>        | NULL                                               |
> | # Detailed Table Information  | NULL                                        
>        | NULL                                               |
> | Database:                     | default                                     
>        | NULL                                               |
> | OwnerType:                    | USER                                        
>        | NULL                                               |
> | Owner:                        | anonymous                                   
>        | NULL                                               |
> | CreateTime:                   | Wed Nov 23 23:06:41 CST 2022                
>        | NULL                                               |
> | LastAccessTime:               | UNKNOWN                                     
>        | NULL                                               |
> | Retention:                    | 0                                           
>        | NULL                                               |
> | Location:                     | 
> hdfs://localhost:8020/warehouse/tablespace/managed/hive/table_test | NULL     
>                                           |
> | Table Type:                   | MANAGED_TABLE                               
>        | NULL                                               |
> | Table Parameters:             | NULL                                        
>        | NULL                                               |
> |                               | COLUMN_STATS_ACCURATE                       
>        | {\"BASIC_STATS\":\"true\",\"COLUMN_STATS\":{\"id\":\"true\"}} |
> |                               | bucketing_version                           
>        | 2                                                  |
> |                               | last_modified_by                            
>        | anonymous                                          |
> |                               | last_modified_time                          
>        | 1669216665                                         |
> |                               | numFiles                                    
>        | 0                                                  |
> |                               | numRows                                     
>        | 0                                                  |
> |                               | rawDataSize                                 
>        | 0                                                  |
> |                               | totalSize                                   
>        | 0                                                  |
> |                               | transactional                               
>        | true                                               |
> |                               | transactional_properties                    
>        | default                                            |
> |                               | transient_lastDdlTime                       
>        | 1669216665                                         |
> |                               | NULL                                        
>  

[jira] [Created] (SPARK-41241) Use Hive and Spark SQL to modify table field comment, the modified results of Hive cannot be queried using Spark SQL

2022-11-23 Thread weiliang hao (Jira)
weiliang hao created SPARK-41241:


 Summary: Use Hive and Spark SQL to modify table field comment, the 
modified results of Hive cannot be queried using Spark SQL
 Key: SPARK-41241
 URL: https://issues.apache.org/jira/browse/SPARK-41241
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.3.0, 3.2.0, 3.1.0, 3.0.0
Reporter: weiliang hao


---HIVE---

> create table table_test(id int);

> alter table table_test change column id id int comment "hive comment";

> desc formatted table_test;
{code:java}
+---+++
|           col_name            |                     data_type                 
     |                      comment                       |
+---+++
| # col_name                    | data_type                                     
     | comment                                            |
| id                            | int                                           
     | hive comment                                        |
|                               | NULL                                          
     | NULL                                               |
| # Detailed Table Information  | NULL                                          
     | NULL                                               |
| Database:                     | default                                       
     | NULL                                               |
| OwnerType:                    | USER                                          
     | NULL                                               |
| Owner:                        | anonymous                                     
     | NULL                                               |
| CreateTime:                   | Wed Nov 23 23:06:41 CST 2022                  
     | NULL                                               |
| LastAccessTime:               | UNKNOWN                                       
     | NULL                                               |
| Retention:                    | 0                                             
     | NULL                                               |
| Location:                     | 
hdfs://localhost:8020/warehouse/tablespace/managed/hive/table_test | NULL       
                                        |
| Table Type:                   | MANAGED_TABLE                                 
     | NULL                                               |
| Table Parameters:             | NULL                                          
     | NULL                                               |
|                               | COLUMN_STATS_ACCURATE                         
     | {\"BASIC_STATS\":\"true\",\"COLUMN_STATS\":{\"id\":\"true\"}} |
|                               | bucketing_version                             
     | 2                                                  |
|                               | last_modified_by                              
     | anonymous                                          |
|                               | last_modified_time                            
     | 1669216665                                         |
|                               | numFiles                                      
     | 0                                                  |
|                               | numRows                                       
     | 0                                                  |
|                               | rawDataSize                                   
     | 0                                                  |
|                               | totalSize                                     
     | 0                                                  |
|                               | transactional                                 
     | true                                               |
|                               | transactional_properties                      
     | default                                            |
|                               | transient_lastDdlTime                         
     | 1669216665                                         |
|                               | NULL                                          
     | NULL                                               |
| # Storage Information         | NULL                                          
     | NULL                                               |
| SerDe Library:                | org.apache.hadoop.hive.ql.io.orc.OrcSerde     
     | NULL                                               |
| InputFormat:                  | 
org.apache.hadoop.hive.ql.io.orc.OrcInputFormat    | NULL   

[jira] [Commented] (SPARK-41160) An error is reported when submitting a task to the yarn that enabled the timeline service

2022-11-16 Thread weiliang hao (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17634788#comment-17634788
 ] 

weiliang hao commented on SPARK-41160:
--

I will try to fix it. Depends on jersey-client:1.19 when initializing 
TimelineClient

> An error is reported when submitting a task to the yarn that enabled the 
> timeline service
> -
>
> Key: SPARK-41160
> URL: https://issues.apache.org/jira/browse/SPARK-41160
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 3.0.0
>Reporter: weiliang hao
>Priority: Major
>
> java8、hadoop2、spark3.3.1
> {code:java}
> $SPARK_HOME/bin/spark-submit \
>   --class org.apache.spark.examples.SparkPi \
>   --master yarn \
>   --deploy-mode cluster \
>   $SPARK_HOME/examples/jars/spark-examples*.jar 10 {code}
> The following errors were reported when submitting tasks to yarn with 
> timeline service enabled:
> {code:java}
> Exception in thread "main" java.lang.NoClassDefFoundError: 
> com/sun/jersey/api/client/config/ClientConfig
>   at 
> org.apache.hadoop.yarn.client.api.TimelineClient.createTimelineClient(TimelineClient.java:55)
>   at 
> org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.createTimelineClient(YarnClientImpl.java:181)
>   at 
> org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.serviceInit(YarnClientImpl.java:168)
>   at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
>   at 
> org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:191)
>   at org.apache.spark.deploy.yarn.Client.run(Client.scala:1327)
>   at 
> org.apache.spark.deploy.yarn.YarnClusterApplication.start(Client.scala:1764)
>   at 
> org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:958)
>   at 
> org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
>   at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
>   at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
>   at 
> org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1046)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1055)
>   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> Caused by: java.lang.ClassNotFoundException: 
> com.sun.jersey.api.client.config.ClientConfig
>   at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
>   at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
>   ... 14 more {code}
> hadoop yarn-site.xml:
> {code:java}
> ...
> 
>       yarn.timeline-service.enabled
>       true
>  
> ...{code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-41160) An error is reported when submitting a task to the yarn that enabled the timeline service

2022-11-16 Thread weiliang hao (Jira)
weiliang hao created SPARK-41160:


 Summary: An error is reported when submitting a task to the yarn 
that enabled the timeline service
 Key: SPARK-41160
 URL: https://issues.apache.org/jira/browse/SPARK-41160
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 3.0.0
Reporter: weiliang hao


java8、hadoop2、spark3.3.1
{code:java}
$SPARK_HOME/bin/spark-submit \
  --class org.apache.spark.examples.SparkPi \
  --master yarn \
  --deploy-mode cluster \
  $SPARK_HOME/examples/jars/spark-examples*.jar 10 {code}
The following errors were reported when submitting tasks to yarn with timeline 
service enabled:
{code:java}
Exception in thread "main" java.lang.NoClassDefFoundError: 
com/sun/jersey/api/client/config/ClientConfig
at 
org.apache.hadoop.yarn.client.api.TimelineClient.createTimelineClient(TimelineClient.java:55)
at 
org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.createTimelineClient(YarnClientImpl.java:181)
at 
org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.serviceInit(YarnClientImpl.java:168)
at 
org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
at 
org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:191)
at org.apache.spark.deploy.yarn.Client.run(Client.scala:1327)
at 
org.apache.spark.deploy.yarn.YarnClusterApplication.start(Client.scala:1764)
at 
org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:958)
at 
org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
at 
org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1046)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1055)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.ClassNotFoundException: 
com.sun.jersey.api.client.config.ClientConfig
at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)
at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
... 14 more {code}
hadoop yarn-site.xml:
{code:java}
...


      yarn.timeline-service.enabled
      true
 

...{code}
 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-39396) Spark Thriftserver enabled LDAP,Error using beeline connection: error code 49 - invalid credentials

2022-06-06 Thread weiliang hao (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-39396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17550772#comment-17550772
 ] 

weiliang hao commented on SPARK-39396:
--

When a user with DN (cn=user, ou=people, dc=example, dc=com) logs in, it will 
fail because the DN generated in the class 
org.apache.hive.service.auth.LdapAuthenticationProviderImpl#Authenticate() is 
(uid=user, ou=people, dc=example, dc=com)

> Spark Thriftserver enabled LDAP,Error using beeline connection: error code 49 
> - invalid credentials
> ---
>
> Key: SPARK-39396
> URL: https://issues.apache.org/jira/browse/SPARK-39396
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.8
>Reporter: weiliang hao
>Priority: Major
>
> Spark Thriftserver enabled LDAP,and report an error when logging in with LDAP 
> user through beeline connection:
> {code:java}
> 22/06/06 17:45:29 ERROR transport.TSaslTransport: SASL negotiation failure
> javax.security.sasl.SaslException: Error validating the login [Caused by 
> javax.security.sasl.AuthenticationException: Error validating LDAP user 
> [Caused by javax.naming.AuthenticationException: [LDAP: error code 49 - 
> Invalid Credentials]]]
>   at 
> org.apache.hive.service.auth.PlainSaslServer.evaluateResponse(PlainSaslServer.java:109)
>   at 
> org.apache.thrift.transport.TSaslTransport$SaslParticipant.evaluateChallengeOrResponse(TSaslTransport.java:539)
>   at 
> org.apache.thrift.transport.TSaslTransport.open(TSaslTransport.java:283)
>   at 
> org.apache.thrift.transport.TSaslServerTransport.open(TSaslServerTransport.java:41)
>   at 
> org.apache.thrift.transport.TSaslServerTransport$Factory.getTransport(TSaslServerTransport.java:216)
>   at 
> org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:269)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> Caused by: javax.security.sasl.AuthenticationException: Error validating LDAP 
> user [Caused by javax.naming.AuthenticationException: [LDAP: error code 49 - 
> Invalid Credentials]]
>   at 
> org.apache.hive.service.auth.LdapAuthenticationProviderImpl.Authenticate(LdapAuthenticationProviderImpl.java:77)
>   at 
> org.apache.hive.service.auth.PlainSaslHelper$PlainServerCallbackHandler.handle(PlainSaslHelper.java:106)
>   at 
> org.apache.hive.service.auth.PlainSaslServer.evaluateResponse(PlainSaslServer.java:102)
>   ... 8 more
> Caused by: javax.naming.AuthenticationException: [LDAP: error code 49 - 
> Invalid Credentials]
>   at com.sun.jndi.ldap.LdapCtx.mapErrorCode(LdapCtx.java:3154)
>   at com.sun.jndi.ldap.LdapCtx.processReturnCode(LdapCtx.java:3100)
>   at com.sun.jndi.ldap.LdapCtx.processReturnCode(LdapCtx.java:2886)
>   at com.sun.jndi.ldap.LdapCtx.connect(LdapCtx.java:2800)
>   at com.sun.jndi.ldap.LdapCtx.(LdapCtx.java:319)
>   at com.sun.jndi.ldap.LdapCtxFactory.getUsingURL(LdapCtxFactory.java:192)
>   at 
> com.sun.jndi.ldap.LdapCtxFactory.getUsingURLs(LdapCtxFactory.java:210)
>   at 
> com.sun.jndi.ldap.LdapCtxFactory.getLdapCtxInstance(LdapCtxFactory.java:153)
>   at 
> com.sun.jndi.ldap.LdapCtxFactory.getInitialContext(LdapCtxFactory.java:83)
>   at 
> javax.naming.spi.NamingManager.getInitialContext(NamingManager.java:684)
>   at 
> javax.naming.InitialContext.getDefaultInitCtx(InitialContext.java:313)
>   at javax.naming.InitialContext.init(InitialContext.java:244)
>   at javax.naming.InitialContext.(InitialContext.java:216)
>   at 
> javax.naming.directory.InitialDirContext.(InitialDirContext.java:101)
>   at 
> org.apache.hive.service.auth.LdapAuthenticationProviderImpl.Authenticate(LdapAuthenticationProviderImpl.java:74)
>   ... 10 more
> 22/06/06 17:45:29 ERROR server.TThreadPoolServer: Error occurred during 
> processing of message.
> java.lang.RuntimeException: org.apache.thrift.transport.TTransportException: 
> Error validating the login
>   at 
> org.apache.thrift.transport.TSaslServerTransport$Factory.getTransport(TSaslServerTransport.java:219)
>   at 
> org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:269)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> Caused by: org.apache.thrift.transport.TTransportException: Error validating 
> the login
>   at 
> 

[jira] [Commented] (SPARK-39396) Spark Thriftserver enabled LDAP,Error using beeline connection: error code 49 - invalid credentials

2022-06-06 Thread weiliang hao (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-39396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17550751#comment-17550751
 ] 

weiliang hao commented on SPARK-39396:
--

I will try to solve this problem

> Spark Thriftserver enabled LDAP,Error using beeline connection: error code 49 
> - invalid credentials
> ---
>
> Key: SPARK-39396
> URL: https://issues.apache.org/jira/browse/SPARK-39396
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.8
>Reporter: weiliang hao
>Priority: Major
>
> Spark Thriftserver enabled LDAP,and report an error when logging in with LDAP 
> user through beeline connection:
> {code:java}
> 22/06/06 17:45:29 ERROR transport.TSaslTransport: SASL negotiation failure
> javax.security.sasl.SaslException: Error validating the login [Caused by 
> javax.security.sasl.AuthenticationException: Error validating LDAP user 
> [Caused by javax.naming.AuthenticationException: [LDAP: error code 49 - 
> Invalid Credentials]]]
>   at 
> org.apache.hive.service.auth.PlainSaslServer.evaluateResponse(PlainSaslServer.java:109)
>   at 
> org.apache.thrift.transport.TSaslTransport$SaslParticipant.evaluateChallengeOrResponse(TSaslTransport.java:539)
>   at 
> org.apache.thrift.transport.TSaslTransport.open(TSaslTransport.java:283)
>   at 
> org.apache.thrift.transport.TSaslServerTransport.open(TSaslServerTransport.java:41)
>   at 
> org.apache.thrift.transport.TSaslServerTransport$Factory.getTransport(TSaslServerTransport.java:216)
>   at 
> org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:269)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> Caused by: javax.security.sasl.AuthenticationException: Error validating LDAP 
> user [Caused by javax.naming.AuthenticationException: [LDAP: error code 49 - 
> Invalid Credentials]]
>   at 
> org.apache.hive.service.auth.LdapAuthenticationProviderImpl.Authenticate(LdapAuthenticationProviderImpl.java:77)
>   at 
> org.apache.hive.service.auth.PlainSaslHelper$PlainServerCallbackHandler.handle(PlainSaslHelper.java:106)
>   at 
> org.apache.hive.service.auth.PlainSaslServer.evaluateResponse(PlainSaslServer.java:102)
>   ... 8 more
> Caused by: javax.naming.AuthenticationException: [LDAP: error code 49 - 
> Invalid Credentials]
>   at com.sun.jndi.ldap.LdapCtx.mapErrorCode(LdapCtx.java:3154)
>   at com.sun.jndi.ldap.LdapCtx.processReturnCode(LdapCtx.java:3100)
>   at com.sun.jndi.ldap.LdapCtx.processReturnCode(LdapCtx.java:2886)
>   at com.sun.jndi.ldap.LdapCtx.connect(LdapCtx.java:2800)
>   at com.sun.jndi.ldap.LdapCtx.(LdapCtx.java:319)
>   at com.sun.jndi.ldap.LdapCtxFactory.getUsingURL(LdapCtxFactory.java:192)
>   at 
> com.sun.jndi.ldap.LdapCtxFactory.getUsingURLs(LdapCtxFactory.java:210)
>   at 
> com.sun.jndi.ldap.LdapCtxFactory.getLdapCtxInstance(LdapCtxFactory.java:153)
>   at 
> com.sun.jndi.ldap.LdapCtxFactory.getInitialContext(LdapCtxFactory.java:83)
>   at 
> javax.naming.spi.NamingManager.getInitialContext(NamingManager.java:684)
>   at 
> javax.naming.InitialContext.getDefaultInitCtx(InitialContext.java:313)
>   at javax.naming.InitialContext.init(InitialContext.java:244)
>   at javax.naming.InitialContext.(InitialContext.java:216)
>   at 
> javax.naming.directory.InitialDirContext.(InitialDirContext.java:101)
>   at 
> org.apache.hive.service.auth.LdapAuthenticationProviderImpl.Authenticate(LdapAuthenticationProviderImpl.java:74)
>   ... 10 more
> 22/06/06 17:45:29 ERROR server.TThreadPoolServer: Error occurred during 
> processing of message.
> java.lang.RuntimeException: org.apache.thrift.transport.TTransportException: 
> Error validating the login
>   at 
> org.apache.thrift.transport.TSaslServerTransport$Factory.getTransport(TSaslServerTransport.java:219)
>   at 
> org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:269)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> Caused by: org.apache.thrift.transport.TTransportException: Error validating 
> the login
>   at 
> org.apache.thrift.transport.TSaslTransport.sendAndThrowMessage(TSaslTransport.java:232)
>   at 
> org.apache.thrift.transport.TSaslTransport.open(TSaslTransport.java:316)
>   at 
> 

[jira] [Created] (SPARK-39396) Spark Thriftserver enabled LDAP,Error using beeline connection: error code 49 - invalid credentials

2022-06-06 Thread weiliang hao (Jira)
weiliang hao created SPARK-39396:


 Summary: Spark Thriftserver enabled LDAP,Error using beeline 
connection: error code 49 - invalid credentials
 Key: SPARK-39396
 URL: https://issues.apache.org/jira/browse/SPARK-39396
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.8
Reporter: weiliang hao


Spark Thriftserver enabled LDAP,and report an error when logging in with LDAP 
user through beeline connection:
{code:java}
22/06/06 17:45:29 ERROR transport.TSaslTransport: SASL negotiation failure
javax.security.sasl.SaslException: Error validating the login [Caused by 
javax.security.sasl.AuthenticationException: Error validating LDAP user [Caused 
by javax.naming.AuthenticationException: [LDAP: error code 49 - Invalid 
Credentials]]]
at 
org.apache.hive.service.auth.PlainSaslServer.evaluateResponse(PlainSaslServer.java:109)
at 
org.apache.thrift.transport.TSaslTransport$SaslParticipant.evaluateChallengeOrResponse(TSaslTransport.java:539)
at 
org.apache.thrift.transport.TSaslTransport.open(TSaslTransport.java:283)
at 
org.apache.thrift.transport.TSaslServerTransport.open(TSaslServerTransport.java:41)
at 
org.apache.thrift.transport.TSaslServerTransport$Factory.getTransport(TSaslServerTransport.java:216)
at 
org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:269)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: javax.security.sasl.AuthenticationException: Error validating LDAP 
user [Caused by javax.naming.AuthenticationException: [LDAP: error code 49 - 
Invalid Credentials]]
at 
org.apache.hive.service.auth.LdapAuthenticationProviderImpl.Authenticate(LdapAuthenticationProviderImpl.java:77)
at 
org.apache.hive.service.auth.PlainSaslHelper$PlainServerCallbackHandler.handle(PlainSaslHelper.java:106)
at 
org.apache.hive.service.auth.PlainSaslServer.evaluateResponse(PlainSaslServer.java:102)
... 8 more
Caused by: javax.naming.AuthenticationException: [LDAP: error code 49 - Invalid 
Credentials]
at com.sun.jndi.ldap.LdapCtx.mapErrorCode(LdapCtx.java:3154)
at com.sun.jndi.ldap.LdapCtx.processReturnCode(LdapCtx.java:3100)
at com.sun.jndi.ldap.LdapCtx.processReturnCode(LdapCtx.java:2886)
at com.sun.jndi.ldap.LdapCtx.connect(LdapCtx.java:2800)
at com.sun.jndi.ldap.LdapCtx.(LdapCtx.java:319)
at com.sun.jndi.ldap.LdapCtxFactory.getUsingURL(LdapCtxFactory.java:192)
at 
com.sun.jndi.ldap.LdapCtxFactory.getUsingURLs(LdapCtxFactory.java:210)
at 
com.sun.jndi.ldap.LdapCtxFactory.getLdapCtxInstance(LdapCtxFactory.java:153)
at 
com.sun.jndi.ldap.LdapCtxFactory.getInitialContext(LdapCtxFactory.java:83)
at 
javax.naming.spi.NamingManager.getInitialContext(NamingManager.java:684)
at 
javax.naming.InitialContext.getDefaultInitCtx(InitialContext.java:313)
at javax.naming.InitialContext.init(InitialContext.java:244)
at javax.naming.InitialContext.(InitialContext.java:216)
at 
javax.naming.directory.InitialDirContext.(InitialDirContext.java:101)
at 
org.apache.hive.service.auth.LdapAuthenticationProviderImpl.Authenticate(LdapAuthenticationProviderImpl.java:74)
... 10 more
22/06/06 17:45:29 ERROR server.TThreadPoolServer: Error occurred during 
processing of message.
java.lang.RuntimeException: org.apache.thrift.transport.TTransportException: 
Error validating the login
at 
org.apache.thrift.transport.TSaslServerTransport$Factory.getTransport(TSaslServerTransport.java:219)
at 
org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:269)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.thrift.transport.TTransportException: Error validating 
the login
at 
org.apache.thrift.transport.TSaslTransport.sendAndThrowMessage(TSaslTransport.java:232)
at 
org.apache.thrift.transport.TSaslTransport.open(TSaslTransport.java:316)
at 
org.apache.thrift.transport.TSaslServerTransport.open(TSaslServerTransport.java:41)
at 
org.apache.thrift.transport.TSaslServerTransport$Factory.getTransport(TSaslServerTransport.java:216)
... 4 more {code}
hive-site.xml:
{code:java}



    
        hive.metastore.uris
        thrift://metastore_uri:9083
        Thrift URI for the remote metastore. Used by metastore 
client to connect to remote metastore.
    


    
        

[jira] [Created] (SPARK-39395) Spark Thriftserver enabled LDAP,Error using beeline connection: error code 49 - invalid credentials

2022-06-06 Thread weiliang hao (Jira)
weiliang hao created SPARK-39395:


 Summary: Spark Thriftserver enabled LDAP,Error using beeline 
connection: error code 49 - invalid credentials
 Key: SPARK-39395
 URL: https://issues.apache.org/jira/browse/SPARK-39395
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.8
Reporter: weiliang hao


Spark Thriftserver enabled LDAP,and report an error when logging in with LDAP 
user through beeline connection:
{code:java}
22/06/06 17:45:29 ERROR transport.TSaslTransport: SASL negotiation failure
javax.security.sasl.SaslException: Error validating the login [Caused by 
javax.security.sasl.AuthenticationException: Error validating LDAP user [Caused 
by javax.naming.AuthenticationException: [LDAP: error code 49 - Invalid 
Credentials]]]
at 
org.apache.hive.service.auth.PlainSaslServer.evaluateResponse(PlainSaslServer.java:109)
at 
org.apache.thrift.transport.TSaslTransport$SaslParticipant.evaluateChallengeOrResponse(TSaslTransport.java:539)
at 
org.apache.thrift.transport.TSaslTransport.open(TSaslTransport.java:283)
at 
org.apache.thrift.transport.TSaslServerTransport.open(TSaslServerTransport.java:41)
at 
org.apache.thrift.transport.TSaslServerTransport$Factory.getTransport(TSaslServerTransport.java:216)
at 
org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:269)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: javax.security.sasl.AuthenticationException: Error validating LDAP 
user [Caused by javax.naming.AuthenticationException: [LDAP: error code 49 - 
Invalid Credentials]]
at 
org.apache.hive.service.auth.LdapAuthenticationProviderImpl.Authenticate(LdapAuthenticationProviderImpl.java:77)
at 
org.apache.hive.service.auth.PlainSaslHelper$PlainServerCallbackHandler.handle(PlainSaslHelper.java:106)
at 
org.apache.hive.service.auth.PlainSaslServer.evaluateResponse(PlainSaslServer.java:102)
... 8 more
Caused by: javax.naming.AuthenticationException: [LDAP: error code 49 - Invalid 
Credentials]
at com.sun.jndi.ldap.LdapCtx.mapErrorCode(LdapCtx.java:3154)
at com.sun.jndi.ldap.LdapCtx.processReturnCode(LdapCtx.java:3100)
at com.sun.jndi.ldap.LdapCtx.processReturnCode(LdapCtx.java:2886)
at com.sun.jndi.ldap.LdapCtx.connect(LdapCtx.java:2800)
at com.sun.jndi.ldap.LdapCtx.(LdapCtx.java:319)
at com.sun.jndi.ldap.LdapCtxFactory.getUsingURL(LdapCtxFactory.java:192)
at 
com.sun.jndi.ldap.LdapCtxFactory.getUsingURLs(LdapCtxFactory.java:210)
at 
com.sun.jndi.ldap.LdapCtxFactory.getLdapCtxInstance(LdapCtxFactory.java:153)
at 
com.sun.jndi.ldap.LdapCtxFactory.getInitialContext(LdapCtxFactory.java:83)
at 
javax.naming.spi.NamingManager.getInitialContext(NamingManager.java:684)
at 
javax.naming.InitialContext.getDefaultInitCtx(InitialContext.java:313)
at javax.naming.InitialContext.init(InitialContext.java:244)
at javax.naming.InitialContext.(InitialContext.java:216)
at 
javax.naming.directory.InitialDirContext.(InitialDirContext.java:101)
at 
org.apache.hive.service.auth.LdapAuthenticationProviderImpl.Authenticate(LdapAuthenticationProviderImpl.java:74)
... 10 more
22/06/06 17:45:29 ERROR server.TThreadPoolServer: Error occurred during 
processing of message.
java.lang.RuntimeException: org.apache.thrift.transport.TTransportException: 
Error validating the login
at 
org.apache.thrift.transport.TSaslServerTransport$Factory.getTransport(TSaslServerTransport.java:219)
at 
org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:269)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.thrift.transport.TTransportException: Error validating 
the login
at 
org.apache.thrift.transport.TSaslTransport.sendAndThrowMessage(TSaslTransport.java:232)
at 
org.apache.thrift.transport.TSaslTransport.open(TSaslTransport.java:316)
at 
org.apache.thrift.transport.TSaslServerTransport.open(TSaslServerTransport.java:41)
at 
org.apache.thrift.transport.TSaslServerTransport$Factory.getTransport(TSaslServerTransport.java:216)
... 4 more {code}
hive-site.xml:
{code:java}



    
        hive.metastore.uris
        thrift://metastore_uri:9083
        Thrift URI for the remote metastore. Used by metastore 
client to connect to remote metastore.
    


    
        

[jira] [Commented] (SPARK-32570) Thriftserver LDAP failed

2022-06-03 Thread weiliang hao (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17547810#comment-17547810
 ] 

weiliang hao commented on SPARK-32570:
--

cc [~dongjoon]  [~hyukjin.kwon] 

> Thriftserver LDAP failed
> 
>
> Key: SPARK-32570
> URL: https://issues.apache.org/jira/browse/SPARK-32570
> Project: Spark
>  Issue Type: Question
>  Components: SQL
>Affects Versions: 2.4.6
>Reporter: Jie Zhang
>Priority: Major
>
> I downloaded spark-2.4.6-bin-hadoop2.7.tgz, added a new file to 
> conf/hive-site.xml, put the following parameters into it, ran 
> sbin/start-thriftserver.sh, then bin/beeline worked, able to query tables in 
> our hive-metastore. 
> {code:java}
> 
> hive.metastore.uris
> thrift://hive-metastore-service.company.com:9083
> 
> 
> hive.metastore.schema.verification
> false
> 
> 
> javax.jdo.option.ConnectionURL
> 
> jdbc:mysql://hive-metastore-db.company.com:3306/hive?createDatabaseIfNotExist=false
> 
> 
> javax.jdo.option.ConnectionDriverName
> org.mariadb.jdbc.Driver
> 
> 
> javax.jdo.option.ConnectionUserName
> x
>   
> 
> javax.jdo.option.ConnectionPassword
> x
>   
> 
> hive.metastore.connect.retries
> 15
> 
> {code}
> In order to enable LDAP, I added these parameters into conf/hive-site.xml, 
> stopped and started thriftserver, then bin/beeline complained invalid 
> credentials.
> I know my credentials works because I enabled LDAP on Hive-Server2 and it 
> worked. 
> {code:java}
> 
> hive.server2.authentication
> LDAP
>   
> 
> hive.server2.authentication.ldap.url
> ldaps://ldap-server.company.com:636
>   
> 
> hive.server2.authentication.ldap.baseDN
> ou=People,dc=company,dc=com
>   
> 
> hive.server2.authentication.ldap.userDNPattern
> cn=%s,ou=People,dc=company,dc=com
> 
> {code}
> The error message:
> {code:java}
> 20/08/07 21:05:39 ERROR TSaslTransport: SASL negotiation failure20/08/07 
> 21:05:39 ERROR TSaslTransport: SASL negotiation 
> failurejavax.security.sasl.SaslException: Error validating the login [Caused 
> by javax.security.sasl.AuthenticationException: Error validating LDAP user 
> [Caused by javax.naming.AuthenticationException: [LDAP: error code 49 - 
> Invalid Credentials]]] at 
> org.apache.hive.service.auth.PlainSaslServer.evaluateResponse(PlainSaslServer.java:109)
>  at 
> org.apache.thrift.transport.TSaslTransport$SaslParticipant.evaluateChallengeOrResponse(TSaslTransport.java:539)
>  at org.apache.thrift.transport.TSaslTransport.open(TSaslTransport.java:283) 
> at 
> org.apache.thrift.transport.TSaslServerTransport.open(TSaslServerTransport.java:41)
>  at 
> org.apache.thrift.transport.TSaslServerTransport$Factory.getTransport(TSaslServerTransport.java:216)
>  at 
> org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:269)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  at java.lang.Thread.run(Thread.java:748)Caused by: 
> javax.security.sasl.AuthenticationException: Error validating LDAP user 
> [Caused by javax.naming.AuthenticationException: [LDAP: error code 49 - 
> Invalid Credentials]] at 
> org.apache.hive.service.auth.LdapAuthenticationProviderImpl.Authenticate(LdapAuthenticationProviderImpl.java:77)
>  at 
> org.apache.hive.service.auth.PlainSaslHelper$PlainServerCallbackHandler.handle(PlainSaslHelper.java:106)
>  at 
> org.apache.hive.service.auth.PlainSaslServer.evaluateResponse(PlainSaslServer.java:102)
>  ... 8 more
> {code}
> Anything else I need to do in order to enable LDAP on Spark Thriftserver? 
> Thanks for your help. 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32570) Thriftserver LDAP failed

2022-05-31 Thread weiliang hao (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17544209#comment-17544209
 ] 

weiliang hao commented on SPARK-32570:
--

Can this JIRA be assigned to me? I will try to solve this problem,Thanks

> Thriftserver LDAP failed
> 
>
> Key: SPARK-32570
> URL: https://issues.apache.org/jira/browse/SPARK-32570
> Project: Spark
>  Issue Type: Question
>  Components: SQL
>Affects Versions: 2.4.6
>Reporter: Jie Zhang
>Priority: Major
>
> I downloaded spark-2.4.6-bin-hadoop2.7.tgz, added a new file to 
> conf/hive-site.xml, put the following parameters into it, ran 
> sbin/start-thriftserver.sh, then bin/beeline worked, able to query tables in 
> our hive-metastore. 
> {code:java}
> 
> hive.metastore.uris
> thrift://hive-metastore-service.company.com:9083
> 
> 
> hive.metastore.schema.verification
> false
> 
> 
> javax.jdo.option.ConnectionURL
> 
> jdbc:mysql://hive-metastore-db.company.com:3306/hive?createDatabaseIfNotExist=false
> 
> 
> javax.jdo.option.ConnectionDriverName
> org.mariadb.jdbc.Driver
> 
> 
> javax.jdo.option.ConnectionUserName
> x
>   
> 
> javax.jdo.option.ConnectionPassword
> x
>   
> 
> hive.metastore.connect.retries
> 15
> 
> {code}
> In order to enable LDAP, I added these parameters into conf/hive-site.xml, 
> stopped and started thriftserver, then bin/beeline complained invalid 
> credentials.
> I know my credentials works because I enabled LDAP on Hive-Server2 and it 
> worked. 
> {code:java}
> 
> hive.server2.authentication
> LDAP
>   
> 
> hive.server2.authentication.ldap.url
> ldaps://ldap-server.company.com:636
>   
> 
> hive.server2.authentication.ldap.baseDN
> ou=People,dc=company,dc=com
>   
> 
> hive.server2.authentication.ldap.userDNPattern
> cn=%s,ou=People,dc=company,dc=com
> 
> {code}
> The error message:
> {code:java}
> 20/08/07 21:05:39 ERROR TSaslTransport: SASL negotiation failure20/08/07 
> 21:05:39 ERROR TSaslTransport: SASL negotiation 
> failurejavax.security.sasl.SaslException: Error validating the login [Caused 
> by javax.security.sasl.AuthenticationException: Error validating LDAP user 
> [Caused by javax.naming.AuthenticationException: [LDAP: error code 49 - 
> Invalid Credentials]]] at 
> org.apache.hive.service.auth.PlainSaslServer.evaluateResponse(PlainSaslServer.java:109)
>  at 
> org.apache.thrift.transport.TSaslTransport$SaslParticipant.evaluateChallengeOrResponse(TSaslTransport.java:539)
>  at org.apache.thrift.transport.TSaslTransport.open(TSaslTransport.java:283) 
> at 
> org.apache.thrift.transport.TSaslServerTransport.open(TSaslServerTransport.java:41)
>  at 
> org.apache.thrift.transport.TSaslServerTransport$Factory.getTransport(TSaslServerTransport.java:216)
>  at 
> org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:269)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  at java.lang.Thread.run(Thread.java:748)Caused by: 
> javax.security.sasl.AuthenticationException: Error validating LDAP user 
> [Caused by javax.naming.AuthenticationException: [LDAP: error code 49 - 
> Invalid Credentials]] at 
> org.apache.hive.service.auth.LdapAuthenticationProviderImpl.Authenticate(LdapAuthenticationProviderImpl.java:77)
>  at 
> org.apache.hive.service.auth.PlainSaslHelper$PlainServerCallbackHandler.handle(PlainSaslHelper.java:106)
>  at 
> org.apache.hive.service.auth.PlainSaslServer.evaluateResponse(PlainSaslServer.java:102)
>  ... 8 more
> {code}
> Anything else I need to do in order to enable LDAP on Spark Thriftserver? 
> Thanks for your help. 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] (SPARK-32570) Thriftserver LDAP failed

2022-05-31 Thread weiliang hao (Jira)


[ https://issues.apache.org/jira/browse/SPARK-32570 ]


weiliang hao deleted comment on SPARK-32570:
--

was (Author: haoweiliang):
Can this JIRA be assigned to me? I will try to solve this problem

> Thriftserver LDAP failed
> 
>
> Key: SPARK-32570
> URL: https://issues.apache.org/jira/browse/SPARK-32570
> Project: Spark
>  Issue Type: Question
>  Components: SQL
>Affects Versions: 2.4.6
>Reporter: Jie Zhang
>Priority: Major
>
> I downloaded spark-2.4.6-bin-hadoop2.7.tgz, added a new file to 
> conf/hive-site.xml, put the following parameters into it, ran 
> sbin/start-thriftserver.sh, then bin/beeline worked, able to query tables in 
> our hive-metastore. 
> {code:java}
> 
> hive.metastore.uris
> thrift://hive-metastore-service.company.com:9083
> 
> 
> hive.metastore.schema.verification
> false
> 
> 
> javax.jdo.option.ConnectionURL
> 
> jdbc:mysql://hive-metastore-db.company.com:3306/hive?createDatabaseIfNotExist=false
> 
> 
> javax.jdo.option.ConnectionDriverName
> org.mariadb.jdbc.Driver
> 
> 
> javax.jdo.option.ConnectionUserName
> x
>   
> 
> javax.jdo.option.ConnectionPassword
> x
>   
> 
> hive.metastore.connect.retries
> 15
> 
> {code}
> In order to enable LDAP, I added these parameters into conf/hive-site.xml, 
> stopped and started thriftserver, then bin/beeline complained invalid 
> credentials.
> I know my credentials works because I enabled LDAP on Hive-Server2 and it 
> worked. 
> {code:java}
> 
> hive.server2.authentication
> LDAP
>   
> 
> hive.server2.authentication.ldap.url
> ldaps://ldap-server.company.com:636
>   
> 
> hive.server2.authentication.ldap.baseDN
> ou=People,dc=company,dc=com
>   
> 
> hive.server2.authentication.ldap.userDNPattern
> cn=%s,ou=People,dc=company,dc=com
> 
> {code}
> The error message:
> {code:java}
> 20/08/07 21:05:39 ERROR TSaslTransport: SASL negotiation failure20/08/07 
> 21:05:39 ERROR TSaslTransport: SASL negotiation 
> failurejavax.security.sasl.SaslException: Error validating the login [Caused 
> by javax.security.sasl.AuthenticationException: Error validating LDAP user 
> [Caused by javax.naming.AuthenticationException: [LDAP: error code 49 - 
> Invalid Credentials]]] at 
> org.apache.hive.service.auth.PlainSaslServer.evaluateResponse(PlainSaslServer.java:109)
>  at 
> org.apache.thrift.transport.TSaslTransport$SaslParticipant.evaluateChallengeOrResponse(TSaslTransport.java:539)
>  at org.apache.thrift.transport.TSaslTransport.open(TSaslTransport.java:283) 
> at 
> org.apache.thrift.transport.TSaslServerTransport.open(TSaslServerTransport.java:41)
>  at 
> org.apache.thrift.transport.TSaslServerTransport$Factory.getTransport(TSaslServerTransport.java:216)
>  at 
> org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:269)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  at java.lang.Thread.run(Thread.java:748)Caused by: 
> javax.security.sasl.AuthenticationException: Error validating LDAP user 
> [Caused by javax.naming.AuthenticationException: [LDAP: error code 49 - 
> Invalid Credentials]] at 
> org.apache.hive.service.auth.LdapAuthenticationProviderImpl.Authenticate(LdapAuthenticationProviderImpl.java:77)
>  at 
> org.apache.hive.service.auth.PlainSaslHelper$PlainServerCallbackHandler.handle(PlainSaslHelper.java:106)
>  at 
> org.apache.hive.service.auth.PlainSaslServer.evaluateResponse(PlainSaslServer.java:102)
>  ... 8 more
> {code}
> Anything else I need to do in order to enable LDAP on Spark Thriftserver? 
> Thanks for your help. 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32570) Thriftserver LDAP failed

2022-05-31 Thread weiliang hao (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17544208#comment-17544208
 ] 

weiliang hao commented on SPARK-32570:
--

Can this JIRA be assigned to me? I will try to solve this problem

> Thriftserver LDAP failed
> 
>
> Key: SPARK-32570
> URL: https://issues.apache.org/jira/browse/SPARK-32570
> Project: Spark
>  Issue Type: Question
>  Components: SQL
>Affects Versions: 2.4.6
>Reporter: Jie Zhang
>Priority: Major
>
> I downloaded spark-2.4.6-bin-hadoop2.7.tgz, added a new file to 
> conf/hive-site.xml, put the following parameters into it, ran 
> sbin/start-thriftserver.sh, then bin/beeline worked, able to query tables in 
> our hive-metastore. 
> {code:java}
> 
> hive.metastore.uris
> thrift://hive-metastore-service.company.com:9083
> 
> 
> hive.metastore.schema.verification
> false
> 
> 
> javax.jdo.option.ConnectionURL
> 
> jdbc:mysql://hive-metastore-db.company.com:3306/hive?createDatabaseIfNotExist=false
> 
> 
> javax.jdo.option.ConnectionDriverName
> org.mariadb.jdbc.Driver
> 
> 
> javax.jdo.option.ConnectionUserName
> x
>   
> 
> javax.jdo.option.ConnectionPassword
> x
>   
> 
> hive.metastore.connect.retries
> 15
> 
> {code}
> In order to enable LDAP, I added these parameters into conf/hive-site.xml, 
> stopped and started thriftserver, then bin/beeline complained invalid 
> credentials.
> I know my credentials works because I enabled LDAP on Hive-Server2 and it 
> worked. 
> {code:java}
> 
> hive.server2.authentication
> LDAP
>   
> 
> hive.server2.authentication.ldap.url
> ldaps://ldap-server.company.com:636
>   
> 
> hive.server2.authentication.ldap.baseDN
> ou=People,dc=company,dc=com
>   
> 
> hive.server2.authentication.ldap.userDNPattern
> cn=%s,ou=People,dc=company,dc=com
> 
> {code}
> The error message:
> {code:java}
> 20/08/07 21:05:39 ERROR TSaslTransport: SASL negotiation failure20/08/07 
> 21:05:39 ERROR TSaslTransport: SASL negotiation 
> failurejavax.security.sasl.SaslException: Error validating the login [Caused 
> by javax.security.sasl.AuthenticationException: Error validating LDAP user 
> [Caused by javax.naming.AuthenticationException: [LDAP: error code 49 - 
> Invalid Credentials]]] at 
> org.apache.hive.service.auth.PlainSaslServer.evaluateResponse(PlainSaslServer.java:109)
>  at 
> org.apache.thrift.transport.TSaslTransport$SaslParticipant.evaluateChallengeOrResponse(TSaslTransport.java:539)
>  at org.apache.thrift.transport.TSaslTransport.open(TSaslTransport.java:283) 
> at 
> org.apache.thrift.transport.TSaslServerTransport.open(TSaslServerTransport.java:41)
>  at 
> org.apache.thrift.transport.TSaslServerTransport$Factory.getTransport(TSaslServerTransport.java:216)
>  at 
> org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:269)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  at java.lang.Thread.run(Thread.java:748)Caused by: 
> javax.security.sasl.AuthenticationException: Error validating LDAP user 
> [Caused by javax.naming.AuthenticationException: [LDAP: error code 49 - 
> Invalid Credentials]] at 
> org.apache.hive.service.auth.LdapAuthenticationProviderImpl.Authenticate(LdapAuthenticationProviderImpl.java:77)
>  at 
> org.apache.hive.service.auth.PlainSaslHelper$PlainServerCallbackHandler.handle(PlainSaslHelper.java:106)
>  at 
> org.apache.hive.service.auth.PlainSaslServer.evaluateResponse(PlainSaslServer.java:102)
>  ... 8 more
> {code}
> Anything else I need to do in order to enable LDAP on Spark Thriftserver? 
> Thanks for your help. 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37274) When the value of this parameter is greater than the maximum value of int type, the value will be thrown out of bounds. The document description of this parameter should

2021-11-10 Thread hao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hao updated SPARK-37274:

Summary: When the value of this parameter is greater than the maximum value 
of int type, the value will be thrown out of bounds. The document description 
of this parameter should remind the user of this risk point  (was: These 
parameters should be of type long, not int)

> When the value of this parameter is greater than the maximum value of int 
> type, the value will be thrown out of bounds. The document description of 
> this parameter should remind the user of this risk point
> 
>
> Key: SPARK-37274
> URL: https://issues.apache.org/jira/browse/SPARK-37274
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: hao
>Priority: Major
>
> These parameters [spark.sql.orc.columnarReaderBatchSize], 
> [spark.sql.inMemoryColumnarStorage.batchSize], 
> [spark.sql.parquet.columnarReaderBatchSize] should be of type long, not of 
> type int. when the user sets the value to be greater than the maximum value 
> of type int, an error will be thrown



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37274) When the value of this parameter is greater than the maximum value of int type, the value will be thrown out of bounds. The document description of this parameter should

2021-11-10 Thread hao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hao updated SPARK-37274:

Description: When the value of this parameter is greater than the maximum 
value of int type, the value will be thrown out of bounds. The document 
description of this parameter should remind the user of this risk point  (was: 
These parameters [spark.sql.orc.columnarReaderBatchSize], 
[spark.sql.inMemoryColumnarStorage.batchSize], 
[spark.sql.parquet.columnarReaderBatchSize] should be of type long, not of type 
int. when the user sets the value to be greater than the maximum value of type 
int, an error will be thrown)

> When the value of this parameter is greater than the maximum value of int 
> type, the value will be thrown out of bounds. The document description of 
> this parameter should remind the user of this risk point
> 
>
> Key: SPARK-37274
> URL: https://issues.apache.org/jira/browse/SPARK-37274
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: hao
>Priority: Major
>
> When the value of this parameter is greater than the maximum value of int 
> type, the value will be thrown out of bounds. The document description of 
> this parameter should remind the user of this risk point



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37274) These parameters should be of type long, not int

2021-11-10 Thread hao (Jira)
hao created SPARK-37274:
---

 Summary: These parameters should be of type long, not int
 Key: SPARK-37274
 URL: https://issues.apache.org/jira/browse/SPARK-37274
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.2.0
Reporter: hao


These parameters [spark.sql.orc.columnarReaderBatchSize], 
[spark.sql.inMemoryColumnarStorage.batchSize], 
[spark.sql.parquet.columnarReaderBatchSize] should be of type long, not of type 
int. when the user sets the value to be greater than the maximum value of type 
int, an error will be thrown



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34406) When we submit spark core tasks frequently, the submitted nodes will have a lot of resource pressure

2021-02-08 Thread hao (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17281560#comment-17281560
 ] 

hao commented on SPARK-34406:
-

Yes, sir. I'm using the Yarn cluster mode. What I mean here is that when the 
spark client submits spark core to the remote yarn, it is submitted in the way 
of process, but this way consumes a lot of resources

> When we submit spark core tasks frequently, the submitted nodes will have a 
> lot of resource pressure
> 
>
> Key: SPARK-34406
> URL: https://issues.apache.org/jira/browse/SPARK-34406
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.1
>Reporter: hao
>Priority: Major
>
> When we submit spark core tasks frequently, the submitted node will have a 
> lot of resource pressure, because spark will create a process instead of a 
> thread for each submitted task. In fact, there is a lot of resource 
> consumption. When the QPS of the submitted task is very high, the submission 
> will fail due to insufficient resources. I would like to ask how to optimize 
> the amount of resources consumed by spark core submission



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-34406) When we submit spark core tasks frequently, the submitted nodes will have a lot of resource pressure

2021-02-08 Thread hao (Jira)
hao created SPARK-34406:
---

 Summary: When we submit spark core tasks frequently, the submitted 
nodes will have a lot of resource pressure
 Key: SPARK-34406
 URL: https://issues.apache.org/jira/browse/SPARK-34406
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.0.1
Reporter: hao


When we submit spark core tasks frequently, the submitted node will have a lot 
of resource pressure, because spark will create a process instead of a thread 
for each submitted task. In fact, there is a lot of resource consumption. When 
the QPS of the submitted task is very high, the submission will fail due to 
insufficient resources. I would like to ask how to optimize the amount of 
resources consumed by spark core submission



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33982) Sparksql does not support when the inserted table is a read table

2021-01-05 Thread hao (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17259391#comment-17259391
 ] 

hao commented on SPARK-33982:
-

[~hyukjin.kwon] 小哥,能看懂中文的呀!是的这是同一个问题>.<

> Sparksql does not support when the inserted table is a read table
> -
>
> Key: SPARK-33982
> URL: https://issues.apache.org/jira/browse/SPARK-33982
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.1
>Reporter: hao
>Priority: Major
>
> When the inserted table is a read table, sparksql will throw an error - > 
> org.apache.spark . sql.AnalysisException : Cannot overwrite a path that is 
> also being read from.;



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-34006) [spark.sql.hive.convertMetastoreOrc]This parameter can solve orc format table insert overwrite read table, it should be stated in the document

2021-01-04 Thread hao (Jira)
hao created SPARK-34006:
---

 Summary: [spark.sql.hive.convertMetastoreOrc]This parameter can 
solve orc format table insert overwrite read table, it should be stated in the 
document
 Key: SPARK-34006
 URL: https://issues.apache.org/jira/browse/SPARK-34006
 Project: Spark
  Issue Type: Bug
  Components: docs
Affects Versions: 3.0.1
Reporter: hao


This parameter can solve orc format table insert overwrite read table, it 
should be stated in the document



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-33982) Sparksql does not support when the inserted table is a read table

2021-01-04 Thread hao (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17258052#comment-17258052
 ] 

hao edited comment on SPARK-33982 at 1/4/21, 11:19 AM:
---

我认为sparksql应该得到支持insert overwrite 读取表中


was (Author: hao.duan):
我认为sparksql应该得到支持

> Sparksql does not support when the inserted table is a read table
> -
>
> Key: SPARK-33982
> URL: https://issues.apache.org/jira/browse/SPARK-33982
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.1
>Reporter: hao
>Priority: Major
>
> When the inserted table is a read table, sparksql will throw an error - > 
> org.apache.spark . sql.AnalysisException : Cannot overwrite a path that is 
> also being read from.;



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33982) Sparksql does not support when the inserted table is a read table

2021-01-04 Thread hao (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17258052#comment-17258052
 ] 

hao commented on SPARK-33982:
-

我认为sparksql应该得到支持

> Sparksql does not support when the inserted table is a read table
> -
>
> Key: SPARK-33982
> URL: https://issues.apache.org/jira/browse/SPARK-33982
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.1
>Reporter: hao
>Priority: Major
>
> When the inserted table is a read table, sparksql will throw an error - > 
> org.apache.spark . sql.AnalysisException : Cannot overwrite a path that is 
> also being read from.;



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33982) Sparksql does not support when the inserted table is a read table

2021-01-04 Thread hao (Jira)
hao created SPARK-33982:
---

 Summary: Sparksql does not support when the inserted table is a 
read table
 Key: SPARK-33982
 URL: https://issues.apache.org/jira/browse/SPARK-33982
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.1
Reporter: hao


When the inserted table is a read table, sparksql will throw an error - > 
org.apache.spark . sql.AnalysisException : Cannot overwrite a path that is also 
being read from.;



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17299) TRIM/LTRIM/RTRIM strips characters other than spaces

2016-08-31 Thread Cheng Hao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15451914#comment-15451914
 ] 

Cheng Hao commented on SPARK-17299:
---

Or come after SPARK-14878 ?

> TRIM/LTRIM/RTRIM strips characters other than spaces
> 
>
> Key: SPARK-17299
> URL: https://issues.apache.org/jira/browse/SPARK-17299
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, SQL
>Affects Versions: 2.0.0
>Reporter: Jeremy Beard
>Priority: Minor
>
> TRIM/LTRIM/RTRIM docs state that they only strip spaces:
> http://spark.apache.org/docs/2.0.0/api/java/org/apache/spark/sql/functions.html#trim(org.apache.spark.sql.Column)
> But the implementation strips all characters of ASCII value 20 or less:
> https://github.com/apache/spark/blob/v2.0.0/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java#L468-L470



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17299) TRIM/LTRIM/RTRIM strips characters other than spaces

2016-08-31 Thread Cheng Hao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15451810#comment-15451810
 ] 

Cheng Hao commented on SPARK-17299:
---

Yes, that's my bad, I thought it should be the same behavior of 
`String.trim()`. We should fix this bug. [~jbeard], can you please fix it?

> TRIM/LTRIM/RTRIM strips characters other than spaces
> 
>
> Key: SPARK-17299
> URL: https://issues.apache.org/jira/browse/SPARK-17299
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, SQL
>Affects Versions: 2.0.0
>Reporter: Jeremy Beard
>Priority: Minor
>
> TRIM/LTRIM/RTRIM docs state that they only strip spaces:
> http://spark.apache.org/docs/2.0.0/api/java/org/apache/spark/sql/functions.html#trim(org.apache.spark.sql.Column)
> But the implementation strips all characters of ASCII value 20 or less:
> https://github.com/apache/spark/blob/v2.0.0/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java#L468-L470



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16869) Wrong projection when join on columns with the same name which are derived from the same dataframe

2016-08-03 Thread GUAN Hao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15405686#comment-15405686
 ] 

GUAN Hao commented on SPARK-16869:
--

I've update the outer joining part (which was missing) to the code snippet in 
the original post.

{code}
result = b.join(c, (b.i == c.i)
  & (b.j == c.j)
  & (b.k == c.k), 'outer') \
.select(
b.i.alias('b_i'),
c.i.alias('c_i'),
functions.coalesce(b.i, c.i).alias('i'),
functions.coalesce(b.j, c.j).alias('j'),
functions.coalesce(b.k, c.k).alias('k'),
b.p,
c.q,
)

result.explain()
result.show()
{code}

> Wrong projection when join on columns with the same name which are derived 
> from the same dataframe
> --
>
> Key: SPARK-16869
> URL: https://issues.apache.org/jira/browse/SPARK-16869
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: GUAN Hao
>
> I have to DataFrames, both contain a column named *i* which are derived from 
> a same DataFrame (join).
> {code}
> b
> +---+---+---+---+
> |  j|  p|  i|  k|
> +---+---+---+---+
> |  3|  2|  3|  3|
> |  2|  1|  2|  2|
> +---+---+---+---+
> c
> +---+---+---+---+
> |  j|  k|  q|  i|
> +---+---+---+---+
> |  1|  1|  0|  1|
> |  2|  2|  1|  2|
> +---+---+---+---+
> {code}
> The result of OUTER join of two DataFrames above is:
> {code}
> i = colaesce(b.i, c.i)
> ++++---+---+++
> | b_i| c_i|   i|  j|  k|   p|   q|
> ++++---+---+++
> |   2|   2|   2|  2|  2|   1|   1|
> |null|   1|   1|  1|  1|null|   0|
> |   3|null|   3|  3|  3|   2|null|
> ++++---+---+++
> {code}
> However, what I got is:
> {code}
> ++++---+---+++
> | b_i| c_i|   i|  j|  k|   p|   q|
> ++++---+---+++
> |   2|   2|   2|  2|  2|   1|   1|
> |null|null|null|  1|  1|null|   0|
> |   3|   3|   3|  3|  3|   2|null|
> ++++---+---+++
> {code}
> {code}
> == Physical Plan ==
> *Project [i#0L AS b_i#146L, i#0L AS c_i#147L, coalesce(i#0L, i#0L) AS i#148L, 
> coalesce(j#12L, j#21L) AS j#149L, coalesce(k#2L, k#22L) AS k#150L, p#13L, 
> q#23L]
> +- SortMergeJoin [i#0L, j#12L, k#2L], [i#113L, j#21L, k#22L], FullOuter
> 
> {code}
> As shown in the plan, columns {{b.i}} and {{c.i}} are correctly resolved to 
> {{i#0L}} and {{i#113L}} correspondingly in the join condition part. However,
> in the projection part, both {{b.i}} and {{c.i}} are resolved to {{i#0L}}.
> Complete code to re-produce:
> {code}
> from pyspark import SparkContext, SQLContext
> from pyspark.sql import Row, functions
> sc = SparkContext()
> sqlContext = SQLContext(sc)
> data_a = sc.parallelize([
> Row(i=1, j=1, k=1),
> Row(i=2, j=2, k=2),
> Row(i=3, j=3, k=3),
> ])
> table_a = sqlContext.createDataFrame(data_a)
> table_a.show()
> data_b = sc.parallelize([
> Row(j=2, p=1),
> Row(j=3, p=2),
> ])
> table_b = sqlContext.createDataFrame(data_b)
> table_b.show()
> data_c = sc.parallelize([
> Row(j=1, k=1, q=0),
> Row(j=2, k=2, q=1),
> ])
> table_c = sqlContext.createDataFrame(data_c)
> table_c.show()
> b = table_b.join(table_a, table_b.j == table_a.j).drop(table_a.j)
> c = table_c.join(table_a, (table_c.j == table_a.j)
>   & (table_c.k == table_a.k)) \
> .drop(table_a.j) \
> .drop(table_a.k)
> b.show()
> c.show()
> result = b.join(c, (b.i == c.i)
>   & (b.j == c.j)
>   & (b.k == c.k), 'outer') \
> .select(
> b.i.alias('b_i'),
> c.i.alias('c_i'),
> functions.coalesce(b.i, c.i).alias('i'),
> functions.coalesce(b.j, c.j).alias('j'),
> functions.coalesce(b.k, c.k).alias('k'),
> b.p,
> c.q,
> )
> result.explain()
> result.show()
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-16869) Wrong projection when join on columns with the same name which are derived from the same dataframe

2016-08-03 Thread GUAN Hao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15405669#comment-15405669
 ] 

GUAN Hao edited comment on SPARK-16869 at 8/3/16 10:17 AM:
---

{{i = colaesce(b.i, c.i)}} is actually just a comment.


Oh sorry, some line are lost after paste. I've updated the original post.


was (Author: raptium):
{{i = colaesce(b.i, c.i)}} is actually just a comment.


Oh sorry, some line are lost after paste.
{code}
result = b.join(c, (b.i == c.i)
  & (b.j == c.j)
  & (b.k == c.k), 'outer') \
.select(
b.i.alias('b_i'),
c.i.alias('c_i'),
functions.coalesce(b.i, c.i).alias('i'),
functions.coalesce(b.j, c.j).alias('j'),
functions.coalesce(b.k, c.k).alias('k'),
b.p,
c.q,
)

result.explain()
result.show()
{code}

> Wrong projection when join on columns with the same name which are derived 
> from the same dataframe
> --
>
> Key: SPARK-16869
> URL: https://issues.apache.org/jira/browse/SPARK-16869
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: GUAN Hao
>
> I have to DataFrames, both contain a column named *i* which are derived from 
> a same DataFrame (join).
> {code}
> b
> +---+---+---+---+
> |  j|  p|  i|  k|
> +---+---+---+---+
> |  3|  2|  3|  3|
> |  2|  1|  2|  2|
> +---+---+---+---+
> c
> +---+---+---+---+
> |  j|  k|  q|  i|
> +---+---+---+---+
> |  1|  1|  0|  1|
> |  2|  2|  1|  2|
> +---+---+---+---+
> {code}
> The result of OUTER join of two DataFrames above is:
> {code}
> i = colaesce(b.i, c.i)
> ++++---+---+++
> | b_i| c_i|   i|  j|  k|   p|   q|
> ++++---+---+++
> |   2|   2|   2|  2|  2|   1|   1|
> |null|   1|   1|  1|  1|null|   0|
> |   3|null|   3|  3|  3|   2|null|
> ++++---+---+++
> {code}
> However, what I got is:
> {code}
> ++++---+---+++
> | b_i| c_i|   i|  j|  k|   p|   q|
> ++++---+---+++
> |   2|   2|   2|  2|  2|   1|   1|
> |null|null|null|  1|  1|null|   0|
> |   3|   3|   3|  3|  3|   2|null|
> ++++---+---+++
> {code}
> {code}
> == Physical Plan ==
> *Project [i#0L AS b_i#146L, i#0L AS c_i#147L, coalesce(i#0L, i#0L) AS i#148L, 
> coalesce(j#12L, j#21L) AS j#149L, coalesce(k#2L, k#22L) AS k#150L, p#13L, 
> q#23L]
> +- SortMergeJoin [i#0L, j#12L, k#2L], [i#113L, j#21L, k#22L], FullOuter
> 
> {code}
> As shown in the plan, columns {{b.i}} and {{c.i}} are correctly resolved to 
> {{i#0L}} and {{i#113L}} correspondingly in the join condition part. However,
> in the projection part, both {{b.i}} and {{c.i}} are resolved to {{i#0L}}.
> Complete code to re-produce:
> {code}
> from pyspark import SparkContext, SQLContext
> from pyspark.sql import Row, functions
> sc = SparkContext()
> sqlContext = SQLContext(sc)
> data_a = sc.parallelize([
> Row(i=1, j=1, k=1),
> Row(i=2, j=2, k=2),
> Row(i=3, j=3, k=3),
> ])
> table_a = sqlContext.createDataFrame(data_a)
> table_a.show()
> data_b = sc.parallelize([
> Row(j=2, p=1),
> Row(j=3, p=2),
> ])
> table_b = sqlContext.createDataFrame(data_b)
> table_b.show()
> data_c = sc.parallelize([
> Row(j=1, k=1, q=0),
> Row(j=2, k=2, q=1),
> ])
> table_c = sqlContext.createDataFrame(data_c)
> table_c.show()
> b = table_b.join(table_a, table_b.j == table_a.j).drop(table_a.j)
> c = table_c.join(table_a, (table_c.j == table_a.j)
>   & (table_c.k == table_a.k)) \
> .drop(table_a.j) \
> .drop(table_a.k)
> b.show()
> c.show()
> result = b.join(c, (b.i == c.i)
>   & (b.j == c.j)
>   & (b.k == c.k), 'outer') \
> .select(
> b.i.alias('b_i'),
> c.i.alias('c_i'),
> functions.coalesce(b.i, c.i).alias('i'),
> functions.coalesce(b.j, c.j).alias('j'),
> functions.coalesce(b.k, c.k).alias('k'),
> b.p,
> c.q,
> )
> result.explain()
> result.show()
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16869) Wrong projection when join on columns with the same name which are derived from the same dataframe

2016-08-03 Thread GUAN Hao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

GUAN Hao updated SPARK-16869:
-
Description: 
I have to DataFrames, both contain a column named *i* which are derived from a 
same DataFrame (join).

{code}
b
+---+---+---+---+
|  j|  p|  i|  k|
+---+---+---+---+
|  3|  2|  3|  3|
|  2|  1|  2|  2|
+---+---+---+---+

c
+---+---+---+---+
|  j|  k|  q|  i|
+---+---+---+---+
|  1|  1|  0|  1|
|  2|  2|  1|  2|
+---+---+---+---+
{code}

The result of OUTER join of two DataFrames above is:

{code}
i = colaesce(b.i, c.i)
++++---+---+++
| b_i| c_i|   i|  j|  k|   p|   q|
++++---+---+++
|   2|   2|   2|  2|  2|   1|   1|
|null|   1|   1|  1|  1|null|   0|
|   3|null|   3|  3|  3|   2|null|
++++---+---+++
{code}

However, what I got is:

{code}
++++---+---+++
| b_i| c_i|   i|  j|  k|   p|   q|
++++---+---+++
|   2|   2|   2|  2|  2|   1|   1|
|null|null|null|  1|  1|null|   0|
|   3|   3|   3|  3|  3|   2|null|
++++---+---+++
{code}

{code}
== Physical Plan ==
*Project [i#0L AS b_i#146L, i#0L AS c_i#147L, coalesce(i#0L, i#0L) AS i#148L, 
coalesce(j#12L, j#21L) AS j#149L, coalesce(k#2L, k#22L) AS k#150L, p#13L, q#23L]
+- SortMergeJoin [i#0L, j#12L, k#2L], [i#113L, j#21L, k#22L], FullOuter

{code}

As shown in the plan, columns {{b.i}} and {{c.i}} are correctly resolved to 
{{i#0L}} and {{i#113L}} correspondingly in the join condition part. However,
in the projection part, both {{b.i}} and {{c.i}} are resolved to {{i#0L}}.

Complete code to re-produce:

{code}
from pyspark import SparkContext, SQLContext

from pyspark.sql import Row, functions

sc = SparkContext()
sqlContext = SQLContext(sc)

data_a = sc.parallelize([
Row(i=1, j=1, k=1),
Row(i=2, j=2, k=2),
Row(i=3, j=3, k=3),
])
table_a = sqlContext.createDataFrame(data_a)
table_a.show()

data_b = sc.parallelize([
Row(j=2, p=1),
Row(j=3, p=2),
])
table_b = sqlContext.createDataFrame(data_b)
table_b.show()

data_c = sc.parallelize([
Row(j=1, k=1, q=0),
Row(j=2, k=2, q=1),
])
table_c = sqlContext.createDataFrame(data_c)
table_c.show()

b = table_b.join(table_a, table_b.j == table_a.j).drop(table_a.j)

c = table_c.join(table_a, (table_c.j == table_a.j)
  & (table_c.k == table_a.k)) \
.drop(table_a.j) \
.drop(table_a.k)


b.show()
c.show()

result = b.join(c, (b.i == c.i)
  & (b.j == c.j)
  & (b.k == c.k), 'outer') \
.select(
b.i.alias('b_i'),
c.i.alias('c_i'),
functions.coalesce(b.i, c.i).alias('i'),
functions.coalesce(b.j, c.j).alias('j'),
functions.coalesce(b.k, c.k).alias('k'),
b.p,
c.q,
)

result.explain()
result.show()
{code}


  was:
I have to DataFrames, both contain a column named *i* which are derived from a 
same DataFrame (join).

{code}
b
+---+---+---+---+
|  j|  p|  i|  k|
+---+---+---+---+
|  3|  2|  3|  3|
|  2|  1|  2|  2|
+---+---+---+---+

c
+---+---+---+---+
|  j|  k|  q|  i|
+---+---+---+---+
|  1|  1|  0|  1|
|  2|  2|  1|  2|
+---+---+---+---+
{code}

The result of OUTER join of two DataFrames above is:

{code}
i = colaesce(b.i, c.i)
++++---+---+++
| b_i| c_i|   i|  j|  k|   p|   q|
++++---+---+++
|   2|   2|   2|  2|  2|   1|   1|
|null|   1|   1|  1|  1|null|   0|
|   3|null|   3|  3|  3|   2|null|
++++---+---+++
{code}

However, what I got is:

{code}
++++---+---+++
| b_i| c_i|   i|  j|  k|   p|   q|
++++---+---+++
|   2|   2|   2|  2|  2|   1|   1|
|null|null|null|  1|  1|null|   0|
|   3|   3|   3|  3|  3|   2|null|
++++---+---+++
{code}

{code}
== Physical Plan ==
*Project [i#0L AS b_i#146L, i#0L AS c_i#147L, coalesce(i#0L, i#0L) AS i#148L, 
coalesce(j#12L, j#21L) AS j#149L, coalesce(k#2L, k#22L) AS k#150L, p#13L, q#23L]
+- SortMergeJoin [i#0L, j#12L, k#2L], [i#113L, j#21L, k#22L], FullOuter

{code}

As shown in the plan, columns {{b.i}} and {{c.i}} are correctly resolved to 
{{i#0L}} and {{i#113L}} correspondingly in the join condition part. However,
in the projection part, both {{b.i}} and {{c.i}} are resolved to {{i#0L}}.

Complete code to re-produce:

{code}
from pyspark import SparkContext, SQLContext

from pyspark.sql import Row, functions

sc = SparkContext()
sqlContext = SQLContext(sc)

data_a = sc.parallelize([
Row(i=1, j=1, k=1),
Row(i=2, j=2, k=2),
Row(i=3, j=3, k=3),
])
table_a = sqlContext.createDataFrame(data_a)
table_a.show()

data_b = sc.parallelize([
Row(j=2, p=1),
Row(j=3, p=2),
])
table_b = sqlContext.createDataFrame(data_b)
table_b.show()

data_c = sc.parallelize([
Row(j=1, k=1, q=0),
Row(j=2, k=2, q=1),
])
table_c = sqlContext.createDataFrame(data_c)
table_c.show()

b = table_b.join(table_a, table_b.j == 

[jira] [Comment Edited] (SPARK-16869) Wrong projection when join on columns with the same name which are derived from the same dataframe

2016-08-03 Thread GUAN Hao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15405669#comment-15405669
 ] 

GUAN Hao edited comment on SPARK-16869 at 8/3/16 10:16 AM:
---

{{i = colaesce(b.i, c.i)}} is actually just a comment.


Oh sorry, some line are lost after paste.
{code}
result = b.join(c, (b.i == c.i)
  & (b.j == c.j)
  & (b.k == c.k), 'outer') \
.select(
b.i.alias('b_i'),
c.i.alias('c_i'),
functions.coalesce(b.i, c.i).alias('i'),
functions.coalesce(b.j, c.j).alias('j'),
functions.coalesce(b.k, c.k).alias('k'),
b.p,
c.q,
)

result.explain()
result.show()
{code}


was (Author: raptium):
Oh, {{i = colaesce(b.i, c.i)}} is actually just a comment.

Please try the code at the bottom of the original post.

> Wrong projection when join on columns with the same name which are derived 
> from the same dataframe
> --
>
> Key: SPARK-16869
> URL: https://issues.apache.org/jira/browse/SPARK-16869
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: GUAN Hao
>
> I have to DataFrames, both contain a column named *i* which are derived from 
> a same DataFrame (join).
> {code}
> b
> +---+---+---+---+
> |  j|  p|  i|  k|
> +---+---+---+---+
> |  3|  2|  3|  3|
> |  2|  1|  2|  2|
> +---+---+---+---+
> c
> +---+---+---+---+
> |  j|  k|  q|  i|
> +---+---+---+---+
> |  1|  1|  0|  1|
> |  2|  2|  1|  2|
> +---+---+---+---+
> {code}
> The result of OUTER join of two DataFrames above is:
> {code}
> i = colaesce(b.i, c.i)
> ++++---+---+++
> | b_i| c_i|   i|  j|  k|   p|   q|
> ++++---+---+++
> |   2|   2|   2|  2|  2|   1|   1|
> |null|   1|   1|  1|  1|null|   0|
> |   3|null|   3|  3|  3|   2|null|
> ++++---+---+++
> {code}
> However, what I got is:
> {code}
> ++++---+---+++
> | b_i| c_i|   i|  j|  k|   p|   q|
> ++++---+---+++
> |   2|   2|   2|  2|  2|   1|   1|
> |null|null|null|  1|  1|null|   0|
> |   3|   3|   3|  3|  3|   2|null|
> ++++---+---+++
> {code}
> {code}
> == Physical Plan ==
> *Project [i#0L AS b_i#146L, i#0L AS c_i#147L, coalesce(i#0L, i#0L) AS i#148L, 
> coalesce(j#12L, j#21L) AS j#149L, coalesce(k#2L, k#22L) AS k#150L, p#13L, 
> q#23L]
> +- SortMergeJoin [i#0L, j#12L, k#2L], [i#113L, j#21L, k#22L], FullOuter
> 
> {code}
> As shown in the plan, columns {{b.i}} and {{c.i}} are correctly resolved to 
> {{i#0L}} and {{i#113L}} correspondingly in the join condition part. However,
> in the projection part, both {{b.i}} and {{c.i}} are resolved to {{i#0L}}.
> Complete code to re-produce:
> {code}
> from pyspark import SparkContext, SQLContext
> from pyspark.sql import Row, functions
> sc = SparkContext()
> sqlContext = SQLContext(sc)
> data_a = sc.parallelize([
> Row(i=1, j=1, k=1),
> Row(i=2, j=2, k=2),
> Row(i=3, j=3, k=3),
> ])
> table_a = sqlContext.createDataFrame(data_a)
> table_a.show()
> data_b = sc.parallelize([
> Row(j=2, p=1),
> Row(j=3, p=2),
> ])
> table_b = sqlContext.createDataFrame(data_b)
> table_b.show()
> data_c = sc.parallelize([
> Row(j=1, k=1, q=0),
> Row(j=2, k=2, q=1),
> ])
> table_c = sqlContext.createDataFrame(data_c)
> table_c.show()
> b = table_b.join(table_a, table_b.j == table_a.j).drop(table_a.j)
> c = table_c.join(table_a, (table_c.j == table_a.j)
>   & (table_c.k == table_a.k)) \
> .drop(table_a.j) \
> .drop(table_a.k)
> b.show()
> c.show()
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-16869) Wrong projection when join on columns with the same name which are derived from the same dataframe

2016-08-03 Thread GUAN Hao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15405669#comment-15405669
 ] 

GUAN Hao edited comment on SPARK-16869 at 8/3/16 10:14 AM:
---

Oh, {{i = colaesce(b.i, c.i)}} is actually just a comment.

Please try the code at the bottom of the original post.


was (Author: raptium):
Oh, {{ i = colaesce(b.i, c.i) }} is actually just a comment.

Please try the code at the bottom of the original post.

> Wrong projection when join on columns with the same name which are derived 
> from the same dataframe
> --
>
> Key: SPARK-16869
> URL: https://issues.apache.org/jira/browse/SPARK-16869
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: GUAN Hao
>
> I have to DataFrames, both contain a column named *i* which are derived from 
> a same DataFrame (join).
> {code}
> b
> +---+---+---+---+
> |  j|  p|  i|  k|
> +---+---+---+---+
> |  3|  2|  3|  3|
> |  2|  1|  2|  2|
> +---+---+---+---+
> c
> +---+---+---+---+
> |  j|  k|  q|  i|
> +---+---+---+---+
> |  1|  1|  0|  1|
> |  2|  2|  1|  2|
> +---+---+---+---+
> {code}
> The result of OUTER join of two DataFrames above is:
> {code}
> i = colaesce(b.i, c.i)
> ++++---+---+++
> | b_i| c_i|   i|  j|  k|   p|   q|
> ++++---+---+++
> |   2|   2|   2|  2|  2|   1|   1|
> |null|   1|   1|  1|  1|null|   0|
> |   3|null|   3|  3|  3|   2|null|
> ++++---+---+++
> {code}
> However, what I got is:
> {code}
> ++++---+---+++
> | b_i| c_i|   i|  j|  k|   p|   q|
> ++++---+---+++
> |   2|   2|   2|  2|  2|   1|   1|
> |null|null|null|  1|  1|null|   0|
> |   3|   3|   3|  3|  3|   2|null|
> ++++---+---+++
> {code}
> {code}
> == Physical Plan ==
> *Project [i#0L AS b_i#146L, i#0L AS c_i#147L, coalesce(i#0L, i#0L) AS i#148L, 
> coalesce(j#12L, j#21L) AS j#149L, coalesce(k#2L, k#22L) AS k#150L, p#13L, 
> q#23L]
> +- SortMergeJoin [i#0L, j#12L, k#2L], [i#113L, j#21L, k#22L], FullOuter
> 
> {code}
> As shown in the plan, columns {{b.i}} and {{c.i}} are correctly resolved to 
> {{i#0L}} and {{i#113L}} correspondingly in the join condition part. However,
> in the projection part, both {{b.i}} and {{c.i}} are resolved to {{i#0L}}.
> Complete code to re-produce:
> {code}
> from pyspark import SparkContext, SQLContext
> from pyspark.sql import Row, functions
> sc = SparkContext()
> sqlContext = SQLContext(sc)
> data_a = sc.parallelize([
> Row(i=1, j=1, k=1),
> Row(i=2, j=2, k=2),
> Row(i=3, j=3, k=3),
> ])
> table_a = sqlContext.createDataFrame(data_a)
> table_a.show()
> data_b = sc.parallelize([
> Row(j=2, p=1),
> Row(j=3, p=2),
> ])
> table_b = sqlContext.createDataFrame(data_b)
> table_b.show()
> data_c = sc.parallelize([
> Row(j=1, k=1, q=0),
> Row(j=2, k=2, q=1),
> ])
> table_c = sqlContext.createDataFrame(data_c)
> table_c.show()
> b = table_b.join(table_a, table_b.j == table_a.j).drop(table_a.j)
> c = table_c.join(table_a, (table_c.j == table_a.j)
>   & (table_c.k == table_a.k)) \
> .drop(table_a.j) \
> .drop(table_a.k)
> b.show()
> c.show()
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16869) Wrong projection when join on columns with the same name which are derived from the same dataframe

2016-08-03 Thread GUAN Hao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15405669#comment-15405669
 ] 

GUAN Hao commented on SPARK-16869:
--

Oh, {{ i = colaesce(b.i, c.i) }} is actually just a comment.

Please try the code at the bottom of the original post.

> Wrong projection when join on columns with the same name which are derived 
> from the same dataframe
> --
>
> Key: SPARK-16869
> URL: https://issues.apache.org/jira/browse/SPARK-16869
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: GUAN Hao
>
> I have to DataFrames, both contain a column named *i* which are derived from 
> a same DataFrame (join).
> {code}
> b
> +---+---+---+---+
> |  j|  p|  i|  k|
> +---+---+---+---+
> |  3|  2|  3|  3|
> |  2|  1|  2|  2|
> +---+---+---+---+
> c
> +---+---+---+---+
> |  j|  k|  q|  i|
> +---+---+---+---+
> |  1|  1|  0|  1|
> |  2|  2|  1|  2|
> +---+---+---+---+
> {code}
> The result of OUTER join of two DataFrames above is:
> {code}
> i = colaesce(b.i, c.i)
> ++++---+---+++
> | b_i| c_i|   i|  j|  k|   p|   q|
> ++++---+---+++
> |   2|   2|   2|  2|  2|   1|   1|
> |null|   1|   1|  1|  1|null|   0|
> |   3|null|   3|  3|  3|   2|null|
> ++++---+---+++
> {code}
> However, what I got is:
> {code}
> ++++---+---+++
> | b_i| c_i|   i|  j|  k|   p|   q|
> ++++---+---+++
> |   2|   2|   2|  2|  2|   1|   1|
> |null|null|null|  1|  1|null|   0|
> |   3|   3|   3|  3|  3|   2|null|
> ++++---+---+++
> {code}
> {code}
> == Physical Plan ==
> *Project [i#0L AS b_i#146L, i#0L AS c_i#147L, coalesce(i#0L, i#0L) AS i#148L, 
> coalesce(j#12L, j#21L) AS j#149L, coalesce(k#2L, k#22L) AS k#150L, p#13L, 
> q#23L]
> +- SortMergeJoin [i#0L, j#12L, k#2L], [i#113L, j#21L, k#22L], FullOuter
> 
> {code}
> As shown in the plan, columns {{b.i}} and {{c.i}} are correctly resolved to 
> {{i#0L}} and {{i#113L}} correspondingly in the join condition part. However,
> in the projection part, both {{b.i}} and {{c.i}} are resolved to {{i#0L}}.
> Complete code to re-produce:
> {code}
> from pyspark import SparkContext, SQLContext
> from pyspark.sql import Row, functions
> sc = SparkContext()
> sqlContext = SQLContext(sc)
> data_a = sc.parallelize([
> Row(i=1, j=1, k=1),
> Row(i=2, j=2, k=2),
> Row(i=3, j=3, k=3),
> ])
> table_a = sqlContext.createDataFrame(data_a)
> table_a.show()
> data_b = sc.parallelize([
> Row(j=2, p=1),
> Row(j=3, p=2),
> ])
> table_b = sqlContext.createDataFrame(data_b)
> table_b.show()
> data_c = sc.parallelize([
> Row(j=1, k=1, q=0),
> Row(j=2, k=2, q=1),
> ])
> table_c = sqlContext.createDataFrame(data_c)
> table_c.show()
> b = table_b.join(table_a, table_b.j == table_a.j).drop(table_a.j)
> c = table_c.join(table_a, (table_c.j == table_a.j)
>   & (table_c.k == table_a.k)) \
> .drop(table_a.j) \
> .drop(table_a.k)
> b.show()
> c.show()
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16869) Wrong projection when join on columns with the same name which are derived from the same dataframe

2016-08-03 Thread GUAN Hao (JIRA)
GUAN Hao created SPARK-16869:


 Summary: Wrong projection when join on columns with the same name 
which are derived from the same dataframe
 Key: SPARK-16869
 URL: https://issues.apache.org/jira/browse/SPARK-16869
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.0
Reporter: GUAN Hao


I have to DataFrames, both contain a column named *i* which are derived from a 
same DataFrame (join).

{code}
b
+---+---+---+---+
|  j|  p|  i|  k|
+---+---+---+---+
|  3|  2|  3|  3|
|  2|  1|  2|  2|
+---+---+---+---+

c
+---+---+---+---+
|  j|  k|  q|  i|
+---+---+---+---+
|  1|  1|  0|  1|
|  2|  2|  1|  2|
+---+---+---+---+
{code}

The result of OUTER join of two DataFrames above is:

{code}
i = colaesce(b.i, c.i)
++++---+---+++
| b_i| c_i|   i|  j|  k|   p|   q|
++++---+---+++
|   2|   2|   2|  2|  2|   1|   1|
|null|   1|   1|  1|  1|null|   0|
|   3|null|   3|  3|  3|   2|null|
++++---+---+++
{code}

However, what I got is:

{code}
++++---+---+++
| b_i| c_i|   i|  j|  k|   p|   q|
++++---+---+++
|   2|   2|   2|  2|  2|   1|   1|
|null|null|null|  1|  1|null|   0|
|   3|   3|   3|  3|  3|   2|null|
++++---+---+++
{code}

{code}
== Physical Plan ==
*Project [i#0L AS b_i#146L, i#0L AS c_i#147L, coalesce(i#0L, i#0L) AS i#148L, 
coalesce(j#12L, j#21L) AS j#149L, coalesce(k#2L, k#22L) AS k#150L, p#13L, q#23L]
+- SortMergeJoin [i#0L, j#12L, k#2L], [i#113L, j#21L, k#22L], FullOuter

{code}

As shown in the plan, columns {{b.i}} and {{c.i}} are correctly resolved to 
{{i#0L}} and {{i#113L}} correspondingly in the join condition part. However,
in the projection part, both {{b.i}} and {{c.i}} are resolved to {{i#0L}}.

Complete code to re-produce:

{code}
from pyspark import SparkContext, SQLContext

from pyspark.sql import Row, functions

sc = SparkContext()
sqlContext = SQLContext(sc)

data_a = sc.parallelize([
Row(i=1, j=1, k=1),
Row(i=2, j=2, k=2),
Row(i=3, j=3, k=3),
])
table_a = sqlContext.createDataFrame(data_a)
table_a.show()

data_b = sc.parallelize([
Row(j=2, p=1),
Row(j=3, p=2),
])
table_b = sqlContext.createDataFrame(data_b)
table_b.show()

data_c = sc.parallelize([
Row(j=1, k=1, q=0),
Row(j=2, k=2, q=1),
])
table_c = sqlContext.createDataFrame(data_c)
table_c.show()

b = table_b.join(table_a, table_b.j == table_a.j).drop(table_a.j)

c = table_c.join(table_a, (table_c.j == table_a.j)
  & (table_c.k == table_a.k)) \
.drop(table_a.j) \
.drop(table_a.k)


b.show()
c.show()
{code}




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16214) fix the denominator of SparkPi

2016-06-26 Thread Yang Hao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Hao updated SPARK-16214:
-
Description: 
As the iteration number is n - 1, the denominator would also be n, 



  was:
As the iteration number is n - 1, then the denominator would also be n, 




>  fix the denominator of SparkPi
> ---
>
> Key: SPARK-16214
> URL: https://issues.apache.org/jira/browse/SPARK-16214
> Project: Spark
>  Issue Type: Bug
>  Components: Examples
>Affects Versions: 1.6.2
>Reporter: Yang Hao
>Assignee: Yang Hao
>Priority: Minor
>
> As the iteration number is n - 1, the denominator would also be n, 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16214) fix the denominator of SparkPi

2016-06-26 Thread Yang Hao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Hao updated SPARK-16214:
-
Description: 
As the iteration number is n - 1, then the denominator would also be n, 



  was:
As the denominator is n, the iteration number should also be n




>  fix the denominator of SparkPi
> ---
>
> Key: SPARK-16214
> URL: https://issues.apache.org/jira/browse/SPARK-16214
> Project: Spark
>  Issue Type: Bug
>  Components: Examples
>Affects Versions: 1.6.2
>Reporter: Yang Hao
>Assignee: Yang Hao
>Priority: Minor
>
> As the iteration number is n - 1, then the denominator would also be n, 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16214) fix the denominator of SparkPi

2016-06-26 Thread Yang Hao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Hao updated SPARK-16214:
-
Summary:  fix the denominator of SparkPi  (was: make SparkPi iteration 
number correct)

>  fix the denominator of SparkPi
> ---
>
> Key: SPARK-16214
> URL: https://issues.apache.org/jira/browse/SPARK-16214
> Project: Spark
>  Issue Type: Bug
>  Components: Examples
>Affects Versions: 1.6.2
>Reporter: Yang Hao
>Assignee: Yang Hao
>Priority: Minor
>
> As the denominator is n, the iteration number should also be n



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16214) make SparkPi iteration number correct

2016-06-26 Thread Yang Hao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Hao updated SPARK-16214:
-
Attachment: (was: SPARK-16214.patch)

> make SparkPi iteration number correct
> -
>
> Key: SPARK-16214
> URL: https://issues.apache.org/jira/browse/SPARK-16214
> Project: Spark
>  Issue Type: Bug
>  Components: Examples
>Affects Versions: 1.6.2
>Reporter: Yang Hao
>Assignee: Yang Hao
>Priority: Minor
>
> As the denominator is n, the iteration number should also be n



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16214) make SparkPi iteration number correct

2016-06-26 Thread Yang Hao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Hao updated SPARK-16214:
-
Description: 
As the denominator is n, the iteration number should also be n



  was:
As the denominator is n, the iteration number should also be n

A pull request is https://github.com/apache/spark/pull/13910



> make SparkPi iteration number correct
> -
>
> Key: SPARK-16214
> URL: https://issues.apache.org/jira/browse/SPARK-16214
> Project: Spark
>  Issue Type: Bug
>  Components: Examples
>Affects Versions: 1.6.2
>Reporter: Yang Hao
>Assignee: Yang Hao
>Priority: Minor
> Attachments: SPARK-16214.patch
>
>
> As the denominator is n, the iteration number should also be n



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16214) make SparkPi iteration number correct

2016-06-26 Thread Yang Hao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Hao updated SPARK-16214:
-
Description: 
As the denominator is n, the iteration number should also be n

A pull request is https://github.com/apache/spark/pull/13910


  was:
As the denominator is n, the iteration number should also be n



> make SparkPi iteration number correct
> -
>
> Key: SPARK-16214
> URL: https://issues.apache.org/jira/browse/SPARK-16214
> Project: Spark
>  Issue Type: Bug
>  Components: Examples
>Affects Versions: 1.6.2
>Reporter: Yang Hao
>Assignee: Yang Hao
>Priority: Minor
> Attachments: SPARK-16214.patch
>
>
> As the denominator is n, the iteration number should also be n
> A pull request is https://github.com/apache/spark/pull/13910



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16214) make SparkPi iteration time correct

2016-06-25 Thread Yang Hao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Hao updated SPARK-16214:
-
Attachment: (was: SPARK-16214.patch)

> make SparkPi iteration time correct
> ---
>
> Key: SPARK-16214
> URL: https://issues.apache.org/jira/browse/SPARK-16214
> Project: Spark
>  Issue Type: Improvement
>  Components: Examples
>Affects Versions: 1.6.2
>Reporter: Yang Hao
> Attachments: SPARK-16214.patch
>
>
> As the denominator is n, the iteration time should also be n
> pull request is : https://github.com/apache/spark/pull/13910



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16214) make SparkPi iteration time correct

2016-06-25 Thread Yang Hao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Hao updated SPARK-16214:
-
Description: 
As the denominator is n, the iteration time should also be n


  was:
As the denominator is n, the iteration time should also be n

pull request is : https://github.com/apache/spark/pull/13910


> make SparkPi iteration time correct
> ---
>
> Key: SPARK-16214
> URL: https://issues.apache.org/jira/browse/SPARK-16214
> Project: Spark
>  Issue Type: Improvement
>  Components: Examples
>Affects Versions: 1.6.2
>Reporter: Yang Hao
> Attachments: SPARK-16214.patch
>
>
> As the denominator is n, the iteration time should also be n



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16214) make SparkPi iteration time correct

2016-06-25 Thread Yang Hao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Hao updated SPARK-16214:
-
Attachment: SPARK-16214.patch

> make SparkPi iteration time correct
> ---
>
> Key: SPARK-16214
> URL: https://issues.apache.org/jira/browse/SPARK-16214
> Project: Spark
>  Issue Type: Improvement
>  Components: Examples
>Affects Versions: 1.6.2
>Reporter: Yang Hao
> Attachments: SPARK-16214.patch
>
>
> As the denominator is n, the iteration time should also be n
> pull request is : https://github.com/apache/spark/pull/13910



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-16214) make SparkPi iteration time correct

2016-06-25 Thread Yang Hao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Hao resolved SPARK-16214.
--
Resolution: Resolved

> make SparkPi iteration time correct
> ---
>
> Key: SPARK-16214
> URL: https://issues.apache.org/jira/browse/SPARK-16214
> Project: Spark
>  Issue Type: Improvement
>  Components: Examples
>Affects Versions: 1.6.2
>Reporter: Yang Hao
> Attachments: SPARK-16214.patch
>
>
> As the denominator is n, the iteration time should also be n
> pull request is : https://github.com/apache/spark/pull/13910



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16214) make SparkPi iteration time correct

2016-06-25 Thread Yang Hao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Hao updated SPARK-16214:
-
Attachment: SPARK-16214.patch

> make SparkPi iteration time correct
> ---
>
> Key: SPARK-16214
> URL: https://issues.apache.org/jira/browse/SPARK-16214
> Project: Spark
>  Issue Type: Improvement
>  Components: Examples
>Affects Versions: 1.6.2
>Reporter: Yang Hao
> Attachments: SPARK-16214.patch
>
>
> As the denominator is n, the iteration time should also be n
> pull request is : https://github.com/apache/spark/pull/13910



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16214) make SparkPi iteration time correct

2016-06-25 Thread Yang Hao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Hao updated SPARK-16214:
-
Summary: make SparkPi iteration time correct  (was: optimize SparkPi)

> make SparkPi iteration time correct
> ---
>
> Key: SPARK-16214
> URL: https://issues.apache.org/jira/browse/SPARK-16214
> Project: Spark
>  Issue Type: Improvement
>  Components: Examples
>Affects Versions: 1.6.2
>Reporter: Yang Hao
>
> As the denominator is n, the iteration time should also be n
> pull request is : https://github.com/apache/spark/pull/13910



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16214) optimize SparkPi

2016-06-25 Thread Yang Hao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Hao updated SPARK-16214:
-
Summary: optimize SparkPi  (was: calculate pi is not correct)

> optimize SparkPi
> 
>
> Key: SPARK-16214
> URL: https://issues.apache.org/jira/browse/SPARK-16214
> Project: Spark
>  Issue Type: Improvement
>  Components: Examples
>Affects Versions: 1.6.2
>Reporter: Yang Hao
>
> As the denominator is n, the iteration time should also be n
> pull request is : https://github.com/apache/spark/pull/13910



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16214) calculate pi is not correct

2016-06-25 Thread Yang Hao (JIRA)
Yang Hao created SPARK-16214:


 Summary: calculate pi is not correct
 Key: SPARK-16214
 URL: https://issues.apache.org/jira/browse/SPARK-16214
 Project: Spark
  Issue Type: Improvement
  Components: Examples
Affects Versions: 1.6.2
Reporter: Yang Hao


As the denominator is n, the iteration time should also be n

pull request is : https://github.com/apache/spark/pull/13910



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15859) Optimize the Partition Pruning with Disjunction

2016-06-09 Thread Cheng Hao (JIRA)
Cheng Hao created SPARK-15859:
-

 Summary: Optimize the Partition Pruning with Disjunction
 Key: SPARK-15859
 URL: https://issues.apache.org/jira/browse/SPARK-15859
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Cheng Hao
Priority: Critical


Currently we can not optimize the partition pruning in disjunction, for example:

{{(part1=2 and col1='abc') or (part1=5 and col1='cde')}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15730) [Spark SQL] the value of 'hiveconf' parameter in Spark-sql CLI don't take effect in spark-sql session

2016-06-07 Thread Cheng Hao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15318654#comment-15318654
 ] 

Cheng Hao commented on SPARK-15730:
---

[~jameszhouyi], can you please verify this fixing?

> [Spark SQL] the value of 'hiveconf' parameter in Spark-sql CLI don't take 
> effect in spark-sql session
> -
>
> Key: SPARK-15730
> URL: https://issues.apache.org/jira/browse/SPARK-15730
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Yi Zhou
>Priority: Critical
>
> /usr/lib/spark/bin/spark-sql -v --driver-memory 4g --executor-memory 7g 
> --executor-cores 5 --num-executors 31 --master yarn-client --conf 
> spark.yarn.executor.memoryOverhead=1024 --hiveconf RESULT_TABLE=test_result01
> spark-sql> use test;
> 16/06/02 21:36:15 INFO execution.SparkSqlParser: Parsing command: use test
> 16/06/02 21:36:15 INFO spark.SparkContext: Starting job: processCmd at 
> CliDriver.java:376
> 16/06/02 21:36:15 INFO scheduler.DAGScheduler: Got job 2 (processCmd at 
> CliDriver.java:376) with 1 output partitions
> 16/06/02 21:36:15 INFO scheduler.DAGScheduler: Final stage: ResultStage 2 
> (processCmd at CliDriver.java:376)
> 16/06/02 21:36:15 INFO scheduler.DAGScheduler: Parents of final stage: List()
> 16/06/02 21:36:15 INFO scheduler.DAGScheduler: Missing parents: List()
> 16/06/02 21:36:15 INFO scheduler.DAGScheduler: Submitting ResultStage 2 
> (MapPartitionsRDD[8] at processCmd at CliDriver.java:376), which has no 
> missing parents
> 16/06/02 21:36:15 INFO memory.MemoryStore: Block broadcast_2 stored as values 
> in memory (estimated size 3.2 KB, free 2.4 GB)
> 16/06/02 21:36:15 INFO memory.MemoryStore: Block broadcast_2_piece0 stored as 
> bytes in memory (estimated size 1964.0 B, free 2.4 GB)
> 16/06/02 21:36:15 INFO storage.BlockManagerInfo: Added broadcast_2_piece0 in 
> memory on 192.168.3.11:36189 (size: 1964.0 B, free: 2.4 GB)
> 16/06/02 21:36:15 INFO spark.SparkContext: Created broadcast 2 from broadcast 
> at DAGScheduler.scala:1012
> 16/06/02 21:36:15 INFO scheduler.DAGScheduler: Submitting 1 missing tasks 
> from ResultStage 2 (MapPartitionsRDD[8] at processCmd at CliDriver.java:376)
> 16/06/02 21:36:15 INFO cluster.YarnScheduler: Adding task set 2.0 with 1 tasks
> 16/06/02 21:36:15 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 
> 2.0 (TID 2, 192.168.3.13, partition 0, PROCESS_LOCAL, 5362 bytes)
> 16/06/02 21:36:15 INFO cluster.YarnClientSchedulerBackend: Launching task 2 
> on executor id: 10 hostname: 192.168.3.13.
> 16/06/02 21:36:16 INFO storage.BlockManagerInfo: Added broadcast_2_piece0 in 
> memory on hw-node3:45924 (size: 1964.0 B, free: 4.4 GB)
> 16/06/02 21:36:17 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 
> 2.0 (TID 2) in 1934 ms on 192.168.3.13 (1/1)
> 16/06/02 21:36:17 INFO cluster.YarnScheduler: Removed TaskSet 2.0, whose 
> tasks have all completed, from pool
> 16/06/02 21:36:17 INFO scheduler.DAGScheduler: ResultStage 2 (processCmd at 
> CliDriver.java:376) finished in 1.937 s
> 16/06/02 21:36:17 INFO scheduler.DAGScheduler: Job 2 finished: processCmd at 
> CliDriver.java:376, took 1.962631 s
> Time taken: 2.027 seconds
> 16/06/02 21:36:17 INFO CliDriver: Time taken: 2.027 seconds
> spark-sql> DROP TABLE IF EXISTS ${hiveconf:RESULT_TABLE};
> 16/06/02 21:36:36 INFO execution.SparkSqlParser: Parsing command: DROP TABLE 
> IF EXISTS ${hiveconf:RESULT_TABLE}
> Error in query:
> mismatched input '$' expecting {'ADD', 'AS', 'ALL', 'GROUP', 'BY', 
> 'GROUPING', 'SETS', 'CUBE', 'ROLLUP', 'ORDER', 'LIMIT', 'AT', 'IN', 'NO', 
> 'EXISTS', 'BETWEEN', 'LIKE', RLIKE, 'IS', 'NULL', 'TRUE', 'FALSE', 'NULLS', 
> 'ASC', 'DESC', 'FOR', 'OUTER', 'LATERAL', 'WINDOW', 'OVER', 'PARTITION', 
> 'RANGE', 'ROWS', 'PRECEDING', 'FOLLOWING', 'CURRENT', 'ROW', 'WITH', 
> 'VALUES', 'CREATE', 'TABLE', 'VIEW', 'REPLACE', 'INSERT', 'DELETE', 'INTO', 
> 'DESCRIBE', 'EXPLAIN', 'FORMAT', 'LOGICAL', 'CODEGEN', 'SHOW', 'TABLES', 
> 'COLUMNS', 'COLUMN', 'USE', 'PARTITIONS', 'FUNCTIONS', 'DROP', 'TO', 
> 'TABLESAMPLE', 'ALTER', 'RENAME', 'ARRAY', 'MAP', 'STRUCT', 'COMMENT', 'SET', 
> 'RESET', 'DATA', 'START', 'TRANSACTION', 'COMMIT', 'ROLLBACK', 'IF', 
> 'PERCENT', 'BUCKET', 'OUT', 'OF', 'SORT', 'CLUSTER', 'DISTRIBUTE', 
> 'OVERWRITE', 'TRANSFORM', 'REDUCE', 'USING', 'SERDE', 'SERDEPROPERTIES', 
> 'RECORDREADER', 'RECORDWRITER', 'DELIMITED', 'FIELDS', 'TERMINATED', 
> 'COLLECTION', 'ITEMS', 'KEYS', 'ESCAPED', 'LINES', 'SEPARATED', 'EXTENDED', 
> 'REFRESH', 'CLEAR', 'CACHE', 'UNCACHE', 'LAZY', 'FORMATTED', TEMPORARY, 
> 'OPTIONS', 'UNSET', 'TBLPROPERTIES', 'DBPROPERTIES', 'BUCKETS', 'SKEWED', 
> 'STORED', 'DIRECTORIES', 'LOCATION', 'EXCHANGE', 'ARCHIVE', 'UNARCHIVE', 
> 'FILEFORMAT', 'TOUCH', 

[jira] [Commented] (SPARK-15034) Use the value of spark.sql.warehouse.dir as the warehouse location instead of using hive.metastore.warehouse.dir

2016-05-25 Thread Cheng Hao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15300072#comment-15300072
 ] 

Cheng Hao commented on SPARK-15034:
---

[~yhuai], but it probably not respect the `hive-site.xml`, and lots of users 
will be impacted by this configuration change, will that acceptable?

> Use the value of spark.sql.warehouse.dir as the warehouse location instead of 
> using hive.metastore.warehouse.dir
> 
>
> Key: SPARK-15034
> URL: https://issues.apache.org/jira/browse/SPARK-15034
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Yin Huai
>  Labels: release_notes, releasenotes
> Fix For: 2.0.0
>
>
> Starting from Spark 2.0, spark.sql.warehouse.dir will be the conf to set 
> warehouse location. We will not use hive.metastore.warehouse.dir.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-4452) Shuffle data structures can starve others on the same thread for memory

2016-05-09 Thread Xin Hao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15276109#comment-15276109
 ] 

Xin Hao edited comment on SPARK-4452 at 5/9/16 8:36 AM:


Since this is an old issue which impact Spark since 1.1.0, can the patch be 
merged to Spark 1.6.X ? This will be very helpful for Spark 1.6.X users. Thanks.


was (Author: xhao1):
Since this is an old issue which impact Spark since 1.1.0, can the patch be 
merged to Spark 1.6.X ? Thanks.

> Shuffle data structures can starve others on the same thread for memory 
> 
>
> Key: SPARK-4452
> URL: https://issues.apache.org/jira/browse/SPARK-4452
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: Tianshuo Deng
>Assignee: Lianhui Wang
> Fix For: 2.0.0
>
>
> When an Aggregator is used with ExternalSorter in a task, spark will create 
> many small files and could cause too many files open error during merging.
> Currently, ShuffleMemoryManager does not work well when there are 2 spillable 
> objects in a thread, which are ExternalSorter and ExternalAppendOnlyMap(used 
> by Aggregator) in this case. Here is an example: Due to the usage of mapside 
> aggregation, ExternalAppendOnlyMap is created first to read the RDD. It may 
> ask as much memory as it can, which is totalMem/numberOfThreads. Then later 
> on when ExternalSorter is created in the same thread, the 
> ShuffleMemoryManager could refuse to allocate more memory to it, since the 
> memory is already given to the previous requested 
> object(ExternalAppendOnlyMap). That causes the ExternalSorter keeps spilling 
> small files(due to the lack of memory)
> I'm currently working on a PR to address these two issues. It will include 
> following changes:
> 1. The ShuffleMemoryManager should not only track the memory usage for each 
> thread, but also the object who holds the memory
> 2. The ShuffleMemoryManager should be able to trigger the spilling of a 
> spillable object. In this way, if a new object in a thread is requesting 
> memory, the old occupant could be evicted/spilled. Previously the spillable 
> objects trigger spilling by themselves. So one may not trigger spilling even 
> if another object in the same thread needs more memory. After this change The 
> ShuffleMemoryManager could trigger the spilling of an object if it needs to.
> 3. Make the iterator of ExternalAppendOnlyMap spillable. Previously 
> ExternalAppendOnlyMap returns an destructive iterator and can not be spilled 
> after the iterator is returned. This should be changed so that even after the 
> iterator is returned, the ShuffleMemoryManager can still spill it.
> Currently, I have a working branch in progress: 
> https://github.com/tsdeng/spark/tree/enhance_memory_manager. Already made 
> change 3 and have a prototype of change 1 and 2 to evict spillable from 
> memory manager, still in progress. I will send a PR when it's done.
> Any feedback or thoughts on this change is highly appreciated !



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4452) Shuffle data structures can starve others on the same thread for memory

2016-05-09 Thread Xin Hao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15276109#comment-15276109
 ] 

Xin Hao commented on SPARK-4452:


Since this is an old issue which impact Spark since 1.1.0, can the patch be 
merged to Spark 1.6.X ? Thanks.

> Shuffle data structures can starve others on the same thread for memory 
> 
>
> Key: SPARK-4452
> URL: https://issues.apache.org/jira/browse/SPARK-4452
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: Tianshuo Deng
>Assignee: Lianhui Wang
> Fix For: 2.0.0
>
>
> When an Aggregator is used with ExternalSorter in a task, spark will create 
> many small files and could cause too many files open error during merging.
> Currently, ShuffleMemoryManager does not work well when there are 2 spillable 
> objects in a thread, which are ExternalSorter and ExternalAppendOnlyMap(used 
> by Aggregator) in this case. Here is an example: Due to the usage of mapside 
> aggregation, ExternalAppendOnlyMap is created first to read the RDD. It may 
> ask as much memory as it can, which is totalMem/numberOfThreads. Then later 
> on when ExternalSorter is created in the same thread, the 
> ShuffleMemoryManager could refuse to allocate more memory to it, since the 
> memory is already given to the previous requested 
> object(ExternalAppendOnlyMap). That causes the ExternalSorter keeps spilling 
> small files(due to the lack of memory)
> I'm currently working on a PR to address these two issues. It will include 
> following changes:
> 1. The ShuffleMemoryManager should not only track the memory usage for each 
> thread, but also the object who holds the memory
> 2. The ShuffleMemoryManager should be able to trigger the spilling of a 
> spillable object. In this way, if a new object in a thread is requesting 
> memory, the old occupant could be evicted/spilled. Previously the spillable 
> objects trigger spilling by themselves. So one may not trigger spilling even 
> if another object in the same thread needs more memory. After this change The 
> ShuffleMemoryManager could trigger the spilling of an object if it needs to.
> 3. Make the iterator of ExternalAppendOnlyMap spillable. Previously 
> ExternalAppendOnlyMap returns an destructive iterator and can not be spilled 
> after the iterator is returned. This should be changed so that even after the 
> iterator is returned, the ShuffleMemoryManager can still spill it.
> Currently, I have a working branch in progress: 
> https://github.com/tsdeng/spark/tree/enhance_memory_manager. Already made 
> change 3 and have a prototype of change 1 and 2 to evict spillable from 
> memory manager, still in progress. I will send a PR when it's done.
> Any feedback or thoughts on this change is highly appreciated !



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13894) SQLContext.range should return Dataset[Long]

2016-03-15 Thread Cheng Hao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15195274#comment-15195274
 ] 

Cheng Hao commented on SPARK-13894:
---

The existing functions "SQLContext.range()" returns the underlying schema with 
name "id", it will be lots of unit test code requires to be updated if we 
changed the column name to "value". How about keep the name as "id" unchanged?

> SQLContext.range should return Dataset[Long]
> 
>
> Key: SPARK-13894
> URL: https://issues.apache.org/jira/browse/SPARK-13894
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Priority: Blocker
>
> Rather than returning DataFrame, it should return a Dataset[Long]. The 
> documentation should still make it clear that the underlying schema consists 
> of a single long column named "value".



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13326) Dataset in spark 2.0.0-SNAPSHOT missing columns

2016-03-15 Thread Cheng Hao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15195022#comment-15195022
 ] 

Cheng Hao commented on SPARK-13326:
---

Can not reproduce it anymore, can you try it again?

> Dataset in spark 2.0.0-SNAPSHOT missing columns
> ---
>
> Key: SPARK-13326
> URL: https://issues.apache.org/jira/browse/SPARK-13326
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: koert kuipers
>Priority: Minor
>
> i noticed some things stopped working on datasets in spark 2.0.0-SNAPSHOT, 
> and with a confusing error message (cannot resolved some column with input 
> columns []).
> for example in 1.6.0-SNAPSHOT:
> {noformat}
> scala> val ds = sc.parallelize(1 to 10).toDS
> ds: org.apache.spark.sql.Dataset[Int] = [value: int]
> scala> ds.map(x => Option(x))
> res0: org.apache.spark.sql.Dataset[Option[Int]] = [value: int]
> {noformat}
> and same commands in 2.0.0-SNAPSHOT:
> {noformat}
> scala> val ds = sc.parallelize(1 to 10).toDS
> ds: org.apache.spark.sql.Dataset[Int] = [value: int]
> scala> ds.map(x => Option(x))
> org.apache.spark.sql.AnalysisException: cannot resolve 'value' given input 
> columns: [];
>   at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:60)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:57)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:284)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:284)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:283)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionUp$1(QueryPlan.scala:162)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:172)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2$1.apply(QueryPlan.scala:176)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:245)
>   at scala.collection.immutable.List.map(List.scala:285)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:176)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:181)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:370)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:742)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1194)
>   at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
>   at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
>   at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
>   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:308)
>   at scala.collection.AbstractIterator.to(Iterator.scala:1194)
>   at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:300)
>   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1194)
>   at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:287)
>   at scala.collection.AbstractIterator.toArray(Iterator.scala:1194)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUp(QueryPlan.scala:181)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:57)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:50)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:122)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:121)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:121)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:121)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:50)
>   at 
> 

[jira] [Commented] (SPARK-13133) When the option --master of spark-submit script is inconsistent with SparkConf.setMaster in Spark appliction code, the behavior of Spark application is difficult to un

2016-02-02 Thread Ji Hao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15127911#comment-15127911
 ] 

Ji Hao commented on SPARK-13133:


Sean Owen, I think you should consider this issue, the error log may be more 
clearly!

> When the option --master of spark-submit script is inconsistent with 
> SparkConf.setMaster in Spark appliction code, the behavior of Spark 
> application is difficult to understand
> ---
>
> Key: SPARK-13133
> URL: https://issues.apache.org/jira/browse/SPARK-13133
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.3.0, 1.4.0, 1.5.0, 1.6.0
>Reporter: Li Ye
>Priority: Minor
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> When the option --master of spark-submit script is inconsistent with 
> SparkConf.setMaster in Spark application code, the behavior is difficult to 
> understand. For example, if the option --master of spark-submit script is 
> yarn-cluster while there is SparkConf.setMaster("local") in Spark application 
> code, the application exit abnormally after about 2 minutes. In driver's log 
> there is an error whose content is "SparkContext did not initialize after 
> waiting for 10 ms. Please check earlier log output for errors. Failing 
> the application".
> When SparkContext is launched, it should be checked whether the option 
> --master of spark-submit script and SparkConf.setMaster in Spark application 
> code are different. If they are different, there should be a clear hint in 
> the driver's log for the developer to troubleshoot.
> I found the same question with me in stackoverflow:
> http://stackoverflow.com/questions/30670933/submit-spark-job-on-yarn-cluster



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-13133) When the option --master of spark-submit script is inconsistent with SparkConf.setMaster in Spark appliction code, the behavior of Spark application is dif

2016-02-02 Thread Ji Hao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ji Hao updated SPARK-13133:
---
Comment: was deleted

(was: Sean Owen, I think you should consider this issue, the error log may be 
more clearly!)

> When the option --master of spark-submit script is inconsistent with 
> SparkConf.setMaster in Spark appliction code, the behavior of Spark 
> application is difficult to understand
> ---
>
> Key: SPARK-13133
> URL: https://issues.apache.org/jira/browse/SPARK-13133
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.3.0, 1.4.0, 1.5.0, 1.6.0
>Reporter: Li Ye
>Priority: Minor
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> When the option --master of spark-submit script is inconsistent with 
> SparkConf.setMaster in Spark application code, the behavior is difficult to 
> understand. For example, if the option --master of spark-submit script is 
> yarn-cluster while there is SparkConf.setMaster("local") in Spark application 
> code, the application exit abnormally after about 2 minutes. In driver's log 
> there is an error whose content is "SparkContext did not initialize after 
> waiting for 10 ms. Please check earlier log output for errors. Failing 
> the application".
> When SparkContext is launched, it should be checked whether the option 
> --master of spark-submit script and SparkConf.setMaster in Spark application 
> code are different. If they are different, there should be a clear hint in 
> the driver's log for the developer to troubleshoot.
> I found the same question with me in stackoverflow:
> http://stackoverflow.com/questions/30670933/submit-spark-job-on-yarn-cluster



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13133) When the option --master of spark-submit script is inconsistent with SparkConf.setMaster in Spark appliction code, the behavior of Spark application is difficult to un

2016-02-02 Thread Ji Hao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15127912#comment-15127912
 ] 

Ji Hao commented on SPARK-13133:


Sean Owen, I think you should consider this issue, the error log may be more 
clearly!

> When the option --master of spark-submit script is inconsistent with 
> SparkConf.setMaster in Spark appliction code, the behavior of Spark 
> application is difficult to understand
> ---
>
> Key: SPARK-13133
> URL: https://issues.apache.org/jira/browse/SPARK-13133
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.3.0, 1.4.0, 1.5.0, 1.6.0
>Reporter: Li Ye
>Priority: Minor
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> When the option --master of spark-submit script is inconsistent with 
> SparkConf.setMaster in Spark application code, the behavior is difficult to 
> understand. For example, if the option --master of spark-submit script is 
> yarn-cluster while there is SparkConf.setMaster("local") in Spark application 
> code, the application exit abnormally after about 2 minutes. In driver's log 
> there is an error whose content is "SparkContext did not initialize after 
> waiting for 10 ms. Please check earlier log output for errors. Failing 
> the application".
> When SparkContext is launched, it should be checked whether the option 
> --master of spark-submit script and SparkConf.setMaster in Spark application 
> code are different. If they are different, there should be a clear hint in 
> the driver's log for the developer to troubleshoot.
> I found the same question with me in stackoverflow:
> http://stackoverflow.com/questions/30670933/submit-spark-job-on-yarn-cluster



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12610) Add Anti Join Operators

2016-01-03 Thread Cheng Hao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Hao updated SPARK-12610:
--
Issue Type: Sub-task  (was: New Feature)
Parent: SPARK-4226

> Add Anti Join Operators
> ---
>
> Key: SPARK-12610
> URL: https://issues.apache.org/jira/browse/SPARK-12610
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Cheng Hao
>
> We need to implements the anti join operators, for supporting the NOT 
> predicates in subquery.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12610) Add Anti Join Operators

2016-01-03 Thread Cheng Hao (JIRA)
Cheng Hao created SPARK-12610:
-

 Summary: Add Anti Join Operators
 Key: SPARK-12610
 URL: https://issues.apache.org/jira/browse/SPARK-12610
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: Cheng Hao


We need to implements the anti join operators, for supporting the NOT 
predicates in subquery.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12196) Store blocks in different speed storage devices by hierarchy way

2015-12-28 Thread Cheng Hao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15072634#comment-15072634
 ] 

Cheng Hao commented on SPARK-12196:
---

Thank you wei wu to support this feature! 

However, we're trying to avoid to change the existing configuration format, as 
it might impact the user applications, and besides, in Yarn/Mesos, this 
configuration key will not work anymore.

An updated PR will be submitted soon, welcome to join the discussion the in PR.

> Store blocks in different speed storage devices by hierarchy way
> 
>
> Key: SPARK-12196
> URL: https://issues.apache.org/jira/browse/SPARK-12196
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Reporter: yucai
>
> *Problem*
> Nowadays, users have both SSDs and HDDs. 
> SSDs have great performance, but capacity is small. HDDs have good capacity, 
> but x2-x3 lower than SSDs.
> How can we get both good?
> *Solution*
> Our idea is to build hierarchy store: use SSDs as cache and HDDs as backup 
> storage. 
> When Spark core allocates blocks for RDD (either shuffle or RDD cache), it 
> gets blocks from SSDs first, and when SSD’s useable space is less than some 
> threshold, getting blocks from HDDs.
> In our implementation, we actually go further. We support a way to build any 
> level hierarchy store access all storage medias (NVM, SSD, HDD etc.).
> *Performance*
> 1. At the best case, our solution performs the same as all SSDs.
> 2. At the worst case, like all data are spilled to HDDs, no performance 
> regression.
> 3. Compared with all HDDs, hierarchy store improves more than *_x1.86_* (it 
> could be higher, CPU reaches bottleneck in our test environment).
> 4. Compared with Tachyon, our hierarchy store still *_x1.3_* faster. Because 
> we support both RDD cache and shuffle and no extra inter process 
> communication.
> *Usage*
> 1. Set the priority and threshold for each layer in 
> spark.storage.hierarchyStore.
> {code}
> spark.storage.hierarchyStore='nvm 50GB,ssd 80GB'
> {code}
> It builds a 3 layers hierarchy store: the 1st is "nvm", the 2nd is "sdd", all 
> the rest form the last layer.
> 2. Configure each layer's location, user just needs put the keyword like 
> "nvm", "ssd", which are specified in step 1, into local dirs, like 
> spark.local.dir or yarn.nodemanager.local-dirs.
> {code}
> spark.local.dir=/mnt/nvm1,/mnt/ssd1,/mnt/ssd2,/mnt/ssd3,/mnt/disk1,/mnt/disk2,/mnt/disk3,/mnt/disk4,/mnt/others
> {code}
> After then, restart your Spark application, it will allocate blocks from nvm 
> first.
> When nvm's usable space is less than 50GB, it starts to allocate from ssd.
> When ssd's usable space is less than 80GB, it starts to allocate from the 
> last layer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8360) Streaming DataFrames

2015-12-02 Thread Cheng Hao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Hao updated SPARK-8360:
-
Attachment: StreamingDataFrameProposal.pdf

This is a proposal for streaming dataframes that we were trying to work, 
hopefully helpful for the new design.

> Streaming DataFrames
> 
>
> Key: SPARK-8360
> URL: https://issues.apache.org/jira/browse/SPARK-8360
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL, Streaming
>Reporter: Reynold Xin
> Attachments: StreamingDataFrameProposal.pdf
>
>
> Umbrella ticket to track what's needed to make streaming DataFrame a reality.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-8360) Streaming DataFrames

2015-12-02 Thread Cheng Hao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15035335#comment-15035335
 ] 

Cheng Hao edited comment on SPARK-8360 at 12/2/15 12:14 PM:


Remove the google docs link, as I cannot make it access for anyone when using 
the corp account. In the meantime, I put an pdf doc, hopefully helpful.


was (Author: chenghao):
Add some thoughts on StreamingSQL. 
https://docs.google.com/document/u/1/d/1ZgmDsNLT2XltMI317nNtkUkxBFwaaCkLHtPXjQE6JBY/pub

Request Edit if you needed.
https://docs.google.com/document/d/1ZgmDsNLT2XltMI317nNtkUkxBFwaaCkLHtPXjQE6JBY/edit
 

> Streaming DataFrames
> 
>
> Key: SPARK-8360
> URL: https://issues.apache.org/jira/browse/SPARK-8360
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL, Streaming
>Reporter: Reynold Xin
> Attachments: StreamingDataFrameProposal.pdf
>
>
> Umbrella ticket to track what's needed to make streaming DataFrame a reality.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-8360) Streaming DataFrames

2015-12-01 Thread Cheng Hao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15035335#comment-15035335
 ] 

Cheng Hao edited comment on SPARK-8360 at 12/2/15 6:19 AM:
---

Add some thoughts on StreamingSQL. 
https://docs.google.com/document/u/1/d/1ZgmDsNLT2XltMI317nNtkUkxBFwaaCkLHtPXjQE6JBY/pub

Request Edit if you needed.
https://docs.google.com/document/d/1ZgmDsNLT2XltMI317nNtkUkxBFwaaCkLHtPXjQE6JBY/edit
 


was (Author: chenghao):
Add some thoughts on StreamingSQL. 
https://docs.google.com/document/d/1ZgmDsNLT2XltMI317nNtkUkxBFwaaCkLHtPXjQE6JBY/edit
 

> Streaming DataFrames
> 
>
> Key: SPARK-8360
> URL: https://issues.apache.org/jira/browse/SPARK-8360
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL, Streaming
>Reporter: Reynold Xin
>
> Umbrella ticket to track what's needed to make streaming DataFrame a reality.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8360) Streaming DataFrames

2015-12-01 Thread Cheng Hao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15035335#comment-15035335
 ] 

Cheng Hao commented on SPARK-8360:
--

Add some thoughts on StreamingSQL. 
https://docs.google.com/document/d/1ZgmDsNLT2XltMI317nNtkUkxBFwaaCkLHtPXjQE6JBY/edit
 

> Streaming DataFrames
> 
>
> Key: SPARK-8360
> URL: https://issues.apache.org/jira/browse/SPARK-8360
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL, Streaming
>Reporter: Reynold Xin
>
> Umbrella ticket to track what's needed to make streaming DataFrame a reality.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12064) Make the SqlParser as trait for better integrated with extensions

2015-11-30 Thread Cheng Hao (JIRA)
Cheng Hao created SPARK-12064:
-

 Summary: Make the SqlParser as trait for better integrated with 
extensions
 Key: SPARK-12064
 URL: https://issues.apache.org/jira/browse/SPARK-12064
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Cheng Hao


`SqlParser` is now an object, which hard to reuse it in extensions, a proper 
implementation will be make the `SqlParser` as trait, and keep all of its 
implementation unchanged, and then add another object called `SqlParser` 
inherits from the trait.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12064) Make the SqlParser as trait for better integrated with extensions

2015-11-30 Thread Cheng Hao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Hao resolved SPARK-12064.
---
Resolution: Won't Fix

DBX has plan to remove the SqlParser in 2.0.

> Make the SqlParser as trait for better integrated with extensions
> -
>
> Key: SPARK-12064
> URL: https://issues.apache.org/jira/browse/SPARK-12064
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Cheng Hao
>
> `SqlParser` is now an object, which hard to reuse it in extensions, a proper 
> implementation will be make the `SqlParser` as trait, and keep all of its 
> implementation unchanged, and then add another object called `SqlParser` 
> inherits from the trait.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10865) [Spark SQL] [UDF] the ceil/ceiling function got wrong return value type

2015-11-11 Thread Cheng Hao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15001423#comment-15001423
 ] 

Cheng Hao commented on SPARK-10865:
---

1.5.2 is released, I am not sure whether part of it now or not.

> [Spark SQL] [UDF] the ceil/ceiling function got wrong return value type
> ---
>
> Key: SPARK-10865
> URL: https://issues.apache.org/jira/browse/SPARK-10865
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Yi Zhou
>Assignee: Cheng Hao
> Fix For: 1.6.0
>
>
> As per ceil/ceiling definition,it should get BIGINT return value
> -ceil(DOUBLE a), ceiling(DOUBLE a)
> -Returns the minimum BIGINT value that is equal to or greater than a.
> But in current Spark implementation, it got wrong value type.
> e.g., 
> select ceil(2642.12) from udf_test_web_sales limit 1;
> 2643.0
> In hive implementation, it got return value type like below:
> hive> select ceil(2642.12) from udf_test_web_sales limit 1;
> OK
> 2643



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10865) [Spark SQL] [UDF] the ceil/ceiling function got wrong return value type

2015-11-11 Thread Cheng Hao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15001422#comment-15001422
 ] 

Cheng Hao commented on SPARK-10865:
---

We actually follow the criteria of Hive, and actually I tested it in MySQL, it 
works in the same way. 

> [Spark SQL] [UDF] the ceil/ceiling function got wrong return value type
> ---
>
> Key: SPARK-10865
> URL: https://issues.apache.org/jira/browse/SPARK-10865
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Yi Zhou
>Assignee: Cheng Hao
> Fix For: 1.6.0
>
>
> As per ceil/ceiling definition,it should get BIGINT return value
> -ceil(DOUBLE a), ceiling(DOUBLE a)
> -Returns the minimum BIGINT value that is equal to or greater than a.
> But in current Spark implementation, it got wrong value type.
> e.g., 
> select ceil(2642.12) from udf_test_web_sales limit 1;
> 2643.0
> In hive implementation, it got return value type like below:
> hive> select ceil(2642.12) from udf_test_web_sales limit 1;
> OK
> 2643



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11512) Bucket Join

2015-11-04 Thread Cheng Hao (JIRA)
Cheng Hao created SPARK-11512:
-

 Summary: Bucket Join
 Key: SPARK-11512
 URL: https://issues.apache.org/jira/browse/SPARK-11512
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Cheng Hao


Sort merge join on two datasets on the file system that have already been 
partitioned the same with the same number of partitions and sorted within each 
partition, and we don't need to sort it again while join with the 
sorted/partitioned keys

This functionality exists in
- Hive (hive.optimize.bucketmapjoin.sortedmerge)
- Pig (USING 'merge')
- MapReduce (CompositeInputFormat)




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11512) Bucket Join

2015-11-04 Thread Cheng Hao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14990867#comment-14990867
 ] 

Cheng Hao commented on SPARK-11512:
---

Oh, yes, but SPARK-5292 is only about to support the Hive bucket, but in a 
generic way, we need to add support the bucket for Data Source API. Anyway, I 
will add a link with that jira issue.

> Bucket Join
> ---
>
> Key: SPARK-11512
> URL: https://issues.apache.org/jira/browse/SPARK-11512
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Cheng Hao
>
> Sort merge join on two datasets on the file system that have already been 
> partitioned the same with the same number of partitions and sorted within 
> each partition, and we don't need to sort it again while join with the 
> sorted/partitioned keys
> This functionality exists in
> - Hive (hive.optimize.bucketmapjoin.sortedmerge)
> - Pig (USING 'merge')
> - MapReduce (CompositeInputFormat)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11512) Bucket Join

2015-11-04 Thread Cheng Hao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14990868#comment-14990868
 ] 

Cheng Hao commented on SPARK-11512:
---

We need to support the "bucket" for DataSource API.

> Bucket Join
> ---
>
> Key: SPARK-11512
> URL: https://issues.apache.org/jira/browse/SPARK-11512
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Cheng Hao
>
> Sort merge join on two datasets on the file system that have already been 
> partitioned the same with the same number of partitions and sorted within 
> each partition, and we don't need to sort it again while join with the 
> sorted/partitioned keys
> This functionality exists in
> - Hive (hive.optimize.bucketmapjoin.sortedmerge)
> - Pig (USING 'merge')
> - MapReduce (CompositeInputFormat)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10371) Optimize sequential projections

2015-10-29 Thread Cheng Hao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14981650#comment-14981650
 ] 

Cheng Hao commented on SPARK-10371:
---

Eliminating the common sub expression within the projection?

> Optimize sequential projections
> ---
>
> Key: SPARK-10371
> URL: https://issues.apache.org/jira/browse/SPARK-10371
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, SQL
>Affects Versions: 1.5.0
>Reporter: Xiangrui Meng
>
> In ML pipelines, each transformer/estimator appends new columns to the input 
> DataFrame. For example, it might produce DataFrames like the following 
> columns: a, b, c, d, where a is from raw input, b = udf_b(a), c = udf_c(b), 
> and d = udf_d(c). Some UDFs could be expensive. However, if we materialize c 
> and d, udf_b, and udf_c are triggered twice, i.e., value c is not re-used.
> It would be nice to detect this pattern and re-use intermediate values.
> {code}
> val input = sqlContext.range(10)
> val output = input.withColumn("x", col("id") + 1).withColumn("y", col("x") * 
> 2)
> output.explain(true)
> == Parsed Logical Plan ==
> 'Project [*,('x * 2) AS y#254]
>  Project [id#252L,(id#252L + cast(1 as bigint)) AS x#253L]
>   LogicalRDD [id#252L], MapPartitionsRDD[458] at range at :30
> == Analyzed Logical Plan ==
> id: bigint, x: bigint, y: bigint
> Project [id#252L,x#253L,(x#253L * cast(2 as bigint)) AS y#254L]
>  Project [id#252L,(id#252L + cast(1 as bigint)) AS x#253L]
>   LogicalRDD [id#252L], MapPartitionsRDD[458] at range at :30
> == Optimized Logical Plan ==
> Project [id#252L,(id#252L + 1) AS x#253L,((id#252L + 1) * 2) AS y#254L]
>  LogicalRDD [id#252L], MapPartitionsRDD[458] at range at :30
> == Physical Plan ==
> TungstenProject [id#252L,(id#252L + 1) AS x#253L,((id#252L + 1) * 2) AS 
> y#254L]
>  Scan PhysicalRDD[id#252L]
> Code Generation: true
> input: org.apache.spark.sql.DataFrame = [id: bigint]
> output: org.apache.spark.sql.DataFrame = [id: bigint, x: bigint, y: bigint]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11330) Filter operation on StringType after groupBy PERSISTED brings no results

2015-10-28 Thread Cheng Hao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14979646#comment-14979646
 ] 

Cheng Hao commented on SPARK-11330:
---

[~saif.a.ellafi] I've checked that with 1.5.0 and it's confirmed it can be 
reproduced, however, it does not exists in latest master branch, I am still 
digging when and how it's been fixed.

> Filter operation on StringType after groupBy PERSISTED brings no results
> 
>
> Key: SPARK-11330
> URL: https://issues.apache.org/jira/browse/SPARK-11330
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.1
> Environment: Stand alone Cluster of five servers (happens as well in 
> local mode). sqlContext instance of HiveContext (happens as well with 
> SQLContext)
> No special options other than driver memory and executor memory.
> Parquet partitions are 512 where there are 160 cores. Happens as well with 
> other partitioning
> Data is nearly 2 billion rows.
>Reporter: Saif Addin Ellafi
>Priority: Blocker
> Attachments: bug_reproduce.zip, bug_reproduce_50k.zip
>
>
> ONLY HAPPENS WHEN PERSIST() IS CALLED
> val data = sqlContext.read.parquet("/var/data/Saif/ppnr_pqt")
> data.groupBy("vintages").count.select("vintages").filter("vintages = 
> '2007-01-01'").first
> >>> res9: org.apache.spark.sql.Row = [2007-01-01]
> data.groupBy("vintages").count.persist.select("vintages").filter("vintages = 
> '2007-01-01'").first
> >>> Exception on empty iterator stuff
> This does not happen if using another type of field, eg IntType
> data.groupBy("mm").count.persist.select("mm").filter("mm = 
> 200805").first >>> res13: org.apache.spark.sql.Row = [200805]
> NOTE: I have no idea whether I used KRYO serializer when stored this parquet.
> NOTE2: If setting the persist after the filtering, it works fine. But this is 
> not a good enough workaround since any filter operation afterwards will break 
> results.
> NOTE3: I have reproduced the issue with several different datasets.
> NOTE4: The only real workaround is to store the groupBy dataframe in database 
> and reload it as a new dataframe.
> Query to raw-data works fine:
> data.select("vintages").filter("vintages = '2007-01-01'").first >>> res4: 
> org.apache.spark.sql.Row = [2007-01-01]
> Originally, the issue happened with a larger aggregation operation, the 
> result was that data was inconsistent bringing different results every call.
> Reducing the operation step by step, I got into this issue.
> In any case, the original operation was:
> val data = sqlContext.read.parquet("/var/Saif/data_pqt")
> val res = data.groupBy("product", "band", "age", "vint", "mb", 
> "mm").agg(count($"account_id").as("N"), 
> sum($"balance").as("balance_eom"), sum($"balancec").as("balance_eoc"), 
> sum($"spend").as("spend"), sum($"payment").as("payment"), 
> sum($"feoc").as("feoc"), sum($"cfintbal").as("cfintbal"), count($"newacct" 
> === 1).as("newacct")).persist()
> val z = res.select("vint", "mm").filter("vint = 
> '2007-01-01'").select("mm").distinct.collect
> z.length
> >>> res0: Int = 102
> res.unpersist()
> val z = res.select("vint", "mm").filter("vint = 
> '2007-01-01'").select("mm").distinct.collect
> z.length
> >>> res1: Int = 103



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-11330) Filter operation on StringType after groupBy PERSISTED brings no results

2015-10-28 Thread Cheng Hao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14979699#comment-14979699
 ] 

Cheng Hao edited comment on SPARK-11330 at 10/29/15 2:48 AM:
-

OK, seems it's solved in https://issues.apache.org/jira/browse/SPARK-10859, it 
should not be the problem in the 1.5.2 or 1.6.0 any more.


was (Author: chenghao):
OK, seems it's solved in https://issues.apache.org/jira/browse/SPARK-11330, it 
should not be the problem in the 1.5.2 or 1.6.0 any more.

> Filter operation on StringType after groupBy PERSISTED brings no results
> 
>
> Key: SPARK-11330
> URL: https://issues.apache.org/jira/browse/SPARK-11330
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.1
> Environment: Stand alone Cluster of five servers (happens as well in 
> local mode). sqlContext instance of HiveContext (happens as well with 
> SQLContext)
> No special options other than driver memory and executor memory.
> Parquet partitions are 512 where there are 160 cores. Happens as well with 
> other partitioning
> Data is nearly 2 billion rows.
>Reporter: Saif Addin Ellafi
>Priority: Blocker
> Attachments: bug_reproduce.zip, bug_reproduce_50k.zip
>
>
> ONLY HAPPENS WHEN PERSIST() IS CALLED
> val data = sqlContext.read.parquet("/var/data/Saif/ppnr_pqt")
> data.groupBy("vintages").count.select("vintages").filter("vintages = 
> '2007-01-01'").first
> >>> res9: org.apache.spark.sql.Row = [2007-01-01]
> data.groupBy("vintages").count.persist.select("vintages").filter("vintages = 
> '2007-01-01'").first
> >>> Exception on empty iterator stuff
> This does not happen if using another type of field, eg IntType
> data.groupBy("mm").count.persist.select("mm").filter("mm = 
> 200805").first >>> res13: org.apache.spark.sql.Row = [200805]
> NOTE: I have no idea whether I used KRYO serializer when stored this parquet.
> NOTE2: If setting the persist after the filtering, it works fine. But this is 
> not a good enough workaround since any filter operation afterwards will break 
> results.
> NOTE3: I have reproduced the issue with several different datasets.
> NOTE4: The only real workaround is to store the groupBy dataframe in database 
> and reload it as a new dataframe.
> Query to raw-data works fine:
> data.select("vintages").filter("vintages = '2007-01-01'").first >>> res4: 
> org.apache.spark.sql.Row = [2007-01-01]
> Originally, the issue happened with a larger aggregation operation, the 
> result was that data was inconsistent bringing different results every call.
> Reducing the operation step by step, I got into this issue.
> In any case, the original operation was:
> val data = sqlContext.read.parquet("/var/Saif/data_pqt")
> val res = data.groupBy("product", "band", "age", "vint", "mb", 
> "mm").agg(count($"account_id").as("N"), 
> sum($"balance").as("balance_eom"), sum($"balancec").as("balance_eoc"), 
> sum($"spend").as("spend"), sum($"payment").as("payment"), 
> sum($"feoc").as("feoc"), sum($"cfintbal").as("cfintbal"), count($"newacct" 
> === 1).as("newacct")).persist()
> val z = res.select("vint", "mm").filter("vint = 
> '2007-01-01'").select("mm").distinct.collect
> z.length
> >>> res0: Int = 102
> res.unpersist()
> val z = res.select("vint", "mm").filter("vint = 
> '2007-01-01'").select("mm").distinct.collect
> z.length
> >>> res1: Int = 103



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11330) Filter operation on StringType after groupBy PERSISTED brings no results

2015-10-28 Thread Cheng Hao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14979699#comment-14979699
 ] 

Cheng Hao commented on SPARK-11330:
---

OK, seems it's solved in https://issues.apache.org/jira/browse/SPARK-11330, it 
should not be the problem in the 1.5.2 or 1.6.0 any more.

> Filter operation on StringType after groupBy PERSISTED brings no results
> 
>
> Key: SPARK-11330
> URL: https://issues.apache.org/jira/browse/SPARK-11330
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.1
> Environment: Stand alone Cluster of five servers (happens as well in 
> local mode). sqlContext instance of HiveContext (happens as well with 
> SQLContext)
> No special options other than driver memory and executor memory.
> Parquet partitions are 512 where there are 160 cores. Happens as well with 
> other partitioning
> Data is nearly 2 billion rows.
>Reporter: Saif Addin Ellafi
>Priority: Blocker
> Attachments: bug_reproduce.zip, bug_reproduce_50k.zip
>
>
> ONLY HAPPENS WHEN PERSIST() IS CALLED
> val data = sqlContext.read.parquet("/var/data/Saif/ppnr_pqt")
> data.groupBy("vintages").count.select("vintages").filter("vintages = 
> '2007-01-01'").first
> >>> res9: org.apache.spark.sql.Row = [2007-01-01]
> data.groupBy("vintages").count.persist.select("vintages").filter("vintages = 
> '2007-01-01'").first
> >>> Exception on empty iterator stuff
> This does not happen if using another type of field, eg IntType
> data.groupBy("mm").count.persist.select("mm").filter("mm = 
> 200805").first >>> res13: org.apache.spark.sql.Row = [200805]
> NOTE: I have no idea whether I used KRYO serializer when stored this parquet.
> NOTE2: If setting the persist after the filtering, it works fine. But this is 
> not a good enough workaround since any filter operation afterwards will break 
> results.
> NOTE3: I have reproduced the issue with several different datasets.
> NOTE4: The only real workaround is to store the groupBy dataframe in database 
> and reload it as a new dataframe.
> Query to raw-data works fine:
> data.select("vintages").filter("vintages = '2007-01-01'").first >>> res4: 
> org.apache.spark.sql.Row = [2007-01-01]
> Originally, the issue happened with a larger aggregation operation, the 
> result was that data was inconsistent bringing different results every call.
> Reducing the operation step by step, I got into this issue.
> In any case, the original operation was:
> val data = sqlContext.read.parquet("/var/Saif/data_pqt")
> val res = data.groupBy("product", "band", "age", "vint", "mb", 
> "mm").agg(count($"account_id").as("N"), 
> sum($"balance").as("balance_eom"), sum($"balancec").as("balance_eoc"), 
> sum($"spend").as("spend"), sum($"payment").as("payment"), 
> sum($"feoc").as("feoc"), sum($"cfintbal").as("cfintbal"), count($"newacct" 
> === 1).as("newacct")).persist()
> val z = res.select("vint", "mm").filter("vint = 
> '2007-01-01'").select("mm").distinct.collect
> z.length
> >>> res0: Int = 102
> res.unpersist()
> val z = res.select("vint", "mm").filter("vint = 
> '2007-01-01'").select("mm").distinct.collect
> z.length
> >>> res1: Int = 103



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-11330) Filter operation on StringType after groupBy PERSISTED brings no results

2015-10-27 Thread Cheng Hao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14977600#comment-14977600
 ] 

Cheng Hao edited comment on SPARK-11330 at 10/28/15 2:28 AM:
-

Hi, [~saif.a.ellafi], I've tried the code like below:
{code}
case class Spark11330(account_id: Int, product: String, vint: String,
  band: String, age: Int, mb: String, mm: String,
  balance: Float, balancec: Float)

test("SPARK-11330: Filter operation on StringType after groupBy PERSISTED 
brings no results") {
withTempPath { f =>
  // generate the more data.
  val d = Seq(
Spark11330(1, "product1", "2007-01-01", "band1", 27, "mb1", "200808", 
1000.0f, 2000.0f),
Spark11330(1, "product1", "2007-01-01", "band1", 27, "mb1", "200808", 
2000.0f, 2000.0f),
Spark11330(1, "product1", "2007-01-01", "band1", 27, "mb1", "200809", 
2000.0f, 2000.0f),
Spark11330(2, "product2", "2007-01-01", "band3", 29, "mb1", "200809", 
2010.0f, 3000.0f))

  val data = List.tabulate[Seq[Spark11330]](10) { i => d }.flatten

  // save as parquet file in local disk
  sqlContext.sparkContext.parallelize(data, 4)

.toDF().write.format("parquet").mode(SaveMode.Overwrite).save(f.getAbsolutePath)

  // reproduce
  val df = sqlContext.read.parquet(f.getAbsolutePath)

  val f1 = df.groupBy("vint").count().persist().filter("vint = 
'2007-01-01'").first
  val f2 = df.groupBy("vint").count().filter("vint = '2007-01-01'").first

  assert(f1 == f2)

  val res = df
.groupBy("product", "band", "age", "vint", "mb", "mm")
.agg(
  count($"account_id").as("N"),
  sum($"balance").as("balance_eom"),
  sum($"balancec").as("balance_eoc")).persist()

  val c1 = res.select("vint", 
"mm").filter("vint='2007-01-01'").select("mm").distinct.collect
  res.unpersist()
  val c2 = res.select("vint", 
"mm").filter("vint='2007-01-01'").select("mm").distinct.collect
  assert(c1.sameElements(c2))
}
  }
{code}

Seems everything works fine, I am not sure if I missed something, can you try 
to reproduce the issue based on my code?


was (Author: chenghao):
Hi, [~saif.a.ellafi], I've tried the code like below:
{code}
case class Spark11330(account_id: Int, product: String, vint: String,
  band: String, age: Int, mb: String, mm: String,
  balance: Float, balancec: Float)

test("SPARK-11330: Filter operation on StringType after groupBy PERSISTED 
brings no results") {
withTempPath { f =>
  val d = Seq(
Spark11330(1, "product1", "2007-01-01", "band1", 27, "mb1", "200808", 
1000.0f, 2000.0f),
Spark11330(1, "product1", "2007-01-01", "band1", 27, "mb1", "200808", 
2000.0f, 2000.0f),
Spark11330(1, "product1", "2007-01-01", "band1", 27, "mb1", "200809", 
2000.0f, 2000.0f),
Spark11330(2, "product2", "2007-01-01", "band3", 29, "mb1", "200809", 
2010.0f, 3000.0f))

  val data = List.tabulate[Seq[Spark11330]](10) { i => d }.flatten

  sqlContext.sparkContext.parallelize(data, 4)

.toDF().write.format("parquet").mode(SaveMode.Overwrite).save(f.getAbsolutePath)

  val df = sqlContext.read.parquet(f.getAbsolutePath)

  val f1 = df.groupBy("vint").count().persist().filter("vint = 
'2007-01-01'").first
  val f2 = df.groupBy("vint").count().filter("vint = '2007-01-01'").first

  assert(f1 == f2)

  val res = df
.groupBy("product", "band", "age", "vint", "mb", "mm")
.agg(
  count($"account_id").as("N"),
  sum($"balance").as("balance_eom"),
  sum($"balancec").as("balance_eoc")).persist()

  val c1 = res.select("vint", 
"mm").filter("vint='2007-01-01'").select("mm").distinct.collect
  res.unpersist()
  val c2 = res.select("vint", 
"mm").filter("vint='2007-01-01'").select("mm").distinct.collect
  assert(c1.sameElements(c2))
}
  }
{code}

Seems everything works fine, I am not sure if I missed something, can you try 
to reproduce the issue based on my code?

> Filter operation on StringType after groupBy PERSISTED brings no results
> 
>
> Key: SPARK-11330
> URL: https://issues.apache.org/jira/browse/SPARK-11330
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.1
> Environment: Stand alone Cluster of five servers (happens as well in 
> local mode). sqlContext instance of HiveContext (happens as well with 
> SQLContext)
> No special options other than driver memory and executor memory.
> Parquet partitions are 512 where there are 160 cores. Happens as well with 
> other partitioning
> Data is nearly 2 billion rows.
>Reporter: Saif 

[jira] [Commented] (SPARK-11330) Filter operation on StringType after groupBy PERSISTED brings no results

2015-10-27 Thread Cheng Hao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14977600#comment-14977600
 ] 

Cheng Hao commented on SPARK-11330:
---

Hi, [~saif.a.ellafi], I've tried the code like below:
{code}
case class Spark11330(account_id: Int, product: String, vint: String,
  band: String, age: Int, mb: String, mm: String,
  balance: Float, balancec: Float)

test("SPARK-11330: Filter operation on StringType after groupBy PERSISTED 
brings no results") {
withTempPath { f =>
  val d = Seq(
Spark11330(1, "product1", "2007-01-01", "band1", 27, "mb1", "200808", 
1000.0f, 2000.0f),
Spark11330(1, "product1", "2007-01-01", "band1", 27, "mb1", "200808", 
2000.0f, 2000.0f),
Spark11330(1, "product1", "2007-01-01", "band1", 27, "mb1", "200809", 
2000.0f, 2000.0f),
Spark11330(2, "product2", "2007-01-01", "band3", 29, "mb1", "200809", 
2010.0f, 3000.0f))

  val data = List.tabulate[Seq[Spark11330]](10) { i => d }.flatten

  sqlContext.sparkContext.parallelize(data, 4)

.toDF().write.format("parquet").mode(SaveMode.Overwrite).save(f.getAbsolutePath)

  val df = sqlContext.read.parquet(f.getAbsolutePath)

  val f1 = df.groupBy("vint").count().persist().filter("vint = 
'2007-01-01'").first
  val f2 = df.groupBy("vint").count().filter("vint = '2007-01-01'").first

  assert(f1 == f2)

  val res = df
.groupBy("product", "band", "age", "vint", "mb", "mm")
.agg(
  count($"account_id").as("N"),
  sum($"balance").as("balance_eom"),
  sum($"balancec").as("balance_eoc")).persist()

  val c1 = res.select("vint", 
"mm").filter("vint='2007-01-01'").select("mm").distinct.collect
  res.unpersist()
  val c2 = res.select("vint", 
"mm").filter("vint='2007-01-01'").select("mm").distinct.collect
  assert(c1.sameElements(c2))
}
  }
{code}

Seems everything works fine, I am not sure if I missed something, can you try 
to reproduce the issue based on my code?

> Filter operation on StringType after groupBy PERSISTED brings no results
> 
>
> Key: SPARK-11330
> URL: https://issues.apache.org/jira/browse/SPARK-11330
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.1
> Environment: Stand alone Cluster of five servers (happens as well in 
> local mode). sqlContext instance of HiveContext (happens as well with 
> SQLContext)
> No special options other than driver memory and executor memory.
> Parquet partitions are 512 where there are 160 cores. Happens as well with 
> other partitioning
> Data is nearly 2 billion rows.
>Reporter: Saif Addin Ellafi
>Priority: Blocker
>
> ONLY HAPPENS WHEN PERSIST() IS CALLED
> val data = sqlContext.read.parquet("/var/data/Saif/ppnr_pqt")
> data.groupBy("vintages").count.select("vintages").filter("vintages = 
> '2007-01-01'").first
> >>> res9: org.apache.spark.sql.Row = [2007-01-01]
> data.groupBy("vintages").count.persist.select("vintages").filter("vintages = 
> '2007-01-01'").first
> >>> Exception on empty iterator stuff
> This does not happen if using another type of field, eg IntType
> data.groupBy("mm").count.persist.select("mm").filter("mm = 
> 200805").first >>> res13: org.apache.spark.sql.Row = [200805]
> NOTE: I have no idea whether I used KRYO serializer when stored this parquet.
> NOTE2: If setting the persist after the filtering, it works fine. But this is 
> not a good enough workaround since any filter operation afterwards will break 
> results.
> NOTE3: I have reproduced the issue with several different datasets.
> NOTE4: The only real workaround is to store the groupBy dataframe in database 
> and reload it as a new dataframe.
> Query to raw-data works fine:
> data.select("vintages").filter("vintages = '2007-01-01'").first >>> res4: 
> org.apache.spark.sql.Row = [2007-01-01]
> Originally, the issue happened with a larger aggregation operation, the 
> result was that data was inconsistent bringing different results every call.
> Reducing the operation step by step, I got into this issue.
> In any case, the original operation was:
> val data = sqlContext.read.parquet("/var/Saif/data_pqt")
> val res = data.groupBy("product", "band", "age", "vint", "mb", 
> "mm").agg(count($"account_id").as("N"), 
> sum($"balance").as("balance_eom"), sum($"balancec").as("balance_eoc"), 
> sum($"spend").as("spend"), sum($"payment").as("payment"), 
> sum($"feoc").as("feoc"), sum($"cfintbal").as("cfintbal"), count($"newacct" 
> === 1).as("newacct")).persist()
> val z = res.select("vint", "mm").filter("vint = 
> '2007-01-01'").select("mm").distinct.collect
> z.length
> >>> res0: Int = 102
> res.unpersist()
> val z = 

[jira] [Updated] (SPARK-9735) Auto infer partition schema of HadoopFsRelation should should respected the user specified one

2015-10-20 Thread Cheng Hao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9735?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Hao updated SPARK-9735:
-
Description: 
This code is copied from the hadoopFsRelationSuite.scala

{code}
partitionedTestDF = (for {
i <- 1 to 3
p2 <- Seq("foo", "bar")
  } yield (i, s"val_$i", 1, p2)).toDF("a", "b", "p1", "p2")

withTempPath { file =>
  val input = partitionedTestDF.select('a, 'b, 
'p1.cast(StringType).as('ps), 'p2)

  input
.write
.format(dataSourceName)
.mode(SaveMode.Overwrite)
.partitionBy("ps", "p2")
.saveAsTable("t")

  input
.write
.format(dataSourceName)
.mode(SaveMode.Append)
.partitionBy("ps", "p2")
.saveAsTable("t")

  val realData = input.collect()
  withTempTable("t") {
checkAnswer(sqlContext.table("t"), realData ++ realData)
  }
}

java.lang.ClassCastException: java.lang.Integer cannot be cast to 
org.apache.spark.unsafe.types.UTF8String
at 
org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getUTF8String(rows.scala:45)
at 
org.apache.spark.sql.catalyst.expressions.GenericInternalRow.getUTF8String(rows.scala:220)
at 
org.apache.spark.sql.catalyst.expressions.JoinedRow.getUTF8String(JoinedRow.scala:102)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(generated.java:62)
at 
org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$17$$anonfun$apply$9.apply(DataSourceStrategy.scala:212)
at 
org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$17$$anonfun$apply$9.apply(DataSourceStrategy.scala:212)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at 
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
at scala.collection.AbstractIterator.to(Iterator.scala:1157)
at 
scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
at 
scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
at 
org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:903)
at 
org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:903)
at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1846)
at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1846)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
07:44:01.344 ERROR org.apache.spark.executor.Executor: Exception in task 14.0 
in stage 3.0 (TID 206)
java.lang.ClassCastException: java.lang.Integer cannot be cast to 
org.apache.spark.unsafe.types.UTF8String
at 
org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getUTF8String(rows.scala:45)
at 
org.apache.spark.sql.catalyst.expressions.GenericInternalRow.getUTF8String(rows.scala:220)
at 
org.apache.spark.sql.catalyst.expressions.JoinedRow.getUTF8String(JoinedRow.scala:102)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(generated.java:62)
at 
org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$17$$anonfun$apply$9.apply(DataSourceStrategy.scala:212)
at 
org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$17$$anonfun$apply$9.apply(DataSourceStrategy.scala:212)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at 
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)

[jira] [Commented] (SPARK-4226) SparkSQL - Add support for subqueries in predicates

2015-10-15 Thread Cheng Hao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14958524#comment-14958524
 ] 

Cheng Hao commented on SPARK-4226:
--

[~nadenf] Actually I am working on it right now, and the first PR is ready, it 
will be great appreciated if you can try 
https://github.com/apache/spark/pull/9055 in your local testing, let me know if 
there any problem or bug you found.

> SparkSQL - Add support for subqueries in predicates
> ---
>
> Key: SPARK-4226
> URL: https://issues.apache.org/jira/browse/SPARK-4226
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.2.0
> Environment: Spark 1.2 snapshot
>Reporter: Terry Siu
>
> I have a test table defined in Hive as follows:
> {code:sql}
> CREATE TABLE sparkbug (
>   id INT,
>   event STRING
> ) STORED AS PARQUET;
> {code}
> and insert some sample data with ids 1, 2, 3.
> In a Spark shell, I then create a HiveContext and then execute the following 
> HQL to test out subquery predicates:
> {code}
> val hc = HiveContext(hc)
> hc.hql("select customerid from sparkbug where customerid in (select 
> customerid from sparkbug where customerid in (2,3))")
> {code}
> I get the following error:
> {noformat}
> java.lang.RuntimeException: Unsupported language features in query: select 
> customerid from sparkbug where customerid in (select customerid from sparkbug 
> where customerid in (2,3))
> TOK_QUERY
>   TOK_FROM
> TOK_TABREF
>   TOK_TABNAME
> sparkbug
>   TOK_INSERT
> TOK_DESTINATION
>   TOK_DIR
> TOK_TMP_FILE
> TOK_SELECT
>   TOK_SELEXPR
> TOK_TABLE_OR_COL
>   customerid
> TOK_WHERE
>   TOK_SUBQUERY_EXPR
> TOK_SUBQUERY_OP
>   in
> TOK_QUERY
>   TOK_FROM
> TOK_TABREF
>   TOK_TABNAME
> sparkbug
>   TOK_INSERT
> TOK_DESTINATION
>   TOK_DIR
> TOK_TMP_FILE
> TOK_SELECT
>   TOK_SELEXPR
> TOK_TABLE_OR_COL
>   customerid
> TOK_WHERE
>   TOK_FUNCTION
> in
> TOK_TABLE_OR_COL
>   customerid
> 2
> 3
> TOK_TABLE_OR_COL
>   customerid
> scala.NotImplementedError: No parse rules for ASTNode type: 817, text: 
> TOK_SUBQUERY_EXPR :
> TOK_SUBQUERY_EXPR
>   TOK_SUBQUERY_OP
> in
>   TOK_QUERY
> TOK_FROM
>   TOK_TABREF
> TOK_TABNAME
>   sparkbug
> TOK_INSERT
>   TOK_DESTINATION
> TOK_DIR
>   TOK_TMP_FILE
>   TOK_SELECT
> TOK_SELEXPR
>   TOK_TABLE_OR_COL
> customerid
>   TOK_WHERE
> TOK_FUNCTION
>   in
>   TOK_TABLE_OR_COL
> customerid
>   2
>   3
>   TOK_TABLE_OR_COL
> customerid
> " +
>  
> org.apache.spark.sql.hive.HiveQl$.nodeToExpr(HiveQl.scala:1098)
> 
> at scala.sys.package$.error(package.scala:27)
> at org.apache.spark.sql.hive.HiveQl$.createPlan(HiveQl.scala:252)
> at 
> org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:50)
> at 
> org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:49)
> at 
> scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136)
> {noformat}
> [This 
> thread|http://apache-spark-user-list.1001560.n3.nabble.com/Subquery-in-having-clause-Spark-1-1-0-td17401.html]
>  also brings up lack of subquery support in SparkSQL. It would be nice to 
> have subquery predicate support in a near, future release (1.3, maybe?).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11076) Decimal Support for Ceil/Floor

2015-10-12 Thread Cheng Hao (JIRA)
Cheng Hao created SPARK-11076:
-

 Summary: Decimal Support for Ceil/Floor
 Key: SPARK-11076
 URL: https://issues.apache.org/jira/browse/SPARK-11076
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Cheng Hao


Currently, Ceil & Floor doesn't support decimal, but Hive does.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-11041) Add (NOT) IN / EXISTS support for predicates

2015-10-09 Thread Cheng Hao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Hao closed SPARK-11041.
-
Resolution: Duplicate

> Add (NOT) IN / EXISTS support for predicates
> 
>
> Key: SPARK-11041
> URL: https://issues.apache.org/jira/browse/SPARK-11041
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Cheng Hao
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11041) Add (NOT) IN / EXISTS support for predicates

2015-10-09 Thread Cheng Hao (JIRA)
Cheng Hao created SPARK-11041:
-

 Summary: Add (NOT) IN / EXISTS support for predicates
 Key: SPARK-11041
 URL: https://issues.apache.org/jira/browse/SPARK-11041
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Cheng Hao






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10992) Partial Aggregation Support for Hive UDAF

2015-10-07 Thread Cheng Hao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Hao updated SPARK-10992:
--
Issue Type: Sub-task  (was: Improvement)
Parent: SPARK-4366

> Partial Aggregation Support for Hive UDAF
> -
>
> Key: SPARK-10992
> URL: https://issues.apache.org/jira/browse/SPARK-10992
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Cheng Hao
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10831) Spark SQL Configuration missing in the doc

2015-09-25 Thread Cheng Hao (JIRA)
Cheng Hao created SPARK-10831:
-

 Summary: Spark SQL Configuration missing in the doc
 Key: SPARK-10831
 URL: https://issues.apache.org/jira/browse/SPARK-10831
 Project: Spark
  Issue Type: Documentation
  Components: SQL
Reporter: Cheng Hao


E.g.
spark.sql.codegen
spark.sql.planner.sortMergeJoin
spark.sql.dialect
spark.sql.caseSensitive



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10829) Scan DataSource with predicate expression combine partition key and attributes doesn't work

2015-09-24 Thread Cheng Hao (JIRA)
Cheng Hao created SPARK-10829:
-

 Summary: Scan DataSource with predicate expression combine 
partition key and attributes doesn't work
 Key: SPARK-10829
 URL: https://issues.apache.org/jira/browse/SPARK-10829
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Cheng Hao
Priority: Blocker


To reproduce that with the code:
{code}
withSQLConf(SQLConf.PARQUET_FILTER_PUSHDOWN_ENABLED.key -> "true") {
  withTempPath { dir =>
val path = s"${dir.getCanonicalPath}/part=1"
(1 to 3).map(i => (i, i.toString)).toDF("a", "b").write.parquet(path)

// If the "part = 1" filter gets pushed down, this query will throw an 
exception since
// "part" is not a valid column in the actual Parquet file
checkAnswer(
  sqlContext.read.parquet(path).filter("a > 0 and (part = 0 or a > 1)"),
  (2 to 3).map(i => Row(i, i.toString, 1)))
  }
}
{code}
We expect the result as:
{code}
2, 1
3, 1
{code}
But we got:
{code}
1, 1
2, 1
3, 1
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10733) TungstenAggregation cannot acquire page after switching to sort-based

2015-09-23 Thread Cheng Hao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14904778#comment-14904778
 ] 

Cheng Hao commented on SPARK-10733:
---

[~jameszhouyi] Can you please patch the 
https://github.com/chenghao-intel/spark/commit/91af33397100802d6ba577a3f423bb47d5a761ea
 and try your workload? And be sure set the log level to `INFO`.

[~andrewor14] [~yhuai] One possibility is Sort-Merge-Join eat out all of the 
memory, as Sort-Merge-Join will not free the memory until we finish iterating 
all join result, however, partial aggregation will actually accept the iterator 
the join result, which means possible no memory at all for aggregation.

> TungstenAggregation cannot acquire page after switching to sort-based
> -
>
> Key: SPARK-10733
> URL: https://issues.apache.org/jira/browse/SPARK-10733
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>Priority: Blocker
>
> This is uncovered after fixing SPARK-10474. Stack trace:
> {code}
> 15/09/21 12:51:46 WARN scheduler.TaskSetManager: Lost task 115.0 in stage 
> 152.0 (TID 1736, bb-node2): java.io.IOException: Unable to acquire 16777216 
> bytes of memory
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPage(UnsafeExternalSorter.java:378)
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPageIfNecessary(UnsafeExternalSorter.java:359)
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertKVRecord(UnsafeExternalSorter.java:488)
> at 
> org.apache.spark.sql.execution.UnsafeKVExternalSorter.insertKV(UnsafeKVExternalSorter.java:144)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.switchToSortBasedAggregation(TungstenAggregationIterator.scala:465)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:379)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.start(TungstenAggregationIterator.scala:622)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1.org$apache$spark$sql$execution$aggregate$TungstenAggregate$$anonfun$$executePartition$1(TungstenAggregate.scala:110)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
> at 
> org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:64)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:88)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10474) Aggregation failed with unable to acquire memory

2015-09-17 Thread Cheng Hao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14802912#comment-14802912
 ] 

Cheng Hao commented on SPARK-10474:
---

The root reason for this failure, is because of the 
`TungstenAggregationIterator.switchToSortBasedAggregation`, as it's eat out 
memory by HashAggregation, and then, we cannot allocate memory when turn the 
sort-based aggregation even in the spilling time.

I post a workaround solution PR for discussion.

> Aggregation failed with unable to acquire memory
> 
>
> Key: SPARK-10474
> URL: https://issues.apache.org/jira/browse/SPARK-10474
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Yi Zhou
>Priority: Blocker
>
> In aggregation case, a  Lost task happened with below error.
> {code}
>  java.io.IOException: Could not acquire 65536 bytes of memory
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.initializeForWriting(UnsafeExternalSorter.java:169)
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:220)
> at 
> org.apache.spark.sql.execution.UnsafeKVExternalSorter.(UnsafeKVExternalSorter.java:126)
> at 
> org.apache.spark.sql.execution.UnsafeFixedWidthAggregationMap.destructAndCreateExternalSorter(UnsafeFixedWidthAggregationMap.java:257)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.switchToSortBasedAggregation(TungstenAggregationIterator.scala:435)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:379)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.start(TungstenAggregationIterator.scala:622)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1.org$apache$spark$sql$execution$aggregate$TungstenAggregate$$anonfun$$executePartition$1(TungstenAggregate.scala:110)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
> at 
> org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:64)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:88)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> Key SQL Query
> {code:sql}
> INSERT INTO TABLE test_table
> SELECT
>   ss.ss_customer_sk AS cid,
>   count(CASE WHEN i.i_class_id=1  THEN 1 ELSE NULL END) AS id1,
>   count(CASE WHEN i.i_class_id=3  THEN 1 ELSE NULL END) AS id3,
>   count(CASE WHEN i.i_class_id=5  THEN 1 ELSE NULL END) AS id5,
>   count(CASE WHEN i.i_class_id=7  THEN 1 ELSE NULL END) AS id7,
>   count(CASE WHEN i.i_class_id=9  THEN 1 ELSE NULL END) AS id9,
>   count(CASE WHEN i.i_class_id=11 THEN 1 ELSE NULL END) AS id11,
>   count(CASE WHEN i.i_class_id=13 THEN 1 ELSE NULL END) AS id13,
>   count(CASE WHEN i.i_class_id=15 THEN 1 ELSE NULL END) AS id15,
>   count(CASE WHEN i.i_class_id=2  THEN 1 ELSE NULL END) AS id2,
>   count(CASE WHEN i.i_class_id=4  THEN 1 ELSE NULL END) AS id4,
>   count(CASE WHEN i.i_class_id=6  THEN 1 ELSE NULL END) AS id6,
>   count(CASE WHEN i.i_class_id=8  THEN 1 ELSE NULL END) AS id8,
>   count(CASE WHEN i.i_class_id=10 THEN 1 ELSE NULL END) AS id10,
>   count(CASE WHEN i.i_class_id=14 THEN 1 ELSE NULL END) AS id14,
>   count(CASE WHEN i.i_class_id=16 THEN 1 ELSE NULL END) AS id16
> FROM store_sales ss
> INNER JOIN item i ON ss.ss_item_sk = i.i_item_sk
> WHERE i.i_category IN ('Books')
> AND ss.ss_customer_sk IS NOT NULL
> GROUP BY ss.ss_customer_sk
> HAVING count(ss.ss_item_sk) > 5
> {code}
> Note:
> the store_sales is a big fact table and item is a small dimension table.



--
This message was sent by Atlassian JIRA

[jira] [Comment Edited] (SPARK-10474) Aggregation failed with unable to acquire memory

2015-09-17 Thread Cheng Hao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14802912#comment-14802912
 ] 

Cheng Hao edited comment on SPARK-10474 at 9/17/15 1:48 PM:


The root reason for this failure, is the trigger condition from  hash-based 
aggregation to sort-based aggregation in the `TungstenAggregationIterator`, 
current code logic is if no more memory to can be allocated, then turn to 
sort-based aggregation,  however, since no memory left, the data spill will 
also failed in UnsafeExternalSorter.initializeWriting.

I post a workaround solution PR for discussion.


was (Author: chenghao):
The root reason for this failure, is because of the 
`TungstenAggregationIterator.switchToSortBasedAggregation`, as it's eat out 
memory by HashAggregation, and then, we cannot allocate memory when turn the 
sort-based aggregation even in the spilling time.

I post a workaround solution PR for discussion.

> Aggregation failed with unable to acquire memory
> 
>
> Key: SPARK-10474
> URL: https://issues.apache.org/jira/browse/SPARK-10474
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Yi Zhou
>Priority: Blocker
>
> In aggregation case, a  Lost task happened with below error.
> {code}
>  java.io.IOException: Could not acquire 65536 bytes of memory
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.initializeForWriting(UnsafeExternalSorter.java:169)
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:220)
> at 
> org.apache.spark.sql.execution.UnsafeKVExternalSorter.(UnsafeKVExternalSorter.java:126)
> at 
> org.apache.spark.sql.execution.UnsafeFixedWidthAggregationMap.destructAndCreateExternalSorter(UnsafeFixedWidthAggregationMap.java:257)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.switchToSortBasedAggregation(TungstenAggregationIterator.scala:435)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:379)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.start(TungstenAggregationIterator.scala:622)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1.org$apache$spark$sql$execution$aggregate$TungstenAggregate$$anonfun$$executePartition$1(TungstenAggregate.scala:110)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
> at 
> org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:64)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:88)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> Key SQL Query
> {code:sql}
> INSERT INTO TABLE test_table
> SELECT
>   ss.ss_customer_sk AS cid,
>   count(CASE WHEN i.i_class_id=1  THEN 1 ELSE NULL END) AS id1,
>   count(CASE WHEN i.i_class_id=3  THEN 1 ELSE NULL END) AS id3,
>   count(CASE WHEN i.i_class_id=5  THEN 1 ELSE NULL END) AS id5,
>   count(CASE WHEN i.i_class_id=7  THEN 1 ELSE NULL END) AS id7,
>   count(CASE WHEN i.i_class_id=9  THEN 1 ELSE NULL END) AS id9,
>   count(CASE WHEN i.i_class_id=11 THEN 1 ELSE NULL END) AS id11,
>   count(CASE WHEN i.i_class_id=13 THEN 1 ELSE NULL END) AS id13,
>   count(CASE WHEN i.i_class_id=15 THEN 1 ELSE NULL END) AS id15,
>   count(CASE WHEN i.i_class_id=2  THEN 1 ELSE NULL END) AS id2,
>   count(CASE WHEN i.i_class_id=4  THEN 1 ELSE NULL END) AS id4,
>   count(CASE WHEN i.i_class_id=6  THEN 1 ELSE NULL END) AS id6,
>   count(CASE WHEN i.i_class_id=8  THEN 1 ELSE NULL END) AS id8,
>   count(CASE WHEN i.i_class_id=10 THEN 1 ELSE NULL END) 

[jira] [Commented] (SPARK-10606) Cube/Rollup/GrpSet doesn't create the correct plan when group by is on something other than an AttributeReference

2015-09-16 Thread Cheng Hao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14791499#comment-14791499
 ] 

Cheng Hao commented on SPARK-10606:
---

[~rhbutani] Which version are you using, actually I've fixed the bug at 
SPARK-8972, it should be included in 1.5. Can you try that with 1.5?

> Cube/Rollup/GrpSet doesn't create the correct plan when group by is on 
> something other than an AttributeReference
> -
>
> Key: SPARK-10606
> URL: https://issues.apache.org/jira/browse/SPARK-10606
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Harish Butani
>Priority: Critical
>
> Consider the following table: t(a : String, b : String) and the query
> {code}
> select a, concat(b, '1'), count(*)
> from t
> group by a, concat(b, '1') with cube
> {code}
> The projections in the Expand operator are not setup correctly. The expand 
> logic in Analyzer:expand is comparing grouping expressions against 
> child.output. So {{concat(b, '1')}} is never mapped to a null Literal.  
> A simple fix is to add a Rule to introduce a Projection below the 
> Cube/Rollup/GrpSet operator that additionally projects the   
> groupingExpressions that are missing in the child.
> Marking this as Critical, because you get wrong results.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4226) SparkSQL - Add support for subqueries in predicates

2015-09-15 Thread Cheng Hao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14746642#comment-14746642
 ] 

Cheng Hao commented on SPARK-4226:
--

Thank you [~brooks], you're right! I meant it will makes more complicated in 
the implementation, e.g. to resolved and split the conjunction for the 
condition, that's also what I was trying to avoid in my PR by using the 
anti-join. 

> SparkSQL - Add support for subqueries in predicates
> ---
>
> Key: SPARK-4226
> URL: https://issues.apache.org/jira/browse/SPARK-4226
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.2.0
> Environment: Spark 1.2 snapshot
>Reporter: Terry Siu
>
> I have a test table defined in Hive as follows:
> {code:sql}
> CREATE TABLE sparkbug (
>   id INT,
>   event STRING
> ) STORED AS PARQUET;
> {code}
> and insert some sample data with ids 1, 2, 3.
> In a Spark shell, I then create a HiveContext and then execute the following 
> HQL to test out subquery predicates:
> {code}
> val hc = HiveContext(hc)
> hc.hql("select customerid from sparkbug where customerid in (select 
> customerid from sparkbug where customerid in (2,3))")
> {code}
> I get the following error:
> {noformat}
> java.lang.RuntimeException: Unsupported language features in query: select 
> customerid from sparkbug where customerid in (select customerid from sparkbug 
> where customerid in (2,3))
> TOK_QUERY
>   TOK_FROM
> TOK_TABREF
>   TOK_TABNAME
> sparkbug
>   TOK_INSERT
> TOK_DESTINATION
>   TOK_DIR
> TOK_TMP_FILE
> TOK_SELECT
>   TOK_SELEXPR
> TOK_TABLE_OR_COL
>   customerid
> TOK_WHERE
>   TOK_SUBQUERY_EXPR
> TOK_SUBQUERY_OP
>   in
> TOK_QUERY
>   TOK_FROM
> TOK_TABREF
>   TOK_TABNAME
> sparkbug
>   TOK_INSERT
> TOK_DESTINATION
>   TOK_DIR
> TOK_TMP_FILE
> TOK_SELECT
>   TOK_SELEXPR
> TOK_TABLE_OR_COL
>   customerid
> TOK_WHERE
>   TOK_FUNCTION
> in
> TOK_TABLE_OR_COL
>   customerid
> 2
> 3
> TOK_TABLE_OR_COL
>   customerid
> scala.NotImplementedError: No parse rules for ASTNode type: 817, text: 
> TOK_SUBQUERY_EXPR :
> TOK_SUBQUERY_EXPR
>   TOK_SUBQUERY_OP
> in
>   TOK_QUERY
> TOK_FROM
>   TOK_TABREF
> TOK_TABNAME
>   sparkbug
> TOK_INSERT
>   TOK_DESTINATION
> TOK_DIR
>   TOK_TMP_FILE
>   TOK_SELECT
> TOK_SELEXPR
>   TOK_TABLE_OR_COL
> customerid
>   TOK_WHERE
> TOK_FUNCTION
>   in
>   TOK_TABLE_OR_COL
> customerid
>   2
>   3
>   TOK_TABLE_OR_COL
> customerid
> " +
>  
> org.apache.spark.sql.hive.HiveQl$.nodeToExpr(HiveQl.scala:1098)
> 
> at scala.sys.package$.error(package.scala:27)
> at org.apache.spark.sql.hive.HiveQl$.createPlan(HiveQl.scala:252)
> at 
> org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:50)
> at 
> org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:49)
> at 
> scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136)
> {noformat}
> [This 
> thread|http://apache-spark-user-list.1001560.n3.nabble.com/Subquery-in-having-clause-Spark-1-1-0-td17401.html]
>  also brings up lack of subquery support in SparkSQL. It would be nice to 
> have subquery predicate support in a near, future release (1.3, maybe?).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10474) Aggregation failed with unable to acquire memory

2015-09-15 Thread Cheng Hao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14744969#comment-14744969
 ] 

Cheng Hao commented on SPARK-10474:
---

The root causes for the exception is the executor don't have enough memory for 
external sorting(UnsafeXXXSorter), 
The memory used for the sorting is MAX_JVM_HEAP * spark.shuffle.memoryFraction 
* spark.shuffle.safetyFraction.

So a workaround is to set a bigger memory for jvm, or the spark conf keys 
"spark.shuffle.memoryFraction"(0.2 by default) and 
"spark.shuffle.safetyFraction"(0.8 by default).


> Aggregation failed with unable to acquire memory
> 
>
> Key: SPARK-10474
> URL: https://issues.apache.org/jira/browse/SPARK-10474
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Yi Zhou
>Priority: Blocker
>
> In aggregation case, a  Lost task happened with below error.
> {code}
>  java.io.IOException: Could not acquire 65536 bytes of memory
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.initializeForWriting(UnsafeExternalSorter.java:169)
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:220)
> at 
> org.apache.spark.sql.execution.UnsafeKVExternalSorter.(UnsafeKVExternalSorter.java:126)
> at 
> org.apache.spark.sql.execution.UnsafeFixedWidthAggregationMap.destructAndCreateExternalSorter(UnsafeFixedWidthAggregationMap.java:257)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.switchToSortBasedAggregation(TungstenAggregationIterator.scala:435)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:379)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.start(TungstenAggregationIterator.scala:622)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1.org$apache$spark$sql$execution$aggregate$TungstenAggregate$$anonfun$$executePartition$1(TungstenAggregate.scala:110)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
> at 
> org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:64)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:88)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> Key SQL Query
> {code:sql}
> INSERT INTO TABLE test_table
> SELECT
>   ss.ss_customer_sk AS cid,
>   count(CASE WHEN i.i_class_id=1  THEN 1 ELSE NULL END) AS id1,
>   count(CASE WHEN i.i_class_id=3  THEN 1 ELSE NULL END) AS id3,
>   count(CASE WHEN i.i_class_id=5  THEN 1 ELSE NULL END) AS id5,
>   count(CASE WHEN i.i_class_id=7  THEN 1 ELSE NULL END) AS id7,
>   count(CASE WHEN i.i_class_id=9  THEN 1 ELSE NULL END) AS id9,
>   count(CASE WHEN i.i_class_id=11 THEN 1 ELSE NULL END) AS id11,
>   count(CASE WHEN i.i_class_id=13 THEN 1 ELSE NULL END) AS id13,
>   count(CASE WHEN i.i_class_id=15 THEN 1 ELSE NULL END) AS id15,
>   count(CASE WHEN i.i_class_id=2  THEN 1 ELSE NULL END) AS id2,
>   count(CASE WHEN i.i_class_id=4  THEN 1 ELSE NULL END) AS id4,
>   count(CASE WHEN i.i_class_id=6  THEN 1 ELSE NULL END) AS id6,
>   count(CASE WHEN i.i_class_id=8  THEN 1 ELSE NULL END) AS id8,
>   count(CASE WHEN i.i_class_id=10 THEN 1 ELSE NULL END) AS id10,
>   count(CASE WHEN i.i_class_id=14 THEN 1 ELSE NULL END) AS id14,
>   count(CASE WHEN i.i_class_id=16 THEN 1 ELSE NULL END) AS id16
> FROM store_sales ss
> INNER JOIN item i ON ss.ss_item_sk = i.i_item_sk
> WHERE i.i_category IN ('Books')
> AND ss.ss_customer_sk IS NOT NULL
> GROUP BY ss.ss_customer_sk
> HAVING count(ss.ss_item_sk) > 5
> {code}
> Note:
> the store_sales is a 

[jira] [Commented] (SPARK-10466) UnsafeRow exception in Sort-Based Shuffle with data spill

2015-09-15 Thread Cheng Hao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14744966#comment-14744966
 ] 

Cheng Hao commented on SPARK-10466:
---

[~naliazheli] It's an irrelevant issue, you'd better to subscribe the spark 
mail list and then ask question in English. 
See(http://spark.apache.org/community.html)

> UnsafeRow exception in Sort-Based Shuffle with data spill 
> --
>
> Key: SPARK-10466
> URL: https://issues.apache.org/jira/browse/SPARK-10466
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Cheng Hao
>Assignee: Cheng Hao
>Priority: Blocker
> Fix For: 1.6.0, 1.5.1
>
>
> In sort-based shuffle, if we have data spill, it will cause assert exception, 
> the follow code can reproduce that
> {code}
> withSparkConf(("spark.shuffle.sort.bypassMergeThreshold", "2")) {
>   withSQLConf(("spark.sql.autoBroadcastJoinThreshold", "0")) {
> withTempTable("mytemp") {
>   sparkContext.parallelize(1 to 1000, 3).map(i => (i, 
> i)).toDF("key", "value").registerTempTable("mytemp")
>   sql("select key, value as v1 from mytemp where key > 
> 1").registerTempTable("l")
>   sql("select key, value as v2 from mytemp where key > 
> 3").registerTempTable("r")
>   val df3 = sql("select v1, v2 from l left join r on l.key=r.key")
>   df3.count()
> }
>   }
> }
> {code}
> {code}
> java.lang.AssertionError: assertion failed
>   at scala.Predef$.assert(Predef.scala:165)
>   at 
> org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$2.writeKey(UnsafeRowSerializer.scala:75)
>   at 
> org.apache.spark.storage.DiskBlockObjectWriter.write(DiskBlockObjectWriter.scala:180)
>   at 
> org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2$$anonfun$apply$1.apply(ExternalSorter.scala:688)
>   at 
> org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2$$anonfun$apply$1.apply(ExternalSorter.scala:687)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at 
> org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2.apply(ExternalSorter.scala:687)
>   at 
> org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2.apply(ExternalSorter.scala:683)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at 
> org.apache.spark.util.collection.ExternalSorter.writePartitionedFile(ExternalSorter.scala:683)
>   at 
> org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:80)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>   at org.apache.spark.scheduler.Task.run(Task.scala:88)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
>   at java.lang.Thread.run(Thread.java:722)
> 17:32:06.172 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 1.0 in 
> stage 1.0 (TID 4, localhost): java.lang.AssertionError: assertion failed
>   at scala.Predef$.assert(Predef.scala:165)
>   at 
> org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$2.writeKey(UnsafeRowSerializer.scala:75)
>   at 
> org.apache.spark.storage.DiskBlockObjectWriter.write(DiskBlockObjectWriter.scala:180)
>   at 
> org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2$$anonfun$apply$1.apply(ExternalSorter.scala:688)
>   at 
> org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2$$anonfun$apply$1.apply(ExternalSorter.scala:687)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at 
> org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2.apply(ExternalSorter.scala:687)
>   at 
> org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2.apply(ExternalSorter.scala:683)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at 
> org.apache.spark.util.collection.ExternalSorter.writePartitionedFile(ExternalSorter.scala:683)
>   at 
> org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:80)
>   at 
> 

[jira] [Commented] (SPARK-10474) Aggregation failed with unable to acquire memory

2015-09-15 Thread Cheng Hao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14745008#comment-14745008
 ] 

Cheng Hao commented on SPARK-10474:
---

But from the current implementation, we'd better not to throw exception if 
acquired memory(offheap) is not satisfied,  maybe we'd better use fixed memory 
allocations for both data page and the pointer array, what do you think?

> Aggregation failed with unable to acquire memory
> 
>
> Key: SPARK-10474
> URL: https://issues.apache.org/jira/browse/SPARK-10474
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Yi Zhou
>Priority: Blocker
>
> In aggregation case, a  Lost task happened with below error.
> {code}
>  java.io.IOException: Could not acquire 65536 bytes of memory
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.initializeForWriting(UnsafeExternalSorter.java:169)
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:220)
> at 
> org.apache.spark.sql.execution.UnsafeKVExternalSorter.(UnsafeKVExternalSorter.java:126)
> at 
> org.apache.spark.sql.execution.UnsafeFixedWidthAggregationMap.destructAndCreateExternalSorter(UnsafeFixedWidthAggregationMap.java:257)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.switchToSortBasedAggregation(TungstenAggregationIterator.scala:435)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:379)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.start(TungstenAggregationIterator.scala:622)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1.org$apache$spark$sql$execution$aggregate$TungstenAggregate$$anonfun$$executePartition$1(TungstenAggregate.scala:110)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
> at 
> org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:64)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:88)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> Key SQL Query
> {code:sql}
> INSERT INTO TABLE test_table
> SELECT
>   ss.ss_customer_sk AS cid,
>   count(CASE WHEN i.i_class_id=1  THEN 1 ELSE NULL END) AS id1,
>   count(CASE WHEN i.i_class_id=3  THEN 1 ELSE NULL END) AS id3,
>   count(CASE WHEN i.i_class_id=5  THEN 1 ELSE NULL END) AS id5,
>   count(CASE WHEN i.i_class_id=7  THEN 1 ELSE NULL END) AS id7,
>   count(CASE WHEN i.i_class_id=9  THEN 1 ELSE NULL END) AS id9,
>   count(CASE WHEN i.i_class_id=11 THEN 1 ELSE NULL END) AS id11,
>   count(CASE WHEN i.i_class_id=13 THEN 1 ELSE NULL END) AS id13,
>   count(CASE WHEN i.i_class_id=15 THEN 1 ELSE NULL END) AS id15,
>   count(CASE WHEN i.i_class_id=2  THEN 1 ELSE NULL END) AS id2,
>   count(CASE WHEN i.i_class_id=4  THEN 1 ELSE NULL END) AS id4,
>   count(CASE WHEN i.i_class_id=6  THEN 1 ELSE NULL END) AS id6,
>   count(CASE WHEN i.i_class_id=8  THEN 1 ELSE NULL END) AS id8,
>   count(CASE WHEN i.i_class_id=10 THEN 1 ELSE NULL END) AS id10,
>   count(CASE WHEN i.i_class_id=14 THEN 1 ELSE NULL END) AS id14,
>   count(CASE WHEN i.i_class_id=16 THEN 1 ELSE NULL END) AS id16
> FROM store_sales ss
> INNER JOIN item i ON ss.ss_item_sk = i.i_item_sk
> WHERE i.i_category IN ('Books')
> AND ss.ss_customer_sk IS NOT NULL
> GROUP BY ss.ss_customer_sk
> HAVING count(ss.ss_item_sk) > 5
> {code}
> Note:
> the store_sales is a big fact table and item is a small dimension table.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (SPARK-10466) UnsafeRow exception in Sort-Based Shuffle with data spill

2015-09-15 Thread Cheng Hao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14744967#comment-14744967
 ] 

Cheng Hao commented on SPARK-10466:
---

[~naliazheli] It's an irrelevant issue, you'd better to subscribe the spark 
mail list and then ask question in English. 
See(http://spark.apache.org/community.html)

> UnsafeRow exception in Sort-Based Shuffle with data spill 
> --
>
> Key: SPARK-10466
> URL: https://issues.apache.org/jira/browse/SPARK-10466
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Cheng Hao
>Assignee: Cheng Hao
>Priority: Blocker
> Fix For: 1.6.0, 1.5.1
>
>
> In sort-based shuffle, if we have data spill, it will cause assert exception, 
> the follow code can reproduce that
> {code}
> withSparkConf(("spark.shuffle.sort.bypassMergeThreshold", "2")) {
>   withSQLConf(("spark.sql.autoBroadcastJoinThreshold", "0")) {
> withTempTable("mytemp") {
>   sparkContext.parallelize(1 to 1000, 3).map(i => (i, 
> i)).toDF("key", "value").registerTempTable("mytemp")
>   sql("select key, value as v1 from mytemp where key > 
> 1").registerTempTable("l")
>   sql("select key, value as v2 from mytemp where key > 
> 3").registerTempTable("r")
>   val df3 = sql("select v1, v2 from l left join r on l.key=r.key")
>   df3.count()
> }
>   }
> }
> {code}
> {code}
> java.lang.AssertionError: assertion failed
>   at scala.Predef$.assert(Predef.scala:165)
>   at 
> org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$2.writeKey(UnsafeRowSerializer.scala:75)
>   at 
> org.apache.spark.storage.DiskBlockObjectWriter.write(DiskBlockObjectWriter.scala:180)
>   at 
> org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2$$anonfun$apply$1.apply(ExternalSorter.scala:688)
>   at 
> org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2$$anonfun$apply$1.apply(ExternalSorter.scala:687)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at 
> org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2.apply(ExternalSorter.scala:687)
>   at 
> org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2.apply(ExternalSorter.scala:683)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at 
> org.apache.spark.util.collection.ExternalSorter.writePartitionedFile(ExternalSorter.scala:683)
>   at 
> org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:80)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>   at org.apache.spark.scheduler.Task.run(Task.scala:88)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
>   at java.lang.Thread.run(Thread.java:722)
> 17:32:06.172 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 1.0 in 
> stage 1.0 (TID 4, localhost): java.lang.AssertionError: assertion failed
>   at scala.Predef$.assert(Predef.scala:165)
>   at 
> org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$2.writeKey(UnsafeRowSerializer.scala:75)
>   at 
> org.apache.spark.storage.DiskBlockObjectWriter.write(DiskBlockObjectWriter.scala:180)
>   at 
> org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2$$anonfun$apply$1.apply(ExternalSorter.scala:688)
>   at 
> org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2$$anonfun$apply$1.apply(ExternalSorter.scala:687)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at 
> org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2.apply(ExternalSorter.scala:687)
>   at 
> org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2.apply(ExternalSorter.scala:683)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at 
> org.apache.spark.util.collection.ExternalSorter.writePartitionedFile(ExternalSorter.scala:683)
>   at 
> org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:80)
>   at 
> 

[jira] [Issue Comment Deleted] (SPARK-10466) UnsafeRow exception in Sort-Based Shuffle with data spill

2015-09-15 Thread Cheng Hao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Hao updated SPARK-10466:
--
Comment: was deleted

(was: [~naliazheli] It's an irrelevant issue, you'd better to subscribe the 
spark mail list and then ask question in English. 
See(http://spark.apache.org/community.html))

> UnsafeRow exception in Sort-Based Shuffle with data spill 
> --
>
> Key: SPARK-10466
> URL: https://issues.apache.org/jira/browse/SPARK-10466
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Cheng Hao
>Assignee: Cheng Hao
>Priority: Blocker
> Fix For: 1.6.0, 1.5.1
>
>
> In sort-based shuffle, if we have data spill, it will cause assert exception, 
> the follow code can reproduce that
> {code}
> withSparkConf(("spark.shuffle.sort.bypassMergeThreshold", "2")) {
>   withSQLConf(("spark.sql.autoBroadcastJoinThreshold", "0")) {
> withTempTable("mytemp") {
>   sparkContext.parallelize(1 to 1000, 3).map(i => (i, 
> i)).toDF("key", "value").registerTempTable("mytemp")
>   sql("select key, value as v1 from mytemp where key > 
> 1").registerTempTable("l")
>   sql("select key, value as v2 from mytemp where key > 
> 3").registerTempTable("r")
>   val df3 = sql("select v1, v2 from l left join r on l.key=r.key")
>   df3.count()
> }
>   }
> }
> {code}
> {code}
> java.lang.AssertionError: assertion failed
>   at scala.Predef$.assert(Predef.scala:165)
>   at 
> org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$2.writeKey(UnsafeRowSerializer.scala:75)
>   at 
> org.apache.spark.storage.DiskBlockObjectWriter.write(DiskBlockObjectWriter.scala:180)
>   at 
> org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2$$anonfun$apply$1.apply(ExternalSorter.scala:688)
>   at 
> org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2$$anonfun$apply$1.apply(ExternalSorter.scala:687)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at 
> org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2.apply(ExternalSorter.scala:687)
>   at 
> org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2.apply(ExternalSorter.scala:683)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at 
> org.apache.spark.util.collection.ExternalSorter.writePartitionedFile(ExternalSorter.scala:683)
>   at 
> org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:80)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>   at org.apache.spark.scheduler.Task.run(Task.scala:88)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
>   at java.lang.Thread.run(Thread.java:722)
> 17:32:06.172 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 1.0 in 
> stage 1.0 (TID 4, localhost): java.lang.AssertionError: assertion failed
>   at scala.Predef$.assert(Predef.scala:165)
>   at 
> org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$2.writeKey(UnsafeRowSerializer.scala:75)
>   at 
> org.apache.spark.storage.DiskBlockObjectWriter.write(DiskBlockObjectWriter.scala:180)
>   at 
> org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2$$anonfun$apply$1.apply(ExternalSorter.scala:688)
>   at 
> org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2$$anonfun$apply$1.apply(ExternalSorter.scala:687)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at 
> org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2.apply(ExternalSorter.scala:687)
>   at 
> org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2.apply(ExternalSorter.scala:683)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at 
> org.apache.spark.util.collection.ExternalSorter.writePartitionedFile(ExternalSorter.scala:683)
>   at 
> org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:80)
>   at 
> 

  1   2   3   4   5   >