[jira] [Updated] (SPARK-18704) CrossValidator should preserve more tuning statistics

2016-12-03 Thread yuhao yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yuhao yang updated SPARK-18704:
---
Description: 
Currently CrossValidator will train (k-fold * paramMaps) different models 
during the training process, yet it only passes the average metrics to 
CrossValidatorModel. From which some important information like variances for 
the same paramMap cannot be retrieved, and users cannot be sure if the k number 
is proper. Since the CrossValidator is relatively expensive, we probably want 
to get the most from the tuning process.

Just want to see if this sounds good. In my opinion, this can be done either by 
passing a metrics matrix to the CrossValidatorModel, or we can introduce a 
CrossValidatorSummary. I would vote for introducing the TunningSummary class, 
which can also be used by TrainValidationSplit. In the summary we can present a 
better statistics for the tuning process. Something like a DataFrame:
+---+++-+
|elasticNetParam|fitIntercept|regParam|metrics  |
+---+++-+
|0.0|true|0.1 |9.747795248932505|
|0.0|true|0.01|9.751942357398603|
|0.0|false   |0.1 |9.71727627087487 |
|0.0|false   |0.01|9.721149803723822|
|0.5|true|0.1 |9.719358515436005|
|0.5|true|0.01|9.748121645368501|
|0.5|false   |0.1 |9.687771328829479|
|0.5|false   |0.01|9.717304811419261|
|1.0|true|0.1 |9.696769467196487|
|1.0|true|0.01|9.744325276259957|
|1.0|false   |0.1 |9.665822167122172|
|1.0|false   |0.01|9.713484065511892|
+---+++-+

Using the dataFrame, users can better understand the effect of different 
parameters.

Another thing we should improve is to include the paramMaps in the 
CrossValidatorModel (or TrainValidationSplitModel) to allow meaningful 
serialization. Keeping only the metrics without ParamMaps does not really help 
model reuse.


  was:
Currently CrossValidator will train (k-fold * paramMaps) different models 
during the training process, yet it only passes the average metrics to 
CrossValidatorModel. From which some important information like variances for 
the same paramMap cannot be retrieved, and users cannot be sure if the k number 
is proper. Since the CrossValidator is relatively expensive, we probably want 
to get the most from the tuning process.

Just want to see if this sounds good. In my opinion, this can be done either by 
passing a metrics matrix to the CrossValidatorModel, or we can introduce a 
CrossValidatorSummary. I would vote for introducing the TunningSummary class, 
which can also be used by TrainValidationSplit. In the summary we can present a 
better statistics for the tuning process. Something like a DataFrame:
+---+++-+
|elasticNetParam|fitIntercept|regParam|metrics  |
+---+++-+
|0.0|true|0.1 |9.747795248932505|
|0.0|true|0.01|9.751942357398603|
|0.0|false   |0.1 |9.71727627087487 |
|0.0|false   |0.01|9.721149803723822|
|0.5|true|0.1 |9.719358515436005|
|0.5|true|0.01|9.748121645368501|
|0.5|false   |0.1 |9.687771328829479|
|0.5|false   |0.01|9.717304811419261|
|1.0|true|0.1 |9.696769467196487|
|1.0|true|0.01|9.744325276259957|
|1.0|false   |0.1 |9.665822167122172|
|1.0|false   |0.01|9.713484065511892|
+---+++-+

Using the dataFrame, users can better understand the effect of different 
parameters.





> CrossValidator should preserve more tuning statistics
> -
>
> Key: SPARK-18704
> URL: https://issues.apache.org/jira/browse/SPARK-18704
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: yuhao yang
>Priority: Minor
>
> Currently CrossValidator will train (k-fold * paramMaps) different models 
> during the training process, yet it only passes the average metrics to 
> CrossValidatorModel. From which some important information like variances for 
> the same paramMap cannot be retrieved, and users cannot be sure if the k 
> number is proper. Since the CrossValidator is relatively expensive, we 
> probably want to get the most from the tuning process.
> Just want to see if this sounds good. In my opinion, this can 

[jira] [Commented] (SPARK-18704) CrossValidator should preserve more tuning statistics

2016-12-03 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15719499#comment-15719499
 ] 

yuhao yang commented on SPARK-18704:


One implementation for the tuning summary is available at 
https://github.com/hhbyyh/spark/tree/tuningsummary/mllib/src/main/scala/org/apache/spark/ml/tuning
 for anyone with interest.

> CrossValidator should preserve more tuning statistics
> -
>
> Key: SPARK-18704
> URL: https://issues.apache.org/jira/browse/SPARK-18704
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: yuhao yang
>Priority: Minor
>
> Currently CrossValidator will train (k-fold * paramMaps) different models 
> during the training process, yet it only passes the average metrics to 
> CrossValidatorModel. From which some important information like variances for 
> the same paramMap cannot be retrieved, and users cannot be sure if the k 
> number is proper. Since the CrossValidator is relatively expensive, we 
> probably want to get the most from the tuning process.
> Just want to see if this sounds good. In my opinion, this can be done either 
> by passing a metrics matrix to the CrossValidatorModel, or we can introduce a 
> CrossValidatorSummary. I would vote for introducing the TunningSummary class, 
> which can also be used by TrainValidationSplit. In the summary we can present 
> a better statistics for the tuning process. Something like a DataFrame:
> +---+++-+
> |elasticNetParam|fitIntercept|regParam|metrics  |
> +---+++-+
> |0.0|true|0.1 |9.747795248932505|
> |0.0|true|0.01|9.751942357398603|
> |0.0|false   |0.1 |9.71727627087487 |
> |0.0|false   |0.01|9.721149803723822|
> |0.5|true|0.1 |9.719358515436005|
> |0.5|true|0.01|9.748121645368501|
> |0.5|false   |0.1 |9.687771328829479|
> |0.5|false   |0.01|9.717304811419261|
> |1.0|true|0.1 |9.696769467196487|
> |1.0|true|0.01|9.744325276259957|
> |1.0|false   |0.1 |9.665822167122172|
> |1.0|false   |0.01|9.713484065511892|
> +---+++-+
> Using the dataFrame, users can better understand the effect of different 
> parameters.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18704) CrossValidator should preserve more tuning statistics

2016-12-03 Thread yuhao yang (JIRA)
yuhao yang created SPARK-18704:
--

 Summary: CrossValidator should preserve more tuning statistics
 Key: SPARK-18704
 URL: https://issues.apache.org/jira/browse/SPARK-18704
 Project: Spark
  Issue Type: Improvement
  Components: ML
Reporter: yuhao yang
Priority: Minor


Currently CrossValidator will train (k-fold * paramMaps) different models 
during the training process, yet it only passes the average metrics to 
CrossValidatorModel. From which some important information like variances for 
the same paramMap cannot be retrieved, and users cannot be sure if the k number 
is proper. Since the CrossValidator is relatively expensive, we probably want 
to get the most from the tuning process.

Just want to see if this sounds good. In my opinion, this can be done either by 
passing a metrics matrix to the CrossValidatorModel, or we can introduce a 
CrossValidatorSummary. I would vote for introducing the TunningSummary class, 
which can also be used by TrainValidationSplit. In the summary we can present a 
better statistics for the tuning process. Something like a DataFrame:
+---+++-+
|elasticNetParam|fitIntercept|regParam|metrics  |
+---+++-+
|0.0|true|0.1 |9.747795248932505|
|0.0|true|0.01|9.751942357398603|
|0.0|false   |0.1 |9.71727627087487 |
|0.0|false   |0.01|9.721149803723822|
|0.5|true|0.1 |9.719358515436005|
|0.5|true|0.01|9.748121645368501|
|0.5|false   |0.1 |9.687771328829479|
|0.5|false   |0.01|9.717304811419261|
|1.0|true|0.1 |9.696769467196487|
|1.0|true|0.01|9.744325276259957|
|1.0|false   |0.1 |9.665822167122172|
|1.0|false   |0.01|9.713484065511892|
+---+++-+

Using the dataFrame, users can better understand the effect of different 
parameters.






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18700) getCached in HiveMetastoreCatalog not thread safe cause driver OOM

2016-12-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18700:


Assignee: (was: Apache Spark)

> getCached in HiveMetastoreCatalog not thread safe cause driver OOM
> --
>
> Key: SPARK-18700
> URL: https://issues.apache.org/jira/browse/SPARK-18700
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1, 2.0.0
>Reporter: Li Yuanjian
>
> In our spark sql platform, each query use same HiveContext and 
> independent thread, new data will append to tables as new partitions every 
> 30min. After a new partition added to table T, we should call refreshTable to 
> clear T’s cache in cachedDataSourceTables to make the new partition 
> searchable. 
> For the table have more partitions and files(much bigger than 
> spark.sql.sources.parallelPartitionDiscovery.threshold), a new query of table 
> T will start a job to fetch all FileStatus in listLeafFiles function. Because 
> of the huge number of files, the job will run several seconds, during the 
> time, new queries of table T will also start new jobs to fetch FileStatus 
> because of the function of getCache is not thread safe. Final cause a driver 
> OOM.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18700) getCached in HiveMetastoreCatalog not thread safe cause driver OOM

2016-12-03 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15719456#comment-15719456
 ] 

Apache Spark commented on SPARK-18700:
--

User 'xuanyuanking' has created a pull request for this issue:
https://github.com/apache/spark/pull/16135

> getCached in HiveMetastoreCatalog not thread safe cause driver OOM
> --
>
> Key: SPARK-18700
> URL: https://issues.apache.org/jira/browse/SPARK-18700
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1, 2.0.0
>Reporter: Li Yuanjian
>
> In our spark sql platform, each query use same HiveContext and 
> independent thread, new data will append to tables as new partitions every 
> 30min. After a new partition added to table T, we should call refreshTable to 
> clear T’s cache in cachedDataSourceTables to make the new partition 
> searchable. 
> For the table have more partitions and files(much bigger than 
> spark.sql.sources.parallelPartitionDiscovery.threshold), a new query of table 
> T will start a job to fetch all FileStatus in listLeafFiles function. Because 
> of the huge number of files, the job will run several seconds, during the 
> time, new queries of table T will also start new jobs to fetch FileStatus 
> because of the function of getCache is not thread safe. Final cause a driver 
> OOM.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18700) getCached in HiveMetastoreCatalog not thread safe cause driver OOM

2016-12-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18700:


Assignee: Apache Spark

> getCached in HiveMetastoreCatalog not thread safe cause driver OOM
> --
>
> Key: SPARK-18700
> URL: https://issues.apache.org/jira/browse/SPARK-18700
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1, 2.0.0
>Reporter: Li Yuanjian
>Assignee: Apache Spark
>
> In our spark sql platform, each query use same HiveContext and 
> independent thread, new data will append to tables as new partitions every 
> 30min. After a new partition added to table T, we should call refreshTable to 
> clear T’s cache in cachedDataSourceTables to make the new partition 
> searchable. 
> For the table have more partitions and files(much bigger than 
> spark.sql.sources.parallelPartitionDiscovery.threshold), a new query of table 
> T will start a job to fetch all FileStatus in listLeafFiles function. Because 
> of the huge number of files, the job will run several seconds, during the 
> time, new queries of table T will also start new jobs to fetch FileStatus 
> because of the function of getCache is not thread safe. Final cause a driver 
> OOM.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18703) Insertion/CTAS against Hive Tables: Staging Directories and Data Files Not Dropped Until Normal Termination of JVM

2016-12-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18703:


Assignee: Apache Spark

> Insertion/CTAS against Hive Tables: Staging Directories and Data Files Not 
> Dropped Until Normal Termination of JVM
> --
>
> Key: SPARK-18703
> URL: https://issues.apache.org/jira/browse/SPARK-18703
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: Xiao Li
>Assignee: Apache Spark
>Priority: Critical
>
> Below are the files/directories generated for three inserts againsts a Hive 
> table:
> {noformat}
> /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-29_149_4298858301766472202-1
> /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-29_149_4298858301766472202-1/-ext-1
> /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-29_149_4298858301766472202-1/-ext-1/._SUCCESS.crc
> /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-29_149_4298858301766472202-1/-ext-1/.part-0.crc
> /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-29_149_4298858301766472202-1/-ext-1/_SUCCESS
> /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-29_149_4298858301766472202-1/-ext-1/part-0
> /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_454_6445008511655931341-1
> /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_454_6445008511655931341-1/-ext-1
> /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_454_6445008511655931341-1/-ext-1/._SUCCESS.crc
> /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_454_6445008511655931341-1/-ext-1/.part-0.crc
> /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_454_6445008511655931341-1/-ext-1/_SUCCESS
> /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_454_6445008511655931341-1/-ext-1/part-0
> /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_722_3388423608658711001-1
> /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_722_3388423608658711001-1/-ext-1
> /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_722_3388423608658711001-1/-ext-1/._SUCCESS.crc
> /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_722_3388423608658711001-1/-ext-1/.part-0.crc
> /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_722_3388423608658711001-1/-ext-1/_SUCCESS
> /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_722_3388423608658711001-1/-ext-1/part-0
> /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.part-0.crc
> /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/part-0
> {noformat}
> The first 18 files are temporary. We do not drop it until the end of JVM 
> termination. If JVM does not appropriately terminate, these temporary 
> files/directories will not be dropped.
> Only the last two files are needed, as shown below.
> {noformat}
> /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.part-0.crc
> /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/part-0
> {noformat}
> Ideally, we should drop the created 

[jira] [Commented] (SPARK-18703) Insertion/CTAS against Hive Tables: Staging Directories and Data Files Not Dropped Until Normal Termination of JVM

2016-12-03 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15719433#comment-15719433
 ] 

Apache Spark commented on SPARK-18703:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/16134

> Insertion/CTAS against Hive Tables: Staging Directories and Data Files Not 
> Dropped Until Normal Termination of JVM
> --
>
> Key: SPARK-18703
> URL: https://issues.apache.org/jira/browse/SPARK-18703
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: Xiao Li
>Priority: Critical
>
> Below are the files/directories generated for three inserts againsts a Hive 
> table:
> {noformat}
> /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-29_149_4298858301766472202-1
> /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-29_149_4298858301766472202-1/-ext-1
> /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-29_149_4298858301766472202-1/-ext-1/._SUCCESS.crc
> /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-29_149_4298858301766472202-1/-ext-1/.part-0.crc
> /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-29_149_4298858301766472202-1/-ext-1/_SUCCESS
> /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-29_149_4298858301766472202-1/-ext-1/part-0
> /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_454_6445008511655931341-1
> /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_454_6445008511655931341-1/-ext-1
> /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_454_6445008511655931341-1/-ext-1/._SUCCESS.crc
> /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_454_6445008511655931341-1/-ext-1/.part-0.crc
> /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_454_6445008511655931341-1/-ext-1/_SUCCESS
> /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_454_6445008511655931341-1/-ext-1/part-0
> /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_722_3388423608658711001-1
> /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_722_3388423608658711001-1/-ext-1
> /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_722_3388423608658711001-1/-ext-1/._SUCCESS.crc
> /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_722_3388423608658711001-1/-ext-1/.part-0.crc
> /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_722_3388423608658711001-1/-ext-1/_SUCCESS
> /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_722_3388423608658711001-1/-ext-1/part-0
> /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.part-0.crc
> /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/part-0
> {noformat}
> The first 18 files are temporary. We do not drop it until the end of JVM 
> termination. If JVM does not appropriately terminate, these temporary 
> files/directories will not be dropped.
> Only the last two files are needed, as shown below.
> {noformat}
> /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.part-0.crc
> 

[jira] [Assigned] (SPARK-18703) Insertion/CTAS against Hive Tables: Staging Directories and Data Files Not Dropped Until Normal Termination of JVM

2016-12-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18703:


Assignee: (was: Apache Spark)

> Insertion/CTAS against Hive Tables: Staging Directories and Data Files Not 
> Dropped Until Normal Termination of JVM
> --
>
> Key: SPARK-18703
> URL: https://issues.apache.org/jira/browse/SPARK-18703
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: Xiao Li
>Priority: Critical
>
> Below are the files/directories generated for three inserts againsts a Hive 
> table:
> {noformat}
> /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-29_149_4298858301766472202-1
> /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-29_149_4298858301766472202-1/-ext-1
> /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-29_149_4298858301766472202-1/-ext-1/._SUCCESS.crc
> /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-29_149_4298858301766472202-1/-ext-1/.part-0.crc
> /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-29_149_4298858301766472202-1/-ext-1/_SUCCESS
> /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-29_149_4298858301766472202-1/-ext-1/part-0
> /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_454_6445008511655931341-1
> /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_454_6445008511655931341-1/-ext-1
> /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_454_6445008511655931341-1/-ext-1/._SUCCESS.crc
> /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_454_6445008511655931341-1/-ext-1/.part-0.crc
> /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_454_6445008511655931341-1/-ext-1/_SUCCESS
> /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_454_6445008511655931341-1/-ext-1/part-0
> /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_722_3388423608658711001-1
> /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_722_3388423608658711001-1/-ext-1
> /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_722_3388423608658711001-1/-ext-1/._SUCCESS.crc
> /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_722_3388423608658711001-1/-ext-1/.part-0.crc
> /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_722_3388423608658711001-1/-ext-1/_SUCCESS
> /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_722_3388423608658711001-1/-ext-1/part-0
> /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.part-0.crc
> /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/part-0
> {noformat}
> The first 18 files are temporary. We do not drop it until the end of JVM 
> termination. If JVM does not appropriately terminate, these temporary 
> files/directories will not be dropped.
> Only the last two files are needed, as shown below.
> {noformat}
> /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.part-0.crc
> /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/part-0
> {noformat}
> Ideally, we should drop the created staging files and 

[jira] [Updated] (SPARK-18703) Insertion/CTAS against Hive Tables: Staging Directories and Data Files Not Dropped Until Normal Termination of JVM

2016-12-03 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-18703:

Description: 
Below are the files/directories generated for three inserts againsts a Hive 
table:
{noformat}
/private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-29_149_4298858301766472202-1
/private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-29_149_4298858301766472202-1/-ext-1
/private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-29_149_4298858301766472202-1/-ext-1/._SUCCESS.crc
/private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-29_149_4298858301766472202-1/-ext-1/.part-0.crc
/private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-29_149_4298858301766472202-1/-ext-1/_SUCCESS
/private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-29_149_4298858301766472202-1/-ext-1/part-0
/private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_454_6445008511655931341-1
/private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_454_6445008511655931341-1/-ext-1
/private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_454_6445008511655931341-1/-ext-1/._SUCCESS.crc
/private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_454_6445008511655931341-1/-ext-1/.part-0.crc
/private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_454_6445008511655931341-1/-ext-1/_SUCCESS
/private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_454_6445008511655931341-1/-ext-1/part-0
/private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_722_3388423608658711001-1
/private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_722_3388423608658711001-1/-ext-1
/private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_722_3388423608658711001-1/-ext-1/._SUCCESS.crc
/private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_722_3388423608658711001-1/-ext-1/.part-0.crc
/private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_722_3388423608658711001-1/-ext-1/_SUCCESS
/private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_722_3388423608658711001-1/-ext-1/part-0
/private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.part-0.crc
/private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/part-0
{noformat}

The first 18 files are temporary. We do not drop it until the end of JVM 
termination. If JVM does not appropriately terminate, these temporary 
files/directories will not be dropped.

Only the last two files are needed, as shown below.
{noformat}
/private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.part-0.crc
/private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/part-0
{noformat}

Ideally, we should drop the created staging files and temporary data files 
after each insert/CTAS. The temporary files/directories could accumulate a lot 
when we issue many inserts, since each insert generats at least six files. This 
could eat a lot of spaces and slow down the JVM termination.

  was:
Below are the files/directories for three inserts againsts a Hive table:
{noformat}
/private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-29_149_4298858301766472202-1

[jira] [Created] (SPARK-18703) Insertion/CTAS against Hive Tables: Staging Directories and Data Files Not Dropped Until Normal Termination of JVM

2016-12-03 Thread Xiao Li (JIRA)
Xiao Li created SPARK-18703:
---

 Summary: Insertion/CTAS against Hive Tables: Staging Directories 
and Data Files Not Dropped Until Normal Termination of JVM
 Key: SPARK-18703
 URL: https://issues.apache.org/jira/browse/SPARK-18703
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.2
Reporter: Xiao Li
Priority: Critical


Below are the files/directories for three inserts againsts a Hive table:
{noformat}
/private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-29_149_4298858301766472202-1
/private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-29_149_4298858301766472202-1/-ext-1
/private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-29_149_4298858301766472202-1/-ext-1/._SUCCESS.crc
/private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-29_149_4298858301766472202-1/-ext-1/.part-0.crc
/private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-29_149_4298858301766472202-1/-ext-1/_SUCCESS
/private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-29_149_4298858301766472202-1/-ext-1/part-0
/private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_454_6445008511655931341-1
/private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_454_6445008511655931341-1/-ext-1
/private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_454_6445008511655931341-1/-ext-1/._SUCCESS.crc
/private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_454_6445008511655931341-1/-ext-1/.part-0.crc
/private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_454_6445008511655931341-1/-ext-1/_SUCCESS
/private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_454_6445008511655931341-1/-ext-1/part-0
/private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_722_3388423608658711001-1
/private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_722_3388423608658711001-1/-ext-1
/private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_722_3388423608658711001-1/-ext-1/._SUCCESS.crc
/private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_722_3388423608658711001-1/-ext-1/.part-0.crc
/private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_722_3388423608658711001-1/-ext-1/_SUCCESS
/private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_722_3388423608658711001-1/-ext-1/part-0
/private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.part-0.crc
/private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/part-0
{noformat}

Ideally, we should drop the created staging files and temporary data files 
after each insert. The temporary files/directories could accumulate a lot when 
we issues many insert, since each insert generats at least six files. This 
could eat a lot of spaces and slow down the JVM termination.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18702) input_file_block_start and input_file_block_length function

2016-12-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18702:


Assignee: Reynold Xin  (was: Apache Spark)

> input_file_block_start and input_file_block_length function
> ---
>
> Key: SPARK-18702
> URL: https://issues.apache.org/jira/browse/SPARK-18702
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> We currently have function input_file_name to get the path of the input file, 
> but don't have functions to get the block start offset and length. This patch 
> introduces two functions:
> 1. input_file_block_start: returns the file block start offset, or -1 if not 
> available.
> 2. input_file_block_length: returns the file block length, or -1 if not 
> available.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18702) input_file_block_start and input_file_block_length function

2016-12-03 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15719375#comment-15719375
 ] 

Apache Spark commented on SPARK-18702:
--

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/16133

> input_file_block_start and input_file_block_length function
> ---
>
> Key: SPARK-18702
> URL: https://issues.apache.org/jira/browse/SPARK-18702
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> We currently have function input_file_name to get the path of the input file, 
> but don't have functions to get the block start offset and length. This patch 
> introduces two functions:
> 1. input_file_block_start: returns the file block start offset, or -1 if not 
> available.
> 2. input_file_block_length: returns the file block length, or -1 if not 
> available.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18702) input_file_block_start and input_file_block_length function

2016-12-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18702:


Assignee: Apache Spark  (was: Reynold Xin)

> input_file_block_start and input_file_block_length function
> ---
>
> Key: SPARK-18702
> URL: https://issues.apache.org/jira/browse/SPARK-18702
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Apache Spark
>
> We currently have function input_file_name to get the path of the input file, 
> but don't have functions to get the block start offset and length. This patch 
> introduces two functions:
> 1. input_file_block_start: returns the file block start offset, or -1 if not 
> available.
> 2. input_file_block_length: returns the file block length, or -1 if not 
> available.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18702) input_file_block_start and input_file_block_length function

2016-12-03 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-18702:
---

 Summary: input_file_block_start and input_file_block_length 
function
 Key: SPARK-18702
 URL: https://issues.apache.org/jira/browse/SPARK-18702
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin


We currently have function input_file_name to get the path of the input file, 
but don't have functions to get the block start offset and length. This patch 
introduces two functions:

1. input_file_block_start: returns the file block start offset, or -1 if not 
available.

2. input_file_block_length: returns the file block length, or -1 if not 
available.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18681) Throw Filtering is supported only on partition keys of type string exception

2016-12-03 Thread Yuming Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-18681:

Description: 
Cloudera put 
{{/var/run/cloudera-scm-agent/process/15000-hive-HIVEMETASTORE/hive-site.xml}} 
as the configuration file for the Hive Metastore Server, where 
{{hive.metastore.try.direct.sql=false}}. But Spark isn't reading this 
configuration file and get default value 
{{hive.metastore.try.direct.sql=true}}. we should use {{getMetaConf}} or 
{{getMSC.getConfigValue}} method to obtain the original configuration from Hive 
Metastore Server.

{noformat}
spark-sql> CREATE TABLE test (value INT) PARTITIONED BY (part INT);
Time taken: 0.221 seconds
spark-sql> select * from test where part=1 limit 10;
16/12/02 08:33:45 ERROR thriftserver.SparkSQLDriver: Failed in [select * from 
test where part=1 limit 10]
java.lang.RuntimeException: Caught Hive MetaException attempting to get 
partition metadata by filter from Hive. You can set the Spark configuration 
setting spark.sql.hive.manageFilesourcePartitions to false to work around this 
problem, however this will result in degraded performance. Please report a bug: 
https://issues.apache.org/jira/browse/SPARK
at 
org.apache.spark.sql.hive.client.Shim_v0_13.getPartitionsByFilter(HiveShim.scala:610)
at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionsByFilter$1.apply(HiveClientImpl.scala:549)
at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionsByFilter$1.apply(HiveClientImpl.scala:547)
at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:282)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:229)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:228)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:271)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.getPartitionsByFilter(HiveClientImpl.scala:547)
at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:954)
at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:938)
at 
org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:91)
at 
org.apache.spark.sql.hive.HiveExternalCatalog.listPartitionsByFilter(HiveExternalCatalog.scala:938)
at 
org.apache.spark.sql.hive.MetastoreRelation.getHiveQlPartitions(MetastoreRelation.scala:156)
at 
org.apache.spark.sql.hive.execution.HiveTableScanExec$$anonfun$10.apply(HiveTableScanExec.scala:151)
at 
org.apache.spark.sql.hive.execution.HiveTableScanExec$$anonfun$10.apply(HiveTableScanExec.scala:150)
at org.apache.spark.util.Utils$.withDummyCallSite(Utils.scala:2435)
at 
org.apache.spark.sql.hive.execution.HiveTableScanExec.doExecute(HiveTableScanExec.scala:149)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at 
org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:113)
at 
org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:225)
at 
org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:308)
at 
org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38)
at 
org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:295)
at 
org.apache.spark.sql.execution.QueryExecution$$anonfun$hiveResultString$4.apply(QueryExecution.scala:134)
at 
org.apache.spark.sql.execution.QueryExecution$$anonfun$hiveResultString$4.apply(QueryExecution.scala:133)
at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57)
at 
org.apache.spark.sql.execution.QueryExecution.hiveResultString(QueryExecution.scala:133)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:63)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:335)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:376)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:247)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
at 

[jira] [Commented] (SPARK-18701) Poisson GLM fails due to wrong initialization

2016-12-03 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15719212#comment-15719212
 ] 

Apache Spark commented on SPARK-18701:
--

User 'actuaryzhang' has created a pull request for this issue:
https://github.com/apache/spark/pull/16131

> Poisson GLM fails due to wrong initialization
> -
>
> Key: SPARK-18701
> URL: https://issues.apache.org/jira/browse/SPARK-18701
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.0.2
>Reporter: Wayne Zhang
>Priority: Critical
> Fix For: 2.2.0
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Poisson GLM fails for many standard data sets. The issue is incorrect 
> initialization leading to almost zero probability and weights. The following 
> simple example reproduces the error. 
> {code:borderStyle=solid}
> val datasetPoissonLogWithZero = Seq(
>   LabeledPoint(0.0, Vectors.dense(18, 1.0)),
>   LabeledPoint(1.0, Vectors.dense(12, 0.0)),
>   LabeledPoint(0.0, Vectors.dense(15, 0.0)),
>   LabeledPoint(0.0, Vectors.dense(13, 2.0)),
>   LabeledPoint(0.0, Vectors.dense(15, 1.0)),
>   LabeledPoint(1.0, Vectors.dense(16, 1.0)),
>   LabeledPoint(0.0, Vectors.dense(10, 0.0)),
>   LabeledPoint(0.0, Vectors.dense(15, 0.0)),
>   LabeledPoint(0.0, Vectors.dense(12, 2.0)),
>   LabeledPoint(0.0, Vectors.dense(13, 0.0)),
>   LabeledPoint(1.0, Vectors.dense(15, 0.0)),
>   LabeledPoint(1.0, Vectors.dense(15, 0.0)),
>   LabeledPoint(0.0, Vectors.dense(15, 0.0)),
>   LabeledPoint(0.0, Vectors.dense(12, 2.0)),
>   LabeledPoint(1.0, Vectors.dense(12, 2.0))
> ).toDF()
> 
> val glr = new GeneralizedLinearRegression()
>   .setFamily("poisson")
>   .setLink("log")
>   .setMaxIter(20)
>   .setRegParam(0)
> val model = glr.fit(datasetPoissonLogWithZero)
> {code}
> The issue is in the initialization:  the mean is initialized as the response, 
> which could be zero. Applying the log link results in very negative numbers 
> (protected against -Inf), which again leads to close to zero probability and 
> weights in the weighted least squares. The fix is easy: just add a small 
> constant, highlighted in red below. 
>  
> override def initialize(y: Double, weight: Double): Double = {
>   require(y >= 0.0, "The response variable of Poisson family " +
> s"should be non-negative, but got $y")
>   y {color:red}+ 0.1 {color}
> }
> I already have a fix and test code. Will create a PR. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18701) Poisson GLM fails due to wrong initialization

2016-12-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18701:


Assignee: Apache Spark

> Poisson GLM fails due to wrong initialization
> -
>
> Key: SPARK-18701
> URL: https://issues.apache.org/jira/browse/SPARK-18701
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.0.2
>Reporter: Wayne Zhang
>Assignee: Apache Spark
>Priority: Critical
> Fix For: 2.2.0
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Poisson GLM fails for many standard data sets. The issue is incorrect 
> initialization leading to almost zero probability and weights. The following 
> simple example reproduces the error. 
> {code:borderStyle=solid}
> val datasetPoissonLogWithZero = Seq(
>   LabeledPoint(0.0, Vectors.dense(18, 1.0)),
>   LabeledPoint(1.0, Vectors.dense(12, 0.0)),
>   LabeledPoint(0.0, Vectors.dense(15, 0.0)),
>   LabeledPoint(0.0, Vectors.dense(13, 2.0)),
>   LabeledPoint(0.0, Vectors.dense(15, 1.0)),
>   LabeledPoint(1.0, Vectors.dense(16, 1.0)),
>   LabeledPoint(0.0, Vectors.dense(10, 0.0)),
>   LabeledPoint(0.0, Vectors.dense(15, 0.0)),
>   LabeledPoint(0.0, Vectors.dense(12, 2.0)),
>   LabeledPoint(0.0, Vectors.dense(13, 0.0)),
>   LabeledPoint(1.0, Vectors.dense(15, 0.0)),
>   LabeledPoint(1.0, Vectors.dense(15, 0.0)),
>   LabeledPoint(0.0, Vectors.dense(15, 0.0)),
>   LabeledPoint(0.0, Vectors.dense(12, 2.0)),
>   LabeledPoint(1.0, Vectors.dense(12, 2.0))
> ).toDF()
> 
> val glr = new GeneralizedLinearRegression()
>   .setFamily("poisson")
>   .setLink("log")
>   .setMaxIter(20)
>   .setRegParam(0)
> val model = glr.fit(datasetPoissonLogWithZero)
> {code}
> The issue is in the initialization:  the mean is initialized as the response, 
> which could be zero. Applying the log link results in very negative numbers 
> (protected against -Inf), which again leads to close to zero probability and 
> weights in the weighted least squares. The fix is easy: just add a small 
> constant, highlighted in red below. 
>  
> override def initialize(y: Double, weight: Double): Double = {
>   require(y >= 0.0, "The response variable of Poisson family " +
> s"should be non-negative, but got $y")
>   y {color:red}+ 0.1 {color}
> }
> I already have a fix and test code. Will create a PR. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18701) Poisson GLM fails due to wrong initialization

2016-12-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18701:


Assignee: (was: Apache Spark)

> Poisson GLM fails due to wrong initialization
> -
>
> Key: SPARK-18701
> URL: https://issues.apache.org/jira/browse/SPARK-18701
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.0.2
>Reporter: Wayne Zhang
>Priority: Critical
> Fix For: 2.2.0
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Poisson GLM fails for many standard data sets. The issue is incorrect 
> initialization leading to almost zero probability and weights. The following 
> simple example reproduces the error. 
> {code:borderStyle=solid}
> val datasetPoissonLogWithZero = Seq(
>   LabeledPoint(0.0, Vectors.dense(18, 1.0)),
>   LabeledPoint(1.0, Vectors.dense(12, 0.0)),
>   LabeledPoint(0.0, Vectors.dense(15, 0.0)),
>   LabeledPoint(0.0, Vectors.dense(13, 2.0)),
>   LabeledPoint(0.0, Vectors.dense(15, 1.0)),
>   LabeledPoint(1.0, Vectors.dense(16, 1.0)),
>   LabeledPoint(0.0, Vectors.dense(10, 0.0)),
>   LabeledPoint(0.0, Vectors.dense(15, 0.0)),
>   LabeledPoint(0.0, Vectors.dense(12, 2.0)),
>   LabeledPoint(0.0, Vectors.dense(13, 0.0)),
>   LabeledPoint(1.0, Vectors.dense(15, 0.0)),
>   LabeledPoint(1.0, Vectors.dense(15, 0.0)),
>   LabeledPoint(0.0, Vectors.dense(15, 0.0)),
>   LabeledPoint(0.0, Vectors.dense(12, 2.0)),
>   LabeledPoint(1.0, Vectors.dense(12, 2.0))
> ).toDF()
> 
> val glr = new GeneralizedLinearRegression()
>   .setFamily("poisson")
>   .setLink("log")
>   .setMaxIter(20)
>   .setRegParam(0)
> val model = glr.fit(datasetPoissonLogWithZero)
> {code}
> The issue is in the initialization:  the mean is initialized as the response, 
> which could be zero. Applying the log link results in very negative numbers 
> (protected against -Inf), which again leads to close to zero probability and 
> weights in the weighted least squares. The fix is easy: just add a small 
> constant, highlighted in red below. 
>  
> override def initialize(y: Double, weight: Double): Double = {
>   require(y >= 0.0, "The response variable of Poisson family " +
> s"should be non-negative, but got $y")
>   y {color:red}+ 0.1 {color}
> }
> I already have a fix and test code. Will create a PR. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18701) Poisson GLM fails due to wrong initialization

2016-12-03 Thread Wayne Zhang (JIRA)
Wayne Zhang created SPARK-18701:
---

 Summary: Poisson GLM fails due to wrong initialization
 Key: SPARK-18701
 URL: https://issues.apache.org/jira/browse/SPARK-18701
 Project: Spark
  Issue Type: New Feature
  Components: ML
Affects Versions: 2.0.2
Reporter: Wayne Zhang
Priority: Critical
 Fix For: 2.2.0


Poisson GLM fails for many standard data sets. The issue is incorrect 
initialization leading to almost zero probability and weights. The following 
simple example reproduces the error. 

{code:borderStyle=solid}
val datasetPoissonLogWithZero = Seq(
  LabeledPoint(0.0, Vectors.dense(18, 1.0)),
  LabeledPoint(1.0, Vectors.dense(12, 0.0)),
  LabeledPoint(0.0, Vectors.dense(15, 0.0)),
  LabeledPoint(0.0, Vectors.dense(13, 2.0)),
  LabeledPoint(0.0, Vectors.dense(15, 1.0)),
  LabeledPoint(1.0, Vectors.dense(16, 1.0)),
  LabeledPoint(0.0, Vectors.dense(10, 0.0)),
  LabeledPoint(0.0, Vectors.dense(15, 0.0)),
  LabeledPoint(0.0, Vectors.dense(12, 2.0)),
  LabeledPoint(0.0, Vectors.dense(13, 0.0)),
  LabeledPoint(1.0, Vectors.dense(15, 0.0)),
  LabeledPoint(1.0, Vectors.dense(15, 0.0)),
  LabeledPoint(0.0, Vectors.dense(15, 0.0)),
  LabeledPoint(0.0, Vectors.dense(12, 2.0)),
  LabeledPoint(1.0, Vectors.dense(12, 2.0))
).toDF()

val glr = new GeneralizedLinearRegression()
  .setFamily("poisson")
  .setLink("log")
  .setMaxIter(20)
  .setRegParam(0)

val model = glr.fit(datasetPoissonLogWithZero)
{code}

The issue is in the initialization:  the mean is initialized as the response, 
which could be zero. Applying the log link results in very negative numbers 
(protected against -Inf), which again leads to close to zero probability and 
weights in the weighted least squares. The fix is easy: just add a small 
constant, highlighted in red below. 
 

override def initialize(y: Double, weight: Double): Double = {
  require(y >= 0.0, "The response variable of Poisson family " +
s"should be non-negative, but got $y")
  y {color:red}+ 0.1 {color}
}

I already have a fix and test code. Will create a PR. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18200) GraphX Invalid initial capacity when running triangleCount

2016-12-03 Thread Sumesh Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15719170#comment-15719170
 ] 

Sumesh Kumar commented on SPARK-18200:
--

Thanks much [~dongjoon]

> GraphX Invalid initial capacity when running triangleCount
> --
>
> Key: SPARK-18200
> URL: https://issues.apache.org/jira/browse/SPARK-18200
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Affects Versions: 2.0.0, 2.0.1, 2.0.2
> Environment: Databricks, Ubuntu 16.04, macOS Sierra
>Reporter: Denny Lee
>Assignee: Dongjoon Hyun
>  Labels: graph, graphx
> Fix For: 2.0.3, 2.1.0
>
>
> Running GraphX triangle count on large-ish file results in the "Invalid 
> initial capacity" error when running on Spark 2.0 (tested on Spark 2.0, 
> 2.0.1, and 2.0.2).  You can see the results at: http://bit.ly/2eQKWDN
> Running the same code on Spark 1.6 and the query completes without any 
> problems: http://bit.ly/2fATO1M
> As well, running the GraphFrames version of this code runs as well (Spark 
> 2.0, GraphFrames 0.2): http://bit.ly/2fAS8W8
> Reference Stackoverflow question:
> Spark GraphX: requirement failed: Invalid initial capacity 
> (http://stackoverflow.com/questions/40337366/spark-graphx-requirement-failed-invalid-initial-capacity)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-18200) GraphX Invalid initial capacity when running triangleCount

2016-12-03 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15719124#comment-15719124
 ] 

Dongjoon Hyun edited comment on SPARK-18200 at 12/4/16 2:15 AM:


Hi,
Yes, the bugs are there in 2.0.1.
The fix will be in upcoming Apache Spark 2.0.3 and 2.1.0.
We cannot backport into 2.0.1 because it's already released.


was (Author: dongjoon):
Hi,

It will be in upcoming Apache Spark 2.0.3 and 2.1.0.
We cannot backport into 2.0.1 because it's already released.

> GraphX Invalid initial capacity when running triangleCount
> --
>
> Key: SPARK-18200
> URL: https://issues.apache.org/jira/browse/SPARK-18200
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Affects Versions: 2.0.0, 2.0.1, 2.0.2
> Environment: Databricks, Ubuntu 16.04, macOS Sierra
>Reporter: Denny Lee
>Assignee: Dongjoon Hyun
>  Labels: graph, graphx
> Fix For: 2.0.3, 2.1.0
>
>
> Running GraphX triangle count on large-ish file results in the "Invalid 
> initial capacity" error when running on Spark 2.0 (tested on Spark 2.0, 
> 2.0.1, and 2.0.2).  You can see the results at: http://bit.ly/2eQKWDN
> Running the same code on Spark 1.6 and the query completes without any 
> problems: http://bit.ly/2fATO1M
> As well, running the GraphFrames version of this code runs as well (Spark 
> 2.0, GraphFrames 0.2): http://bit.ly/2fAS8W8
> Reference Stackoverflow question:
> Spark GraphX: requirement failed: Invalid initial capacity 
> (http://stackoverflow.com/questions/40337366/spark-graphx-requirement-failed-invalid-initial-capacity)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18200) GraphX Invalid initial capacity when running triangleCount

2016-12-03 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15719124#comment-15719124
 ] 

Dongjoon Hyun commented on SPARK-18200:
---

Hi,

It will be in upcoming Apache Spark 2.0.3 and 2.1.0.
We cannot backport into 2.0.1 because it's already released.

> GraphX Invalid initial capacity when running triangleCount
> --
>
> Key: SPARK-18200
> URL: https://issues.apache.org/jira/browse/SPARK-18200
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Affects Versions: 2.0.0, 2.0.1, 2.0.2
> Environment: Databricks, Ubuntu 16.04, macOS Sierra
>Reporter: Denny Lee
>Assignee: Dongjoon Hyun
>  Labels: graph, graphx
> Fix For: 2.0.3, 2.1.0
>
>
> Running GraphX triangle count on large-ish file results in the "Invalid 
> initial capacity" error when running on Spark 2.0 (tested on Spark 2.0, 
> 2.0.1, and 2.0.2).  You can see the results at: http://bit.ly/2eQKWDN
> Running the same code on Spark 1.6 and the query completes without any 
> problems: http://bit.ly/2fATO1M
> As well, running the GraphFrames version of this code runs as well (Spark 
> 2.0, GraphFrames 0.2): http://bit.ly/2fAS8W8
> Reference Stackoverflow question:
> Spark GraphX: requirement failed: Invalid initial capacity 
> (http://stackoverflow.com/questions/40337366/spark-graphx-requirement-failed-invalid-initial-capacity)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18200) GraphX Invalid initial capacity when running triangleCount

2016-12-03 Thread Sumesh Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15719081#comment-15719081
 ] 

Sumesh Kumar commented on SPARK-18200:
--

Does this issue exist currently in version 2.0.1?. I just ran a test and it's 
throwing the following exception.

User class threw exception: org.apache.spark.SparkException: Job aborted due to 
stage failure: Task 3 in stage 10.0 failed 4 times, most recent failure: Lost 
task 3.3 in stage 10.0 (TID 196, BD-S2F13): java.lang.IllegalArgumentException: 
requirement failed: Invalid initial capacity
at scala.Predef$.require(Predef.scala:224)
at 
org.apache.spark.util.collection.OpenHashSet$mcJ$sp.(OpenHashSet.scala:51)
at 
org.apache.spark.util.collection.OpenHashSet$mcJ$sp.(OpenHashSet.scala:57)
at 
org.apache.spark.graphx.lib.TriangleCount$$anonfun$5.apply(TriangleCount.scala:70)
at 
org.apache.spark.graphx.lib.TriangleCount$$anonfun$5.apply(TriangleCount.scala:69)
at 
org.apache.spark.graphx.impl.VertexPartitionBaseOps.map(VertexPartitionBaseOps.scala:61)
at 
org.apache.spark.graphx.impl.VertexRDDImpl$$anonfun$mapValues$2.apply(VertexRDDImpl.scala:102)
at 
org.apache.spark.graphx.impl.VertexRDDImpl$$anonfun$mapValues$2.apply(VertexRDDImpl.scala:102)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at 
org.apache.spark.graphx.impl.VertexRDDImpl$$anonfun$3.apply(VertexRDDImpl.scala:156)
at 
org.apache.spark.graphx.impl.VertexRDDImpl$$anonfun$3.apply(VertexRDDImpl.scala:154)
at 
org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:89)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD$$anonfun$8.apply(RDD.scala:332)
at org.apache.spark.rdd.RDD$$anonfun$8.apply(RDD.scala:330)
at 
org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:935)
at 
org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:926)
at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:866)
at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:926)
at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:670)
at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:330)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:281)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
at org.apache.spark.scheduler.Task.run(Task.scala:86)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

> GraphX Invalid initial capacity when running triangleCount
> --
>
> Key: SPARK-18200
> URL: https://issues.apache.org/jira/browse/SPARK-18200
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Affects Versions: 2.0.0, 2.0.1, 2.0.2
> Environment: Databricks, Ubuntu 16.04, macOS Sierra
>Reporter: Denny Lee
>Assignee: Dongjoon Hyun
>  Labels: graph, graphx
> Fix For: 2.0.3, 2.1.0
>
>
> Running GraphX triangle count on large-ish file results in the "Invalid 
> initial capacity" error when running on Spark 2.0 (tested on Spark 2.0, 
> 2.0.1, and 2.0.2).  You can see the results at: http://bit.ly/2eQKWDN
> Running the same code on Spark 1.6 and the query completes without any 
> problems: http://bit.ly/2fATO1M
> As well, running the GraphFrames version of this code runs as well (Spark 
> 2.0, GraphFrames 0.2): http://bit.ly/2fAS8W8
> Reference Stackoverflow question:
> Spark GraphX: requirement failed: Invalid initial capacity 
> (http://stackoverflow.com/questions/40337366/spark-graphx-requirement-failed-invalid-initial-capacity)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18081) Locality Sensitive Hashing (LSH) User Guide

2016-12-03 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-18081.
---
   Resolution: Fixed
Fix Version/s: 2.2.0
   2.1.1

Issue resolved by pull request 15795
[https://github.com/apache/spark/pull/15795]

> Locality Sensitive Hashing (LSH) User Guide
> ---
>
> Key: SPARK-18081
> URL: https://issues.apache.org/jira/browse/SPARK-18081
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, ML
>Reporter: Joseph K. Bradley
>Assignee: Yun Ni
> Fix For: 2.1.1, 2.2.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18581) MultivariateGaussian does not check if covariance matrix is invertible

2016-12-03 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-18581.
---
Resolution: Not A Problem

> MultivariateGaussian does not check if covariance matrix is invertible
> --
>
> Key: SPARK-18581
> URL: https://issues.apache.org/jira/browse/SPARK-18581
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.6.2, 2.0.2
>Reporter: Hao Ren
>
> When training GaussianMixtureModel, I found some probability much larger than 
> 1. That leads me to that fact that, the value returned by 
> MultivariateGaussian.pdf can be 10^5, etc.
> After reviewing the code, I found that problem lies in the computation of 
> determinant of the covariance matrix.
> The computation is simplified by using pseudo-determinant of a positive 
> defined matrix. 
> In my case, I have a feature = 0 for all data point.
> As a result, covariance matrix is not invertible <=> det(covariance matrix) = 
> 0 => pseudo-determinant will be very close to zero,
> Thus, log(pseudo-determinant) will be a large negative number which finally 
> make logpdf very biger, pdf will be even bigger > 1.
> As said in comments of MultivariateGaussian.scala, 
> """
> Singular values are considered to be non-zero only if they exceed a tolerance 
> based on machine precision.
> """
> But if a singular value is considered to be zero, means the covariance matrix 
> is non invertible which is a contradiction to the assumption that it should 
> be invertible.
> So we should check if there a single value is smaller than the tolerance 
> before computing the pseudo determinant



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8007) Support resolving virtual columns in DataFrames

2016-12-03 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15718790#comment-15718790
 ] 

Reynold Xin commented on SPARK-8007:


spark_partition_id() is available in PySpark starting 1.6. It's in 
pyspark.functions.spark_partition_id.


> Support resolving virtual columns in DataFrames
> ---
>
> Key: SPARK-8007
> URL: https://issues.apache.org/jira/browse/SPARK-8007
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Joseph Batchik
>
> Create the infrastructure so we can resolve df("SPARK__PARTITION__ID") to 
> SparkPartitionID expression.
> A cool use case is to understand physical data skew:
> {code}
> df.groupBy("SPARK__PARTITION__ID").count()
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18581) MultivariateGaussian does not check if covariance matrix is invertible

2016-12-03 Thread Hao Ren (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15718706#comment-15718706
 ] 

Hao Ren commented on SPARK-18581:
-

I checked several (mu, sigma) pairs in R.
The package I used is: mvtnorm
The numerical difference of pdf between mllib and R is negligible, no matter 
whether the sigma is invertible or (near-)singular.
Hence, there is no problems here.

Here is my code: https://gist.github.com/invkrh/2a5422c01a3c3a063f504f1f099cbdae
which can generate R code for cross check

> MultivariateGaussian does not check if covariance matrix is invertible
> --
>
> Key: SPARK-18581
> URL: https://issues.apache.org/jira/browse/SPARK-18581
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.6.2, 2.0.2
>Reporter: Hao Ren
>
> When training GaussianMixtureModel, I found some probability much larger than 
> 1. That leads me to that fact that, the value returned by 
> MultivariateGaussian.pdf can be 10^5, etc.
> After reviewing the code, I found that problem lies in the computation of 
> determinant of the covariance matrix.
> The computation is simplified by using pseudo-determinant of a positive 
> defined matrix. 
> In my case, I have a feature = 0 for all data point.
> As a result, covariance matrix is not invertible <=> det(covariance matrix) = 
> 0 => pseudo-determinant will be very close to zero,
> Thus, log(pseudo-determinant) will be a large negative number which finally 
> make logpdf very biger, pdf will be even bigger > 1.
> As said in comments of MultivariateGaussian.scala, 
> """
> Singular values are considered to be non-zero only if they exceed a tolerance 
> based on machine precision.
> """
> But if a singular value is considered to be zero, means the covariance matrix 
> is non invertible which is a contradiction to the assumption that it should 
> be invertible.
> So we should check if there a single value is smaller than the tolerance 
> before computing the pseudo determinant



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18582) Whitelist LogicalPlan operators allowed in correlated subqueries

2016-12-03 Thread Herman van Hovell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell resolved SPARK-18582.
---
   Resolution: Fixed
 Assignee: Nattavut Sutyanyong
Fix Version/s: 2.1.0

> Whitelist LogicalPlan operators allowed in correlated subqueries
> 
>
> Key: SPARK-18582
> URL: https://issues.apache.org/jira/browse/SPARK-18582
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Nattavut Sutyanyong
>Assignee: Nattavut Sutyanyong
> Fix For: 2.1.0
>
>
> We want to tighten the code that handles correlated subquery to whitelist 
> operators that are allowed in it.
> The current code in {{def pullOutCorrelatedPredicates}} looks like
> {code}
>   // Simplify the predicates before pulling them out.
>   val transformed = BooleanSimplification(sub) transformUp {
> case f @ Filter(cond, child) => ...
> case p @ Project(expressions, child) => ...
> case a @ Aggregate(grouping, expressions, child) => ...
> case w : Window => ...
> case j @ Join(left, _, RightOuter, _) => ...
> case j @ Join(left, right, FullOuter, _) => ...
> case j @ Join(_, right, jt, _) if !jt.isInstanceOf[InnerLike] => ...
> case u: Union => ...
> case s: SetOperation => ...
> case e: Expand => ...
> case l : LocalLimit => ...
> case g : GlobalLimit => ...
> case s : Sample => ...
> case p =>
>   failOnOuterReference(p)
>   ...
>   }
> {code}
> The code disallows operators in a sub plan of an operator hosting correlation 
> on a case by case basis. As it is today, it only blocks {{Union}}, 
> {{Intersect}}, {{Except}}, {{Expand}} {{LocalLimit}} {{GlobalLimit}} 
> {{Sample}} {{FullOuter}} and right table of {{LeftOuter}} (and left table of 
> {{RightOuter}}). That means any {{LogicalPlan}} operators that are not in the 
> list above are permitted to be under a correlation point. Is this risky? 
> There are many (30+ at least from browsing the {{LogicalPlan}} type 
> hierarchy) operators derived from {{LogicalPlan}} class.
> For the case of {{ScalarSubquery}}, it explicitly checks that only 
> {{SubqueryAlias}} {{Project}} {{Filter}} {{Aggregate}} are allowed 
> ({{CheckAnalysis.scala}} around line 126-165 in and after {{def 
> cleanQuery}}). We should whitelist which operators are allowed in correlated 
> subqueries. At my first glance, we should allow, in addition to the ones 
> allowed in {{ScalarSubquery}}: {{Join}}, {{Distinct}}, {{Sort}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-8007) Support resolving virtual columns in DataFrames

2016-12-03 Thread Ruslan Dautkhanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15718615#comment-15718615
 ] 

Ruslan Dautkhanov edited comment on SPARK-8007 at 12/3/16 7:34 PM:
---

Is {noformat}spark__partition__id{noformat} available in PySpark too? Can't 
find a way to run the same code in PySpark.


was (Author: tagar):
Is spark__partition__id available in PySpark too? Can't find a way to run the 
same code in PySpark.

> Support resolving virtual columns in DataFrames
> ---
>
> Key: SPARK-8007
> URL: https://issues.apache.org/jira/browse/SPARK-8007
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Joseph Batchik
>
> Create the infrastructure so we can resolve df("SPARK__PARTITION__ID") to 
> SparkPartitionID expression.
> A cool use case is to understand physical data skew:
> {code}
> df.groupBy("SPARK__PARTITION__ID").count()
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8007) Support resolving virtual columns in DataFrames

2016-12-03 Thread Ruslan Dautkhanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15718615#comment-15718615
 ] 

Ruslan Dautkhanov commented on SPARK-8007:
--

Is spark__partition__id available in PySpark too? Can't find a way to run the 
same code in PySpark.

> Support resolving virtual columns in DataFrames
> ---
>
> Key: SPARK-8007
> URL: https://issues.apache.org/jira/browse/SPARK-8007
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Joseph Batchik
>
> Create the infrastructure so we can resolve df("SPARK__PARTITION__ID") to 
> SparkPartitionID expression.
> A cool use case is to understand physical data skew:
> {code}
> df.groupBy("SPARK__PARTITION__ID").count()
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18700) getCached in HiveMetastoreCatalog not thread safe cause driver OOM

2016-12-03 Thread Li Yuanjian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15718580#comment-15718580
 ] 

Li Yuanjian commented on SPARK-18700:
-

I'll add PR for this soon, add ReadWriteLock for each table's relation in 
cache, not for the whole cachedDataSourceTables.


> getCached in HiveMetastoreCatalog not thread safe cause driver OOM
> --
>
> Key: SPARK-18700
> URL: https://issues.apache.org/jira/browse/SPARK-18700
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1, 2.0.0
>Reporter: Li Yuanjian
>
> In our spark sql platform, each query use same HiveContext and 
> independent thread, new data will append to tables as new partitions every 
> 30min. After a new partition added to table T, we should call refreshTable to 
> clear T’s cache in cachedDataSourceTables to make the new partition 
> searchable. 
> For the table have more partitions and files(much bigger than 
> spark.sql.sources.parallelPartitionDiscovery.threshold), a new query of table 
> T will start a job to fetch all FileStatus in listLeafFiles function. Because 
> of the huge number of files, the job will run several seconds, during the 
> time, new queries of table T will also start new jobs to fetch FileStatus 
> because of the function of getCache is not thread safe. Final cause a driver 
> OOM.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18700) getCached in HiveMetastoreCatalog not thread safe cause driver OOM

2016-12-03 Thread Li Yuanjian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Yuanjian updated SPARK-18700:

Description: 
In our spark sql platform, each query use same HiveContext and independent 
thread, new data will append to tables as new partitions every 30min. After a 
new partition added to table T, we should call refreshTable to clear T’s cache 
in cachedDataSourceTables to make the new partition searchable. 
For the table have more partitions and files(much bigger than 
spark.sql.sources.parallelPartitionDiscovery.threshold), a new query of table T 
will start a job to fetch all FileStatus in listLeafFiles function. Because of 
the huge number of files, the job will run several seconds, during the time, 
new queries of table T will also start new jobs to fetch FileStatus because of 
the function of getCache is not thread safe. Final cause a driver OOM.

  was:
In our spark sql platform, each query use same HiveContext and independent 
thread, new data will append to tables as new partitions every 30min. After a 
new partition added to table T, we should call refreshTable to clear T’s cache 
in cachedDataSourceTables
to make the new partition searchable. 
For the table have more partitions and files(much bigger than 
spark.sql.sources.parallelPartitionDiscovery.threshold), a new query of table T 
will start a job to fetch all FileStatus in listLeafFiles function. Because of 
the huge number of files, the job will run several seconds, during the time, 
new queries of table T will also start new jobs to fetch FileStatus because of 
the function of getCache is not thread safe. Final cause a driver OOM.


> getCached in HiveMetastoreCatalog not thread safe cause driver OOM
> --
>
> Key: SPARK-18700
> URL: https://issues.apache.org/jira/browse/SPARK-18700
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1, 2.0.0
>Reporter: Li Yuanjian
>
> In our spark sql platform, each query use same HiveContext and 
> independent thread, new data will append to tables as new partitions every 
> 30min. After a new partition added to table T, we should call refreshTable to 
> clear T’s cache in cachedDataSourceTables to make the new partition 
> searchable. 
> For the table have more partitions and files(much bigger than 
> spark.sql.sources.parallelPartitionDiscovery.threshold), a new query of table 
> T will start a job to fetch all FileStatus in listLeafFiles function. Because 
> of the huge number of files, the job will run several seconds, during the 
> time, new queries of table T will also start new jobs to fetch FileStatus 
> because of the function of getCache is not thread safe. Final cause a driver 
> OOM.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18700) getCached in HiveMetastoreCatalog not thread safe cause driver OOM

2016-12-03 Thread Li Yuanjian (JIRA)
Li Yuanjian created SPARK-18700:
---

 Summary: getCached in HiveMetastoreCatalog not thread safe cause 
driver OOM
 Key: SPARK-18700
 URL: https://issues.apache.org/jira/browse/SPARK-18700
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.0, 1.6.1
Reporter: Li Yuanjian


In our spark sql platform, each query use same HiveContext and independent 
thread, new data will append to tables as new partitions every 30min. After a 
new partition added to table T, we should call refreshTable to clear T’s cache 
in cachedDataSourceTables
to make the new partition searchable. 
For the table have more partitions and files(much bigger than 
spark.sql.sources.parallelPartitionDiscovery.threshold), a new query of table T 
will start a job to fetch all FileStatus in listLeafFiles function. Because of 
the huge number of files, the job will run several seconds, during the time, 
new queries of table T will also start new jobs to fetch FileStatus because of 
the function of getCache is not thread safe. Final cause a driver OOM.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18696) Upgrade sbt plugins

2016-12-03 Thread Weiqing Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15718550#comment-15718550
 ] 

Weiqing Yang commented on SPARK-18696:
--

Oh, yes, thanks for closing this.

> Upgrade sbt plugins
> ---
>
> Key: SPARK-18696
> URL: https://issues.apache.org/jira/browse/SPARK-18696
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Reporter: Weiqing Yang
>Priority: Minor
>
> For 2.2.x, it's better to make sbt plugins up-to-date. The following sbt 
> plugins will be upgraded:
> {code}
> sbt-assembly: 0.11.2 -> 0.14.3
> sbteclipse-plugin: 4.0.0 -> 5.0.1
> sbt-mima-plugin: 0.1.11 -> 0.1.12
> org.ow2.asm/asm: 5.0.3 -> 5.1 
> org.ow2.asm/asm-commons: 5.0.3 -> 5.1 
> {code}
> All other plugins are up-to-date. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18697) Upgrade sbt plugins

2016-12-03 Thread Weiqing Yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weiqing Yang updated SPARK-18697:
-
Target Version/s:   (was: 2.2.0)

> Upgrade sbt plugins
> ---
>
> Key: SPARK-18697
> URL: https://issues.apache.org/jira/browse/SPARK-18697
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Reporter: Weiqing Yang
>Priority: Trivial
>
> For 2.2.x, it's better to make sbt plugins up-to-date. The following sbt 
> plugins will be upgraded:
> {code}
> sbt-assembly: 0.11.2 -> 0.14.3
> sbteclipse-plugin: 4.0.0 -> 5.0.1
> sbt-mima-plugin: 0.1.11 -> 0.1.12
> org.ow2.asm/asm: 5.0.3 -> 5.1 
> org.ow2.asm/asm-commons: 5.0.3 -> 5.1 
> {code}
> All other plugins are up-to-date. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18278) Support native submission of spark jobs to a kubernetes cluster

2016-12-03 Thread Anirudh Ramanathan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15718478#comment-15718478
 ] 

Anirudh Ramanathan commented on SPARK-18278:


There is a way to use a standard image that already exists (say ubuntu) and 
download the distribution and dependencies onto it prior to running drivers and 
executors. I explored this initially but even if this were allowed for, it's 
not likely to be used much. 

>From talking to people looking to use Spark on Kubernetes, it appears that 
>they'd prefer either an official image or build their own image containing the 
>distribution and application-jars. 



> Support native submission of spark jobs to a kubernetes cluster
> ---
>
> Key: SPARK-18278
> URL: https://issues.apache.org/jira/browse/SPARK-18278
> Project: Spark
>  Issue Type: Umbrella
>  Components: Build, Deploy, Documentation, Scheduler, Spark Core
>Reporter: Erik Erlandson
> Attachments: SPARK-18278 - Spark on Kubernetes Design Proposal.pdf
>
>
> A new Apache Spark sub-project that enables native support for submitting 
> Spark applications to a kubernetes cluster.   The submitted application runs 
> in a driver executing on a kubernetes pod, and executors lifecycles are also 
> managed as pods.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18699) Spark CSV parsing types other than String throws exception when malformed

2016-12-03 Thread Jakub Nowacki (JIRA)
Jakub Nowacki created SPARK-18699:
-

 Summary: Spark CSV parsing types other than String throws 
exception when malformed
 Key: SPARK-18699
 URL: https://issues.apache.org/jira/browse/SPARK-18699
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.2
Reporter: Jakub Nowacki


If CSV is read and the schema contains any other type than String, exception is 
thrown when the string value in CSV is malformed; e.g. if the timestamp does 
not match the defined one, an exception is thrown:
{code}
Caused by: java.lang.IllegalArgumentException
at java.sql.Date.valueOf(Date.java:143)
at 
org.apache.spark.sql.catalyst.util.DateTimeUtils$.stringToTime(DateTimeUtils.scala:137)
at 
org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$$anonfun$castTo$6.apply$mcJ$sp(CSVInferSchema.scala:272)
at 
org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$$anonfun$castTo$6.apply(CSVInferSchema.scala:272)
at 
org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$$anonfun$castTo$6.apply(CSVInferSchema.scala:272)
at scala.util.Try.getOrElse(Try.scala:79)
at 
org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$.castTo(CSVInferSchema.scala:269)
at 
org.apache.spark.sql.execution.datasources.csv.CSVRelation$$anonfun$csvParser$3.apply(CSVRelation.scala:116)
at 
org.apache.spark.sql.execution.datasources.csv.CSVRelation$$anonfun$csvParser$3.apply(CSVRelation.scala:85)
at 
org.apache.spark.sql.execution.datasources.csv.CSVFileFormat$$anonfun$buildReader$1$$anonfun$apply$2.apply(CSVFileFormat.scala:128)
at 
org.apache.spark.sql.execution.datasources.csv.CSVFileFormat$$anonfun$buildReader$1$$anonfun$apply$2.apply(CSVFileFormat.scala:127)
at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
at 
org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply$mcV$sp(WriterContainer.scala:253)
at 
org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply(WriterContainer.scala:252)
at 
org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply(WriterContainer.scala:252)
at 
org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1348)
at 
org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:258)
... 8 more
{code}

It behaves similarly with Integer and Long types, from what I've seen.

To my understanding modes PERMISSIVE and DROPMALFORMED should just null the 
value or drop the line, but instead they kill the job.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18278) Support native submission of spark jobs to a kubernetes cluster

2016-12-03 Thread Erik Erlandson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15718249#comment-15718249
 ] 

Erik Erlandson commented on SPARK-18278:


Not publishing images puts users in the position of not being able to run this 
out-of-the-box.  First they would have to either build images themselves, or 
find somebody else's 3rd-party images, etc.  It doesn't seem like it would make 
for good UX.

> Support native submission of spark jobs to a kubernetes cluster
> ---
>
> Key: SPARK-18278
> URL: https://issues.apache.org/jira/browse/SPARK-18278
> Project: Spark
>  Issue Type: Umbrella
>  Components: Build, Deploy, Documentation, Scheduler, Spark Core
>Reporter: Erik Erlandson
> Attachments: SPARK-18278 - Spark on Kubernetes Design Proposal.pdf
>
>
> A new Apache Spark sub-project that enables native support for submitting 
> Spark applications to a kubernetes cluster.   The submitted application runs 
> in a driver executing on a kubernetes pod, and executors lifecycles are also 
> managed as pods.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18278) Support native submission of spark jobs to a kubernetes cluster

2016-12-03 Thread Erik Erlandson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15718240#comment-15718240
 ] 

Erik Erlandson commented on SPARK-18278:


A possible scheme might be to publish the docker-files, but not actually build 
the images.   It seems more standard to actually publish images for the 
community.   Is there some reason for not wanting to do that?

> Support native submission of spark jobs to a kubernetes cluster
> ---
>
> Key: SPARK-18278
> URL: https://issues.apache.org/jira/browse/SPARK-18278
> Project: Spark
>  Issue Type: Umbrella
>  Components: Build, Deploy, Documentation, Scheduler, Spark Core
>Reporter: Erik Erlandson
> Attachments: SPARK-18278 - Spark on Kubernetes Design Proposal.pdf
>
>
> A new Apache Spark sub-project that enables native support for submitting 
> Spark applications to a kubernetes cluster.   The submitted application runs 
> in a driver executing on a kubernetes pod, and executors lifecycles are also 
> managed as pods.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18698) public constructor with uid for IndexToString-class

2016-12-03 Thread Bjoern Toldbod (JIRA)
Bjoern Toldbod created SPARK-18698:
--

 Summary: public constructor with uid for IndexToString-class
 Key: SPARK-18698
 URL: https://issues.apache.org/jira/browse/SPARK-18698
 Project: Spark
  Issue Type: Wish
  Components: ML
Affects Versions: 2.0.2
Reporter: Bjoern Toldbod
Priority: Minor


The IndexToString class in org.apache.spark.ml.feature does not provide a 
public constructor which takes a uid string.

It would be nice to have such a constructor.

(Generally, being able to name pipelinestages makes it much easier to work with 
complex models)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18020) Kinesis receiver does not snapshot when shard completes

2016-12-03 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15717915#comment-15717915
 ] 

Takeshi Yamamuro commented on SPARK-18020:
--

I'm currently looking into this issue.

> Kinesis receiver does not snapshot when shard completes
> ---
>
> Key: SPARK-18020
> URL: https://issues.apache.org/jira/browse/SPARK-18020
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.0.0
>Reporter: Yonathan Randolph
>Priority: Minor
>  Labels: kinesis
>
> When a kinesis shard is split or combined and the old shard ends, the Amazon 
> Kinesis Client library [calls 
> IRecordProcessor.shutdown|https://github.com/awslabs/amazon-kinesis-client/blob/v1.7.0/src/main/java/com/amazonaws/services/kinesis/clientlibrary/lib/worker/ShutdownTask.java#L100]
>  and expects that {{IRecordProcessor.shutdown}} must checkpoint the sequence 
> number {{ExtendedSequenceNumber.SHARD_END}} before returning. Unfortunately, 
> spark’s 
> [KinesisRecordProcessor|https://github.com/apache/spark/blob/v2.0.1/external/kinesis-asl/src/main/scala/org/apache/spark/streaming/kinesis/KinesisRecordProcessor.scala]
>  sometimes does not checkpoint SHARD_END. This results in an error message, 
> and spark is then blocked indefinitely from processing any items from the 
> child shards.
> This issue has also been raised on StackOverflow: [resharding while spark 
> running on kinesis 
> stream|http://stackoverflow.com/questions/38898691/resharding-while-spark-running-on-kinesis-stream]
> Exception that is logged:
> {code}
> 16/10/19 19:37:49 ERROR worker.ShutdownTask: Application exception. 
> java.lang.IllegalArgumentException: Application didn't checkpoint at end of 
> shard shardId-0030
> at 
> com.amazonaws.services.kinesis.clientlibrary.lib.worker.ShutdownTask.call(ShutdownTask.java:106)
> at 
> com.amazonaws.services.kinesis.clientlibrary.lib.worker.MetricsCollectingTaskDecorator.call(MetricsCollectingTaskDecorator.java:49)
> at 
> com.amazonaws.services.kinesis.clientlibrary.lib.worker.MetricsCollectingTaskDecorator.call(MetricsCollectingTaskDecorator.java:24)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> Command used to split shard:
> {code}
> aws kinesis --region us-west-1 split-shard --stream-name my-stream 
> --shard-to-split shardId-0030 --new-starting-hash-key 
> 5316911983139663491615228241121378303
> {code}
> After the spark-streaming job has hung, examining the DynamoDB table 
> indicates that the parent shard processor has not reached 
> {{ExtendedSequenceNumber.SHARD_END}} and the child shards are still at 
> {{ExtendedSequenceNumber.TRIM_HORIZON}} waiting for the parent to finish:
> {code}
> aws kinesis --region us-west-1 describe-stream --stream-name my-stream
> {
> "StreamDescription": {
> "RetentionPeriodHours": 24, 
> "StreamName": "my-stream", 
> "Shards": [
> {
> "ShardId": "shardId-0030", 
> "HashKeyRange": {
> "EndingHashKey": 
> "10633823966279326983230456482242756606", 
> "StartingHashKey": "0"
> },
> ...
> }, 
> {
> "ShardId": "shardId-0062", 
> "HashKeyRange": {
> "EndingHashKey": "5316911983139663491615228241121378302", 
> "StartingHashKey": "0"
> }, 
> "ParentShardId": "shardId-0030", 
> "SequenceNumberRange": {
> "StartingSequenceNumber": 
> "49566806087883755242230188435465744452396445937434624994"
> }
> }, 
> {
> "ShardId": "shardId-0063", 
> "HashKeyRange": {
> "EndingHashKey": 
> "10633823966279326983230456482242756606", 
> "StartingHashKey": "5316911983139663491615228241121378303"
> }, 
> "ParentShardId": "shardId-0030", 
> "SequenceNumberRange": {
> "StartingSequenceNumber": 
> "49566806087906055987428719058607280170669094298940605426"
> }
> },
> ...
> ],
> "StreamStatus": "ACTIVE"
> }
> }
> aws dynamodb --region us-west-1 scan --table-name my-processor
> {
> "Items": [
> {
> "leaseOwner": {
> "S": 

[jira] [Commented] (SPARK-18697) Upgrade sbt plugins

2016-12-03 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15717895#comment-15717895
 ] 

Sean Owen commented on SPARK-18697:
---

I merged SPARK-18696, but just to master. Let's do that to be more conservative.

> Upgrade sbt plugins
> ---
>
> Key: SPARK-18697
> URL: https://issues.apache.org/jira/browse/SPARK-18697
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Reporter: Weiqing Yang
>Priority: Trivial
>
> For 2.2.x, it's better to make sbt plugins up-to-date. The following sbt 
> plugins will be upgraded:
> {code}
> sbt-assembly: 0.11.2 -> 0.14.3
> sbteclipse-plugin: 4.0.0 -> 5.0.1
> sbt-mima-plugin: 0.1.11 -> 0.1.12
> org.ow2.asm/asm: 5.0.3 -> 5.1 
> org.ow2.asm/asm-commons: 5.0.3 -> 5.1 
> {code}
> All other plugins are up-to-date. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18638) Upgrade sbt, zinc and maven plugins

2016-12-03 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-18638.
---
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 16069
[https://github.com/apache/spark/pull/16069]

> Upgrade sbt, zinc and maven plugins
> ---
>
> Key: SPARK-18638
> URL: https://issues.apache.org/jira/browse/SPARK-18638
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Reporter: Weiqing Yang
>Priority: Minor
> Fix For: 2.2.0
>
>
> v2.1.0-rc1has been out. For 2.2.x, it is better to keep sbt up-to-date, and 
> upgrade it from 0.13.11 to 0.13.13. The release notes since the last version 
> we used are: https://github.com/sbt/sbt/releases/tag/v0.13.12 and 
> https://github.com/sbt/sbt/releases/tag/v0.13.13. Both releases include some 
> regression fixes. This jira will also update Zinc and Maven plugins.
> {code}
>sbt: 0.13.11 -> 0.13.13,
>zinc: 0.3.9 -> 0.3.11,
>maven-assembly-plugin: 2.6 -> 3.0.0
>maven-compiler-plugin: 3.5.1 -> 3.6.
>maven-jar-plugin: 2.6 -> 3.0.2
>maven-javadoc-plugin: 2.10.3 -> 2.10.4
>maven-source-plugin: 2.4 -> 3.0.1
>org.codehaus.mojo:build-helper-maven-plugin: 1.10 -> 1.12
>org.codehaus.mojo:exec-maven-plugin: 1.4.0 -> 1.5.0
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18638) Upgrade sbt, zinc and maven plugins

2016-12-03 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-18638:
--
Assignee: Weiqing Yang

> Upgrade sbt, zinc and maven plugins
> ---
>
> Key: SPARK-18638
> URL: https://issues.apache.org/jira/browse/SPARK-18638
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Reporter: Weiqing Yang
>Assignee: Weiqing Yang
>Priority: Minor
> Fix For: 2.2.0
>
>
> v2.1.0-rc1has been out. For 2.2.x, it is better to keep sbt up-to-date, and 
> upgrade it from 0.13.11 to 0.13.13. The release notes since the last version 
> we used are: https://github.com/sbt/sbt/releases/tag/v0.13.12 and 
> https://github.com/sbt/sbt/releases/tag/v0.13.13. Both releases include some 
> regression fixes. This jira will also update Zinc and Maven plugins.
> {code}
>sbt: 0.13.11 -> 0.13.13,
>zinc: 0.3.9 -> 0.3.11,
>maven-assembly-plugin: 2.6 -> 3.0.0
>maven-compiler-plugin: 3.5.1 -> 3.6.
>maven-jar-plugin: 2.6 -> 3.0.2
>maven-javadoc-plugin: 2.10.3 -> 2.10.4
>maven-source-plugin: 2.4 -> 3.0.1
>org.codehaus.mojo:build-helper-maven-plugin: 1.10 -> 1.12
>org.codehaus.mojo:exec-maven-plugin: 1.4.0 -> 1.5.0
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18697) Upgrade sbt plugins

2016-12-03 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-18697:
--
Priority: Trivial  (was: Minor)

OK, it's a little arbitrary to update SBT, zinc, and Maven plugins, but then 
SBT plugins separately. I don't care much either way though. I also think it's 
fine to push this sort of update into 2.1.x

> Upgrade sbt plugins
> ---
>
> Key: SPARK-18697
> URL: https://issues.apache.org/jira/browse/SPARK-18697
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Reporter: Weiqing Yang
>Priority: Trivial
>
> For 2.2.x, it's better to make sbt plugins up-to-date. The following sbt 
> plugins will be upgraded:
> {code}
> sbt-assembly: 0.11.2 -> 0.14.3
> sbteclipse-plugin: 4.0.0 -> 5.0.1
> sbt-mima-plugin: 0.1.11 -> 0.1.12
> org.ow2.asm/asm: 5.0.3 -> 5.1 
> org.ow2.asm/asm-commons: 5.0.3 -> 5.1 
> {code}
> All other plugins are up-to-date. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18696) Upgrade sbt plugins

2016-12-03 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-18696.
---
  Resolution: Duplicate
Target Version/s:   (was: 2.2.0)

> Upgrade sbt plugins
> ---
>
> Key: SPARK-18696
> URL: https://issues.apache.org/jira/browse/SPARK-18696
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Reporter: Weiqing Yang
>Priority: Minor
>
> For 2.2.x, it's better to make sbt plugins up-to-date. The following sbt 
> plugins will be upgraded:
> {code}
> sbt-assembly: 0.11.2 -> 0.14.3
> sbteclipse-plugin: 4.0.0 -> 5.0.1
> sbt-mima-plugin: 0.1.11 -> 0.1.12
> org.ow2.asm/asm: 5.0.3 -> 5.1 
> org.ow2.asm/asm-commons: 5.0.3 -> 5.1 
> {code}
> All other plugins are up-to-date. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18584) multiple Spark Thrift Servers running in the same machine throws org.apache.hadoop.security.AccessControlException

2016-12-03 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-18584.
---
Resolution: Not A Problem

> multiple Spark Thrift Servers running in the same machine throws 
> org.apache.hadoop.security.AccessControlException
> --
>
> Key: SPARK-18584
> URL: https://issues.apache.org/jira/browse/SPARK-18584
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
> Environment: hadoop-2.5.0-cdh5.2.1-och4.0.0
> spark2.0.2
>Reporter: tanxinz
>
> In spark2.0.2 , I have two users(etl , dev ) start Spark Thrift Server in the 
> same machine . I connected by beeline etl STS to execute a command,and 
> throwed org.apache.hadoop.security.AccessControlException.I don't know why is 
> dev user perform,not etl.
> ```
> Caused by: 
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.AccessControlException):
>  Permission denied: user=dev, access=EXECUTE, 
> inode="/user/hive/warehouse/tb_spark_sts/etl_cycle_id=20161122":etl:supergroup:drwxr-x---,group:etl:rwx,group:oth_dev:rwx,default:user:data_mining:r-x,default:group::rwx,default:group:etl:rwx,default:group:oth_dev:rwx,default:mask::rwx,default:other::---
> at 
> org.apache.hadoop.hdfs.server.namenode.DefaultAuthorizationProvider.checkAccessAcl(DefaultAuthorizationProvider.java:335)
> at 
> org.apache.hadoop.hdfs.server.namenode.DefaultAuthorizationProvider.check(DefaultAuthorizationProvider.java:231)
> at 
> org.apache.hadoop.hdfs.server.namenode.DefaultAuthorizationProvider.checkTraverse(DefaultAuthorizationProvider.java:178)
> at 
> org.apache.hadoop.hdfs.server.namenode.DefaultAuthorizationProvider.checkPermission(DefaultAuthorizationProvider.java:137)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:138)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkPermission(FSNamesystem.java:6250)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getFileInfo(FSNamesystem.java:3942)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getFileInfo(NameNodeRpcServer.java:811)
> at 
> org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.getFileInfo(AuthorizationProviderProxyClientProtocol.java:502)
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getFileInfo(ClientNamenodeProtocolServerSideTranslatorPB.java:815)
> at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:587)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)
> ```



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18685) Fix all tests in ExecutorClassLoaderSuite to pass on Windows

2016-12-03 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-18685.
---
   Resolution: Fixed
Fix Version/s: 2.0.3
   2.1.1

Issue resolved by pull request 16116
[https://github.com/apache/spark/pull/16116]

> Fix all tests in ExecutorClassLoaderSuite to pass on Windows
> 
>
> Key: SPARK-18685
> URL: https://issues.apache.org/jira/browse/SPARK-18685
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Shell, Tests
>Reporter: Hyukjin Kwon
>Priority: Minor
> Fix For: 2.1.1, 2.0.3
>
>
> There are two problems as below:
> We should make the URI correct and {{BufferedSource}} from 
> {{Source.fromInputStream}} closed after opening them in the tests in 
> {{ExecutorClassLoaderSuite}}. Currently, these are leading to test failures 
> on Windows.
> {code}
> ExecutorClassLoaderSuite:
> [info] - child first *** FAILED *** (78 milliseconds)
> [info]   java.net.URISyntaxException: Illegal character in authority at index 
> 7: 
> file://C:\projects\spark\target\tmp\spark-00b66070-0548-463c-b6f3-8965d173da9b
> [info]   at java.net.URI$Parser.fail(URI.java:2848)
> [info]   at java.net.URI$Parser.parseAuthority(URI.java:3186)
> ...
> [info] - parent first *** FAILED *** (15 milliseconds)
> [info]   java.net.URISyntaxException: Illegal character in authority at index 
> 7: 
> file://C:\projects\spark\target\tmp\spark-00b66070-0548-463c-b6f3-8965d173da9b
> [info]   at java.net.URI$Parser.fail(URI.java:2848)
> [info]   at java.net.URI$Parser.parseAuthority(URI.java:3186)
> ...
> [info] - child first can fall back *** FAILED *** (0 milliseconds)
> [info]   java.net.URISyntaxException: Illegal character in authority at index 
> 7: 
> file://C:\projects\spark\target\tmp\spark-00b66070-0548-463c-b6f3-8965d173da9b
> [info]   at java.net.URI$Parser.fail(URI.java:2848)
> [info]   at java.net.URI$Parser.parseAuthority(URI.java:3186)
> ...
> [info] - child first can fail *** FAILED *** (0 milliseconds)
> [info]   java.net.URISyntaxException: Illegal character in authority at index 
> 7: 
> file://C:\projects\spark\target\tmp\spark-00b66070-0548-463c-b6f3-8965d173da9b
> [info]   at java.net.URI$Parser.fail(URI.java:2848)
> [info]   at java.net.URI$Parser.parseAuthority(URI.java:3186)
> ...
> [info] - resource from parent *** FAILED *** (0 milliseconds)
> [info]   java.net.URISyntaxException: Illegal character in authority at index 
> 7: 
> file://C:\projects\spark\target\tmp\spark-00b66070-0548-463c-b6f3-8965d173da9b
> [info]   at java.net.URI$Parser.fail(URI.java:2848)
> [info]   at java.net.URI$Parser.parseAuthority(URI.java:3186)
> ...
> [info] - resources from parent *** FAILED *** (0 milliseconds)
> [info]   java.net.URISyntaxException: Illegal character in authority at index 
> 7: 
> file://C:\projects\spark\target\tmp\spark-00b66070-0548-463c-b6f3-8965d173da9b
> [info]   at java.net.URI$Parser.fail(URI.java:2848)
> [info]   at java.net.URI$Parser.parseAuthority(URI.java:3186)
> {code}
> {code}
> [info] Exception encountered when attempting to run a suite with class name: 
> org.apache.spark.repl.ExecutorClassLoaderSuite *** ABORTED *** (7 seconds, 
> 333 milliseconds)
> [info]   java.io.IOException: Failed to delete: 
> C:\projects\spark\target\tmp\spark-77b2f37b-6405-47c4-af1c-4a6a206511f2
> [info]   at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:1010)
> [info]   at 
> org.apache.spark.repl.ExecutorClassLoaderSuite.afterAll(ExecutorClassLoaderSuite.scala:76)
> [info]   at 
> org.scalatest.BeforeAndAfterAll$class.afterAll(BeforeAndAfterAll.scala:213)
> ...
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18581) MultivariateGaussian does not check if covariance matrix is invertible

2016-12-03 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15717855#comment-15717855
 ] 

Sean Owen commented on SPARK-18581:
---

[~invkrh] do you think there's still a problem here?

> MultivariateGaussian does not check if covariance matrix is invertible
> --
>
> Key: SPARK-18581
> URL: https://issues.apache.org/jira/browse/SPARK-18581
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.6.2, 2.0.2
>Reporter: Hao Ren
>
> When training GaussianMixtureModel, I found some probability much larger than 
> 1. That leads me to that fact that, the value returned by 
> MultivariateGaussian.pdf can be 10^5, etc.
> After reviewing the code, I found that problem lies in the computation of 
> determinant of the covariance matrix.
> The computation is simplified by using pseudo-determinant of a positive 
> defined matrix. 
> In my case, I have a feature = 0 for all data point.
> As a result, covariance matrix is not invertible <=> det(covariance matrix) = 
> 0 => pseudo-determinant will be very close to zero,
> Thus, log(pseudo-determinant) will be a large negative number which finally 
> make logpdf very biger, pdf will be even bigger > 1.
> As said in comments of MultivariateGaussian.scala, 
> """
> Singular values are considered to be non-zero only if they exceed a tolerance 
> based on machine precision.
> """
> But if a singular value is considered to be zero, means the covariance matrix 
> is non invertible which is a contradiction to the assumption that it should 
> be invertible.
> So we should check if there a single value is smaller than the tolerance 
> before computing the pseudo determinant



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18685) Fix all tests in ExecutorClassLoaderSuite to pass on Windows

2016-12-03 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-18685:
--
Assignee: Hyukjin Kwon

> Fix all tests in ExecutorClassLoaderSuite to pass on Windows
> 
>
> Key: SPARK-18685
> URL: https://issues.apache.org/jira/browse/SPARK-18685
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Shell, Tests
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Minor
> Fix For: 2.0.3, 2.1.1
>
>
> There are two problems as below:
> We should make the URI correct and {{BufferedSource}} from 
> {{Source.fromInputStream}} closed after opening them in the tests in 
> {{ExecutorClassLoaderSuite}}. Currently, these are leading to test failures 
> on Windows.
> {code}
> ExecutorClassLoaderSuite:
> [info] - child first *** FAILED *** (78 milliseconds)
> [info]   java.net.URISyntaxException: Illegal character in authority at index 
> 7: 
> file://C:\projects\spark\target\tmp\spark-00b66070-0548-463c-b6f3-8965d173da9b
> [info]   at java.net.URI$Parser.fail(URI.java:2848)
> [info]   at java.net.URI$Parser.parseAuthority(URI.java:3186)
> ...
> [info] - parent first *** FAILED *** (15 milliseconds)
> [info]   java.net.URISyntaxException: Illegal character in authority at index 
> 7: 
> file://C:\projects\spark\target\tmp\spark-00b66070-0548-463c-b6f3-8965d173da9b
> [info]   at java.net.URI$Parser.fail(URI.java:2848)
> [info]   at java.net.URI$Parser.parseAuthority(URI.java:3186)
> ...
> [info] - child first can fall back *** FAILED *** (0 milliseconds)
> [info]   java.net.URISyntaxException: Illegal character in authority at index 
> 7: 
> file://C:\projects\spark\target\tmp\spark-00b66070-0548-463c-b6f3-8965d173da9b
> [info]   at java.net.URI$Parser.fail(URI.java:2848)
> [info]   at java.net.URI$Parser.parseAuthority(URI.java:3186)
> ...
> [info] - child first can fail *** FAILED *** (0 milliseconds)
> [info]   java.net.URISyntaxException: Illegal character in authority at index 
> 7: 
> file://C:\projects\spark\target\tmp\spark-00b66070-0548-463c-b6f3-8965d173da9b
> [info]   at java.net.URI$Parser.fail(URI.java:2848)
> [info]   at java.net.URI$Parser.parseAuthority(URI.java:3186)
> ...
> [info] - resource from parent *** FAILED *** (0 milliseconds)
> [info]   java.net.URISyntaxException: Illegal character in authority at index 
> 7: 
> file://C:\projects\spark\target\tmp\spark-00b66070-0548-463c-b6f3-8965d173da9b
> [info]   at java.net.URI$Parser.fail(URI.java:2848)
> [info]   at java.net.URI$Parser.parseAuthority(URI.java:3186)
> ...
> [info] - resources from parent *** FAILED *** (0 milliseconds)
> [info]   java.net.URISyntaxException: Illegal character in authority at index 
> 7: 
> file://C:\projects\spark\target\tmp\spark-00b66070-0548-463c-b6f3-8965d173da9b
> [info]   at java.net.URI$Parser.fail(URI.java:2848)
> [info]   at java.net.URI$Parser.parseAuthority(URI.java:3186)
> {code}
> {code}
> [info] Exception encountered when attempting to run a suite with class name: 
> org.apache.spark.repl.ExecutorClassLoaderSuite *** ABORTED *** (7 seconds, 
> 333 milliseconds)
> [info]   java.io.IOException: Failed to delete: 
> C:\projects\spark\target\tmp\spark-77b2f37b-6405-47c4-af1c-4a6a206511f2
> [info]   at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:1010)
> [info]   at 
> org.apache.spark.repl.ExecutorClassLoaderSuite.afterAll(ExecutorClassLoaderSuite.scala:76)
> [info]   at 
> org.scalatest.BeforeAndAfterAll$class.afterAll(BeforeAndAfterAll.scala:213)
> ...
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18586) netty-3.8.0.Final.jar has vulnerability CVE-2014-3488 and CVE-2014-0193

2016-12-03 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-18586:
--
Assignee: Sean Owen
Priority: Minor  (was: Major)

I don't think the CVE actually affected Spark, as Netty 3 isn't directly used, 
but I updated it anyway.

> netty-3.8.0.Final.jar has vulnerability CVE-2014-3488  and CVE-2014-0193
> 
>
> Key: SPARK-18586
> URL: https://issues.apache.org/jira/browse/SPARK-18586
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Reporter: meiyoula
>Assignee: Sean Owen
>Priority: Minor
> Fix For: 2.2.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18586) netty-3.8.0.Final.jar has vulnerability CVE-2014-3488 and CVE-2014-0193

2016-12-03 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-18586.
---
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 16102
[https://github.com/apache/spark/pull/16102]

> netty-3.8.0.Final.jar has vulnerability CVE-2014-3488  and CVE-2014-0193
> 
>
> Key: SPARK-18586
> URL: https://issues.apache.org/jira/browse/SPARK-18586
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Reporter: meiyoula
> Fix For: 2.2.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18678) Skewed feature subsampling in Random forest

2016-12-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18678:


Assignee: (was: Apache Spark)

> Skewed feature subsampling in Random forest
> ---
>
> Key: SPARK-18678
> URL: https://issues.apache.org/jira/browse/SPARK-18678
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.0.2
>Reporter: Bjoern Toldbod
>
> The feature subsampling performed in the RandomForest-implementation from 
> org.apache.spark.ml.tree.impl.RandomForest
> is performed using SamplingUtils.reservoirSampleAndCount
> The implementation of the sampling skews feature selection in favor of 
> features with a higher index. 
> The skewness is smaller for a large number of features, but completely 
> dominates the feature selection for a small number of features. The extreme 
> case is when the number of features is 2 and number of features to select is 
> 1.
> In this case the feature sampling will always pick feature 1 and ignore 
> feature 0.
> Of course this produces low quality models for few features when using 
> subsampling.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18678) Skewed feature subsampling in Random forest

2016-12-03 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15717808#comment-15717808
 ] 

Apache Spark commented on SPARK-18678:
--

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/16129

> Skewed feature subsampling in Random forest
> ---
>
> Key: SPARK-18678
> URL: https://issues.apache.org/jira/browse/SPARK-18678
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.0.2
>Reporter: Bjoern Toldbod
>
> The feature subsampling performed in the RandomForest-implementation from 
> org.apache.spark.ml.tree.impl.RandomForest
> is performed using SamplingUtils.reservoirSampleAndCount
> The implementation of the sampling skews feature selection in favor of 
> features with a higher index. 
> The skewness is smaller for a large number of features, but completely 
> dominates the feature selection for a small number of features. The extreme 
> case is when the number of features is 2 and number of features to select is 
> 1.
> In this case the feature sampling will always pick feature 1 and ignore 
> feature 0.
> Of course this produces low quality models for few features when using 
> subsampling.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18678) Skewed feature subsampling in Random forest

2016-12-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18678:


Assignee: Apache Spark

> Skewed feature subsampling in Random forest
> ---
>
> Key: SPARK-18678
> URL: https://issues.apache.org/jira/browse/SPARK-18678
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.0.2
>Reporter: Bjoern Toldbod
>Assignee: Apache Spark
>
> The feature subsampling performed in the RandomForest-implementation from 
> org.apache.spark.ml.tree.impl.RandomForest
> is performed using SamplingUtils.reservoirSampleAndCount
> The implementation of the sampling skews feature selection in favor of 
> features with a higher index. 
> The skewness is smaller for a large number of features, but completely 
> dominates the feature selection for a small number of features. The extreme 
> case is when the number of features is 2 and number of features to select is 
> 1.
> In this case the feature sampling will always pick feature 1 and ignore 
> feature 0.
> Of course this produces low quality models for few features when using 
> subsampling.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18349) Update R API documentation on ml model summary

2016-12-03 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15717726#comment-15717726
 ] 

Felix Cheung commented on SPARK-18349:
--

[~wangmiao1981]Please do, thanks! Since we have some questions it would be 
great if you could propose the approach and we could discuss a bit here.

> Update R API documentation on ml model summary
> --
>
> Key: SPARK-18349
> URL: https://issues.apache.org/jira/browse/SPARK-18349
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, SparkR
>Affects Versions: 2.1.0
>Reporter: Felix Cheung
>
> It has been discovered that there is a fair bit of consistency in the 
> documentation of summary functions, eg.
> {code}
> #' @return \code{summary} returns a summary object of the fitted model, a 
> list of components
> #' including formula, number of features, list of features, feature 
> importances, number of
> #' trees, and tree weights
> setMethod("summary", signature(object = "GBTRegressionModel")
> {code}
> For instance, what should be listed for the return value? Should it be a name 
> or a phrase, or should it be a list of items; and should there be a longer 
> description on what they mean, or reference link to Scala doc.
> We will need to review this for all model summary implementations in mllib.R



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17822) JVMObjectTracker.objMap may leak JVM objects

2016-12-03 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15717723#comment-15717723
 ] 

Felix Cheung commented on SPARK-17822:
--

>From what [~josephkb] observed and described, I suspect this is a case of 
>small pointers in R holding larger memory/classes in JVM.

If the memory footprint of the pointer in R is very small, chances are even 
after thousands of iterations the memory consumption in R is still not high 
enough to trigger a GC to reclaim. If we have a repro, calling gc() or 
gcinfo(TRUE) should tell us about memory consumption as it grows.

I'm not sure about the previous attempt to mitigate this with WeakReference 
though - since we don't know which of the R object is still being referenced, 
once we remove the JVM object, and the R pointer could become a dangling 
pointer.

And perhaps then this could be helped by increasing the aggressiveness of R GC:
https://stat.ethz.ch/R-manual/R-devel/library/base/html/Memory.htm
http://adv-r.had.co.nz/memory.html#gc


> JVMObjectTracker.objMap may leak JVM objects
> 
>
> Key: SPARK-17822
> URL: https://issues.apache.org/jira/browse/SPARK-17822
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Reporter: Yin Huai
>Assignee: Xiangrui Meng
> Attachments: screenshot-1.png
>
>
> JVMObjectTracker.objMap is used to track JVM objects for SparkR. However, we 
> observed that JVM objects that are not used anymore are still trapped in this 
> map, which prevents those object get GCed. 
> Seems it makes sense to use weak reference (like persistentRdds in 
> SparkContext). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org