[jira] [Updated] (SPARK-18704) CrossValidator should preserve more tuning statistics
[ https://issues.apache.org/jira/browse/SPARK-18704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yuhao yang updated SPARK-18704: --- Description: Currently CrossValidator will train (k-fold * paramMaps) different models during the training process, yet it only passes the average metrics to CrossValidatorModel. From which some important information like variances for the same paramMap cannot be retrieved, and users cannot be sure if the k number is proper. Since the CrossValidator is relatively expensive, we probably want to get the most from the tuning process. Just want to see if this sounds good. In my opinion, this can be done either by passing a metrics matrix to the CrossValidatorModel, or we can introduce a CrossValidatorSummary. I would vote for introducing the TunningSummary class, which can also be used by TrainValidationSplit. In the summary we can present a better statistics for the tuning process. Something like a DataFrame: +---+++-+ |elasticNetParam|fitIntercept|regParam|metrics | +---+++-+ |0.0|true|0.1 |9.747795248932505| |0.0|true|0.01|9.751942357398603| |0.0|false |0.1 |9.71727627087487 | |0.0|false |0.01|9.721149803723822| |0.5|true|0.1 |9.719358515436005| |0.5|true|0.01|9.748121645368501| |0.5|false |0.1 |9.687771328829479| |0.5|false |0.01|9.717304811419261| |1.0|true|0.1 |9.696769467196487| |1.0|true|0.01|9.744325276259957| |1.0|false |0.1 |9.665822167122172| |1.0|false |0.01|9.713484065511892| +---+++-+ Using the dataFrame, users can better understand the effect of different parameters. Another thing we should improve is to include the paramMaps in the CrossValidatorModel (or TrainValidationSplitModel) to allow meaningful serialization. Keeping only the metrics without ParamMaps does not really help model reuse. was: Currently CrossValidator will train (k-fold * paramMaps) different models during the training process, yet it only passes the average metrics to CrossValidatorModel. From which some important information like variances for the same paramMap cannot be retrieved, and users cannot be sure if the k number is proper. Since the CrossValidator is relatively expensive, we probably want to get the most from the tuning process. Just want to see if this sounds good. In my opinion, this can be done either by passing a metrics matrix to the CrossValidatorModel, or we can introduce a CrossValidatorSummary. I would vote for introducing the TunningSummary class, which can also be used by TrainValidationSplit. In the summary we can present a better statistics for the tuning process. Something like a DataFrame: +---+++-+ |elasticNetParam|fitIntercept|regParam|metrics | +---+++-+ |0.0|true|0.1 |9.747795248932505| |0.0|true|0.01|9.751942357398603| |0.0|false |0.1 |9.71727627087487 | |0.0|false |0.01|9.721149803723822| |0.5|true|0.1 |9.719358515436005| |0.5|true|0.01|9.748121645368501| |0.5|false |0.1 |9.687771328829479| |0.5|false |0.01|9.717304811419261| |1.0|true|0.1 |9.696769467196487| |1.0|true|0.01|9.744325276259957| |1.0|false |0.1 |9.665822167122172| |1.0|false |0.01|9.713484065511892| +---+++-+ Using the dataFrame, users can better understand the effect of different parameters. > CrossValidator should preserve more tuning statistics > - > > Key: SPARK-18704 > URL: https://issues.apache.org/jira/browse/SPARK-18704 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: yuhao yang >Priority: Minor > > Currently CrossValidator will train (k-fold * paramMaps) different models > during the training process, yet it only passes the average metrics to > CrossValidatorModel. From which some important information like variances for > the same paramMap cannot be retrieved, and users cannot be sure if the k > number is proper. Since the CrossValidator is relatively expensive, we > probably want to get the most from the tuning process. > Just want to see if this sounds good. In my opinion, this can
[jira] [Commented] (SPARK-18704) CrossValidator should preserve more tuning statistics
[ https://issues.apache.org/jira/browse/SPARK-18704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15719499#comment-15719499 ] yuhao yang commented on SPARK-18704: One implementation for the tuning summary is available at https://github.com/hhbyyh/spark/tree/tuningsummary/mllib/src/main/scala/org/apache/spark/ml/tuning for anyone with interest. > CrossValidator should preserve more tuning statistics > - > > Key: SPARK-18704 > URL: https://issues.apache.org/jira/browse/SPARK-18704 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: yuhao yang >Priority: Minor > > Currently CrossValidator will train (k-fold * paramMaps) different models > during the training process, yet it only passes the average metrics to > CrossValidatorModel. From which some important information like variances for > the same paramMap cannot be retrieved, and users cannot be sure if the k > number is proper. Since the CrossValidator is relatively expensive, we > probably want to get the most from the tuning process. > Just want to see if this sounds good. In my opinion, this can be done either > by passing a metrics matrix to the CrossValidatorModel, or we can introduce a > CrossValidatorSummary. I would vote for introducing the TunningSummary class, > which can also be used by TrainValidationSplit. In the summary we can present > a better statistics for the tuning process. Something like a DataFrame: > +---+++-+ > |elasticNetParam|fitIntercept|regParam|metrics | > +---+++-+ > |0.0|true|0.1 |9.747795248932505| > |0.0|true|0.01|9.751942357398603| > |0.0|false |0.1 |9.71727627087487 | > |0.0|false |0.01|9.721149803723822| > |0.5|true|0.1 |9.719358515436005| > |0.5|true|0.01|9.748121645368501| > |0.5|false |0.1 |9.687771328829479| > |0.5|false |0.01|9.717304811419261| > |1.0|true|0.1 |9.696769467196487| > |1.0|true|0.01|9.744325276259957| > |1.0|false |0.1 |9.665822167122172| > |1.0|false |0.01|9.713484065511892| > +---+++-+ > Using the dataFrame, users can better understand the effect of different > parameters. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18704) CrossValidator should preserve more tuning statistics
yuhao yang created SPARK-18704: -- Summary: CrossValidator should preserve more tuning statistics Key: SPARK-18704 URL: https://issues.apache.org/jira/browse/SPARK-18704 Project: Spark Issue Type: Improvement Components: ML Reporter: yuhao yang Priority: Minor Currently CrossValidator will train (k-fold * paramMaps) different models during the training process, yet it only passes the average metrics to CrossValidatorModel. From which some important information like variances for the same paramMap cannot be retrieved, and users cannot be sure if the k number is proper. Since the CrossValidator is relatively expensive, we probably want to get the most from the tuning process. Just want to see if this sounds good. In my opinion, this can be done either by passing a metrics matrix to the CrossValidatorModel, or we can introduce a CrossValidatorSummary. I would vote for introducing the TunningSummary class, which can also be used by TrainValidationSplit. In the summary we can present a better statistics for the tuning process. Something like a DataFrame: +---+++-+ |elasticNetParam|fitIntercept|regParam|metrics | +---+++-+ |0.0|true|0.1 |9.747795248932505| |0.0|true|0.01|9.751942357398603| |0.0|false |0.1 |9.71727627087487 | |0.0|false |0.01|9.721149803723822| |0.5|true|0.1 |9.719358515436005| |0.5|true|0.01|9.748121645368501| |0.5|false |0.1 |9.687771328829479| |0.5|false |0.01|9.717304811419261| |1.0|true|0.1 |9.696769467196487| |1.0|true|0.01|9.744325276259957| |1.0|false |0.1 |9.665822167122172| |1.0|false |0.01|9.713484065511892| +---+++-+ Using the dataFrame, users can better understand the effect of different parameters. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18700) getCached in HiveMetastoreCatalog not thread safe cause driver OOM
[ https://issues.apache.org/jira/browse/SPARK-18700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18700: Assignee: (was: Apache Spark) > getCached in HiveMetastoreCatalog not thread safe cause driver OOM > -- > > Key: SPARK-18700 > URL: https://issues.apache.org/jira/browse/SPARK-18700 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1, 2.0.0 >Reporter: Li Yuanjian > > In our spark sql platform, each query use same HiveContext and > independent thread, new data will append to tables as new partitions every > 30min. After a new partition added to table T, we should call refreshTable to > clear T’s cache in cachedDataSourceTables to make the new partition > searchable. > For the table have more partitions and files(much bigger than > spark.sql.sources.parallelPartitionDiscovery.threshold), a new query of table > T will start a job to fetch all FileStatus in listLeafFiles function. Because > of the huge number of files, the job will run several seconds, during the > time, new queries of table T will also start new jobs to fetch FileStatus > because of the function of getCache is not thread safe. Final cause a driver > OOM. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18700) getCached in HiveMetastoreCatalog not thread safe cause driver OOM
[ https://issues.apache.org/jira/browse/SPARK-18700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15719456#comment-15719456 ] Apache Spark commented on SPARK-18700: -- User 'xuanyuanking' has created a pull request for this issue: https://github.com/apache/spark/pull/16135 > getCached in HiveMetastoreCatalog not thread safe cause driver OOM > -- > > Key: SPARK-18700 > URL: https://issues.apache.org/jira/browse/SPARK-18700 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1, 2.0.0 >Reporter: Li Yuanjian > > In our spark sql platform, each query use same HiveContext and > independent thread, new data will append to tables as new partitions every > 30min. After a new partition added to table T, we should call refreshTable to > clear T’s cache in cachedDataSourceTables to make the new partition > searchable. > For the table have more partitions and files(much bigger than > spark.sql.sources.parallelPartitionDiscovery.threshold), a new query of table > T will start a job to fetch all FileStatus in listLeafFiles function. Because > of the huge number of files, the job will run several seconds, during the > time, new queries of table T will also start new jobs to fetch FileStatus > because of the function of getCache is not thread safe. Final cause a driver > OOM. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18700) getCached in HiveMetastoreCatalog not thread safe cause driver OOM
[ https://issues.apache.org/jira/browse/SPARK-18700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18700: Assignee: Apache Spark > getCached in HiveMetastoreCatalog not thread safe cause driver OOM > -- > > Key: SPARK-18700 > URL: https://issues.apache.org/jira/browse/SPARK-18700 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1, 2.0.0 >Reporter: Li Yuanjian >Assignee: Apache Spark > > In our spark sql platform, each query use same HiveContext and > independent thread, new data will append to tables as new partitions every > 30min. After a new partition added to table T, we should call refreshTable to > clear T’s cache in cachedDataSourceTables to make the new partition > searchable. > For the table have more partitions and files(much bigger than > spark.sql.sources.parallelPartitionDiscovery.threshold), a new query of table > T will start a job to fetch all FileStatus in listLeafFiles function. Because > of the huge number of files, the job will run several seconds, during the > time, new queries of table T will also start new jobs to fetch FileStatus > because of the function of getCache is not thread safe. Final cause a driver > OOM. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18703) Insertion/CTAS against Hive Tables: Staging Directories and Data Files Not Dropped Until Normal Termination of JVM
[ https://issues.apache.org/jira/browse/SPARK-18703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18703: Assignee: Apache Spark > Insertion/CTAS against Hive Tables: Staging Directories and Data Files Not > Dropped Until Normal Termination of JVM > -- > > Key: SPARK-18703 > URL: https://issues.apache.org/jira/browse/SPARK-18703 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2 >Reporter: Xiao Li >Assignee: Apache Spark >Priority: Critical > > Below are the files/directories generated for three inserts againsts a Hive > table: > {noformat} > /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-29_149_4298858301766472202-1 > /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-29_149_4298858301766472202-1/-ext-1 > /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-29_149_4298858301766472202-1/-ext-1/._SUCCESS.crc > /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-29_149_4298858301766472202-1/-ext-1/.part-0.crc > /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-29_149_4298858301766472202-1/-ext-1/_SUCCESS > /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-29_149_4298858301766472202-1/-ext-1/part-0 > /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_454_6445008511655931341-1 > /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_454_6445008511655931341-1/-ext-1 > /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_454_6445008511655931341-1/-ext-1/._SUCCESS.crc > /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_454_6445008511655931341-1/-ext-1/.part-0.crc > /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_454_6445008511655931341-1/-ext-1/_SUCCESS > /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_454_6445008511655931341-1/-ext-1/part-0 > /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_722_3388423608658711001-1 > /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_722_3388423608658711001-1/-ext-1 > /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_722_3388423608658711001-1/-ext-1/._SUCCESS.crc > /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_722_3388423608658711001-1/-ext-1/.part-0.crc > /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_722_3388423608658711001-1/-ext-1/_SUCCESS > /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_722_3388423608658711001-1/-ext-1/part-0 > /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.part-0.crc > /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/part-0 > {noformat} > The first 18 files are temporary. We do not drop it until the end of JVM > termination. If JVM does not appropriately terminate, these temporary > files/directories will not be dropped. > Only the last two files are needed, as shown below. > {noformat} > /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.part-0.crc > /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/part-0 > {noformat} > Ideally, we should drop the created
[jira] [Commented] (SPARK-18703) Insertion/CTAS against Hive Tables: Staging Directories and Data Files Not Dropped Until Normal Termination of JVM
[ https://issues.apache.org/jira/browse/SPARK-18703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15719433#comment-15719433 ] Apache Spark commented on SPARK-18703: -- User 'gatorsmile' has created a pull request for this issue: https://github.com/apache/spark/pull/16134 > Insertion/CTAS against Hive Tables: Staging Directories and Data Files Not > Dropped Until Normal Termination of JVM > -- > > Key: SPARK-18703 > URL: https://issues.apache.org/jira/browse/SPARK-18703 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2 >Reporter: Xiao Li >Priority: Critical > > Below are the files/directories generated for three inserts againsts a Hive > table: > {noformat} > /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-29_149_4298858301766472202-1 > /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-29_149_4298858301766472202-1/-ext-1 > /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-29_149_4298858301766472202-1/-ext-1/._SUCCESS.crc > /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-29_149_4298858301766472202-1/-ext-1/.part-0.crc > /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-29_149_4298858301766472202-1/-ext-1/_SUCCESS > /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-29_149_4298858301766472202-1/-ext-1/part-0 > /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_454_6445008511655931341-1 > /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_454_6445008511655931341-1/-ext-1 > /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_454_6445008511655931341-1/-ext-1/._SUCCESS.crc > /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_454_6445008511655931341-1/-ext-1/.part-0.crc > /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_454_6445008511655931341-1/-ext-1/_SUCCESS > /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_454_6445008511655931341-1/-ext-1/part-0 > /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_722_3388423608658711001-1 > /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_722_3388423608658711001-1/-ext-1 > /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_722_3388423608658711001-1/-ext-1/._SUCCESS.crc > /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_722_3388423608658711001-1/-ext-1/.part-0.crc > /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_722_3388423608658711001-1/-ext-1/_SUCCESS > /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_722_3388423608658711001-1/-ext-1/part-0 > /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.part-0.crc > /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/part-0 > {noformat} > The first 18 files are temporary. We do not drop it until the end of JVM > termination. If JVM does not appropriately terminate, these temporary > files/directories will not be dropped. > Only the last two files are needed, as shown below. > {noformat} > /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.part-0.crc >
[jira] [Assigned] (SPARK-18703) Insertion/CTAS against Hive Tables: Staging Directories and Data Files Not Dropped Until Normal Termination of JVM
[ https://issues.apache.org/jira/browse/SPARK-18703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18703: Assignee: (was: Apache Spark) > Insertion/CTAS against Hive Tables: Staging Directories and Data Files Not > Dropped Until Normal Termination of JVM > -- > > Key: SPARK-18703 > URL: https://issues.apache.org/jira/browse/SPARK-18703 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2 >Reporter: Xiao Li >Priority: Critical > > Below are the files/directories generated for three inserts againsts a Hive > table: > {noformat} > /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-29_149_4298858301766472202-1 > /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-29_149_4298858301766472202-1/-ext-1 > /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-29_149_4298858301766472202-1/-ext-1/._SUCCESS.crc > /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-29_149_4298858301766472202-1/-ext-1/.part-0.crc > /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-29_149_4298858301766472202-1/-ext-1/_SUCCESS > /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-29_149_4298858301766472202-1/-ext-1/part-0 > /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_454_6445008511655931341-1 > /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_454_6445008511655931341-1/-ext-1 > /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_454_6445008511655931341-1/-ext-1/._SUCCESS.crc > /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_454_6445008511655931341-1/-ext-1/.part-0.crc > /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_454_6445008511655931341-1/-ext-1/_SUCCESS > /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_454_6445008511655931341-1/-ext-1/part-0 > /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_722_3388423608658711001-1 > /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_722_3388423608658711001-1/-ext-1 > /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_722_3388423608658711001-1/-ext-1/._SUCCESS.crc > /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_722_3388423608658711001-1/-ext-1/.part-0.crc > /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_722_3388423608658711001-1/-ext-1/_SUCCESS > /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_722_3388423608658711001-1/-ext-1/part-0 > /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.part-0.crc > /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/part-0 > {noformat} > The first 18 files are temporary. We do not drop it until the end of JVM > termination. If JVM does not appropriately terminate, these temporary > files/directories will not be dropped. > Only the last two files are needed, as shown below. > {noformat} > /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.part-0.crc > /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/part-0 > {noformat} > Ideally, we should drop the created staging files and
[jira] [Updated] (SPARK-18703) Insertion/CTAS against Hive Tables: Staging Directories and Data Files Not Dropped Until Normal Termination of JVM
[ https://issues.apache.org/jira/browse/SPARK-18703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-18703: Description: Below are the files/directories generated for three inserts againsts a Hive table: {noformat} /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-29_149_4298858301766472202-1 /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-29_149_4298858301766472202-1/-ext-1 /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-29_149_4298858301766472202-1/-ext-1/._SUCCESS.crc /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-29_149_4298858301766472202-1/-ext-1/.part-0.crc /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-29_149_4298858301766472202-1/-ext-1/_SUCCESS /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-29_149_4298858301766472202-1/-ext-1/part-0 /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_454_6445008511655931341-1 /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_454_6445008511655931341-1/-ext-1 /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_454_6445008511655931341-1/-ext-1/._SUCCESS.crc /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_454_6445008511655931341-1/-ext-1/.part-0.crc /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_454_6445008511655931341-1/-ext-1/_SUCCESS /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_454_6445008511655931341-1/-ext-1/part-0 /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_722_3388423608658711001-1 /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_722_3388423608658711001-1/-ext-1 /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_722_3388423608658711001-1/-ext-1/._SUCCESS.crc /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_722_3388423608658711001-1/-ext-1/.part-0.crc /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_722_3388423608658711001-1/-ext-1/_SUCCESS /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_722_3388423608658711001-1/-ext-1/part-0 /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.part-0.crc /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/part-0 {noformat} The first 18 files are temporary. We do not drop it until the end of JVM termination. If JVM does not appropriately terminate, these temporary files/directories will not be dropped. Only the last two files are needed, as shown below. {noformat} /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.part-0.crc /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/part-0 {noformat} Ideally, we should drop the created staging files and temporary data files after each insert/CTAS. The temporary files/directories could accumulate a lot when we issue many inserts, since each insert generats at least six files. This could eat a lot of spaces and slow down the JVM termination. was: Below are the files/directories for three inserts againsts a Hive table: {noformat} /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-29_149_4298858301766472202-1
[jira] [Created] (SPARK-18703) Insertion/CTAS against Hive Tables: Staging Directories and Data Files Not Dropped Until Normal Termination of JVM
Xiao Li created SPARK-18703: --- Summary: Insertion/CTAS against Hive Tables: Staging Directories and Data Files Not Dropped Until Normal Termination of JVM Key: SPARK-18703 URL: https://issues.apache.org/jira/browse/SPARK-18703 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.2 Reporter: Xiao Li Priority: Critical Below are the files/directories for three inserts againsts a Hive table: {noformat} /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-29_149_4298858301766472202-1 /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-29_149_4298858301766472202-1/-ext-1 /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-29_149_4298858301766472202-1/-ext-1/._SUCCESS.crc /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-29_149_4298858301766472202-1/-ext-1/.part-0.crc /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-29_149_4298858301766472202-1/-ext-1/_SUCCESS /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-29_149_4298858301766472202-1/-ext-1/part-0 /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_454_6445008511655931341-1 /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_454_6445008511655931341-1/-ext-1 /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_454_6445008511655931341-1/-ext-1/._SUCCESS.crc /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_454_6445008511655931341-1/-ext-1/.part-0.crc /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_454_6445008511655931341-1/-ext-1/_SUCCESS /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_454_6445008511655931341-1/-ext-1/part-0 /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_722_3388423608658711001-1 /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_722_3388423608658711001-1/-ext-1 /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_722_3388423608658711001-1/-ext-1/._SUCCESS.crc /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_722_3388423608658711001-1/-ext-1/.part-0.crc /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_722_3388423608658711001-1/-ext-1/_SUCCESS /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_722_3388423608658711001-1/-ext-1/part-0 /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.part-0.crc /private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/part-0 {noformat} Ideally, we should drop the created staging files and temporary data files after each insert. The temporary files/directories could accumulate a lot when we issues many insert, since each insert generats at least six files. This could eat a lot of spaces and slow down the JVM termination. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18702) input_file_block_start and input_file_block_length function
[ https://issues.apache.org/jira/browse/SPARK-18702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18702: Assignee: Reynold Xin (was: Apache Spark) > input_file_block_start and input_file_block_length function > --- > > Key: SPARK-18702 > URL: https://issues.apache.org/jira/browse/SPARK-18702 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > > We currently have function input_file_name to get the path of the input file, > but don't have functions to get the block start offset and length. This patch > introduces two functions: > 1. input_file_block_start: returns the file block start offset, or -1 if not > available. > 2. input_file_block_length: returns the file block length, or -1 if not > available. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18702) input_file_block_start and input_file_block_length function
[ https://issues.apache.org/jira/browse/SPARK-18702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15719375#comment-15719375 ] Apache Spark commented on SPARK-18702: -- User 'rxin' has created a pull request for this issue: https://github.com/apache/spark/pull/16133 > input_file_block_start and input_file_block_length function > --- > > Key: SPARK-18702 > URL: https://issues.apache.org/jira/browse/SPARK-18702 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > > We currently have function input_file_name to get the path of the input file, > but don't have functions to get the block start offset and length. This patch > introduces two functions: > 1. input_file_block_start: returns the file block start offset, or -1 if not > available. > 2. input_file_block_length: returns the file block length, or -1 if not > available. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18702) input_file_block_start and input_file_block_length function
[ https://issues.apache.org/jira/browse/SPARK-18702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18702: Assignee: Apache Spark (was: Reynold Xin) > input_file_block_start and input_file_block_length function > --- > > Key: SPARK-18702 > URL: https://issues.apache.org/jira/browse/SPARK-18702 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin >Assignee: Apache Spark > > We currently have function input_file_name to get the path of the input file, > but don't have functions to get the block start offset and length. This patch > introduces two functions: > 1. input_file_block_start: returns the file block start offset, or -1 if not > available. > 2. input_file_block_length: returns the file block length, or -1 if not > available. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18702) input_file_block_start and input_file_block_length function
Reynold Xin created SPARK-18702: --- Summary: input_file_block_start and input_file_block_length function Key: SPARK-18702 URL: https://issues.apache.org/jira/browse/SPARK-18702 Project: Spark Issue Type: New Feature Components: SQL Reporter: Reynold Xin Assignee: Reynold Xin We currently have function input_file_name to get the path of the input file, but don't have functions to get the block start offset and length. This patch introduces two functions: 1. input_file_block_start: returns the file block start offset, or -1 if not available. 2. input_file_block_length: returns the file block length, or -1 if not available. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18681) Throw Filtering is supported only on partition keys of type string exception
[ https://issues.apache.org/jira/browse/SPARK-18681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-18681: Description: Cloudera put {{/var/run/cloudera-scm-agent/process/15000-hive-HIVEMETASTORE/hive-site.xml}} as the configuration file for the Hive Metastore Server, where {{hive.metastore.try.direct.sql=false}}. But Spark isn't reading this configuration file and get default value {{hive.metastore.try.direct.sql=true}}. we should use {{getMetaConf}} or {{getMSC.getConfigValue}} method to obtain the original configuration from Hive Metastore Server. {noformat} spark-sql> CREATE TABLE test (value INT) PARTITIONED BY (part INT); Time taken: 0.221 seconds spark-sql> select * from test where part=1 limit 10; 16/12/02 08:33:45 ERROR thriftserver.SparkSQLDriver: Failed in [select * from test where part=1 limit 10] java.lang.RuntimeException: Caught Hive MetaException attempting to get partition metadata by filter from Hive. You can set the Spark configuration setting spark.sql.hive.manageFilesourcePartitions to false to work around this problem, however this will result in degraded performance. Please report a bug: https://issues.apache.org/jira/browse/SPARK at org.apache.spark.sql.hive.client.Shim_v0_13.getPartitionsByFilter(HiveShim.scala:610) at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionsByFilter$1.apply(HiveClientImpl.scala:549) at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionsByFilter$1.apply(HiveClientImpl.scala:547) at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:282) at org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:229) at org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:228) at org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:271) at org.apache.spark.sql.hive.client.HiveClientImpl.getPartitionsByFilter(HiveClientImpl.scala:547) at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:954) at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:938) at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:91) at org.apache.spark.sql.hive.HiveExternalCatalog.listPartitionsByFilter(HiveExternalCatalog.scala:938) at org.apache.spark.sql.hive.MetastoreRelation.getHiveQlPartitions(MetastoreRelation.scala:156) at org.apache.spark.sql.hive.execution.HiveTableScanExec$$anonfun$10.apply(HiveTableScanExec.scala:151) at org.apache.spark.sql.hive.execution.HiveTableScanExec$$anonfun$10.apply(HiveTableScanExec.scala:150) at org.apache.spark.util.Utils$.withDummyCallSite(Utils.scala:2435) at org.apache.spark.sql.hive.execution.HiveTableScanExec.doExecute(HiveTableScanExec.scala:149) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114) at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:113) at org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:225) at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:308) at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38) at org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:295) at org.apache.spark.sql.execution.QueryExecution$$anonfun$hiveResultString$4.apply(QueryExecution.scala:134) at org.apache.spark.sql.execution.QueryExecution$$anonfun$hiveResultString$4.apply(QueryExecution.scala:133) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57) at org.apache.spark.sql.execution.QueryExecution.hiveResultString(QueryExecution.scala:133) at org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:63) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:335) at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:376) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:247) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala) at
[jira] [Commented] (SPARK-18701) Poisson GLM fails due to wrong initialization
[ https://issues.apache.org/jira/browse/SPARK-18701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15719212#comment-15719212 ] Apache Spark commented on SPARK-18701: -- User 'actuaryzhang' has created a pull request for this issue: https://github.com/apache/spark/pull/16131 > Poisson GLM fails due to wrong initialization > - > > Key: SPARK-18701 > URL: https://issues.apache.org/jira/browse/SPARK-18701 > Project: Spark > Issue Type: New Feature > Components: ML >Affects Versions: 2.0.2 >Reporter: Wayne Zhang >Priority: Critical > Fix For: 2.2.0 > > Original Estimate: 1h > Remaining Estimate: 1h > > Poisson GLM fails for many standard data sets. The issue is incorrect > initialization leading to almost zero probability and weights. The following > simple example reproduces the error. > {code:borderStyle=solid} > val datasetPoissonLogWithZero = Seq( > LabeledPoint(0.0, Vectors.dense(18, 1.0)), > LabeledPoint(1.0, Vectors.dense(12, 0.0)), > LabeledPoint(0.0, Vectors.dense(15, 0.0)), > LabeledPoint(0.0, Vectors.dense(13, 2.0)), > LabeledPoint(0.0, Vectors.dense(15, 1.0)), > LabeledPoint(1.0, Vectors.dense(16, 1.0)), > LabeledPoint(0.0, Vectors.dense(10, 0.0)), > LabeledPoint(0.0, Vectors.dense(15, 0.0)), > LabeledPoint(0.0, Vectors.dense(12, 2.0)), > LabeledPoint(0.0, Vectors.dense(13, 0.0)), > LabeledPoint(1.0, Vectors.dense(15, 0.0)), > LabeledPoint(1.0, Vectors.dense(15, 0.0)), > LabeledPoint(0.0, Vectors.dense(15, 0.0)), > LabeledPoint(0.0, Vectors.dense(12, 2.0)), > LabeledPoint(1.0, Vectors.dense(12, 2.0)) > ).toDF() > > val glr = new GeneralizedLinearRegression() > .setFamily("poisson") > .setLink("log") > .setMaxIter(20) > .setRegParam(0) > val model = glr.fit(datasetPoissonLogWithZero) > {code} > The issue is in the initialization: the mean is initialized as the response, > which could be zero. Applying the log link results in very negative numbers > (protected against -Inf), which again leads to close to zero probability and > weights in the weighted least squares. The fix is easy: just add a small > constant, highlighted in red below. > > override def initialize(y: Double, weight: Double): Double = { > require(y >= 0.0, "The response variable of Poisson family " + > s"should be non-negative, but got $y") > y {color:red}+ 0.1 {color} > } > I already have a fix and test code. Will create a PR. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18701) Poisson GLM fails due to wrong initialization
[ https://issues.apache.org/jira/browse/SPARK-18701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18701: Assignee: Apache Spark > Poisson GLM fails due to wrong initialization > - > > Key: SPARK-18701 > URL: https://issues.apache.org/jira/browse/SPARK-18701 > Project: Spark > Issue Type: New Feature > Components: ML >Affects Versions: 2.0.2 >Reporter: Wayne Zhang >Assignee: Apache Spark >Priority: Critical > Fix For: 2.2.0 > > Original Estimate: 1h > Remaining Estimate: 1h > > Poisson GLM fails for many standard data sets. The issue is incorrect > initialization leading to almost zero probability and weights. The following > simple example reproduces the error. > {code:borderStyle=solid} > val datasetPoissonLogWithZero = Seq( > LabeledPoint(0.0, Vectors.dense(18, 1.0)), > LabeledPoint(1.0, Vectors.dense(12, 0.0)), > LabeledPoint(0.0, Vectors.dense(15, 0.0)), > LabeledPoint(0.0, Vectors.dense(13, 2.0)), > LabeledPoint(0.0, Vectors.dense(15, 1.0)), > LabeledPoint(1.0, Vectors.dense(16, 1.0)), > LabeledPoint(0.0, Vectors.dense(10, 0.0)), > LabeledPoint(0.0, Vectors.dense(15, 0.0)), > LabeledPoint(0.0, Vectors.dense(12, 2.0)), > LabeledPoint(0.0, Vectors.dense(13, 0.0)), > LabeledPoint(1.0, Vectors.dense(15, 0.0)), > LabeledPoint(1.0, Vectors.dense(15, 0.0)), > LabeledPoint(0.0, Vectors.dense(15, 0.0)), > LabeledPoint(0.0, Vectors.dense(12, 2.0)), > LabeledPoint(1.0, Vectors.dense(12, 2.0)) > ).toDF() > > val glr = new GeneralizedLinearRegression() > .setFamily("poisson") > .setLink("log") > .setMaxIter(20) > .setRegParam(0) > val model = glr.fit(datasetPoissonLogWithZero) > {code} > The issue is in the initialization: the mean is initialized as the response, > which could be zero. Applying the log link results in very negative numbers > (protected against -Inf), which again leads to close to zero probability and > weights in the weighted least squares. The fix is easy: just add a small > constant, highlighted in red below. > > override def initialize(y: Double, weight: Double): Double = { > require(y >= 0.0, "The response variable of Poisson family " + > s"should be non-negative, but got $y") > y {color:red}+ 0.1 {color} > } > I already have a fix and test code. Will create a PR. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18701) Poisson GLM fails due to wrong initialization
[ https://issues.apache.org/jira/browse/SPARK-18701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18701: Assignee: (was: Apache Spark) > Poisson GLM fails due to wrong initialization > - > > Key: SPARK-18701 > URL: https://issues.apache.org/jira/browse/SPARK-18701 > Project: Spark > Issue Type: New Feature > Components: ML >Affects Versions: 2.0.2 >Reporter: Wayne Zhang >Priority: Critical > Fix For: 2.2.0 > > Original Estimate: 1h > Remaining Estimate: 1h > > Poisson GLM fails for many standard data sets. The issue is incorrect > initialization leading to almost zero probability and weights. The following > simple example reproduces the error. > {code:borderStyle=solid} > val datasetPoissonLogWithZero = Seq( > LabeledPoint(0.0, Vectors.dense(18, 1.0)), > LabeledPoint(1.0, Vectors.dense(12, 0.0)), > LabeledPoint(0.0, Vectors.dense(15, 0.0)), > LabeledPoint(0.0, Vectors.dense(13, 2.0)), > LabeledPoint(0.0, Vectors.dense(15, 1.0)), > LabeledPoint(1.0, Vectors.dense(16, 1.0)), > LabeledPoint(0.0, Vectors.dense(10, 0.0)), > LabeledPoint(0.0, Vectors.dense(15, 0.0)), > LabeledPoint(0.0, Vectors.dense(12, 2.0)), > LabeledPoint(0.0, Vectors.dense(13, 0.0)), > LabeledPoint(1.0, Vectors.dense(15, 0.0)), > LabeledPoint(1.0, Vectors.dense(15, 0.0)), > LabeledPoint(0.0, Vectors.dense(15, 0.0)), > LabeledPoint(0.0, Vectors.dense(12, 2.0)), > LabeledPoint(1.0, Vectors.dense(12, 2.0)) > ).toDF() > > val glr = new GeneralizedLinearRegression() > .setFamily("poisson") > .setLink("log") > .setMaxIter(20) > .setRegParam(0) > val model = glr.fit(datasetPoissonLogWithZero) > {code} > The issue is in the initialization: the mean is initialized as the response, > which could be zero. Applying the log link results in very negative numbers > (protected against -Inf), which again leads to close to zero probability and > weights in the weighted least squares. The fix is easy: just add a small > constant, highlighted in red below. > > override def initialize(y: Double, weight: Double): Double = { > require(y >= 0.0, "The response variable of Poisson family " + > s"should be non-negative, but got $y") > y {color:red}+ 0.1 {color} > } > I already have a fix and test code. Will create a PR. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18701) Poisson GLM fails due to wrong initialization
Wayne Zhang created SPARK-18701: --- Summary: Poisson GLM fails due to wrong initialization Key: SPARK-18701 URL: https://issues.apache.org/jira/browse/SPARK-18701 Project: Spark Issue Type: New Feature Components: ML Affects Versions: 2.0.2 Reporter: Wayne Zhang Priority: Critical Fix For: 2.2.0 Poisson GLM fails for many standard data sets. The issue is incorrect initialization leading to almost zero probability and weights. The following simple example reproduces the error. {code:borderStyle=solid} val datasetPoissonLogWithZero = Seq( LabeledPoint(0.0, Vectors.dense(18, 1.0)), LabeledPoint(1.0, Vectors.dense(12, 0.0)), LabeledPoint(0.0, Vectors.dense(15, 0.0)), LabeledPoint(0.0, Vectors.dense(13, 2.0)), LabeledPoint(0.0, Vectors.dense(15, 1.0)), LabeledPoint(1.0, Vectors.dense(16, 1.0)), LabeledPoint(0.0, Vectors.dense(10, 0.0)), LabeledPoint(0.0, Vectors.dense(15, 0.0)), LabeledPoint(0.0, Vectors.dense(12, 2.0)), LabeledPoint(0.0, Vectors.dense(13, 0.0)), LabeledPoint(1.0, Vectors.dense(15, 0.0)), LabeledPoint(1.0, Vectors.dense(15, 0.0)), LabeledPoint(0.0, Vectors.dense(15, 0.0)), LabeledPoint(0.0, Vectors.dense(12, 2.0)), LabeledPoint(1.0, Vectors.dense(12, 2.0)) ).toDF() val glr = new GeneralizedLinearRegression() .setFamily("poisson") .setLink("log") .setMaxIter(20) .setRegParam(0) val model = glr.fit(datasetPoissonLogWithZero) {code} The issue is in the initialization: the mean is initialized as the response, which could be zero. Applying the log link results in very negative numbers (protected against -Inf), which again leads to close to zero probability and weights in the weighted least squares. The fix is easy: just add a small constant, highlighted in red below. override def initialize(y: Double, weight: Double): Double = { require(y >= 0.0, "The response variable of Poisson family " + s"should be non-negative, but got $y") y {color:red}+ 0.1 {color} } I already have a fix and test code. Will create a PR. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18200) GraphX Invalid initial capacity when running triangleCount
[ https://issues.apache.org/jira/browse/SPARK-18200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15719170#comment-15719170 ] Sumesh Kumar commented on SPARK-18200: -- Thanks much [~dongjoon] > GraphX Invalid initial capacity when running triangleCount > -- > > Key: SPARK-18200 > URL: https://issues.apache.org/jira/browse/SPARK-18200 > Project: Spark > Issue Type: Bug > Components: GraphX >Affects Versions: 2.0.0, 2.0.1, 2.0.2 > Environment: Databricks, Ubuntu 16.04, macOS Sierra >Reporter: Denny Lee >Assignee: Dongjoon Hyun > Labels: graph, graphx > Fix For: 2.0.3, 2.1.0 > > > Running GraphX triangle count on large-ish file results in the "Invalid > initial capacity" error when running on Spark 2.0 (tested on Spark 2.0, > 2.0.1, and 2.0.2). You can see the results at: http://bit.ly/2eQKWDN > Running the same code on Spark 1.6 and the query completes without any > problems: http://bit.ly/2fATO1M > As well, running the GraphFrames version of this code runs as well (Spark > 2.0, GraphFrames 0.2): http://bit.ly/2fAS8W8 > Reference Stackoverflow question: > Spark GraphX: requirement failed: Invalid initial capacity > (http://stackoverflow.com/questions/40337366/spark-graphx-requirement-failed-invalid-initial-capacity) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-18200) GraphX Invalid initial capacity when running triangleCount
[ https://issues.apache.org/jira/browse/SPARK-18200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15719124#comment-15719124 ] Dongjoon Hyun edited comment on SPARK-18200 at 12/4/16 2:15 AM: Hi, Yes, the bugs are there in 2.0.1. The fix will be in upcoming Apache Spark 2.0.3 and 2.1.0. We cannot backport into 2.0.1 because it's already released. was (Author: dongjoon): Hi, It will be in upcoming Apache Spark 2.0.3 and 2.1.0. We cannot backport into 2.0.1 because it's already released. > GraphX Invalid initial capacity when running triangleCount > -- > > Key: SPARK-18200 > URL: https://issues.apache.org/jira/browse/SPARK-18200 > Project: Spark > Issue Type: Bug > Components: GraphX >Affects Versions: 2.0.0, 2.0.1, 2.0.2 > Environment: Databricks, Ubuntu 16.04, macOS Sierra >Reporter: Denny Lee >Assignee: Dongjoon Hyun > Labels: graph, graphx > Fix For: 2.0.3, 2.1.0 > > > Running GraphX triangle count on large-ish file results in the "Invalid > initial capacity" error when running on Spark 2.0 (tested on Spark 2.0, > 2.0.1, and 2.0.2). You can see the results at: http://bit.ly/2eQKWDN > Running the same code on Spark 1.6 and the query completes without any > problems: http://bit.ly/2fATO1M > As well, running the GraphFrames version of this code runs as well (Spark > 2.0, GraphFrames 0.2): http://bit.ly/2fAS8W8 > Reference Stackoverflow question: > Spark GraphX: requirement failed: Invalid initial capacity > (http://stackoverflow.com/questions/40337366/spark-graphx-requirement-failed-invalid-initial-capacity) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18200) GraphX Invalid initial capacity when running triangleCount
[ https://issues.apache.org/jira/browse/SPARK-18200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15719124#comment-15719124 ] Dongjoon Hyun commented on SPARK-18200: --- Hi, It will be in upcoming Apache Spark 2.0.3 and 2.1.0. We cannot backport into 2.0.1 because it's already released. > GraphX Invalid initial capacity when running triangleCount > -- > > Key: SPARK-18200 > URL: https://issues.apache.org/jira/browse/SPARK-18200 > Project: Spark > Issue Type: Bug > Components: GraphX >Affects Versions: 2.0.0, 2.0.1, 2.0.2 > Environment: Databricks, Ubuntu 16.04, macOS Sierra >Reporter: Denny Lee >Assignee: Dongjoon Hyun > Labels: graph, graphx > Fix For: 2.0.3, 2.1.0 > > > Running GraphX triangle count on large-ish file results in the "Invalid > initial capacity" error when running on Spark 2.0 (tested on Spark 2.0, > 2.0.1, and 2.0.2). You can see the results at: http://bit.ly/2eQKWDN > Running the same code on Spark 1.6 and the query completes without any > problems: http://bit.ly/2fATO1M > As well, running the GraphFrames version of this code runs as well (Spark > 2.0, GraphFrames 0.2): http://bit.ly/2fAS8W8 > Reference Stackoverflow question: > Spark GraphX: requirement failed: Invalid initial capacity > (http://stackoverflow.com/questions/40337366/spark-graphx-requirement-failed-invalid-initial-capacity) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18200) GraphX Invalid initial capacity when running triangleCount
[ https://issues.apache.org/jira/browse/SPARK-18200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15719081#comment-15719081 ] Sumesh Kumar commented on SPARK-18200: -- Does this issue exist currently in version 2.0.1?. I just ran a test and it's throwing the following exception. User class threw exception: org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in stage 10.0 failed 4 times, most recent failure: Lost task 3.3 in stage 10.0 (TID 196, BD-S2F13): java.lang.IllegalArgumentException: requirement failed: Invalid initial capacity at scala.Predef$.require(Predef.scala:224) at org.apache.spark.util.collection.OpenHashSet$mcJ$sp.(OpenHashSet.scala:51) at org.apache.spark.util.collection.OpenHashSet$mcJ$sp.(OpenHashSet.scala:57) at org.apache.spark.graphx.lib.TriangleCount$$anonfun$5.apply(TriangleCount.scala:70) at org.apache.spark.graphx.lib.TriangleCount$$anonfun$5.apply(TriangleCount.scala:69) at org.apache.spark.graphx.impl.VertexPartitionBaseOps.map(VertexPartitionBaseOps.scala:61) at org.apache.spark.graphx.impl.VertexRDDImpl$$anonfun$mapValues$2.apply(VertexRDDImpl.scala:102) at org.apache.spark.graphx.impl.VertexRDDImpl$$anonfun$mapValues$2.apply(VertexRDDImpl.scala:102) at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at org.apache.spark.graphx.impl.VertexRDDImpl$$anonfun$3.apply(VertexRDDImpl.scala:156) at org.apache.spark.graphx.impl.VertexRDDImpl$$anonfun$3.apply(VertexRDDImpl.scala:154) at org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:89) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) at org.apache.spark.rdd.RDD$$anonfun$8.apply(RDD.scala:332) at org.apache.spark.rdd.RDD$$anonfun$8.apply(RDD.scala:330) at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:935) at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:926) at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:866) at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:926) at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:670) at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:330) at org.apache.spark.rdd.RDD.iterator(RDD.scala:281) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) at org.apache.spark.scheduler.Task.run(Task.scala:86) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) > GraphX Invalid initial capacity when running triangleCount > -- > > Key: SPARK-18200 > URL: https://issues.apache.org/jira/browse/SPARK-18200 > Project: Spark > Issue Type: Bug > Components: GraphX >Affects Versions: 2.0.0, 2.0.1, 2.0.2 > Environment: Databricks, Ubuntu 16.04, macOS Sierra >Reporter: Denny Lee >Assignee: Dongjoon Hyun > Labels: graph, graphx > Fix For: 2.0.3, 2.1.0 > > > Running GraphX triangle count on large-ish file results in the "Invalid > initial capacity" error when running on Spark 2.0 (tested on Spark 2.0, > 2.0.1, and 2.0.2). You can see the results at: http://bit.ly/2eQKWDN > Running the same code on Spark 1.6 and the query completes without any > problems: http://bit.ly/2fATO1M > As well, running the GraphFrames version of this code runs as well (Spark > 2.0, GraphFrames 0.2): http://bit.ly/2fAS8W8 > Reference Stackoverflow question: > Spark GraphX: requirement failed: Invalid initial capacity > (http://stackoverflow.com/questions/40337366/spark-graphx-requirement-failed-invalid-initial-capacity) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18081) Locality Sensitive Hashing (LSH) User Guide
[ https://issues.apache.org/jira/browse/SPARK-18081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley resolved SPARK-18081. --- Resolution: Fixed Fix Version/s: 2.2.0 2.1.1 Issue resolved by pull request 15795 [https://github.com/apache/spark/pull/15795] > Locality Sensitive Hashing (LSH) User Guide > --- > > Key: SPARK-18081 > URL: https://issues.apache.org/jira/browse/SPARK-18081 > Project: Spark > Issue Type: Documentation > Components: Documentation, ML >Reporter: Joseph K. Bradley >Assignee: Yun Ni > Fix For: 2.1.1, 2.2.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18581) MultivariateGaussian does not check if covariance matrix is invertible
[ https://issues.apache.org/jira/browse/SPARK-18581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-18581. --- Resolution: Not A Problem > MultivariateGaussian does not check if covariance matrix is invertible > -- > > Key: SPARK-18581 > URL: https://issues.apache.org/jira/browse/SPARK-18581 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.6.2, 2.0.2 >Reporter: Hao Ren > > When training GaussianMixtureModel, I found some probability much larger than > 1. That leads me to that fact that, the value returned by > MultivariateGaussian.pdf can be 10^5, etc. > After reviewing the code, I found that problem lies in the computation of > determinant of the covariance matrix. > The computation is simplified by using pseudo-determinant of a positive > defined matrix. > In my case, I have a feature = 0 for all data point. > As a result, covariance matrix is not invertible <=> det(covariance matrix) = > 0 => pseudo-determinant will be very close to zero, > Thus, log(pseudo-determinant) will be a large negative number which finally > make logpdf very biger, pdf will be even bigger > 1. > As said in comments of MultivariateGaussian.scala, > """ > Singular values are considered to be non-zero only if they exceed a tolerance > based on machine precision. > """ > But if a singular value is considered to be zero, means the covariance matrix > is non invertible which is a contradiction to the assumption that it should > be invertible. > So we should check if there a single value is smaller than the tolerance > before computing the pseudo determinant -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8007) Support resolving virtual columns in DataFrames
[ https://issues.apache.org/jira/browse/SPARK-8007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15718790#comment-15718790 ] Reynold Xin commented on SPARK-8007: spark_partition_id() is available in PySpark starting 1.6. It's in pyspark.functions.spark_partition_id. > Support resolving virtual columns in DataFrames > --- > > Key: SPARK-8007 > URL: https://issues.apache.org/jira/browse/SPARK-8007 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Joseph Batchik > > Create the infrastructure so we can resolve df("SPARK__PARTITION__ID") to > SparkPartitionID expression. > A cool use case is to understand physical data skew: > {code} > df.groupBy("SPARK__PARTITION__ID").count() > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18581) MultivariateGaussian does not check if covariance matrix is invertible
[ https://issues.apache.org/jira/browse/SPARK-18581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15718706#comment-15718706 ] Hao Ren commented on SPARK-18581: - I checked several (mu, sigma) pairs in R. The package I used is: mvtnorm The numerical difference of pdf between mllib and R is negligible, no matter whether the sigma is invertible or (near-)singular. Hence, there is no problems here. Here is my code: https://gist.github.com/invkrh/2a5422c01a3c3a063f504f1f099cbdae which can generate R code for cross check > MultivariateGaussian does not check if covariance matrix is invertible > -- > > Key: SPARK-18581 > URL: https://issues.apache.org/jira/browse/SPARK-18581 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.6.2, 2.0.2 >Reporter: Hao Ren > > When training GaussianMixtureModel, I found some probability much larger than > 1. That leads me to that fact that, the value returned by > MultivariateGaussian.pdf can be 10^5, etc. > After reviewing the code, I found that problem lies in the computation of > determinant of the covariance matrix. > The computation is simplified by using pseudo-determinant of a positive > defined matrix. > In my case, I have a feature = 0 for all data point. > As a result, covariance matrix is not invertible <=> det(covariance matrix) = > 0 => pseudo-determinant will be very close to zero, > Thus, log(pseudo-determinant) will be a large negative number which finally > make logpdf very biger, pdf will be even bigger > 1. > As said in comments of MultivariateGaussian.scala, > """ > Singular values are considered to be non-zero only if they exceed a tolerance > based on machine precision. > """ > But if a singular value is considered to be zero, means the covariance matrix > is non invertible which is a contradiction to the assumption that it should > be invertible. > So we should check if there a single value is smaller than the tolerance > before computing the pseudo determinant -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18582) Whitelist LogicalPlan operators allowed in correlated subqueries
[ https://issues.apache.org/jira/browse/SPARK-18582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hovell resolved SPARK-18582. --- Resolution: Fixed Assignee: Nattavut Sutyanyong Fix Version/s: 2.1.0 > Whitelist LogicalPlan operators allowed in correlated subqueries > > > Key: SPARK-18582 > URL: https://issues.apache.org/jira/browse/SPARK-18582 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.0.0 >Reporter: Nattavut Sutyanyong >Assignee: Nattavut Sutyanyong > Fix For: 2.1.0 > > > We want to tighten the code that handles correlated subquery to whitelist > operators that are allowed in it. > The current code in {{def pullOutCorrelatedPredicates}} looks like > {code} > // Simplify the predicates before pulling them out. > val transformed = BooleanSimplification(sub) transformUp { > case f @ Filter(cond, child) => ... > case p @ Project(expressions, child) => ... > case a @ Aggregate(grouping, expressions, child) => ... > case w : Window => ... > case j @ Join(left, _, RightOuter, _) => ... > case j @ Join(left, right, FullOuter, _) => ... > case j @ Join(_, right, jt, _) if !jt.isInstanceOf[InnerLike] => ... > case u: Union => ... > case s: SetOperation => ... > case e: Expand => ... > case l : LocalLimit => ... > case g : GlobalLimit => ... > case s : Sample => ... > case p => > failOnOuterReference(p) > ... > } > {code} > The code disallows operators in a sub plan of an operator hosting correlation > on a case by case basis. As it is today, it only blocks {{Union}}, > {{Intersect}}, {{Except}}, {{Expand}} {{LocalLimit}} {{GlobalLimit}} > {{Sample}} {{FullOuter}} and right table of {{LeftOuter}} (and left table of > {{RightOuter}}). That means any {{LogicalPlan}} operators that are not in the > list above are permitted to be under a correlation point. Is this risky? > There are many (30+ at least from browsing the {{LogicalPlan}} type > hierarchy) operators derived from {{LogicalPlan}} class. > For the case of {{ScalarSubquery}}, it explicitly checks that only > {{SubqueryAlias}} {{Project}} {{Filter}} {{Aggregate}} are allowed > ({{CheckAnalysis.scala}} around line 126-165 in and after {{def > cleanQuery}}). We should whitelist which operators are allowed in correlated > subqueries. At my first glance, we should allow, in addition to the ones > allowed in {{ScalarSubquery}}: {{Join}}, {{Distinct}}, {{Sort}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-8007) Support resolving virtual columns in DataFrames
[ https://issues.apache.org/jira/browse/SPARK-8007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15718615#comment-15718615 ] Ruslan Dautkhanov edited comment on SPARK-8007 at 12/3/16 7:34 PM: --- Is {noformat}spark__partition__id{noformat} available in PySpark too? Can't find a way to run the same code in PySpark. was (Author: tagar): Is spark__partition__id available in PySpark too? Can't find a way to run the same code in PySpark. > Support resolving virtual columns in DataFrames > --- > > Key: SPARK-8007 > URL: https://issues.apache.org/jira/browse/SPARK-8007 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Joseph Batchik > > Create the infrastructure so we can resolve df("SPARK__PARTITION__ID") to > SparkPartitionID expression. > A cool use case is to understand physical data skew: > {code} > df.groupBy("SPARK__PARTITION__ID").count() > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8007) Support resolving virtual columns in DataFrames
[ https://issues.apache.org/jira/browse/SPARK-8007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15718615#comment-15718615 ] Ruslan Dautkhanov commented on SPARK-8007: -- Is spark__partition__id available in PySpark too? Can't find a way to run the same code in PySpark. > Support resolving virtual columns in DataFrames > --- > > Key: SPARK-8007 > URL: https://issues.apache.org/jira/browse/SPARK-8007 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Joseph Batchik > > Create the infrastructure so we can resolve df("SPARK__PARTITION__ID") to > SparkPartitionID expression. > A cool use case is to understand physical data skew: > {code} > df.groupBy("SPARK__PARTITION__ID").count() > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18700) getCached in HiveMetastoreCatalog not thread safe cause driver OOM
[ https://issues.apache.org/jira/browse/SPARK-18700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15718580#comment-15718580 ] Li Yuanjian commented on SPARK-18700: - I'll add PR for this soon, add ReadWriteLock for each table's relation in cache, not for the whole cachedDataSourceTables. > getCached in HiveMetastoreCatalog not thread safe cause driver OOM > -- > > Key: SPARK-18700 > URL: https://issues.apache.org/jira/browse/SPARK-18700 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1, 2.0.0 >Reporter: Li Yuanjian > > In our spark sql platform, each query use same HiveContext and > independent thread, new data will append to tables as new partitions every > 30min. After a new partition added to table T, we should call refreshTable to > clear T’s cache in cachedDataSourceTables to make the new partition > searchable. > For the table have more partitions and files(much bigger than > spark.sql.sources.parallelPartitionDiscovery.threshold), a new query of table > T will start a job to fetch all FileStatus in listLeafFiles function. Because > of the huge number of files, the job will run several seconds, during the > time, new queries of table T will also start new jobs to fetch FileStatus > because of the function of getCache is not thread safe. Final cause a driver > OOM. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18700) getCached in HiveMetastoreCatalog not thread safe cause driver OOM
[ https://issues.apache.org/jira/browse/SPARK-18700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Yuanjian updated SPARK-18700: Description: In our spark sql platform, each query use same HiveContext and independent thread, new data will append to tables as new partitions every 30min. After a new partition added to table T, we should call refreshTable to clear T’s cache in cachedDataSourceTables to make the new partition searchable. For the table have more partitions and files(much bigger than spark.sql.sources.parallelPartitionDiscovery.threshold), a new query of table T will start a job to fetch all FileStatus in listLeafFiles function. Because of the huge number of files, the job will run several seconds, during the time, new queries of table T will also start new jobs to fetch FileStatus because of the function of getCache is not thread safe. Final cause a driver OOM. was: In our spark sql platform, each query use same HiveContext and independent thread, new data will append to tables as new partitions every 30min. After a new partition added to table T, we should call refreshTable to clear T’s cache in cachedDataSourceTables to make the new partition searchable. For the table have more partitions and files(much bigger than spark.sql.sources.parallelPartitionDiscovery.threshold), a new query of table T will start a job to fetch all FileStatus in listLeafFiles function. Because of the huge number of files, the job will run several seconds, during the time, new queries of table T will also start new jobs to fetch FileStatus because of the function of getCache is not thread safe. Final cause a driver OOM. > getCached in HiveMetastoreCatalog not thread safe cause driver OOM > -- > > Key: SPARK-18700 > URL: https://issues.apache.org/jira/browse/SPARK-18700 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1, 2.0.0 >Reporter: Li Yuanjian > > In our spark sql platform, each query use same HiveContext and > independent thread, new data will append to tables as new partitions every > 30min. After a new partition added to table T, we should call refreshTable to > clear T’s cache in cachedDataSourceTables to make the new partition > searchable. > For the table have more partitions and files(much bigger than > spark.sql.sources.parallelPartitionDiscovery.threshold), a new query of table > T will start a job to fetch all FileStatus in listLeafFiles function. Because > of the huge number of files, the job will run several seconds, during the > time, new queries of table T will also start new jobs to fetch FileStatus > because of the function of getCache is not thread safe. Final cause a driver > OOM. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18700) getCached in HiveMetastoreCatalog not thread safe cause driver OOM
Li Yuanjian created SPARK-18700: --- Summary: getCached in HiveMetastoreCatalog not thread safe cause driver OOM Key: SPARK-18700 URL: https://issues.apache.org/jira/browse/SPARK-18700 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.0, 1.6.1 Reporter: Li Yuanjian In our spark sql platform, each query use same HiveContext and independent thread, new data will append to tables as new partitions every 30min. After a new partition added to table T, we should call refreshTable to clear T’s cache in cachedDataSourceTables to make the new partition searchable. For the table have more partitions and files(much bigger than spark.sql.sources.parallelPartitionDiscovery.threshold), a new query of table T will start a job to fetch all FileStatus in listLeafFiles function. Because of the huge number of files, the job will run several seconds, during the time, new queries of table T will also start new jobs to fetch FileStatus because of the function of getCache is not thread safe. Final cause a driver OOM. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18696) Upgrade sbt plugins
[ https://issues.apache.org/jira/browse/SPARK-18696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15718550#comment-15718550 ] Weiqing Yang commented on SPARK-18696: -- Oh, yes, thanks for closing this. > Upgrade sbt plugins > --- > > Key: SPARK-18696 > URL: https://issues.apache.org/jira/browse/SPARK-18696 > Project: Spark > Issue Type: Improvement > Components: Build >Reporter: Weiqing Yang >Priority: Minor > > For 2.2.x, it's better to make sbt plugins up-to-date. The following sbt > plugins will be upgraded: > {code} > sbt-assembly: 0.11.2 -> 0.14.3 > sbteclipse-plugin: 4.0.0 -> 5.0.1 > sbt-mima-plugin: 0.1.11 -> 0.1.12 > org.ow2.asm/asm: 5.0.3 -> 5.1 > org.ow2.asm/asm-commons: 5.0.3 -> 5.1 > {code} > All other plugins are up-to-date. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18697) Upgrade sbt plugins
[ https://issues.apache.org/jira/browse/SPARK-18697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weiqing Yang updated SPARK-18697: - Target Version/s: (was: 2.2.0) > Upgrade sbt plugins > --- > > Key: SPARK-18697 > URL: https://issues.apache.org/jira/browse/SPARK-18697 > Project: Spark > Issue Type: Improvement > Components: Build >Reporter: Weiqing Yang >Priority: Trivial > > For 2.2.x, it's better to make sbt plugins up-to-date. The following sbt > plugins will be upgraded: > {code} > sbt-assembly: 0.11.2 -> 0.14.3 > sbteclipse-plugin: 4.0.0 -> 5.0.1 > sbt-mima-plugin: 0.1.11 -> 0.1.12 > org.ow2.asm/asm: 5.0.3 -> 5.1 > org.ow2.asm/asm-commons: 5.0.3 -> 5.1 > {code} > All other plugins are up-to-date. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18278) Support native submission of spark jobs to a kubernetes cluster
[ https://issues.apache.org/jira/browse/SPARK-18278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15718478#comment-15718478 ] Anirudh Ramanathan commented on SPARK-18278: There is a way to use a standard image that already exists (say ubuntu) and download the distribution and dependencies onto it prior to running drivers and executors. I explored this initially but even if this were allowed for, it's not likely to be used much. >From talking to people looking to use Spark on Kubernetes, it appears that >they'd prefer either an official image or build their own image containing the >distribution and application-jars. > Support native submission of spark jobs to a kubernetes cluster > --- > > Key: SPARK-18278 > URL: https://issues.apache.org/jira/browse/SPARK-18278 > Project: Spark > Issue Type: Umbrella > Components: Build, Deploy, Documentation, Scheduler, Spark Core >Reporter: Erik Erlandson > Attachments: SPARK-18278 - Spark on Kubernetes Design Proposal.pdf > > > A new Apache Spark sub-project that enables native support for submitting > Spark applications to a kubernetes cluster. The submitted application runs > in a driver executing on a kubernetes pod, and executors lifecycles are also > managed as pods. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18699) Spark CSV parsing types other than String throws exception when malformed
Jakub Nowacki created SPARK-18699: - Summary: Spark CSV parsing types other than String throws exception when malformed Key: SPARK-18699 URL: https://issues.apache.org/jira/browse/SPARK-18699 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.2 Reporter: Jakub Nowacki If CSV is read and the schema contains any other type than String, exception is thrown when the string value in CSV is malformed; e.g. if the timestamp does not match the defined one, an exception is thrown: {code} Caused by: java.lang.IllegalArgumentException at java.sql.Date.valueOf(Date.java:143) at org.apache.spark.sql.catalyst.util.DateTimeUtils$.stringToTime(DateTimeUtils.scala:137) at org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$$anonfun$castTo$6.apply$mcJ$sp(CSVInferSchema.scala:272) at org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$$anonfun$castTo$6.apply(CSVInferSchema.scala:272) at org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$$anonfun$castTo$6.apply(CSVInferSchema.scala:272) at scala.util.Try.getOrElse(Try.scala:79) at org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$.castTo(CSVInferSchema.scala:269) at org.apache.spark.sql.execution.datasources.csv.CSVRelation$$anonfun$csvParser$3.apply(CSVRelation.scala:116) at org.apache.spark.sql.execution.datasources.csv.CSVRelation$$anonfun$csvParser$3.apply(CSVRelation.scala:85) at org.apache.spark.sql.execution.datasources.csv.CSVFileFormat$$anonfun$buildReader$1$$anonfun$apply$2.apply(CSVFileFormat.scala:128) at org.apache.spark.sql.execution.datasources.csv.CSVFileFormat$$anonfun$buildReader$1$$anonfun$apply$2.apply(CSVFileFormat.scala:127) at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) at org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply$mcV$sp(WriterContainer.scala:253) at org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply(WriterContainer.scala:252) at org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply(WriterContainer.scala:252) at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1348) at org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:258) ... 8 more {code} It behaves similarly with Integer and Long types, from what I've seen. To my understanding modes PERMISSIVE and DROPMALFORMED should just null the value or drop the line, but instead they kill the job. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18278) Support native submission of spark jobs to a kubernetes cluster
[ https://issues.apache.org/jira/browse/SPARK-18278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15718249#comment-15718249 ] Erik Erlandson commented on SPARK-18278: Not publishing images puts users in the position of not being able to run this out-of-the-box. First they would have to either build images themselves, or find somebody else's 3rd-party images, etc. It doesn't seem like it would make for good UX. > Support native submission of spark jobs to a kubernetes cluster > --- > > Key: SPARK-18278 > URL: https://issues.apache.org/jira/browse/SPARK-18278 > Project: Spark > Issue Type: Umbrella > Components: Build, Deploy, Documentation, Scheduler, Spark Core >Reporter: Erik Erlandson > Attachments: SPARK-18278 - Spark on Kubernetes Design Proposal.pdf > > > A new Apache Spark sub-project that enables native support for submitting > Spark applications to a kubernetes cluster. The submitted application runs > in a driver executing on a kubernetes pod, and executors lifecycles are also > managed as pods. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18278) Support native submission of spark jobs to a kubernetes cluster
[ https://issues.apache.org/jira/browse/SPARK-18278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15718240#comment-15718240 ] Erik Erlandson commented on SPARK-18278: A possible scheme might be to publish the docker-files, but not actually build the images. It seems more standard to actually publish images for the community. Is there some reason for not wanting to do that? > Support native submission of spark jobs to a kubernetes cluster > --- > > Key: SPARK-18278 > URL: https://issues.apache.org/jira/browse/SPARK-18278 > Project: Spark > Issue Type: Umbrella > Components: Build, Deploy, Documentation, Scheduler, Spark Core >Reporter: Erik Erlandson > Attachments: SPARK-18278 - Spark on Kubernetes Design Proposal.pdf > > > A new Apache Spark sub-project that enables native support for submitting > Spark applications to a kubernetes cluster. The submitted application runs > in a driver executing on a kubernetes pod, and executors lifecycles are also > managed as pods. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18698) public constructor with uid for IndexToString-class
Bjoern Toldbod created SPARK-18698: -- Summary: public constructor with uid for IndexToString-class Key: SPARK-18698 URL: https://issues.apache.org/jira/browse/SPARK-18698 Project: Spark Issue Type: Wish Components: ML Affects Versions: 2.0.2 Reporter: Bjoern Toldbod Priority: Minor The IndexToString class in org.apache.spark.ml.feature does not provide a public constructor which takes a uid string. It would be nice to have such a constructor. (Generally, being able to name pipelinestages makes it much easier to work with complex models) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18020) Kinesis receiver does not snapshot when shard completes
[ https://issues.apache.org/jira/browse/SPARK-18020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15717915#comment-15717915 ] Takeshi Yamamuro commented on SPARK-18020: -- I'm currently looking into this issue. > Kinesis receiver does not snapshot when shard completes > --- > > Key: SPARK-18020 > URL: https://issues.apache.org/jira/browse/SPARK-18020 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.0.0 >Reporter: Yonathan Randolph >Priority: Minor > Labels: kinesis > > When a kinesis shard is split or combined and the old shard ends, the Amazon > Kinesis Client library [calls > IRecordProcessor.shutdown|https://github.com/awslabs/amazon-kinesis-client/blob/v1.7.0/src/main/java/com/amazonaws/services/kinesis/clientlibrary/lib/worker/ShutdownTask.java#L100] > and expects that {{IRecordProcessor.shutdown}} must checkpoint the sequence > number {{ExtendedSequenceNumber.SHARD_END}} before returning. Unfortunately, > spark’s > [KinesisRecordProcessor|https://github.com/apache/spark/blob/v2.0.1/external/kinesis-asl/src/main/scala/org/apache/spark/streaming/kinesis/KinesisRecordProcessor.scala] > sometimes does not checkpoint SHARD_END. This results in an error message, > and spark is then blocked indefinitely from processing any items from the > child shards. > This issue has also been raised on StackOverflow: [resharding while spark > running on kinesis > stream|http://stackoverflow.com/questions/38898691/resharding-while-spark-running-on-kinesis-stream] > Exception that is logged: > {code} > 16/10/19 19:37:49 ERROR worker.ShutdownTask: Application exception. > java.lang.IllegalArgumentException: Application didn't checkpoint at end of > shard shardId-0030 > at > com.amazonaws.services.kinesis.clientlibrary.lib.worker.ShutdownTask.call(ShutdownTask.java:106) > at > com.amazonaws.services.kinesis.clientlibrary.lib.worker.MetricsCollectingTaskDecorator.call(MetricsCollectingTaskDecorator.java:49) > at > com.amazonaws.services.kinesis.clientlibrary.lib.worker.MetricsCollectingTaskDecorator.call(MetricsCollectingTaskDecorator.java:24) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} > Command used to split shard: > {code} > aws kinesis --region us-west-1 split-shard --stream-name my-stream > --shard-to-split shardId-0030 --new-starting-hash-key > 5316911983139663491615228241121378303 > {code} > After the spark-streaming job has hung, examining the DynamoDB table > indicates that the parent shard processor has not reached > {{ExtendedSequenceNumber.SHARD_END}} and the child shards are still at > {{ExtendedSequenceNumber.TRIM_HORIZON}} waiting for the parent to finish: > {code} > aws kinesis --region us-west-1 describe-stream --stream-name my-stream > { > "StreamDescription": { > "RetentionPeriodHours": 24, > "StreamName": "my-stream", > "Shards": [ > { > "ShardId": "shardId-0030", > "HashKeyRange": { > "EndingHashKey": > "10633823966279326983230456482242756606", > "StartingHashKey": "0" > }, > ... > }, > { > "ShardId": "shardId-0062", > "HashKeyRange": { > "EndingHashKey": "5316911983139663491615228241121378302", > "StartingHashKey": "0" > }, > "ParentShardId": "shardId-0030", > "SequenceNumberRange": { > "StartingSequenceNumber": > "49566806087883755242230188435465744452396445937434624994" > } > }, > { > "ShardId": "shardId-0063", > "HashKeyRange": { > "EndingHashKey": > "10633823966279326983230456482242756606", > "StartingHashKey": "5316911983139663491615228241121378303" > }, > "ParentShardId": "shardId-0030", > "SequenceNumberRange": { > "StartingSequenceNumber": > "49566806087906055987428719058607280170669094298940605426" > } > }, > ... > ], > "StreamStatus": "ACTIVE" > } > } > aws dynamodb --region us-west-1 scan --table-name my-processor > { > "Items": [ > { > "leaseOwner": { > "S":
[jira] [Commented] (SPARK-18697) Upgrade sbt plugins
[ https://issues.apache.org/jira/browse/SPARK-18697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15717895#comment-15717895 ] Sean Owen commented on SPARK-18697: --- I merged SPARK-18696, but just to master. Let's do that to be more conservative. > Upgrade sbt plugins > --- > > Key: SPARK-18697 > URL: https://issues.apache.org/jira/browse/SPARK-18697 > Project: Spark > Issue Type: Improvement > Components: Build >Reporter: Weiqing Yang >Priority: Trivial > > For 2.2.x, it's better to make sbt plugins up-to-date. The following sbt > plugins will be upgraded: > {code} > sbt-assembly: 0.11.2 -> 0.14.3 > sbteclipse-plugin: 4.0.0 -> 5.0.1 > sbt-mima-plugin: 0.1.11 -> 0.1.12 > org.ow2.asm/asm: 5.0.3 -> 5.1 > org.ow2.asm/asm-commons: 5.0.3 -> 5.1 > {code} > All other plugins are up-to-date. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18638) Upgrade sbt, zinc and maven plugins
[ https://issues.apache.org/jira/browse/SPARK-18638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-18638. --- Resolution: Fixed Fix Version/s: 2.2.0 Issue resolved by pull request 16069 [https://github.com/apache/spark/pull/16069] > Upgrade sbt, zinc and maven plugins > --- > > Key: SPARK-18638 > URL: https://issues.apache.org/jira/browse/SPARK-18638 > Project: Spark > Issue Type: Improvement > Components: Build >Reporter: Weiqing Yang >Priority: Minor > Fix For: 2.2.0 > > > v2.1.0-rc1has been out. For 2.2.x, it is better to keep sbt up-to-date, and > upgrade it from 0.13.11 to 0.13.13. The release notes since the last version > we used are: https://github.com/sbt/sbt/releases/tag/v0.13.12 and > https://github.com/sbt/sbt/releases/tag/v0.13.13. Both releases include some > regression fixes. This jira will also update Zinc and Maven plugins. > {code} >sbt: 0.13.11 -> 0.13.13, >zinc: 0.3.9 -> 0.3.11, >maven-assembly-plugin: 2.6 -> 3.0.0 >maven-compiler-plugin: 3.5.1 -> 3.6. >maven-jar-plugin: 2.6 -> 3.0.2 >maven-javadoc-plugin: 2.10.3 -> 2.10.4 >maven-source-plugin: 2.4 -> 3.0.1 >org.codehaus.mojo:build-helper-maven-plugin: 1.10 -> 1.12 >org.codehaus.mojo:exec-maven-plugin: 1.4.0 -> 1.5.0 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18638) Upgrade sbt, zinc and maven plugins
[ https://issues.apache.org/jira/browse/SPARK-18638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-18638: -- Assignee: Weiqing Yang > Upgrade sbt, zinc and maven plugins > --- > > Key: SPARK-18638 > URL: https://issues.apache.org/jira/browse/SPARK-18638 > Project: Spark > Issue Type: Improvement > Components: Build >Reporter: Weiqing Yang >Assignee: Weiqing Yang >Priority: Minor > Fix For: 2.2.0 > > > v2.1.0-rc1has been out. For 2.2.x, it is better to keep sbt up-to-date, and > upgrade it from 0.13.11 to 0.13.13. The release notes since the last version > we used are: https://github.com/sbt/sbt/releases/tag/v0.13.12 and > https://github.com/sbt/sbt/releases/tag/v0.13.13. Both releases include some > regression fixes. This jira will also update Zinc and Maven plugins. > {code} >sbt: 0.13.11 -> 0.13.13, >zinc: 0.3.9 -> 0.3.11, >maven-assembly-plugin: 2.6 -> 3.0.0 >maven-compiler-plugin: 3.5.1 -> 3.6. >maven-jar-plugin: 2.6 -> 3.0.2 >maven-javadoc-plugin: 2.10.3 -> 2.10.4 >maven-source-plugin: 2.4 -> 3.0.1 >org.codehaus.mojo:build-helper-maven-plugin: 1.10 -> 1.12 >org.codehaus.mojo:exec-maven-plugin: 1.4.0 -> 1.5.0 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18697) Upgrade sbt plugins
[ https://issues.apache.org/jira/browse/SPARK-18697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-18697: -- Priority: Trivial (was: Minor) OK, it's a little arbitrary to update SBT, zinc, and Maven plugins, but then SBT plugins separately. I don't care much either way though. I also think it's fine to push this sort of update into 2.1.x > Upgrade sbt plugins > --- > > Key: SPARK-18697 > URL: https://issues.apache.org/jira/browse/SPARK-18697 > Project: Spark > Issue Type: Improvement > Components: Build >Reporter: Weiqing Yang >Priority: Trivial > > For 2.2.x, it's better to make sbt plugins up-to-date. The following sbt > plugins will be upgraded: > {code} > sbt-assembly: 0.11.2 -> 0.14.3 > sbteclipse-plugin: 4.0.0 -> 5.0.1 > sbt-mima-plugin: 0.1.11 -> 0.1.12 > org.ow2.asm/asm: 5.0.3 -> 5.1 > org.ow2.asm/asm-commons: 5.0.3 -> 5.1 > {code} > All other plugins are up-to-date. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18696) Upgrade sbt plugins
[ https://issues.apache.org/jira/browse/SPARK-18696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-18696. --- Resolution: Duplicate Target Version/s: (was: 2.2.0) > Upgrade sbt plugins > --- > > Key: SPARK-18696 > URL: https://issues.apache.org/jira/browse/SPARK-18696 > Project: Spark > Issue Type: Improvement > Components: Build >Reporter: Weiqing Yang >Priority: Minor > > For 2.2.x, it's better to make sbt plugins up-to-date. The following sbt > plugins will be upgraded: > {code} > sbt-assembly: 0.11.2 -> 0.14.3 > sbteclipse-plugin: 4.0.0 -> 5.0.1 > sbt-mima-plugin: 0.1.11 -> 0.1.12 > org.ow2.asm/asm: 5.0.3 -> 5.1 > org.ow2.asm/asm-commons: 5.0.3 -> 5.1 > {code} > All other plugins are up-to-date. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18584) multiple Spark Thrift Servers running in the same machine throws org.apache.hadoop.security.AccessControlException
[ https://issues.apache.org/jira/browse/SPARK-18584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-18584. --- Resolution: Not A Problem > multiple Spark Thrift Servers running in the same machine throws > org.apache.hadoop.security.AccessControlException > -- > > Key: SPARK-18584 > URL: https://issues.apache.org/jira/browse/SPARK-18584 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2 > Environment: hadoop-2.5.0-cdh5.2.1-och4.0.0 > spark2.0.2 >Reporter: tanxinz > > In spark2.0.2 , I have two users(etl , dev ) start Spark Thrift Server in the > same machine . I connected by beeline etl STS to execute a command,and > throwed org.apache.hadoop.security.AccessControlException.I don't know why is > dev user perform,not etl. > ``` > Caused by: > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.AccessControlException): > Permission denied: user=dev, access=EXECUTE, > inode="/user/hive/warehouse/tb_spark_sts/etl_cycle_id=20161122":etl:supergroup:drwxr-x---,group:etl:rwx,group:oth_dev:rwx,default:user:data_mining:r-x,default:group::rwx,default:group:etl:rwx,default:group:oth_dev:rwx,default:mask::rwx,default:other::--- > at > org.apache.hadoop.hdfs.server.namenode.DefaultAuthorizationProvider.checkAccessAcl(DefaultAuthorizationProvider.java:335) > at > org.apache.hadoop.hdfs.server.namenode.DefaultAuthorizationProvider.check(DefaultAuthorizationProvider.java:231) > at > org.apache.hadoop.hdfs.server.namenode.DefaultAuthorizationProvider.checkTraverse(DefaultAuthorizationProvider.java:178) > at > org.apache.hadoop.hdfs.server.namenode.DefaultAuthorizationProvider.checkPermission(DefaultAuthorizationProvider.java:137) > at > org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:138) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkPermission(FSNamesystem.java:6250) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getFileInfo(FSNamesystem.java:3942) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getFileInfo(NameNodeRpcServer.java:811) > at > org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.getFileInfo(AuthorizationProviderProxyClientProtocol.java:502) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getFileInfo(ClientNamenodeProtocolServerSideTranslatorPB.java:815) > at > org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:587) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) > ``` -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18685) Fix all tests in ExecutorClassLoaderSuite to pass on Windows
[ https://issues.apache.org/jira/browse/SPARK-18685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-18685. --- Resolution: Fixed Fix Version/s: 2.0.3 2.1.1 Issue resolved by pull request 16116 [https://github.com/apache/spark/pull/16116] > Fix all tests in ExecutorClassLoaderSuite to pass on Windows > > > Key: SPARK-18685 > URL: https://issues.apache.org/jira/browse/SPARK-18685 > Project: Spark > Issue Type: Sub-task > Components: Spark Shell, Tests >Reporter: Hyukjin Kwon >Priority: Minor > Fix For: 2.1.1, 2.0.3 > > > There are two problems as below: > We should make the URI correct and {{BufferedSource}} from > {{Source.fromInputStream}} closed after opening them in the tests in > {{ExecutorClassLoaderSuite}}. Currently, these are leading to test failures > on Windows. > {code} > ExecutorClassLoaderSuite: > [info] - child first *** FAILED *** (78 milliseconds) > [info] java.net.URISyntaxException: Illegal character in authority at index > 7: > file://C:\projects\spark\target\tmp\spark-00b66070-0548-463c-b6f3-8965d173da9b > [info] at java.net.URI$Parser.fail(URI.java:2848) > [info] at java.net.URI$Parser.parseAuthority(URI.java:3186) > ... > [info] - parent first *** FAILED *** (15 milliseconds) > [info] java.net.URISyntaxException: Illegal character in authority at index > 7: > file://C:\projects\spark\target\tmp\spark-00b66070-0548-463c-b6f3-8965d173da9b > [info] at java.net.URI$Parser.fail(URI.java:2848) > [info] at java.net.URI$Parser.parseAuthority(URI.java:3186) > ... > [info] - child first can fall back *** FAILED *** (0 milliseconds) > [info] java.net.URISyntaxException: Illegal character in authority at index > 7: > file://C:\projects\spark\target\tmp\spark-00b66070-0548-463c-b6f3-8965d173da9b > [info] at java.net.URI$Parser.fail(URI.java:2848) > [info] at java.net.URI$Parser.parseAuthority(URI.java:3186) > ... > [info] - child first can fail *** FAILED *** (0 milliseconds) > [info] java.net.URISyntaxException: Illegal character in authority at index > 7: > file://C:\projects\spark\target\tmp\spark-00b66070-0548-463c-b6f3-8965d173da9b > [info] at java.net.URI$Parser.fail(URI.java:2848) > [info] at java.net.URI$Parser.parseAuthority(URI.java:3186) > ... > [info] - resource from parent *** FAILED *** (0 milliseconds) > [info] java.net.URISyntaxException: Illegal character in authority at index > 7: > file://C:\projects\spark\target\tmp\spark-00b66070-0548-463c-b6f3-8965d173da9b > [info] at java.net.URI$Parser.fail(URI.java:2848) > [info] at java.net.URI$Parser.parseAuthority(URI.java:3186) > ... > [info] - resources from parent *** FAILED *** (0 milliseconds) > [info] java.net.URISyntaxException: Illegal character in authority at index > 7: > file://C:\projects\spark\target\tmp\spark-00b66070-0548-463c-b6f3-8965d173da9b > [info] at java.net.URI$Parser.fail(URI.java:2848) > [info] at java.net.URI$Parser.parseAuthority(URI.java:3186) > {code} > {code} > [info] Exception encountered when attempting to run a suite with class name: > org.apache.spark.repl.ExecutorClassLoaderSuite *** ABORTED *** (7 seconds, > 333 milliseconds) > [info] java.io.IOException: Failed to delete: > C:\projects\spark\target\tmp\spark-77b2f37b-6405-47c4-af1c-4a6a206511f2 > [info] at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:1010) > [info] at > org.apache.spark.repl.ExecutorClassLoaderSuite.afterAll(ExecutorClassLoaderSuite.scala:76) > [info] at > org.scalatest.BeforeAndAfterAll$class.afterAll(BeforeAndAfterAll.scala:213) > ... > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18581) MultivariateGaussian does not check if covariance matrix is invertible
[ https://issues.apache.org/jira/browse/SPARK-18581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15717855#comment-15717855 ] Sean Owen commented on SPARK-18581: --- [~invkrh] do you think there's still a problem here? > MultivariateGaussian does not check if covariance matrix is invertible > -- > > Key: SPARK-18581 > URL: https://issues.apache.org/jira/browse/SPARK-18581 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.6.2, 2.0.2 >Reporter: Hao Ren > > When training GaussianMixtureModel, I found some probability much larger than > 1. That leads me to that fact that, the value returned by > MultivariateGaussian.pdf can be 10^5, etc. > After reviewing the code, I found that problem lies in the computation of > determinant of the covariance matrix. > The computation is simplified by using pseudo-determinant of a positive > defined matrix. > In my case, I have a feature = 0 for all data point. > As a result, covariance matrix is not invertible <=> det(covariance matrix) = > 0 => pseudo-determinant will be very close to zero, > Thus, log(pseudo-determinant) will be a large negative number which finally > make logpdf very biger, pdf will be even bigger > 1. > As said in comments of MultivariateGaussian.scala, > """ > Singular values are considered to be non-zero only if they exceed a tolerance > based on machine precision. > """ > But if a singular value is considered to be zero, means the covariance matrix > is non invertible which is a contradiction to the assumption that it should > be invertible. > So we should check if there a single value is smaller than the tolerance > before computing the pseudo determinant -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18685) Fix all tests in ExecutorClassLoaderSuite to pass on Windows
[ https://issues.apache.org/jira/browse/SPARK-18685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-18685: -- Assignee: Hyukjin Kwon > Fix all tests in ExecutorClassLoaderSuite to pass on Windows > > > Key: SPARK-18685 > URL: https://issues.apache.org/jira/browse/SPARK-18685 > Project: Spark > Issue Type: Sub-task > Components: Spark Shell, Tests >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Minor > Fix For: 2.0.3, 2.1.1 > > > There are two problems as below: > We should make the URI correct and {{BufferedSource}} from > {{Source.fromInputStream}} closed after opening them in the tests in > {{ExecutorClassLoaderSuite}}. Currently, these are leading to test failures > on Windows. > {code} > ExecutorClassLoaderSuite: > [info] - child first *** FAILED *** (78 milliseconds) > [info] java.net.URISyntaxException: Illegal character in authority at index > 7: > file://C:\projects\spark\target\tmp\spark-00b66070-0548-463c-b6f3-8965d173da9b > [info] at java.net.URI$Parser.fail(URI.java:2848) > [info] at java.net.URI$Parser.parseAuthority(URI.java:3186) > ... > [info] - parent first *** FAILED *** (15 milliseconds) > [info] java.net.URISyntaxException: Illegal character in authority at index > 7: > file://C:\projects\spark\target\tmp\spark-00b66070-0548-463c-b6f3-8965d173da9b > [info] at java.net.URI$Parser.fail(URI.java:2848) > [info] at java.net.URI$Parser.parseAuthority(URI.java:3186) > ... > [info] - child first can fall back *** FAILED *** (0 milliseconds) > [info] java.net.URISyntaxException: Illegal character in authority at index > 7: > file://C:\projects\spark\target\tmp\spark-00b66070-0548-463c-b6f3-8965d173da9b > [info] at java.net.URI$Parser.fail(URI.java:2848) > [info] at java.net.URI$Parser.parseAuthority(URI.java:3186) > ... > [info] - child first can fail *** FAILED *** (0 milliseconds) > [info] java.net.URISyntaxException: Illegal character in authority at index > 7: > file://C:\projects\spark\target\tmp\spark-00b66070-0548-463c-b6f3-8965d173da9b > [info] at java.net.URI$Parser.fail(URI.java:2848) > [info] at java.net.URI$Parser.parseAuthority(URI.java:3186) > ... > [info] - resource from parent *** FAILED *** (0 milliseconds) > [info] java.net.URISyntaxException: Illegal character in authority at index > 7: > file://C:\projects\spark\target\tmp\spark-00b66070-0548-463c-b6f3-8965d173da9b > [info] at java.net.URI$Parser.fail(URI.java:2848) > [info] at java.net.URI$Parser.parseAuthority(URI.java:3186) > ... > [info] - resources from parent *** FAILED *** (0 milliseconds) > [info] java.net.URISyntaxException: Illegal character in authority at index > 7: > file://C:\projects\spark\target\tmp\spark-00b66070-0548-463c-b6f3-8965d173da9b > [info] at java.net.URI$Parser.fail(URI.java:2848) > [info] at java.net.URI$Parser.parseAuthority(URI.java:3186) > {code} > {code} > [info] Exception encountered when attempting to run a suite with class name: > org.apache.spark.repl.ExecutorClassLoaderSuite *** ABORTED *** (7 seconds, > 333 milliseconds) > [info] java.io.IOException: Failed to delete: > C:\projects\spark\target\tmp\spark-77b2f37b-6405-47c4-af1c-4a6a206511f2 > [info] at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:1010) > [info] at > org.apache.spark.repl.ExecutorClassLoaderSuite.afterAll(ExecutorClassLoaderSuite.scala:76) > [info] at > org.scalatest.BeforeAndAfterAll$class.afterAll(BeforeAndAfterAll.scala:213) > ... > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18586) netty-3.8.0.Final.jar has vulnerability CVE-2014-3488 and CVE-2014-0193
[ https://issues.apache.org/jira/browse/SPARK-18586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-18586: -- Assignee: Sean Owen Priority: Minor (was: Major) I don't think the CVE actually affected Spark, as Netty 3 isn't directly used, but I updated it anyway. > netty-3.8.0.Final.jar has vulnerability CVE-2014-3488 and CVE-2014-0193 > > > Key: SPARK-18586 > URL: https://issues.apache.org/jira/browse/SPARK-18586 > Project: Spark > Issue Type: Bug > Components: Build >Reporter: meiyoula >Assignee: Sean Owen >Priority: Minor > Fix For: 2.2.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18586) netty-3.8.0.Final.jar has vulnerability CVE-2014-3488 and CVE-2014-0193
[ https://issues.apache.org/jira/browse/SPARK-18586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-18586. --- Resolution: Fixed Fix Version/s: 2.2.0 Issue resolved by pull request 16102 [https://github.com/apache/spark/pull/16102] > netty-3.8.0.Final.jar has vulnerability CVE-2014-3488 and CVE-2014-0193 > > > Key: SPARK-18586 > URL: https://issues.apache.org/jira/browse/SPARK-18586 > Project: Spark > Issue Type: Bug > Components: Build >Reporter: meiyoula > Fix For: 2.2.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18678) Skewed feature subsampling in Random forest
[ https://issues.apache.org/jira/browse/SPARK-18678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18678: Assignee: (was: Apache Spark) > Skewed feature subsampling in Random forest > --- > > Key: SPARK-18678 > URL: https://issues.apache.org/jira/browse/SPARK-18678 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.0.2 >Reporter: Bjoern Toldbod > > The feature subsampling performed in the RandomForest-implementation from > org.apache.spark.ml.tree.impl.RandomForest > is performed using SamplingUtils.reservoirSampleAndCount > The implementation of the sampling skews feature selection in favor of > features with a higher index. > The skewness is smaller for a large number of features, but completely > dominates the feature selection for a small number of features. The extreme > case is when the number of features is 2 and number of features to select is > 1. > In this case the feature sampling will always pick feature 1 and ignore > feature 0. > Of course this produces low quality models for few features when using > subsampling. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18678) Skewed feature subsampling in Random forest
[ https://issues.apache.org/jira/browse/SPARK-18678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15717808#comment-15717808 ] Apache Spark commented on SPARK-18678: -- User 'srowen' has created a pull request for this issue: https://github.com/apache/spark/pull/16129 > Skewed feature subsampling in Random forest > --- > > Key: SPARK-18678 > URL: https://issues.apache.org/jira/browse/SPARK-18678 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.0.2 >Reporter: Bjoern Toldbod > > The feature subsampling performed in the RandomForest-implementation from > org.apache.spark.ml.tree.impl.RandomForest > is performed using SamplingUtils.reservoirSampleAndCount > The implementation of the sampling skews feature selection in favor of > features with a higher index. > The skewness is smaller for a large number of features, but completely > dominates the feature selection for a small number of features. The extreme > case is when the number of features is 2 and number of features to select is > 1. > In this case the feature sampling will always pick feature 1 and ignore > feature 0. > Of course this produces low quality models for few features when using > subsampling. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18678) Skewed feature subsampling in Random forest
[ https://issues.apache.org/jira/browse/SPARK-18678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18678: Assignee: Apache Spark > Skewed feature subsampling in Random forest > --- > > Key: SPARK-18678 > URL: https://issues.apache.org/jira/browse/SPARK-18678 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.0.2 >Reporter: Bjoern Toldbod >Assignee: Apache Spark > > The feature subsampling performed in the RandomForest-implementation from > org.apache.spark.ml.tree.impl.RandomForest > is performed using SamplingUtils.reservoirSampleAndCount > The implementation of the sampling skews feature selection in favor of > features with a higher index. > The skewness is smaller for a large number of features, but completely > dominates the feature selection for a small number of features. The extreme > case is when the number of features is 2 and number of features to select is > 1. > In this case the feature sampling will always pick feature 1 and ignore > feature 0. > Of course this produces low quality models for few features when using > subsampling. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18349) Update R API documentation on ml model summary
[ https://issues.apache.org/jira/browse/SPARK-18349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15717726#comment-15717726 ] Felix Cheung commented on SPARK-18349: -- [~wangmiao1981]Please do, thanks! Since we have some questions it would be great if you could propose the approach and we could discuss a bit here. > Update R API documentation on ml model summary > -- > > Key: SPARK-18349 > URL: https://issues.apache.org/jira/browse/SPARK-18349 > Project: Spark > Issue Type: Improvement > Components: ML, SparkR >Affects Versions: 2.1.0 >Reporter: Felix Cheung > > It has been discovered that there is a fair bit of consistency in the > documentation of summary functions, eg. > {code} > #' @return \code{summary} returns a summary object of the fitted model, a > list of components > #' including formula, number of features, list of features, feature > importances, number of > #' trees, and tree weights > setMethod("summary", signature(object = "GBTRegressionModel") > {code} > For instance, what should be listed for the return value? Should it be a name > or a phrase, or should it be a list of items; and should there be a longer > description on what they mean, or reference link to Scala doc. > We will need to review this for all model summary implementations in mllib.R -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17822) JVMObjectTracker.objMap may leak JVM objects
[ https://issues.apache.org/jira/browse/SPARK-17822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15717723#comment-15717723 ] Felix Cheung commented on SPARK-17822: -- >From what [~josephkb] observed and described, I suspect this is a case of >small pointers in R holding larger memory/classes in JVM. If the memory footprint of the pointer in R is very small, chances are even after thousands of iterations the memory consumption in R is still not high enough to trigger a GC to reclaim. If we have a repro, calling gc() or gcinfo(TRUE) should tell us about memory consumption as it grows. I'm not sure about the previous attempt to mitigate this with WeakReference though - since we don't know which of the R object is still being referenced, once we remove the JVM object, and the R pointer could become a dangling pointer. And perhaps then this could be helped by increasing the aggressiveness of R GC: https://stat.ethz.ch/R-manual/R-devel/library/base/html/Memory.htm http://adv-r.had.co.nz/memory.html#gc > JVMObjectTracker.objMap may leak JVM objects > > > Key: SPARK-17822 > URL: https://issues.apache.org/jira/browse/SPARK-17822 > Project: Spark > Issue Type: Bug > Components: SparkR >Reporter: Yin Huai >Assignee: Xiangrui Meng > Attachments: screenshot-1.png > > > JVMObjectTracker.objMap is used to track JVM objects for SparkR. However, we > observed that JVM objects that are not used anymore are still trapped in this > map, which prevents those object get GCed. > Seems it makes sense to use weak reference (like persistentRdds in > SparkContext). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org