[jira] [Comment Edited] (SPARK-15565) The default value of spark.sql.warehouse.dir needs to explicitly point to local filesystem

2016-10-13 Thread Alessio (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15572976#comment-15572976
 ] 

Alessio edited comment on SPARK-15565 at 10/13/16 7:49 PM:
---

Yes Sean, indeed in my latest issue SPARK-17918 I was referring to this 
specific issue with the sentence "This was fixed in 2.0.0, as previous issues 
have reported". Although I noticed that SPARK-17918 was a duplicate of 
SPARK-17810 and I'm glad this will be fixed.


was (Author: purple):
Yes Sean, indeed in my latest issue SPARK-17918 I was referring to this 
specific issue with the sentence "This was fixed in 2.0.0, as previous issues 
have reported". Although I noticed that my issue was a duplicate of SPARK-17810 
and I'm glad this will be fixed.

> The default value of spark.sql.warehouse.dir needs to explicitly point to 
> local filesystem
> --
>
> Key: SPARK-15565
> URL: https://issues.apache.org/jira/browse/SPARK-15565
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Xiao Li
>Priority: Critical
> Fix For: 2.0.0
>
>
> The default value of {{spark.sql.warehouse.dir}} is  
> {{System.getProperty("user.dir")/warehouse}}. Since 
> {{System.getProperty("user.dir")}} is a local dir, we should explicitly set 
> the scheme to local filesystem.
> This should be a one line change  (at 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L58).
> Also see 
> https://issues.apache.org/jira/browse/SPARK-15034?focusedCommentId=15301508=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15301508



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15565) The default value of spark.sql.warehouse.dir needs to explicitly point to local filesystem

2016-10-13 Thread Alessio (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15572976#comment-15572976
 ] 

Alessio commented on SPARK-15565:
-

Yes Sean, indeed in my latest issue SPARK-17918 I was referring to this 
specific issue with the sentence "This was fixed in 2.0.0, as previous issues 
have reported". Although I noticed that my issue was a duplicate of SPARK-17810 
and I'm glad this will be fixed.

> The default value of spark.sql.warehouse.dir needs to explicitly point to 
> local filesystem
> --
>
> Key: SPARK-15565
> URL: https://issues.apache.org/jira/browse/SPARK-15565
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Xiao Li
>Priority: Critical
> Fix For: 2.0.0
>
>
> The default value of {{spark.sql.warehouse.dir}} is  
> {{System.getProperty("user.dir")/warehouse}}. Since 
> {{System.getProperty("user.dir")}} is a local dir, we should explicitly set 
> the scheme to local filesystem.
> This should be a one line change  (at 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L58).
> Also see 
> https://issues.apache.org/jira/browse/SPARK-15034?focusedCommentId=15301508=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15301508



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17918) Default Warehouse location apparently in HDFS

2016-10-13 Thread Alessio (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessio updated SPARK-17918:

Description: 
It seems that the default warehouse location in Spark 2.0.1 not only points at 
an inexistent folder in Macintosh systems (/user/hive/warehouse)  - see first 
INFO - but also such folder is then appended to an HDFS - see the error.

This was fixed in 2.0.0, as previous issues have reported, but appears again in 
2.0.1. Indeed some scripts I was able to run in 2.0.0 now throw such errors: 
Spark 2.0.0 used to create the spark-warehouse folder within the current 
directory (which was good) and didn't complain about such weird paths, even 
because I'm not using Spark though HDFS, but just locally.


*16/10/13 20:47:36 INFO internal.SharedState: Warehouse path is 
'/user/hive/warehouse'.*

*py4j.protocol.Py4JJavaError: An error occurred while calling o32.load.*
*: org.apache.spark.SparkException: Unable to create database default as failed 
to create its directory* *hdfs://localhost:9000/user/hive/warehouse*

{color:red}Update #1:{color}
I was able to reinstall Spark 2.0.0 and the first INFO message clearly states 
that 
*16/10/13 21:06:59 INFO internal.SharedState: Warehouse path is 'file://spark-warehouse'.*

{color:red}Update #2:{color}
In both Spark 2.0.0 and 2.0.1 I didn't edit any config file and the like. 
Everything's default.

  was:
It seems that the default warehouse location in Spark 2.0.1 not only points at 
an inexistent folder in Macintosh systems (/user/hive/warehouse)  - see first 
INFO - but also such folder is then appended to an HDFS - see the error.

This was fixed in 2.0.0, as previous issues reported, but appears again in 
2.0.1. Indeed some scripts I was able to run in 2.0.0 now throw such errors: 
Spark 2.0.0 used to create the spark-warehouse folder within the current 
directory (which was good) and didn't complain about such weird paths, even 
because I'm not using Spark though HDFS, but just locally.


*16/10/13 20:47:36 INFO internal.SharedState: Warehouse path is 
'/user/hive/warehouse'.*

*py4j.protocol.Py4JJavaError: An error occurred while calling o32.load.*
*: org.apache.spark.SparkException: Unable to create database default as failed 
to create its directory* *hdfs://localhost:9000/user/hive/warehouse*

{color:red}Update #1:{color}
I was able to reinstall Spark 2.0.0 and the first INFO message clearly states 
that 
*16/10/13 21:06:59 INFO internal.SharedState: Warehouse path is 'file://spark-warehouse'.*

{color:red}Update #2:{color}
In both Spark 2.0.0 and 2.0.1 I didn't edit any config file and the like. 
Everything's default.


> Default Warehouse location apparently in HDFS 
> --
>
> Key: SPARK-17918
> URL: https://issues.apache.org/jira/browse/SPARK-17918
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.1
> Environment: Mac OS X 10.11.6
>Reporter: Alessio
>
> It seems that the default warehouse location in Spark 2.0.1 not only points 
> at an inexistent folder in Macintosh systems (/user/hive/warehouse)  - see 
> first INFO - but also such folder is then appended to an HDFS - see the error.
> This was fixed in 2.0.0, as previous issues have reported, but appears again 
> in 2.0.1. Indeed some scripts I was able to run in 2.0.0 now throw such 
> errors: Spark 2.0.0 used to create the spark-warehouse folder within the 
> current directory (which was good) and didn't complain about such weird 
> paths, even because I'm not using Spark though HDFS, but just locally.
> *16/10/13 20:47:36 INFO internal.SharedState: Warehouse path is 
> '/user/hive/warehouse'.*
> *py4j.protocol.Py4JJavaError: An error occurred while calling o32.load.*
> *: org.apache.spark.SparkException: Unable to create database default as 
> failed to create its directory* *hdfs://localhost:9000/user/hive/warehouse*
> {color:red}Update #1:{color}
> I was able to reinstall Spark 2.0.0 and the first INFO message clearly states 
> that 
> *16/10/13 21:06:59 INFO internal.SharedState: Warehouse path is 'file:/ FS folder>/spark-warehouse'.*
> {color:red}Update #2:{color}
> In both Spark 2.0.0 and 2.0.1 I didn't edit any config file and the like. 
> Everything's default.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17918) Default Warehouse location apparently in HDFS

2016-10-13 Thread Alessio (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessio updated SPARK-17918:

Description: 
It seems that the default warehouse location in Spark 2.0.1 not only points at 
an inexistent folder in Macintosh systems (/user/hive/warehouse)  - see first 
INFO - but also such folder is then appended to an HDFS - see the error.

This was fixed in 2.0.0, as previous issues reported, but appears again in 
2.0.1. Indeed some scripts I was able to run in 2.0.0 now throw such errors: 
Spark 2.0.0 used to create the spark-warehouse folder within the current 
directory (which was good) and didn't complain about such weird paths, even 
because I'm not using Spark though HDFS, but just locally.


*16/10/13 20:47:36 INFO internal.SharedState: Warehouse path is 
'/user/hive/warehouse'.*

*py4j.protocol.Py4JJavaError: An error occurred while calling o32.load.*
*: org.apache.spark.SparkException: Unable to create database default as failed 
to create its directory* *hdfs://localhost:9000/user/hive/warehouse*

{color:red}Update #1:{color}
I was able to reinstall Spark 2.0.0 and the first INFO message clearly states 
that 
*16/10/13 21:06:59 INFO internal.SharedState: Warehouse path is 'file://spark-warehouse'.*

{color:red}Update #2:{color}
In both Spark 2.0.0 and 2.0.1 I didn't edit any config file and the like. 
Everything's default.

  was:
It seems that the default warehouse location in Spark 2.0.1 not only points at 
an inexistent folder in Macintosh systems (/user/hive/warehouse)  - see first 
INFO - but also such folder is then appended to an HDFS - see the error.

This was fixed in 2.0.0, as previous issues reported, but appears again in 
2.0.1. Indeed some scripts I was able to run in 2.0.0 now throw such errors: 
Spark 2.0.0 used to create the spark-warehouse folder within the current 
directory (which was good) and didn't complain about such weird paths, even 
because I'm not using Spark though HDFS, but just locally.


*16/10/13 20:47:36 INFO internal.SharedState: Warehouse path is 
'/user/hive/warehouse'.*

*py4j.protocol.Py4JJavaError: An error occurred while calling o32.load.*
*: org.apache.spark.SparkException: Unable to create database default as failed 
to create its directory* *hdfs://localhost:9000/user/hive/warehouse*

{color:red}Update #1:
I was able to reinstall Spark 2.0.0 and the first INFO message clearly states 
that 
*16/10/13 21:06:59 INFO internal.SharedState: Warehouse path is 'file://spark-warehouse'.*
{color}


> Default Warehouse location apparently in HDFS 
> --
>
> Key: SPARK-17918
> URL: https://issues.apache.org/jira/browse/SPARK-17918
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.1
> Environment: Mac OS X 10.11.6
>Reporter: Alessio
>
> It seems that the default warehouse location in Spark 2.0.1 not only points 
> at an inexistent folder in Macintosh systems (/user/hive/warehouse)  - see 
> first INFO - but also such folder is then appended to an HDFS - see the error.
> This was fixed in 2.0.0, as previous issues reported, but appears again in 
> 2.0.1. Indeed some scripts I was able to run in 2.0.0 now throw such errors: 
> Spark 2.0.0 used to create the spark-warehouse folder within the current 
> directory (which was good) and didn't complain about such weird paths, even 
> because I'm not using Spark though HDFS, but just locally.
> *16/10/13 20:47:36 INFO internal.SharedState: Warehouse path is 
> '/user/hive/warehouse'.*
> *py4j.protocol.Py4JJavaError: An error occurred while calling o32.load.*
> *: org.apache.spark.SparkException: Unable to create database default as 
> failed to create its directory* *hdfs://localhost:9000/user/hive/warehouse*
> {color:red}Update #1:{color}
> I was able to reinstall Spark 2.0.0 and the first INFO message clearly states 
> that 
> *16/10/13 21:06:59 INFO internal.SharedState: Warehouse path is 'file:/ FS folder>/spark-warehouse'.*
> {color:red}Update #2:{color}
> In both Spark 2.0.0 and 2.0.1 I didn't edit any config file and the like. 
> Everything's default.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17918) Default Warehouse location apparently in HDFS

2016-10-13 Thread Alessio (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessio updated SPARK-17918:

Description: 
It seems that the default warehouse location in Spark 2.0.1 not only points at 
an inexistent folder in Macintosh systems (/user/hive/warehouse)  - see first 
INFO - but also such folder is then appended to an HDFS - see the error.

This was fixed in 2.0.0, as previous issues reported, but appears again in 
2.0.1. Indeed some scripts I was able to run in 2.0.0 now throw such errors: 
Spark 2.0.0 used to create the spark-warehouse folder within the current 
directory (which was good) and didn't complain about such weird paths, even 
because I'm not using Spark though HDFS, but just locally.


*16/10/13 20:47:36 INFO internal.SharedState: Warehouse path is 
'/user/hive/warehouse'.*

*py4j.protocol.Py4JJavaError: An error occurred while calling o32.load.*
*: org.apache.spark.SparkException: Unable to create database default as failed 
to create its directory* *hdfs://localhost:9000/user/hive/warehouse*

Update #1:
I was able to reinstall Spark 2.0.0 and the first INFO message clearly states 
that 
*16/10/13 21:06:59 INFO internal.SharedState: Warehouse path is 
'file:/Users/Purple/Documents/YARNprojects/Spark_K-MEANS/version_postgreSQL/spark-warehouse'.*


  was:
It seems that the default warehouse location in Spark 2.0.1 not only points at 
an inexistent folder in Macintosh systems (/user/hive/warehouse)  - see first 
INFO - but also such folder is then appended to an HDFS - see the error.

This was fixed in 2.0.0, as previous issues reported, but appears again in 
2.0.1. Indeed some scripts I was able to run in 2.0.0 now throw such errors: 
Spark 2.0.0 used to create the spark-warehouse folder within the current 
directory (which was good) and didn't complain about such weird paths, even 
because I'm not using Spark though HDFS, but just locally.


*16/10/13 20:47:36 INFO internal.SharedState: Warehouse path is 
'/user/hive/warehouse'.*

*py4j.protocol.Py4JJavaError: An error occurred while calling o32.load.*
*: org.apache.spark.SparkException: Unable to create database default as failed 
to create its directory* *hdfs://localhost:9000/user/hive/warehouse*




> Default Warehouse location apparently in HDFS 
> --
>
> Key: SPARK-17918
> URL: https://issues.apache.org/jira/browse/SPARK-17918
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.1
> Environment: Mac OS X 10.11.6
>Reporter: Alessio
>
> It seems that the default warehouse location in Spark 2.0.1 not only points 
> at an inexistent folder in Macintosh systems (/user/hive/warehouse)  - see 
> first INFO - but also such folder is then appended to an HDFS - see the error.
> This was fixed in 2.0.0, as previous issues reported, but appears again in 
> 2.0.1. Indeed some scripts I was able to run in 2.0.0 now throw such errors: 
> Spark 2.0.0 used to create the spark-warehouse folder within the current 
> directory (which was good) and didn't complain about such weird paths, even 
> because I'm not using Spark though HDFS, but just locally.
> *16/10/13 20:47:36 INFO internal.SharedState: Warehouse path is 
> '/user/hive/warehouse'.*
> *py4j.protocol.Py4JJavaError: An error occurred while calling o32.load.*
> *: org.apache.spark.SparkException: Unable to create database default as 
> failed to create its directory* *hdfs://localhost:9000/user/hive/warehouse*
> Update #1:
> I was able to reinstall Spark 2.0.0 and the first INFO message clearly states 
> that 
> *16/10/13 21:06:59 INFO internal.SharedState: Warehouse path is 
> 'file:/Users/Purple/Documents/YARNprojects/Spark_K-MEANS/version_postgreSQL/spark-warehouse'.*



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17918) Default Warehouse location apparently in HDFS

2016-10-13 Thread Alessio (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessio updated SPARK-17918:

Description: 
It seems that the default warehouse location in Spark 2.0.1 not only points at 
an inexistent folder in Macintosh systems (/user/hive/warehouse)  - see first 
INFO - but also such folder is then appended to an HDFS - see the error.

This was fixed in 2.0.0, as previous issues reported, but appears again in 
2.0.1. Indeed some scripts I was able to run in 2.0.0 now throw such errors: 
Spark 2.0.0 used to create the spark-warehouse folder within the current 
directory (which was good) and didn't complain about such weird paths, even 
because I'm not using Spark though HDFS, but just locally.


*16/10/13 20:47:36 INFO internal.SharedState: Warehouse path is 
'/user/hive/warehouse'.*

*py4j.protocol.Py4JJavaError: An error occurred while calling o32.load.*
*: org.apache.spark.SparkException: Unable to create database default as failed 
to create its directory* *hdfs://localhost:9000/user/hive/warehouse*

Update #1:
I was able to reinstall Spark 2.0.0 and the first INFO message clearly states 
that 
*16/10/13 21:06:59 INFO internal.SharedState: Warehouse path is 'file://spark-warehouse'.*


  was:
It seems that the default warehouse location in Spark 2.0.1 not only points at 
an inexistent folder in Macintosh systems (/user/hive/warehouse)  - see first 
INFO - but also such folder is then appended to an HDFS - see the error.

This was fixed in 2.0.0, as previous issues reported, but appears again in 
2.0.1. Indeed some scripts I was able to run in 2.0.0 now throw such errors: 
Spark 2.0.0 used to create the spark-warehouse folder within the current 
directory (which was good) and didn't complain about such weird paths, even 
because I'm not using Spark though HDFS, but just locally.


*16/10/13 20:47:36 INFO internal.SharedState: Warehouse path is 
'/user/hive/warehouse'.*

*py4j.protocol.Py4JJavaError: An error occurred while calling o32.load.*
*: org.apache.spark.SparkException: Unable to create database default as failed 
to create its directory* *hdfs://localhost:9000/user/hive/warehouse*

Update #1:
I was able to reinstall Spark 2.0.0 and the first INFO message clearly states 
that 
*16/10/13 21:06:59 INFO internal.SharedState: Warehouse path is 
'file:/Users/Purple/Documents/YARNprojects/Spark_K-MEANS/version_postgreSQL/spark-warehouse'.*



> Default Warehouse location apparently in HDFS 
> --
>
> Key: SPARK-17918
> URL: https://issues.apache.org/jira/browse/SPARK-17918
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.1
> Environment: Mac OS X 10.11.6
>Reporter: Alessio
>
> It seems that the default warehouse location in Spark 2.0.1 not only points 
> at an inexistent folder in Macintosh systems (/user/hive/warehouse)  - see 
> first INFO - but also such folder is then appended to an HDFS - see the error.
> This was fixed in 2.0.0, as previous issues reported, but appears again in 
> 2.0.1. Indeed some scripts I was able to run in 2.0.0 now throw such errors: 
> Spark 2.0.0 used to create the spark-warehouse folder within the current 
> directory (which was good) and didn't complain about such weird paths, even 
> because I'm not using Spark though HDFS, but just locally.
> *16/10/13 20:47:36 INFO internal.SharedState: Warehouse path is 
> '/user/hive/warehouse'.*
> *py4j.protocol.Py4JJavaError: An error occurred while calling o32.load.*
> *: org.apache.spark.SparkException: Unable to create database default as 
> failed to create its directory* *hdfs://localhost:9000/user/hive/warehouse*
> Update #1:
> I was able to reinstall Spark 2.0.0 and the first INFO message clearly states 
> that 
> *16/10/13 21:06:59 INFO internal.SharedState: Warehouse path is 'file:/ FS folder>/spark-warehouse'.*



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17918) Default Warehouse location apparently in HDFS

2016-10-13 Thread Alessio (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessio updated SPARK-17918:

Description: 
It seems that the default warehouse location in Spark 2.0.1 not only points at 
an inexistent folder in Macintosh systems (/user/hive/warehouse)  - see first 
INFO - but also such folder is then appended to an HDFS - see the error.

This was fixed in 2.0.0, as previous issues reported, but appears again in 
2.0.1. Indeed some scripts I was able to run in 2.0.0 now throw such errors: 
Spark 2.0.0 used to create the spark-warehouse folder within the current 
directory (which was good) and didn't complain about such weird paths, even 
because I'm not using Spark though HDFS, but just locally.


*16/10/13 20:47:36 INFO internal.SharedState: Warehouse path is 
'/user/hive/warehouse'.*

*py4j.protocol.Py4JJavaError: An error occurred while calling o32.load.*
*: org.apache.spark.SparkException: Unable to create database default as failed 
to create its directory* *hdfs://localhost:9000/user/hive/warehouse*

{color:red}Update #1:
I was able to reinstall Spark 2.0.0 and the first INFO message clearly states 
that 
*16/10/13 21:06:59 INFO internal.SharedState: Warehouse path is 'file://spark-warehouse'.*
{color}

  was:
It seems that the default warehouse location in Spark 2.0.1 not only points at 
an inexistent folder in Macintosh systems (/user/hive/warehouse)  - see first 
INFO - but also such folder is then appended to an HDFS - see the error.

This was fixed in 2.0.0, as previous issues reported, but appears again in 
2.0.1. Indeed some scripts I was able to run in 2.0.0 now throw such errors: 
Spark 2.0.0 used to create the spark-warehouse folder within the current 
directory (which was good) and didn't complain about such weird paths, even 
because I'm not using Spark though HDFS, but just locally.


*16/10/13 20:47:36 INFO internal.SharedState: Warehouse path is 
'/user/hive/warehouse'.*

*py4j.protocol.Py4JJavaError: An error occurred while calling o32.load.*
*: org.apache.spark.SparkException: Unable to create database default as failed 
to create its directory* *hdfs://localhost:9000/user/hive/warehouse*

Update #1:
I was able to reinstall Spark 2.0.0 and the first INFO message clearly states 
that 
*16/10/13 21:06:59 INFO internal.SharedState: Warehouse path is 'file://spark-warehouse'.*



> Default Warehouse location apparently in HDFS 
> --
>
> Key: SPARK-17918
> URL: https://issues.apache.org/jira/browse/SPARK-17918
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.1
> Environment: Mac OS X 10.11.6
>Reporter: Alessio
>
> It seems that the default warehouse location in Spark 2.0.1 not only points 
> at an inexistent folder in Macintosh systems (/user/hive/warehouse)  - see 
> first INFO - but also such folder is then appended to an HDFS - see the error.
> This was fixed in 2.0.0, as previous issues reported, but appears again in 
> 2.0.1. Indeed some scripts I was able to run in 2.0.0 now throw such errors: 
> Spark 2.0.0 used to create the spark-warehouse folder within the current 
> directory (which was good) and didn't complain about such weird paths, even 
> because I'm not using Spark though HDFS, but just locally.
> *16/10/13 20:47:36 INFO internal.SharedState: Warehouse path is 
> '/user/hive/warehouse'.*
> *py4j.protocol.Py4JJavaError: An error occurred while calling o32.load.*
> *: org.apache.spark.SparkException: Unable to create database default as 
> failed to create its directory* *hdfs://localhost:9000/user/hive/warehouse*
> {color:red}Update #1:
> I was able to reinstall Spark 2.0.0 and the first INFO message clearly states 
> that 
> *16/10/13 21:06:59 INFO internal.SharedState: Warehouse path is 'file:/ FS folder>/spark-warehouse'.*
> {color}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17918) Default Warehouse location apparently in HDFS

2016-10-13 Thread Alessio (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessio updated SPARK-17918:

Description: 
It seems that the default warehouse location in Spark 2.0.1 not only points at 
an inexistent folder in Macintosh systems (/user/hive/warehouse)  - see first 
INFO - but also such folder is then appended to an HDFS - see the error.

This was fixed in 2.0.0, as previous issues reported, but appears again in 
2.0.1. Indeed some scripts I was able to run in 2.0.0 now throw such errors: 
Spark 2.0.0 used to create the spark-warehouse folder within the current 
directory (which was good) and didn't complain about such weird paths, even 
because I'm not using Spark though HDFS, but just locally.


*16/10/13 20:47:36 INFO internal.SharedState: Warehouse path is 
'/user/hive/warehouse'.*

*py4j.protocol.Py4JJavaError: An error occurred while calling o32.load.
: org.apache.spark.SparkException: Unable to create database default as failed 
to create its directory hdfs://localhost:9000/user/hive/warehouse*



  was:
It seems that the default warehouse location in Spark 2.0.1 not only points at 
an inexistent folder in Macintosh systems (/user/hive/warehouse)  - see first 
INFO - but also such folder is then appended to an HDFS - see the error.

This was fixed in 2.0.0, as previous issues reported, but appears again in 
2.0.1. Indeed some scripts I was able to run in 2.0.0 now throw such errors: 
Spark 2.0.0 used to create the spark-warehouse folder within the current 
directory (which was good) and didn't complain about such weird paths, even 
because I'm not using Spark though HDFS, but just locally.


16/10/13 20:47:36 INFO internal.SharedState: Warehouse path is 
'/user/hive/warehouse'.

py4j.protocol.Py4JJavaError: An error occurred while calling o32.load.
: org.apache.spark.SparkException: Unable to create database default as failed 
to create its directory hdfs://localhost:9000/user/hive/warehouse


> Default Warehouse location apparently in HDFS 
> --
>
> Key: SPARK-17918
> URL: https://issues.apache.org/jira/browse/SPARK-17918
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.1
> Environment: Mac OS X 10.11.6
>Reporter: Alessio
>
> It seems that the default warehouse location in Spark 2.0.1 not only points 
> at an inexistent folder in Macintosh systems (/user/hive/warehouse)  - see 
> first INFO - but also such folder is then appended to an HDFS - see the error.
> This was fixed in 2.0.0, as previous issues reported, but appears again in 
> 2.0.1. Indeed some scripts I was able to run in 2.0.0 now throw such errors: 
> Spark 2.0.0 used to create the spark-warehouse folder within the current 
> directory (which was good) and didn't complain about such weird paths, even 
> because I'm not using Spark though HDFS, but just locally.
> *16/10/13 20:47:36 INFO internal.SharedState: Warehouse path is 
> '/user/hive/warehouse'.*
> *py4j.protocol.Py4JJavaError: An error occurred while calling o32.load.
> : org.apache.spark.SparkException: Unable to create database default as 
> failed to create its directory hdfs://localhost:9000/user/hive/warehouse*



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17918) Default Warehouse location apparently in HDFS

2016-10-13 Thread Alessio (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessio updated SPARK-17918:

Description: 
It seems that the default warehouse location in Spark 2.0.1 not only points at 
an inexistent folder in Macintosh systems (/user/hive/warehouse)  - see first 
INFO - but also such folder is then appended to an HDFS - see the error.

This was fixed in 2.0.0, as previous issues reported, but appears again in 
2.0.1. Indeed some scripts I was able to run in 2.0.0 now throw such errors: 
Spark 2.0.0 used to create the spark-warehouse folder within the current 
directory (which was good) and didn't complain about such weird paths, even 
because I'm not using Spark though HDFS, but just locally.


*16/10/13 20:47:36 INFO internal.SharedState: Warehouse path is 
'/user/hive/warehouse'.*

*py4j.protocol.Py4JJavaError: An error occurred while calling o32.load.*
*: org.apache.spark.SparkException: Unable to create database default as failed 
to create its directory* *hdfs://localhost:9000/user/hive/warehouse*



  was:
It seems that the default warehouse location in Spark 2.0.1 not only points at 
an inexistent folder in Macintosh systems (/user/hive/warehouse)  - see first 
INFO - but also such folder is then appended to an HDFS - see the error.

This was fixed in 2.0.0, as previous issues reported, but appears again in 
2.0.1. Indeed some scripts I was able to run in 2.0.0 now throw such errors: 
Spark 2.0.0 used to create the spark-warehouse folder within the current 
directory (which was good) and didn't complain about such weird paths, even 
because I'm not using Spark though HDFS, but just locally.


*16/10/13 20:47:36 INFO internal.SharedState: Warehouse path is 
'/user/hive/warehouse'.*

*py4j.protocol.Py4JJavaError: An error occurred while calling o32.load.
: org.apache.spark.SparkException: Unable to create database default as failed 
to create its directory hdfs://localhost:9000/user/hive/warehouse*




> Default Warehouse location apparently in HDFS 
> --
>
> Key: SPARK-17918
> URL: https://issues.apache.org/jira/browse/SPARK-17918
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.1
> Environment: Mac OS X 10.11.6
>Reporter: Alessio
>
> It seems that the default warehouse location in Spark 2.0.1 not only points 
> at an inexistent folder in Macintosh systems (/user/hive/warehouse)  - see 
> first INFO - but also such folder is then appended to an HDFS - see the error.
> This was fixed in 2.0.0, as previous issues reported, but appears again in 
> 2.0.1. Indeed some scripts I was able to run in 2.0.0 now throw such errors: 
> Spark 2.0.0 used to create the spark-warehouse folder within the current 
> directory (which was good) and didn't complain about such weird paths, even 
> because I'm not using Spark though HDFS, but just locally.
> *16/10/13 20:47:36 INFO internal.SharedState: Warehouse path is 
> '/user/hive/warehouse'.*
> *py4j.protocol.Py4JJavaError: An error occurred while calling o32.load.*
> *: org.apache.spark.SparkException: Unable to create database default as 
> failed to create its directory* *hdfs://localhost:9000/user/hive/warehouse*



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17918) Default Warehouse location apparently in HDFS

2016-10-13 Thread Alessio (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessio updated SPARK-17918:

Environment: Mac OS X 10.11.6  (was: Macintosh)

> Default Warehouse location apparently in HDFS 
> --
>
> Key: SPARK-17918
> URL: https://issues.apache.org/jira/browse/SPARK-17918
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.1
> Environment: Mac OS X 10.11.6
>Reporter: Alessio
>
> It seems that the default warehouse location in Spark 2.0.1 not only points 
> at an inexistent folder in Macintosh systems (/user/hive/warehouse)  - see 
> first INFO - but also such folder is then appended to an HDFS - see the error.
> This was fixed in 2.0.0, as previous issues reported, but appears again in 
> 2.0.1. Indeed some scripts I was able to run in 2.0.0 now throw such errors: 
> Spark 2.0.0 used to create the spark-warehouse folder within the current 
> directory (which was good) and didn't complain about such weird paths, even 
> because I'm not using Spark though HDFS, but just locally.
> 16/10/13 20:47:36 INFO internal.SharedState: Warehouse path is 
> '/user/hive/warehouse'.
> py4j.protocol.Py4JJavaError: An error occurred while calling o32.load.
> : org.apache.spark.SparkException: Unable to create database default as 
> failed to create its directory hdfs://localhost:9000/user/hive/warehouse



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17918) Default Warehouse location apparently in HDFS

2016-10-13 Thread Alessio (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessio updated SPARK-17918:

Description: 
It seems that the default warehouse location in Spark 2.0.1 not only points at 
an inexistent folder in Macintosh systems (/user/hive/warehouse)  - see first 
INFO - but also such folder is then appended to an HDFS - see the error.

This was fixed in 2.0.0, as previous issues reported, but appears again in 
2.0.1. Indeed some scripts I was able to run in 2.0.0 now throw such errors: 
Spark 2.0.0 used to create the spark-warehouse folder within the current 
directory (which was good) and didn't complain about such weird paths, even 
because I'm not using Spark though HDFS, but just locally.


16/10/13 20:47:36 INFO internal.SharedState: Warehouse path is 
'/user/hive/warehouse'.

py4j.protocol.Py4JJavaError: An error occurred while calling o32.load.
: org.apache.spark.SparkException: Unable to create database default as failed 
to create its directory hdfs://localhost:9000/user/hive/warehouse

  was:
It seems that the default warehouse location in Spark 2.0.1 not only points at 
an inexistent folder in Macintosh systems (/user/hive/warehouse)  - see first 
INFO - but also such folder is then appended to an HDFS - see the error.

This was fixed in 2.0.0, as previous issues reported, but appears again in 
2.0.1. Indeed some scripts I was able to run in 2.0.0 now throw such errors.


16/10/13 20:47:36 INFO internal.SharedState: Warehouse path is 
'/user/hive/warehouse'.

py4j.protocol.Py4JJavaError: An error occurred while calling o32.load.
: org.apache.spark.SparkException: Unable to create database default as failed 
to create its directory hdfs://localhost:9000/user/hive/warehouse


> Default Warehouse location apparently in HDFS 
> --
>
> Key: SPARK-17918
> URL: https://issues.apache.org/jira/browse/SPARK-17918
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.1
> Environment: Macintosh
>Reporter: Alessio
>
> It seems that the default warehouse location in Spark 2.0.1 not only points 
> at an inexistent folder in Macintosh systems (/user/hive/warehouse)  - see 
> first INFO - but also such folder is then appended to an HDFS - see the error.
> This was fixed in 2.0.0, as previous issues reported, but appears again in 
> 2.0.1. Indeed some scripts I was able to run in 2.0.0 now throw such errors: 
> Spark 2.0.0 used to create the spark-warehouse folder within the current 
> directory (which was good) and didn't complain about such weird paths, even 
> because I'm not using Spark though HDFS, but just locally.
> 16/10/13 20:47:36 INFO internal.SharedState: Warehouse path is 
> '/user/hive/warehouse'.
> py4j.protocol.Py4JJavaError: An error occurred while calling o32.load.
> : org.apache.spark.SparkException: Unable to create database default as 
> failed to create its directory hdfs://localhost:9000/user/hive/warehouse



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17918) Default Warehouse location apparently in HDFS

2016-10-13 Thread Alessio (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessio updated SPARK-17918:

Summary: Default Warehouse location apparently in HDFS   (was: Default 
Warehause location apparently in HDFS )

> Default Warehouse location apparently in HDFS 
> --
>
> Key: SPARK-17918
> URL: https://issues.apache.org/jira/browse/SPARK-17918
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.1
> Environment: Macintosh
>Reporter: Alessio
>
> It seems that the default warehouse location in Spark 2.0.1 not only points 
> at an inexistent folder in Macintosh systems (/user/hive/warehouse)  - see 
> first INFO - but also such folder is then appended to an HDFS - see the error.
> This was fixed in 2.0.0, as previous issues reported, but appears again in 
> 2.0.1. Indeed some scripts I was able to run in 2.0.0 now throw such errors.
> 16/10/13 20:47:36 INFO internal.SharedState: Warehouse path is 
> '/user/hive/warehouse'.
> py4j.protocol.Py4JJavaError: An error occurred while calling o32.load.
> : org.apache.spark.SparkException: Unable to create database default as 
> failed to create its directory hdfs://localhost:9000/user/hive/warehouse



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17918) Default Warehause location apparently in HDFS

2016-10-13 Thread Alessio (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessio updated SPARK-17918:

Description: 
It seems that the default warehouse location in Spark 2.0.1 not only points at 
an inexistent folder in Macintosh systems (/user/hive/warehouse)  - see first 
INFO - but also such folder is then appended to an HDFS - see the error.

This was fixed in 2.0.0, as previous issues reported, but appears again in 
2.0.1.

`16/10/13 20:47:36 INFO internal.SharedState: Warehouse path is 
'/user/hive/warehouse'.`

py4j.protocol.Py4JJavaError: An error occurred while calling o32.load.
: org.apache.spark.SparkException: Unable to create database default as failed 
to create its directory hdfs://localhost:9000/user/hive/warehouse

  was:
It seems that the default warehouse location in Spark 2.0.1 not only points at 
an inexistent folder in Macintosh systems (/user/hive/warehouse)  - see first 
INFO - but also such folder is then appended to an HDFS - see the error.

This was fixed in 2.0.0, as previous issues reported, but appears again in 
2.0.1.

16/10/13 20:47:36 INFO internal.SharedState: Warehouse path is 
'/user/hive/warehouse'.

py4j.protocol.Py4JJavaError: An error occurred while calling o32.load.
: org.apache.spark.SparkException: Unable to create database default as failed 
to create its directory hdfs://localhost:9000/user/hive/warehouse


> Default Warehause location apparently in HDFS 
> --
>
> Key: SPARK-17918
> URL: https://issues.apache.org/jira/browse/SPARK-17918
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.1
> Environment: Macintosh
>Reporter: Alessio
>
> It seems that the default warehouse location in Spark 2.0.1 not only points 
> at an inexistent folder in Macintosh systems (/user/hive/warehouse)  - see 
> first INFO - but also such folder is then appended to an HDFS - see the error.
> This was fixed in 2.0.0, as previous issues reported, but appears again in 
> 2.0.1.
> `16/10/13 20:47:36 INFO internal.SharedState: Warehouse path is 
> '/user/hive/warehouse'.`
> py4j.protocol.Py4JJavaError: An error occurred while calling o32.load.
> : org.apache.spark.SparkException: Unable to create database default as 
> failed to create its directory hdfs://localhost:9000/user/hive/warehouse



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17918) Default Warehause location apparently in HDFS

2016-10-13 Thread Alessio (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessio updated SPARK-17918:

Description: 
It seems that the default warehouse location in Spark 2.0.1 not only points at 
an inexistent folder in Macintosh systems (/user/hive/warehouse)  - see first 
INFO - but also such folder is then appended to an HDFS - see the error.

This was fixed in 2.0.0, as previous issues reported, but appears again in 
2.0.1. Indeed some scripts I was able to run in 2.0.0 now throw such errors.


16/10/13 20:47:36 INFO internal.SharedState: Warehouse path is 
'/user/hive/warehouse'.

py4j.protocol.Py4JJavaError: An error occurred while calling o32.load.
: org.apache.spark.SparkException: Unable to create database default as failed 
to create its directory hdfs://localhost:9000/user/hive/warehouse

  was:
It seems that the default warehouse location in Spark 2.0.1 not only points at 
an inexistent folder in Macintosh systems (/user/hive/warehouse)  - see first 
INFO - but also such folder is then appended to an HDFS - see the error.

This was fixed in 2.0.0, as previous issues reported, but appears again in 
2.0.1.

`16/10/13 20:47:36 INFO internal.SharedState: Warehouse path is 
'/user/hive/warehouse'.`

py4j.protocol.Py4JJavaError: An error occurred while calling o32.load.
: org.apache.spark.SparkException: Unable to create database default as failed 
to create its directory hdfs://localhost:9000/user/hive/warehouse


> Default Warehause location apparently in HDFS 
> --
>
> Key: SPARK-17918
> URL: https://issues.apache.org/jira/browse/SPARK-17918
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.1
> Environment: Macintosh
>Reporter: Alessio
>
> It seems that the default warehouse location in Spark 2.0.1 not only points 
> at an inexistent folder in Macintosh systems (/user/hive/warehouse)  - see 
> first INFO - but also such folder is then appended to an HDFS - see the error.
> This was fixed in 2.0.0, as previous issues reported, but appears again in 
> 2.0.1. Indeed some scripts I was able to run in 2.0.0 now throw such errors.
> 16/10/13 20:47:36 INFO internal.SharedState: Warehouse path is 
> '/user/hive/warehouse'.
> py4j.protocol.Py4JJavaError: An error occurred while calling o32.load.
> : org.apache.spark.SparkException: Unable to create database default as 
> failed to create its directory hdfs://localhost:9000/user/hive/warehouse



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17918) Default Warehause location apparently in HDFS

2016-10-13 Thread Alessio (JIRA)
Alessio created SPARK-17918:
---

 Summary: Default Warehause location apparently in HDFS 
 Key: SPARK-17918
 URL: https://issues.apache.org/jira/browse/SPARK-17918
 Project: Spark
  Issue Type: Bug
Affects Versions: 2.0.1
 Environment: Macintosh
Reporter: Alessio


It seems that the default warehouse location in Spark 2.0.1 not only points at 
an inexistent folder in Macintosh systems (/user/hive/warehouse)  - see first 
INFO - but also such folder is then appended to an HDFS - see the error.

This was fixed in 2.0.0, as previous issues reported, but appears again in 
2.0.1.

16/10/13 20:47:36 INFO internal.SharedState: Warehouse path is 
'/user/hive/warehouse'.

py4j.protocol.Py4JJavaError: An error occurred while calling o32.load.
: org.apache.spark.SparkException: Unable to create database default as failed 
to create its directory hdfs://localhost:9000/user/hive/warehouse



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15565) The default value of spark.sql.warehouse.dir needs to explicitly point to local filesystem

2016-10-13 Thread Alessio (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15572759#comment-15572759
 ] 

Alessio commented on SPARK-15565:
-

Same problem happened again in Spark 2.0.1.

> The default value of spark.sql.warehouse.dir needs to explicitly point to 
> local filesystem
> --
>
> Key: SPARK-15565
> URL: https://issues.apache.org/jira/browse/SPARK-15565
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Xiao Li
>Priority: Critical
> Fix For: 2.0.0
>
>
> The default value of {{spark.sql.warehouse.dir}} is  
> {{System.getProperty("user.dir")/warehouse}}. Since 
> {{System.getProperty("user.dir")}} is a local dir, we should explicitly set 
> the scheme to local filesystem.
> This should be a one line change  (at 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L58).
> Also see 
> https://issues.apache.org/jira/browse/SPARK-15034?focusedCommentId=15301508=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15301508



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2352) [MLLIB] Add Artificial Neural Network (ANN) to Spark

2016-09-16 Thread Alessio (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15496312#comment-15496312
 ] 

Alessio commented on SPARK-2352:


Dear Sean, I am well aware of the MultiLayer Perceptron classifier.
But you must agree with me that the MLP is just a small branch of the ANN 
world. I reckon that's what the OP wanted to stress: not just MLP or 
feedforward NNs, but also recursive network, Boltzmann Machines and so on...

> [MLLIB] Add Artificial Neural Network (ANN) to Spark
> 
>
> Key: SPARK-2352
> URL: https://issues.apache.org/jira/browse/SPARK-2352
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
> Environment: MLLIB code
>Reporter: Bert Greevenbosch
>Assignee: Bert Greevenbosch
>
> It would be good if the Machine Learning Library contained Artificial Neural 
> Networks (ANNs).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2352) [MLLIB] Add Artificial Neural Network (ANN) to Spark

2016-09-12 Thread Alessio (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15485130#comment-15485130
 ] 

Alessio commented on SPARK-2352:


Pretty strange that this post with such hype is still "In progress" after 1 
year.
If Apache Spark does not (want to?) include your ANNs, can you consider 
releasing it as an independent toolbox?

> [MLLIB] Add Artificial Neural Network (ANN) to Spark
> 
>
> Key: SPARK-2352
> URL: https://issues.apache.org/jira/browse/SPARK-2352
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
> Environment: MLLIB code
>Reporter: Bert Greevenbosch
>Assignee: Bert Greevenbosch
>
> It would be good if the Machine Learning Library contained Artificial Neural 
> Networks (ANNs).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5575) Artificial neural networks for MLlib deep learning

2016-09-12 Thread Alessio (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15485131#comment-15485131
 ] 

Alessio commented on SPARK-5575:


Pretty strange that this post with such hype is still "In progress" after 1 
year.
If Apache Spark does not (want to?) include your ANNs, can you consider 
releasing it as an independent toolbox?

> Artificial neural networks for MLlib deep learning
> --
>
> Key: SPARK-5575
> URL: https://issues.apache.org/jira/browse/SPARK-5575
> Project: Spark
>  Issue Type: Umbrella
>  Components: MLlib
>Affects Versions: 1.2.0
>Reporter: Alexander Ulanov
>
> *Goal:* Implement various types of artificial neural networks
> *Motivation:* (from https://issues.apache.org/jira/browse/SPARK-15581)
> Having deep learning within Spark's ML library is a question of convenience. 
> Spark has broad analytic capabilities and it is useful to have deep learning 
> as one of these tools at hand. Deep learning is a model of choice for several 
> important modern use-cases, and Spark ML might want to cover them. 
> Eventually, it is hard to explain, why do we have PCA in ML but don't provide 
> Autoencoder. To summarize this, Spark should have at least the most widely 
> used deep learning models, such as fully connected artificial neural network, 
> convolutional network and autoencoder. Advanced and experimental deep 
> learning features might reside within packages or as pluggable external 
> tools. These 3 will provide a comprehensive deep learning set for Spark ML. 
> We might also include recurrent networks as well.
> *Requirements:*
> # Extensible API compatible with Spark ML. Basic abstractions such as Neuron, 
> Layer, Error, Regularization, Forward and Backpropagation etc. should be 
> implemented as traits or interfaces, so they can be easily extended or 
> reused. Define the Spark ML API for deep learning. This interface is similar 
> to the other analytics tools in Spark and supports ML pipelines. This makes 
> deep learning easy to use and plug in into analytics workloads for Spark 
> users. 
> # Efficiency. The current implementation of multilayer perceptron in Spark is 
> less than 2x slower than Caffe, both measured on CPU. The main overhead 
> sources are JVM and Spark's communication layer. For more details, please 
> refer to https://github.com/avulanov/ann-benchmark. Having said that, the 
> efficient implementation of deep learning in Spark should be only few times 
> slower than in specialized tool. This is very reasonable for the platform 
> that does much more than deep learning and I believe it is understood by the 
> community.
> # Scalability. Implement efficient distributed training. It relies heavily on 
> the efficient communication and scheduling mechanisms. The default 
> implementation is based on Spark. More efficient implementations might 
> include some external libraries but use the same interface defined.
> *Main features:* 
> # Multilayer perceptron classifier (MLP)
> # Autoencoder
> # Convolutional neural networks for computer vision. The interface has to 
> provide few architectures for deep learning that are widely used in practice, 
> such as AlexNet
> *Additional features:*
> # Other architectures, such as Recurrent neural network (RNN), Long-short 
> term memory (LSTM), Restricted boltzmann machine (RBM), deep belief network 
> (DBN), MLP multivariate regression
> # Regularizers, such as L1, L2, drop-out
> # Normalizers
> # Network customization. The internal API of Spark ANN is designed to be 
> flexible and can handle different types of layers. However, only a part of 
> the API is made public. We have to limit the number of public classes in 
> order to make it simpler to support other languages. This forces us to use 
> (String or Number) parameters instead of introducing of new public classes. 
> One of the options to specify the architecture of ANN is to use text 
> configuration with layer-wise description. We have considered using Caffe 
> format for this. It gives the benefit of compatibility with well known deep 
> learning tool and simplifies the support of other languages in Spark. 
> Implementation of a parser for the subset of Caffe format might be the first 
> step towards the support of general ANN architectures in Spark. 
> # Hardware specific optimization. One can wrap other deep learning 
> implementations with this interface allowing users to pick a particular 
> back-end, e.g. Caffe or TensorFlow, along with the default one. The interface 
> has to provide few architectures for deep learning that are widely used in 
> practice, such as AlexNet. The main motivation for using specialized 
> libraries for deep learning would be to fully take advantage of the hardware 
> where Spark runs, in particular GPUs. Having the default 

[jira] [Updated] (SPARK-15904) High Memory Pressure using MLlib K-means

2016-06-21 Thread Alessio (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessio updated SPARK-15904:

Description: 
*Please Note*: even though the issue has been marked as "not a problem" and 
"resolved", this is actually a problem and wasn't resolved at all. Several 
people encountered memory issues using MLlib for large and complex problems 
(see 
http://stackoverflow.com/questions/32621267/spark-1-4-0-hangs-running-randomforest
 and 
http://stackoverflow.com/questions/27367804/how-do-i-get-spark-submit-to-close)

Running MLlib K-Means on a ~400MB dataset (12 partitions), persisted on Memory 
and Disk.
Everything's fine, although at the end of K-Means, after the number of 
iterations, the cost function value and the running time there's a nice 
"Removing RDD  from persistent list" stage. However, during this stage 
there's a high memory pressure. Weird, since RDDs are about to be removed. Full 
log of this stage:

16/06/12 20:37:33 INFO clustering.KMeans: Run 0 finished in 14 iterations
16/06/12 20:37:33 INFO clustering.KMeans: Iterations took 694.544 seconds.
16/06/12 20:37:33 INFO clustering.KMeans: KMeans converged in 14 iterations.
16/06/12 20:37:33 INFO clustering.KMeans: The cost for the best run is 
49784.87126751288.
16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 781 from persistence 
list
16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 781
16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 780 from persistence 
list
16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 780

I'm running this K-Means on a 16GB machine, with Spark Context as local[*]. My 
machine has an i5 hyperthreaded dual-core, thus [*] means 4.
I'm launching this application though spark-submit with --driver-memory 9G.

_Further test #1:_ the problem appears also without persisting/caching on 
memory (i.e. persist on disk only or no caching/persisting at all).
_Further test #2:_ changing "spark.storage.memoryFraction" doesn't help as well.
_Further test #3:_ lowering the driver memory will result in an Out-of-memory 
error. The out-of-memory error pops out during the Kmeanstrain call.
_Further test #4:_ tried running on a standalone cluster in order to balance 
the memory load, out-of-memory error.
_Further test #4:_ this is most likely to happen when the number of features is 
large: I was indeed able to run something like K=13000 on a dataset with 129 
features.

  was:
*Please Note*: even though the issue has been marked as "not a problem" and 
"resolved", this is actually a problem and wasn't resolved at all. Several 
people encountered memory issues using MLlib for large and complex problems 
(see 
http://stackoverflow.com/questions/32621267/spark-1-4-0-hangs-running-randomforest
 and 
http://stackoverflow.com/questions/27367804/how-do-i-get-spark-submit-to-close)

Running MLlib K-Means on a ~400MB dataset (12 partitions), persisted on Memory 
and Disk.
Everything's fine, although at the end of K-Means, after the number of 
iterations, the cost function value and the running time there's a nice 
"Removing RDD  from persistent list" stage. However, during this stage 
there's a high memory pressure. Weird, since RDDs are about to be removed. Full 
log of this stage:

16/06/12 20:37:33 INFO clustering.KMeans: Run 0 finished in 14 iterations
16/06/12 20:37:33 INFO clustering.KMeans: Iterations took 694.544 seconds.
16/06/12 20:37:33 INFO clustering.KMeans: KMeans converged in 14 iterations.
16/06/12 20:37:33 INFO clustering.KMeans: The cost for the best run is 
49784.87126751288.
16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 781 from persistence 
list
16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 781
16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 780 from persistence 
list
16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 780

I'm running this K-Means on a 16GB machine, with Spark Context as local[*]. My 
machine has an i5 hyperthreaded dual-core, thus [*] means 4.
I'm launching this application though spark-submit with --driver-memory 9G.

_Further test #1:_ the problem appears also without persisting/caching on 
memory (i.e. persist on disk only or no caching/persisting at all).
_Further test #2:_ changing "spark.storage.memoryFraction" doesn't help as well.
_Further test #3:_ lowering the driver memory will result in an Out-of-memory 
error. The out-of-memory error pops out during the Kmeanstrain call.
_Further test #4:_ tried running on a standalone cluster in order to balance 
the memory load, out-of-memory error.


> High Memory Pressure using MLlib K-means
> 
>
> Key: SPARK-15904
> URL: https://issues.apache.org/jira/browse/SPARK-15904
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.6.1
> Environment: Mac OS X 

[jira] [Updated] (SPARK-15904) High Memory Pressure using MLlib K-means

2016-06-21 Thread Alessio (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessio updated SPARK-15904:

Description: 
*Please Note*: even though the issue has been marked as "not a problem" and 
"resolved", this is actually a problem and wasn't resolved at all. Several 
people encountered memory issues using MLlib for large and complex problems 
(see 
http://stackoverflow.com/questions/32621267/spark-1-4-0-hangs-running-randomforest
 and 
http://stackoverflow.com/questions/27367804/how-do-i-get-spark-submit-to-close)

Running MLlib K-Means on a ~400MB dataset (12 partitions), persisted on Memory 
and Disk.
Everything's fine, although at the end of K-Means, after the number of 
iterations, the cost function value and the running time there's a nice 
"Removing RDD  from persistent list" stage. However, during this stage 
there's a high memory pressure. Weird, since RDDs are about to be removed. Full 
log of this stage:

16/06/12 20:37:33 INFO clustering.KMeans: Run 0 finished in 14 iterations
16/06/12 20:37:33 INFO clustering.KMeans: Iterations took 694.544 seconds.
16/06/12 20:37:33 INFO clustering.KMeans: KMeans converged in 14 iterations.
16/06/12 20:37:33 INFO clustering.KMeans: The cost for the best run is 
49784.87126751288.
16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 781 from persistence 
list
16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 781
16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 780 from persistence 
list
16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 780

I'm running this K-Means on a 16GB machine, with Spark Context as local[*]. My 
machine has an i5 hyperthreaded dual-core, thus [*] means 4.
I'm launching this application though spark-submit with --driver-memory 9G.

_Further test #1:_ the problem appears also without persisting/caching on 
memory (i.e. persist on disk only or no caching/persisting at all).
_Further test #2:_ changing "spark.storage.memoryFraction" doesn't help as well.
_Further test #3:_ lowering the driver memory will result in an Out-of-memory 
error. The out-of-memory error pops out during the Kmeanstrain call.
_Further test #4:_ tried running on a standalone cluster in order to balance 
the memory load, out-of-memory error.
_Further test #5:_ this is most likely to happen when the number of features is 
large: I was indeed able to run something like K=13000 on a dataset with 129 
features.

  was:
*Please Note*: even though the issue has been marked as "not a problem" and 
"resolved", this is actually a problem and wasn't resolved at all. Several 
people encountered memory issues using MLlib for large and complex problems 
(see 
http://stackoverflow.com/questions/32621267/spark-1-4-0-hangs-running-randomforest
 and 
http://stackoverflow.com/questions/27367804/how-do-i-get-spark-submit-to-close)

Running MLlib K-Means on a ~400MB dataset (12 partitions), persisted on Memory 
and Disk.
Everything's fine, although at the end of K-Means, after the number of 
iterations, the cost function value and the running time there's a nice 
"Removing RDD  from persistent list" stage. However, during this stage 
there's a high memory pressure. Weird, since RDDs are about to be removed. Full 
log of this stage:

16/06/12 20:37:33 INFO clustering.KMeans: Run 0 finished in 14 iterations
16/06/12 20:37:33 INFO clustering.KMeans: Iterations took 694.544 seconds.
16/06/12 20:37:33 INFO clustering.KMeans: KMeans converged in 14 iterations.
16/06/12 20:37:33 INFO clustering.KMeans: The cost for the best run is 
49784.87126751288.
16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 781 from persistence 
list
16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 781
16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 780 from persistence 
list
16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 780

I'm running this K-Means on a 16GB machine, with Spark Context as local[*]. My 
machine has an i5 hyperthreaded dual-core, thus [*] means 4.
I'm launching this application though spark-submit with --driver-memory 9G.

_Further test #1:_ the problem appears also without persisting/caching on 
memory (i.e. persist on disk only or no caching/persisting at all).
_Further test #2:_ changing "spark.storage.memoryFraction" doesn't help as well.
_Further test #3:_ lowering the driver memory will result in an Out-of-memory 
error. The out-of-memory error pops out during the Kmeanstrain call.
_Further test #4:_ tried running on a standalone cluster in order to balance 
the memory load, out-of-memory error.
_Further test #4:_ this is most likely to happen when the number of features is 
large: I was indeed able to run something like K=13000 on a dataset with 129 
features.


> High Memory Pressure using MLlib K-means
> 
>
> Key: SPARK-15904
> URL: 

[jira] [Updated] (SPARK-15904) High Memory Pressure using MLlib K-means

2016-06-21 Thread Alessio (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessio updated SPARK-15904:

Description: 
*Please Note*: even though the issue has been marked as "not a problem" and 
"resolved", this is actually a problem and wasn't resolved at all. Several 
people encountered memory issues using MLlib for large and complex problems 
(see 
http://stackoverflow.com/questions/32621267/spark-1-4-0-hangs-running-randomforest
 and 
http://stackoverflow.com/questions/27367804/how-do-i-get-spark-submit-to-close)

Running MLlib K-Means on a ~400MB dataset (12 partitions), persisted on Memory 
and Disk.
Everything's fine, although at the end of K-Means, after the number of 
iterations, the cost function value and the running time there's a nice 
"Removing RDD  from persistent list" stage. However, during this stage 
there's a high memory pressure. Weird, since RDDs are about to be removed. Full 
log of this stage:

16/06/12 20:37:33 INFO clustering.KMeans: Run 0 finished in 14 iterations
16/06/12 20:37:33 INFO clustering.KMeans: Iterations took 694.544 seconds.
16/06/12 20:37:33 INFO clustering.KMeans: KMeans converged in 14 iterations.
16/06/12 20:37:33 INFO clustering.KMeans: The cost for the best run is 
49784.87126751288.
16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 781 from persistence 
list
16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 781
16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 780 from persistence 
list
16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 780

I'm running this K-Means on a 16GB machine, with Spark Context as local[*]. My 
machine has an i5 hyperthreaded dual-core, thus [*] means 4.
I'm launching this application though spark-submit with --driver-memory 9G.

_Further test #1:_ the problem appears also without persisting/caching on 
memory (i.e. persist on disk only or no caching/persisting at all).
_Further test #2:_ changing "spark.storage.memoryFraction" doesn't help as well.
_Further test #3:_ lowering the driver memory will result in an Out-of-memory 
error. The out-of-memory error pops out during the Kmeanstrain call.
_Further test #4:_ tried running on a standalone cluster in order to balance 
the memory load, out-of-memory error.

  was:
*Please Note*: even though the issue has been marked as "not a problem" and 
"resolved", this is actually a problem and wasn't resolved at all. Several 
people encountered memory issues using MLlib for large and complex problems 
(see 
http://stackoverflow.com/questions/32621267/spark-1-4-0-hangs-running-randomforest
 and 
http://stackoverflow.com/questions/27367804/how-do-i-get-spark-submit-to-close)

Running MLlib K-Means on a ~400MB dataset (12 partitions), persisted on Memory 
and Disk.
Everything's fine, although at the end of K-Means, after the number of 
iterations, the cost function value and the running time there's a nice 
"Removing RDD  from persistent list" stage. However, during this stage 
there's a high memory pressure. Weird, since RDDs are about to be removed. Full 
log of this stage:

16/06/12 20:37:33 INFO clustering.KMeans: Run 0 finished in 14 iterations
16/06/12 20:37:33 INFO clustering.KMeans: Iterations took 694.544 seconds.
16/06/12 20:37:33 INFO clustering.KMeans: KMeans converged in 14 iterations.
16/06/12 20:37:33 INFO clustering.KMeans: The cost for the best run is 
49784.87126751288.
16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 781 from persistence 
list
16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 781
16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 780 from persistence 
list
16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 780

I'm running this K-Means on a 16GB machine, with Spark Context as local[*]. My 
machine has an i5 hyperthreaded dual-core, thus [*] means 4.
I'm launching this application though spark-submit with --driver-memory 9G.

_Further test #1:_ the problem appears also without persisting/caching on 
memory (i.e. persist on disk only or no caching/persisting at all).
_Further test #2:_ changing "spark.storage.memoryFraction" doesn't help as well.
_Further test #3:_ lowering the driver memory will result in an Out-of-memory 
error.
_Further test #4:_ tried running on a standalone cluster in order to balance 
the memory load, out-of-memory error.


> High Memory Pressure using MLlib K-means
> 
>
> Key: SPARK-15904
> URL: https://issues.apache.org/jira/browse/SPARK-15904
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.6.1
> Environment: Mac OS X 10.11.6beta on Macbook Pro 13" mid-2012. 16GB 
> of RAM.
>Reporter: Alessio
>Priority: Minor
>
> *Please Note*: even though the issue has been marked as "not a problem" and 
> "resolved", this is actually a 

[jira] [Updated] (SPARK-15904) High Memory Pressure using MLlib K-means

2016-06-21 Thread Alessio (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessio updated SPARK-15904:

Description: 
*Please Note*: even though the issue has been marked as "not a problem" and 
"resolved", this is actually a problem and wasn't resolved at all. Several 
people encountered memory issues using MLlib for large and complex problems 
(see 
http://stackoverflow.com/questions/32621267/spark-1-4-0-hangs-running-randomforest
 and 
http://stackoverflow.com/questions/27367804/how-do-i-get-spark-submit-to-close)

Running MLlib K-Means on a ~400MB dataset (12 partitions), persisted on Memory 
and Disk.
Everything's fine, although at the end of K-Means, after the number of 
iterations, the cost function value and the running time there's a nice 
"Removing RDD  from persistent list" stage. However, during this stage 
there's a high memory pressure. Weird, since RDDs are about to be removed. Full 
log of this stage:

16/06/12 20:37:33 INFO clustering.KMeans: Run 0 finished in 14 iterations
16/06/12 20:37:33 INFO clustering.KMeans: Iterations took 694.544 seconds.
16/06/12 20:37:33 INFO clustering.KMeans: KMeans converged in 14 iterations.
16/06/12 20:37:33 INFO clustering.KMeans: The cost for the best run is 
49784.87126751288.
16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 781 from persistence 
list
16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 781
16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 780 from persistence 
list
16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 780

I'm running this K-Means on a 16GB machine, with Spark Context as local[*]. My 
machine has an i5 hyperthreaded dual-core, thus [*] means 4.
I'm launching this application though spark-submit with --driver-memory 9G.

_Further test #1:_ the problem appears also without persisting/caching on 
memory (i.e. persist on disk only or no caching/persisting at all).
_Further test #2:_ changing "spark.storage.memoryFraction" doesn't help as well.
_Further test #3:_ lowering the driver memory will result in an Out-of-memory 
error.
_Further test #4:_ tried running on a standalone cluster in order to balance 
the memory load, out-of-memory error.

  was:
*Please Note*: even though the issue has been marked as "not a problem" and 
"resolved", this is actually a problem and wasn't resolved at all. Several 
people encountered memory issues using MLlib for large and complex problems 
(see 
http://stackoverflow.com/questions/32621267/spark-1-4-0-hangs-running-randomforest
 and 
http://stackoverflow.com/questions/27367804/how-do-i-get-spark-submit-to-close)

Running MLlib K-Means on a ~400MB dataset (12 partitions), persisted on Memory 
and Disk.
Everything's fine, although at the end of K-Means, after the number of 
iterations, the cost function value and the running time there's a nice 
"Removing RDD  from persistent list" stage. However, during this stage 
there's a high memory pressure. Weird, since RDDs are about to be removed. Full 
log of this stage:

16/06/12 20:37:33 INFO clustering.KMeans: Run 0 finished in 14 iterations
16/06/12 20:37:33 INFO clustering.KMeans: Iterations took 694.544 seconds.
16/06/12 20:37:33 INFO clustering.KMeans: KMeans converged in 14 iterations.
16/06/12 20:37:33 INFO clustering.KMeans: The cost for the best run is 
49784.87126751288.
16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 781 from persistence 
list
16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 781
16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 780 from persistence 
list
16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 780

I'm running this K-Means on a 16GB machine, with Spark Context as local[*]. My 
machine has an i5 hyperthreaded dual-core, thus [*] means 4.
I'm launching this application though spark-submit with --driver-memory 9G.

_Further test #1:_ the problem appears also without persisting/caching on 
memory (i.e. persist on disk only or no caching/persisting at all).
_Further test #2:_ changing "spark.storage.memoryFraction" doesn't help as well.
_Further test #3:_ lowering the driver memory will result in an Out-of-memory 
error.


> High Memory Pressure using MLlib K-means
> 
>
> Key: SPARK-15904
> URL: https://issues.apache.org/jira/browse/SPARK-15904
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.6.1
> Environment: Mac OS X 10.11.6beta on Macbook Pro 13" mid-2012. 16GB 
> of RAM.
>Reporter: Alessio
>Priority: Minor
>
> *Please Note*: even though the issue has been marked as "not a problem" and 
> "resolved", this is actually a problem and wasn't resolved at all. Several 
> people encountered memory issues using MLlib for large and complex problems 
> (see 
> 

[jira] [Commented] (SPARK-15904) High Memory Pressure using MLlib K-means

2016-06-16 Thread Alessio (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15333753#comment-15333753
 ] 

Alessio commented on SPARK-15904:
-

If you're so absolutely sure that I'm missing something, I'm ready to hear your 
expert opinion.
You've got 16GB of RAM, dataset size 400MB, driver with 4 cores. K=9120. I've 
already told you that 4GB and 8GB will result in an Out-of-memory error. 9GB 
will result in this unexpected behaviour. How would you tune the driver memory?
Answers like "your code might be the problem" (which is not, of course) and 
"there's something wrong with the memory setup" (how enlightening!) are way too 
easy. 

> High Memory Pressure using MLlib K-means
> 
>
> Key: SPARK-15904
> URL: https://issues.apache.org/jira/browse/SPARK-15904
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.6.1
> Environment: Mac OS X 10.11.6beta on Macbook Pro 13" mid-2012. 16GB 
> of RAM.
>Reporter: Alessio
>Priority: Minor
>
> *Please Note*: even though the issue has been marked as "not a problem" and 
> "resolved", this is actually a problem and wasn't resolved at all. Several 
> people encountered memory issues using MLlib for large and complex problems 
> (see 
> http://stackoverflow.com/questions/32621267/spark-1-4-0-hangs-running-randomforest
>  and 
> http://stackoverflow.com/questions/27367804/how-do-i-get-spark-submit-to-close)
> Running MLlib K-Means on a ~400MB dataset (12 partitions), persisted on 
> Memory and Disk.
> Everything's fine, although at the end of K-Means, after the number of 
> iterations, the cost function value and the running time there's a nice 
> "Removing RDD  from persistent list" stage. However, during this stage 
> there's a high memory pressure. Weird, since RDDs are about to be removed. 
> Full log of this stage:
> 16/06/12 20:37:33 INFO clustering.KMeans: Run 0 finished in 14 iterations
> 16/06/12 20:37:33 INFO clustering.KMeans: Iterations took 694.544 seconds.
> 16/06/12 20:37:33 INFO clustering.KMeans: KMeans converged in 14 iterations.
> 16/06/12 20:37:33 INFO clustering.KMeans: The cost for the best run is 
> 49784.87126751288.
> 16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 781 from 
> persistence list
> 16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 781
> 16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 780 from 
> persistence list
> 16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 780
> I'm running this K-Means on a 16GB machine, with Spark Context as local[*]. 
> My machine has an i5 hyperthreaded dual-core, thus [*] means 4.
> I'm launching this application though spark-submit with --driver-memory 9G.
> _Further test #1:_ the problem appears also without persisting/caching on 
> memory (i.e. persist on disk only or no caching/persisting at all).
> _Further test #2:_ changing "spark.storage.memoryFraction" doesn't help as 
> well.
> _Further test #3:_ lowering the driver memory will result in an Out-of-memory 
> error.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15904) High Memory Pressure using MLlib K-means

2016-06-16 Thread Alessio (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessio updated SPARK-15904:

Description: 
*Please Note*: even though the issue has been marked as "not a problem" and 
"resolved", this is actually a problem and wasn't resolved at all. Several 
people encountered memory issues using MLlib for large and complex problems 
(see 
http://stackoverflow.com/questions/32621267/spark-1-4-0-hangs-running-randomforest
 and 
http://stackoverflow.com/questions/27367804/how-do-i-get-spark-submit-to-close)

Running MLlib K-Means on a ~400MB dataset (12 partitions), persisted on Memory 
and Disk.
Everything's fine, although at the end of K-Means, after the number of 
iterations, the cost function value and the running time there's a nice 
"Removing RDD  from persistent list" stage. However, during this stage 
there's a high memory pressure. Weird, since RDDs are about to be removed. Full 
log of this stage:

16/06/12 20:37:33 INFO clustering.KMeans: Run 0 finished in 14 iterations
16/06/12 20:37:33 INFO clustering.KMeans: Iterations took 694.544 seconds.
16/06/12 20:37:33 INFO clustering.KMeans: KMeans converged in 14 iterations.
16/06/12 20:37:33 INFO clustering.KMeans: The cost for the best run is 
49784.87126751288.
16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 781 from persistence 
list
16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 781
16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 780 from persistence 
list
16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 780

I'm running this K-Means on a 16GB machine, with Spark Context as local[*]. My 
machine has an i5 hyperthreaded dual-core, thus [*] means 4.
I'm launching this application though spark-submit with --driver-memory 9G.

_Further test #1:_ the problem appears also without persisting/caching on 
memory (i.e. persist on disk only or no caching/persisting at all).
_Further test #2:_ changing "spark.storage.memoryFraction" doesn't help as well.
_Further test #3:_ lowering the driver memory will result in an Out-of-memory 
error.

  was:
*Please Note*: even though the issue has been marked as "not a problem" and 
"resolved", this is actually a problem and wasn't resolved at all. Several 
people encountered memory issues using MLlib for large and complex problems 
(see 
http://stackoverflow.com/questions/32621267/spark-1-4-0-hangs-running-randomforest
 and 
http://stackoverflow.com/questions/27367804/how-do-i-get-spark-submit-to-close)

Running MLlib K-Means on a ~400MB dataset (12 partitions), persisted on Memory 
and Disk.
Everything's fine, although at the end of K-Means, after the number of 
iterations, the cost function value and the running time there's a nice 
"Removing RDD  from persistent list" stage. However, during this stage 
there's a high memory pressure. Weird, since RDDs are about to be removed. Full 
log of this stage:

16/06/12 20:37:33 INFO clustering.KMeans: Run 0 finished in 14 iterations
16/06/12 20:37:33 INFO clustering.KMeans: Iterations took 694.544 seconds.
16/06/12 20:37:33 INFO clustering.KMeans: KMeans converged in 14 iterations.
16/06/12 20:37:33 INFO clustering.KMeans: The cost for the best run is 
49784.87126751288.
16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 781 from persistence 
list
16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 781
16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 780 from persistence 
list
16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 780

I'm running this K-Means on a 16GB machine, with Spark Context as local[*]. My 
machine has an i5 hyperthreaded dual-core, thus [*] means 4.
I'm launching this application though spark-submit with --driver-memory 9G.

_Further test #1:_ the problem appears also without persisting/caching on 
memory (i.e. persist on disk only or no caching/persisting at all).
_Further test #2:_ changing "spark.storage.memoryFraction" doesn't help as well.


> High Memory Pressure using MLlib K-means
> 
>
> Key: SPARK-15904
> URL: https://issues.apache.org/jira/browse/SPARK-15904
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.6.1
> Environment: Mac OS X 10.11.6beta on Macbook Pro 13" mid-2012. 16GB 
> of RAM.
>Reporter: Alessio
>Priority: Minor
>
> *Please Note*: even though the issue has been marked as "not a problem" and 
> "resolved", this is actually a problem and wasn't resolved at all. Several 
> people encountered memory issues using MLlib for large and complex problems 
> (see 
> http://stackoverflow.com/questions/32621267/spark-1-4-0-hangs-running-randomforest
>  and 
> http://stackoverflow.com/questions/27367804/how-do-i-get-spark-submit-to-close)
> Running MLlib K-Means on a ~400MB dataset (12 partitions), persisted on 

[jira] [Updated] (SPARK-15904) High Memory Pressure using MLlib K-means

2016-06-16 Thread Alessio (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessio updated SPARK-15904:

Description: 
*Please Note*: even though the issue has been marked as "not a problem" and 
"resolved", this is actually a problem and wasn't resolved at all. Several 
people encountered memory issues using MLlib for large and complex problems 
(see 
http://stackoverflow.com/questions/32621267/spark-1-4-0-hangs-running-randomforest
 and 
http://stackoverflow.com/questions/27367804/how-do-i-get-spark-submit-to-close)

Running MLlib K-Means on a ~400MB dataset (12 partitions), persisted on Memory 
and Disk.
Everything's fine, although at the end of K-Means, after the number of 
iterations, the cost function value and the running time there's a nice 
"Removing RDD  from persistent list" stage. However, during this stage 
there's a high memory pressure. Weird, since RDDs are about to be removed. Full 
log of this stage:

16/06/12 20:37:33 INFO clustering.KMeans: Run 0 finished in 14 iterations
16/06/12 20:37:33 INFO clustering.KMeans: Iterations took 694.544 seconds.
16/06/12 20:37:33 INFO clustering.KMeans: KMeans converged in 14 iterations.
16/06/12 20:37:33 INFO clustering.KMeans: The cost for the best run is 
49784.87126751288.
16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 781 from persistence 
list
16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 781
16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 780 from persistence 
list
16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 780

I'm running this K-Means on a 16GB machine, with Spark Context as local[*]. My 
machine has an i5 hyperthreaded dual-core, thus [*] means 4.
I'm launching this application though spark-submit with --driver-memory 9G.

_Further test #1:_ the problem appears also without persisting/caching on 
memory (i.e. persist on disk only or no caching/persisting at all).
_Further test #2:_ changing "spark.storage.memoryFraction" doesn't help as well.

  was:
*Please Note*: even though the issue has been marked as "not a problem" and 
"resolved", this is actually a problem and wasn't resolved at all. Though I 
reopened it. Several people encountered memory issues using MLlib for large and 
complex problems (see 
http://stackoverflow.com/questions/32621267/spark-1-4-0-hangs-running-randomforest
 and 
http://stackoverflow.com/questions/27367804/how-do-i-get-spark-submit-to-close)

Running MLlib K-Means on a ~400MB dataset (12 partitions), persisted on Memory 
and Disk.
Everything's fine, although at the end of K-Means, after the number of 
iterations, the cost function value and the running time there's a nice 
"Removing RDD  from persistent list" stage. However, during this stage 
there's a high memory pressure. Weird, since RDDs are about to be removed. Full 
log of this stage:

16/06/12 20:37:33 INFO clustering.KMeans: Run 0 finished in 14 iterations
16/06/12 20:37:33 INFO clustering.KMeans: Iterations took 694.544 seconds.
16/06/12 20:37:33 INFO clustering.KMeans: KMeans converged in 14 iterations.
16/06/12 20:37:33 INFO clustering.KMeans: The cost for the best run is 
49784.87126751288.
16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 781 from persistence 
list
16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 781
16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 780 from persistence 
list
16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 780

I'm running this K-Means on a 16GB machine, with Spark Context as local[*]. My 
machine has an i5 hyperthreaded dual-core, thus [*] means 4.
I'm launching this application though spark-submit with --driver-memory 9G.

_Further tests:_ the problem appears also without persisting/caching on memory 
(i.e. persist on disk only or no caching/persisting at all)


> High Memory Pressure using MLlib K-means
> 
>
> Key: SPARK-15904
> URL: https://issues.apache.org/jira/browse/SPARK-15904
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.6.1
> Environment: Mac OS X 10.11.6beta on Macbook Pro 13" mid-2012. 16GB 
> of RAM.
>Reporter: Alessio
>Priority: Minor
>
> *Please Note*: even though the issue has been marked as "not a problem" and 
> "resolved", this is actually a problem and wasn't resolved at all. Several 
> people encountered memory issues using MLlib for large and complex problems 
> (see 
> http://stackoverflow.com/questions/32621267/spark-1-4-0-hangs-running-randomforest
>  and 
> http://stackoverflow.com/questions/27367804/how-do-i-get-spark-submit-to-close)
> Running MLlib K-Means on a ~400MB dataset (12 partitions), persisted on 
> Memory and Disk.
> Everything's fine, although at the end of K-Means, after the number of 
> iterations, the cost function value and the running 

[jira] [Updated] (SPARK-15904) High Memory Pressure using MLlib K-means

2016-06-16 Thread Alessio (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessio updated SPARK-15904:

Description: 
*Please Note*: even though the issue has been marked as "not a problem" and 
"resolved", this is actually a problem and wasn't resolved at all. Though I 
reopened it. Several people encountered memory issues using MLlib for large and 
complex problems (see 
http://stackoverflow.com/questions/32621267/spark-1-4-0-hangs-running-randomforest
 and 
http://stackoverflow.com/questions/27367804/how-do-i-get-spark-submit-to-close)

Running MLlib K-Means on a ~400MB dataset (12 partitions), persisted on Memory 
and Disk.
Everything's fine, although at the end of K-Means, after the number of 
iterations, the cost function value and the running time there's a nice 
"Removing RDD  from persistent list" stage. However, during this stage 
there's a high memory pressure. Weird, since RDDs are about to be removed. Full 
log of this stage:

16/06/12 20:37:33 INFO clustering.KMeans: Run 0 finished in 14 iterations
16/06/12 20:37:33 INFO clustering.KMeans: Iterations took 694.544 seconds.
16/06/12 20:37:33 INFO clustering.KMeans: KMeans converged in 14 iterations.
16/06/12 20:37:33 INFO clustering.KMeans: The cost for the best run is 
49784.87126751288.
16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 781 from persistence 
list
16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 781
16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 780 from persistence 
list
16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 780

I'm running this K-Means on a 16GB machine, with Spark Context as local[*]. My 
machine has an i5 hyperthreaded dual-core, thus [*] means 4.
I'm launching this application though spark-submit with --driver-memory 9G.

_Further tests:_ the problem appears also without persisting/caching on memory 
(i.e. persist on disk only or no caching/persisting at all)

  was:
*Please Note*: even though the issue has been marked as "not a problem" and 
"resolved", this is actually a problem and wasn't resolved at all. Though I 
reopened it. Several people encountered memory issues using MLlib for large and 
complex problems (see 
http://stackoverflow.com/questions/32621267/spark-1-4-0-hangs-running-randomforest
 and 
http://stackoverflow.com/questions/27367804/how-do-i-get-spark-submit-to-close)

Running MLlib K-Means on a ~400MB dataset (12 partitions), persisted on Memory 
and Disk.
Everything's fine, although at the end of K-Means, after the number of 
iterations, the cost function value and the running time there's a nice 
"Removing RDD  from persistent list" stage. However, during this stage 
there's a high memory pressure. Weird, since RDDs are about to be removed. Full 
log of this stage:

16/06/12 20:37:33 INFO clustering.KMeans: Run 0 finished in 14 iterations
16/06/12 20:37:33 INFO clustering.KMeans: Iterations took 694.544 seconds.
16/06/12 20:37:33 INFO clustering.KMeans: KMeans converged in 14 iterations.
16/06/12 20:37:33 INFO clustering.KMeans: The cost for the best run is 
49784.87126751288.
16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 781 from persistence 
list
16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 781
16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 780 from persistence 
list
16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 780

I'm running this K-Means on a 16GB machine, with Spark Context as local[*]. My 
machine has an i5 hyperthreaded dual-core, thus [*] means 4.
I'm launching this application though spark-submit with --driver-memory 9G


> High Memory Pressure using MLlib K-means
> 
>
> Key: SPARK-15904
> URL: https://issues.apache.org/jira/browse/SPARK-15904
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.6.1
> Environment: Mac OS X 10.11.6beta on Macbook Pro 13" mid-2012. 16GB 
> of RAM.
>Reporter: Alessio
>Priority: Minor
>
> *Please Note*: even though the issue has been marked as "not a problem" and 
> "resolved", this is actually a problem and wasn't resolved at all. Though I 
> reopened it. Several people encountered memory issues using MLlib for large 
> and complex problems (see 
> http://stackoverflow.com/questions/32621267/spark-1-4-0-hangs-running-randomforest
>  and 
> http://stackoverflow.com/questions/27367804/how-do-i-get-spark-submit-to-close)
> Running MLlib K-Means on a ~400MB dataset (12 partitions), persisted on 
> Memory and Disk.
> Everything's fine, although at the end of K-Means, after the number of 
> iterations, the cost function value and the running time there's a nice 
> "Removing RDD  from persistent list" stage. However, during this stage 
> there's a high memory pressure. Weird, since RDDs are about to be removed. 
> Full 

[jira] [Updated] (SPARK-15904) High Memory Pressure using MLlib K-means

2016-06-16 Thread Alessio (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessio updated SPARK-15904:

Description: 
*Please Note*: even though the issue has been marked as "not a problem" and 
"resolved", this is actually a problem and wasn't resolved at all. Though I 
reopened it. Several people encountered memory issues using MLlib for large and 
complex problems (see 
http://stackoverflow.com/questions/32621267/spark-1-4-0-hangs-running-randomforest
 and 
http://stackoverflow.com/questions/27367804/how-do-i-get-spark-submit-to-close)

Running MLlib K-Means on a ~400MB dataset (12 partitions), persisted on Memory 
and Disk.
Everything's fine, although at the end of K-Means, after the number of 
iterations, the cost function value and the running time there's a nice 
"Removing RDD  from persistent list" stage. However, during this stage 
there's a high memory pressure. Weird, since RDDs are about to be removed. Full 
log of this stage:

16/06/12 20:37:33 INFO clustering.KMeans: Run 0 finished in 14 iterations
16/06/12 20:37:33 INFO clustering.KMeans: Iterations took 694.544 seconds.
16/06/12 20:37:33 INFO clustering.KMeans: KMeans converged in 14 iterations.
16/06/12 20:37:33 INFO clustering.KMeans: The cost for the best run is 
49784.87126751288.
16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 781 from persistence 
list
16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 781
16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 780 from persistence 
list
16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 780

I'm running this K-Means on a 16GB machine, with Spark Context as local[*]. My 
machine has an i5 hyperthreaded dual-core, thus [*] means 4.
I'm launching this application though spark-submit with --driver-memory 9G

  was:
*Please Note*: even though the issue has been marked as "not a problem" and 
"resolved", this is actually a problem and wasn't resolved at all. Several 
people encountered memory issues using MLlib for large and complex problems 
(see 
http://stackoverflow.com/questions/32621267/spark-1-4-0-hangs-running-randomforest
 and 
http://stackoverflow.com/questions/27367804/how-do-i-get-spark-submit-to-close)

Running MLlib K-Means on a ~400MB dataset (12 partitions), persisted on Memory 
and Disk.
Everything's fine, although at the end of K-Means, after the number of 
iterations, the cost function value and the running time there's a nice 
"Removing RDD  from persistent list" stage. However, during this stage 
there's a high memory pressure. Weird, since RDDs are about to be removed. Full 
log of this stage:

16/06/12 20:37:33 INFO clustering.KMeans: Run 0 finished in 14 iterations
16/06/12 20:37:33 INFO clustering.KMeans: Iterations took 694.544 seconds.
16/06/12 20:37:33 INFO clustering.KMeans: KMeans converged in 14 iterations.
16/06/12 20:37:33 INFO clustering.KMeans: The cost for the best run is 
49784.87126751288.
16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 781 from persistence 
list
16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 781
16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 780 from persistence 
list
16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 780

I'm running this K-Means on a 16GB machine, with Spark Context as local[*]. My 
machine has an i5 hyperthreaded dual-core, thus [*] means 4.
I'm launching this application though spark-submit with --driver-memory 9G


> High Memory Pressure using MLlib K-means
> 
>
> Key: SPARK-15904
> URL: https://issues.apache.org/jira/browse/SPARK-15904
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.6.1
> Environment: Mac OS X 10.11.6beta on Macbook Pro 13" mid-2012. 16GB 
> of RAM.
>Reporter: Alessio
>Priority: Minor
>
> *Please Note*: even though the issue has been marked as "not a problem" and 
> "resolved", this is actually a problem and wasn't resolved at all. Though I 
> reopened it. Several people encountered memory issues using MLlib for large 
> and complex problems (see 
> http://stackoverflow.com/questions/32621267/spark-1-4-0-hangs-running-randomforest
>  and 
> http://stackoverflow.com/questions/27367804/how-do-i-get-spark-submit-to-close)
> Running MLlib K-Means on a ~400MB dataset (12 partitions), persisted on 
> Memory and Disk.
> Everything's fine, although at the end of K-Means, after the number of 
> iterations, the cost function value and the running time there's a nice 
> "Removing RDD  from persistent list" stage. However, during this stage 
> there's a high memory pressure. Weird, since RDDs are about to be removed. 
> Full log of this stage:
> 16/06/12 20:37:33 INFO clustering.KMeans: Run 0 finished in 14 iterations
> 16/06/12 20:37:33 INFO clustering.KMeans: Iterations took 694.544 

[jira] [Reopened] (SPARK-15904) High Memory Pressure using MLlib K-means

2016-06-16 Thread Alessio (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessio reopened SPARK-15904:
-

> High Memory Pressure using MLlib K-means
> 
>
> Key: SPARK-15904
> URL: https://issues.apache.org/jira/browse/SPARK-15904
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.6.1
> Environment: Mac OS X 10.11.6beta on Macbook Pro 13" mid-2012. 16GB 
> of RAM.
>Reporter: Alessio
>Priority: Minor
>
> *Please Note*: even though the issue has been marked as "not a problem" and 
> "resolved", this is actually a problem and wasn't resolved at all. Several 
> people encountered memory issues using MLlib for large and complex problems 
> (see 
> http://stackoverflow.com/questions/32621267/spark-1-4-0-hangs-running-randomforest
>  and 
> http://stackoverflow.com/questions/27367804/how-do-i-get-spark-submit-to-close)
> Running MLlib K-Means on a ~400MB dataset (12 partitions), persisted on 
> Memory and Disk.
> Everything's fine, although at the end of K-Means, after the number of 
> iterations, the cost function value and the running time there's a nice 
> "Removing RDD  from persistent list" stage. However, during this stage 
> there's a high memory pressure. Weird, since RDDs are about to be removed. 
> Full log of this stage:
> 16/06/12 20:37:33 INFO clustering.KMeans: Run 0 finished in 14 iterations
> 16/06/12 20:37:33 INFO clustering.KMeans: Iterations took 694.544 seconds.
> 16/06/12 20:37:33 INFO clustering.KMeans: KMeans converged in 14 iterations.
> 16/06/12 20:37:33 INFO clustering.KMeans: The cost for the best run is 
> 49784.87126751288.
> 16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 781 from 
> persistence list
> 16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 781
> 16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 780 from 
> persistence list
> 16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 780
> I'm running this K-Means on a 16GB machine, with Spark Context as local[*]. 
> My machine has an i5 hyperthreaded dual-core, thus [*] means 4.
> I'm launching this application though spark-submit with --driver-memory 9G



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15904) High Memory Pressure using MLlib K-means

2016-06-16 Thread Alessio (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessio updated SPARK-15904:

Description: 
*Please Note*: even though the issue has been marked as "not a problem" and 
"resolved", this is actually a problem and wasn't resolved at all. Several 
people encountered memory issues using MLlib for large and complex problems 
(see 
http://stackoverflow.com/questions/32621267/spark-1-4-0-hangs-running-randomforest
 and 
http://stackoverflow.com/questions/27367804/how-do-i-get-spark-submit-to-close)

Running MLlib K-Means on a ~400MB dataset (12 partitions), persisted on Memory 
and Disk.
Everything's fine, although at the end of K-Means, after the number of 
iterations, the cost function value and the running time there's a nice 
"Removing RDD  from persistent list" stage. However, during this stage 
there's a high memory pressure. Weird, since RDDs are about to be removed. Full 
log of this stage:

16/06/12 20:37:33 INFO clustering.KMeans: Run 0 finished in 14 iterations
16/06/12 20:37:33 INFO clustering.KMeans: Iterations took 694.544 seconds.
16/06/12 20:37:33 INFO clustering.KMeans: KMeans converged in 14 iterations.
16/06/12 20:37:33 INFO clustering.KMeans: The cost for the best run is 
49784.87126751288.
16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 781 from persistence 
list
16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 781
16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 780 from persistence 
list
16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 780

I'm running this K-Means on a 16GB machine, with Spark Context as local[*]. My 
machine has an i5 hyperthreaded dual-core, thus [*] means 4.
I'm launching this application though spark-submit with --driver-memory 9G

  was:
*Please Note*: even though the issue has been marked as "not a problem" and 
"resolved", this is actually a problem and wasn't resolved at all. Several 
people encountered memory issues using MLlib for large and complex problems 
(see )

Running MLlib K-Means on a ~400MB dataset (12 partitions), persisted on Memory 
and Disk.
Everything's fine, although at the end of K-Means, after the number of 
iterations, the cost function value and the running time there's a nice 
"Removing RDD  from persistent list" stage. However, during this stage 
there's a high memory pressure. Weird, since RDDs are about to be removed. Full 
log of this stage:

16/06/12 20:37:33 INFO clustering.KMeans: Run 0 finished in 14 iterations
16/06/12 20:37:33 INFO clustering.KMeans: Iterations took 694.544 seconds.
16/06/12 20:37:33 INFO clustering.KMeans: KMeans converged in 14 iterations.
16/06/12 20:37:33 INFO clustering.KMeans: The cost for the best run is 
49784.87126751288.
16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 781 from persistence 
list
16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 781
16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 780 from persistence 
list
16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 780

I'm running this K-Means on a 16GB machine, with Spark Context as local[*]. My 
machine has an i5 hyperthreaded dual-core, thus [*] means 4.
I'm launching this application though spark-submit with --driver-memory 9G


> High Memory Pressure using MLlib K-means
> 
>
> Key: SPARK-15904
> URL: https://issues.apache.org/jira/browse/SPARK-15904
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.6.1
> Environment: Mac OS X 10.11.6beta on Macbook Pro 13" mid-2012. 16GB 
> of RAM.
>Reporter: Alessio
>Priority: Minor
>
> *Please Note*: even though the issue has been marked as "not a problem" and 
> "resolved", this is actually a problem and wasn't resolved at all. Several 
> people encountered memory issues using MLlib for large and complex problems 
> (see 
> http://stackoverflow.com/questions/32621267/spark-1-4-0-hangs-running-randomforest
>  and 
> http://stackoverflow.com/questions/27367804/how-do-i-get-spark-submit-to-close)
> Running MLlib K-Means on a ~400MB dataset (12 partitions), persisted on 
> Memory and Disk.
> Everything's fine, although at the end of K-Means, after the number of 
> iterations, the cost function value and the running time there's a nice 
> "Removing RDD  from persistent list" stage. However, during this stage 
> there's a high memory pressure. Weird, since RDDs are about to be removed. 
> Full log of this stage:
> 16/06/12 20:37:33 INFO clustering.KMeans: Run 0 finished in 14 iterations
> 16/06/12 20:37:33 INFO clustering.KMeans: Iterations took 694.544 seconds.
> 16/06/12 20:37:33 INFO clustering.KMeans: KMeans converged in 14 iterations.
> 16/06/12 20:37:33 INFO clustering.KMeans: The cost for the best run is 
> 49784.87126751288.
> 16/06/12 20:37:33 INFO 

[jira] [Updated] (SPARK-15904) High Memory Pressure using MLlib K-means

2016-06-16 Thread Alessio (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessio updated SPARK-15904:

Description: 
*Please Note*: even though the issue has been marked as "not a problem" and 
"resolved", this is actually a problem and wasn't resolved at all. Several 
people encountered memory issues using MLlib for large and complex problems 
(see )

Running MLlib K-Means on a ~400MB dataset (12 partitions), persisted on Memory 
and Disk.
Everything's fine, although at the end of K-Means, after the number of 
iterations, the cost function value and the running time there's a nice 
"Removing RDD  from persistent list" stage. However, during this stage 
there's a high memory pressure. Weird, since RDDs are about to be removed. Full 
log of this stage:

16/06/12 20:37:33 INFO clustering.KMeans: Run 0 finished in 14 iterations
16/06/12 20:37:33 INFO clustering.KMeans: Iterations took 694.544 seconds.
16/06/12 20:37:33 INFO clustering.KMeans: KMeans converged in 14 iterations.
16/06/12 20:37:33 INFO clustering.KMeans: The cost for the best run is 
49784.87126751288.
16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 781 from persistence 
list
16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 781
16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 780 from persistence 
list
16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 780

I'm running this K-Means on a 16GB machine, with Spark Context as local[*]. My 
machine has an i5 hyperthreaded dual-core, thus [*] means 4.
I'm launching this application though spark-submit with --driver-memory 9G

  was:
Running MLlib K-Means on a ~400MB dataset (12 partitions), persisted on Memory 
and Disk.
Everything's fine, although at the end of K-Means, after the number of 
iterations, the cost function value and the running time there's a nice 
"Removing RDD  from persistent list" stage. However, during this stage 
there's a high memory pressure. Weird, since RDDs are about to be removed. Full 
log of this stage:

16/06/12 20:37:33 INFO clustering.KMeans: Run 0 finished in 14 iterations
16/06/12 20:37:33 INFO clustering.KMeans: Iterations took 694.544 seconds.
16/06/12 20:37:33 INFO clustering.KMeans: KMeans converged in 14 iterations.
16/06/12 20:37:33 INFO clustering.KMeans: The cost for the best run is 
49784.87126751288.
16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 781 from persistence 
list
16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 781
16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 780 from persistence 
list
16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 780

I'm running this K-Means on a 16GB machine, with Spark Context as local[*]. My 
machine has an i5 hyperthreaded dual-core, thus [*] means 4.
I'm launching this application though spark-submit with --driver-memory 9G


> High Memory Pressure using MLlib K-means
> 
>
> Key: SPARK-15904
> URL: https://issues.apache.org/jira/browse/SPARK-15904
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.6.1
> Environment: Mac OS X 10.11.6beta on Macbook Pro 13" mid-2012. 16GB 
> of RAM.
>Reporter: Alessio
>Priority: Minor
>
> *Please Note*: even though the issue has been marked as "not a problem" and 
> "resolved", this is actually a problem and wasn't resolved at all. Several 
> people encountered memory issues using MLlib for large and complex problems 
> (see )
> Running MLlib K-Means on a ~400MB dataset (12 partitions), persisted on 
> Memory and Disk.
> Everything's fine, although at the end of K-Means, after the number of 
> iterations, the cost function value and the running time there's a nice 
> "Removing RDD  from persistent list" stage. However, during this stage 
> there's a high memory pressure. Weird, since RDDs are about to be removed. 
> Full log of this stage:
> 16/06/12 20:37:33 INFO clustering.KMeans: Run 0 finished in 14 iterations
> 16/06/12 20:37:33 INFO clustering.KMeans: Iterations took 694.544 seconds.
> 16/06/12 20:37:33 INFO clustering.KMeans: KMeans converged in 14 iterations.
> 16/06/12 20:37:33 INFO clustering.KMeans: The cost for the best run is 
> 49784.87126751288.
> 16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 781 from 
> persistence list
> 16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 781
> 16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 780 from 
> persistence list
> 16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 780
> I'm running this K-Means on a 16GB machine, with Spark Context as local[*]. 
> My machine has an i5 hyperthreaded dual-core, thus [*] means 4.
> I'm launching this application though spark-submit with --driver-memory 9G



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (SPARK-15904) High Memory Pressure using MLlib K-means

2016-06-14 Thread Alessio (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15330054#comment-15330054
 ] 

Alessio commented on SPARK-15904:
-

For the records, just confirming that the problem is not my code, I've been 
running K-means in an interactive fashion, also with its default centroid 
initialization (kmeans||). Same dataset, K=9120. Out of memory error.

> High Memory Pressure using MLlib K-means
> 
>
> Key: SPARK-15904
> URL: https://issues.apache.org/jira/browse/SPARK-15904
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.6.1
> Environment: Mac OS X 10.11.6beta on Macbook Pro 13" mid-2012. 16GB 
> of RAM.
>Reporter: Alessio
>Priority: Minor
>
> Running MLlib K-Means on a ~400MB dataset (12 partitions), persisted on 
> Memory and Disk.
> Everything's fine, although at the end of K-Means, after the number of 
> iterations, the cost function value and the running time there's a nice 
> "Removing RDD  from persistent list" stage. However, during this stage 
> there's a high memory pressure. Weird, since RDDs are about to be removed. 
> Full log of this stage:
> 16/06/12 20:37:33 INFO clustering.KMeans: Run 0 finished in 14 iterations
> 16/06/12 20:37:33 INFO clustering.KMeans: Iterations took 694.544 seconds.
> 16/06/12 20:37:33 INFO clustering.KMeans: KMeans converged in 14 iterations.
> 16/06/12 20:37:33 INFO clustering.KMeans: The cost for the best run is 
> 49784.87126751288.
> 16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 781 from 
> persistence list
> 16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 781
> 16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 780 from 
> persistence list
> 16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 780
> I'm running this K-Means on a 16GB machine, with Spark Context as local[*]. 
> My machine has an i5 hyperthreaded dual-core, thus [*] means 4.
> I'm launching this application though spark-submit with --driver-memory 9G



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15904) High Memory Pressure using MLlib K-means

2016-06-13 Thread Alessio (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15327542#comment-15327542
 ] 

Alessio commented on SPARK-15904:
-

With the --driver-memory 4G switch I've tried both. With no luck. At first I 
changed the storage level to serialized, then I also increased the number of 
partitions (from 12 - default - to 20). Still "out of memory". I guess I'll 
wait for 2.0

> High Memory Pressure using MLlib K-means
> 
>
> Key: SPARK-15904
> URL: https://issues.apache.org/jira/browse/SPARK-15904
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.6.1
> Environment: Mac OS X 10.11.6beta on Macbook Pro 13" mid-2012. 16GB 
> of RAM.
>Reporter: Alessio
>Priority: Minor
>
> Running MLlib K-Means on a ~400MB dataset (12 partitions), persisted on 
> Memory and Disk.
> Everything's fine, although at the end of K-Means, after the number of 
> iterations, the cost function value and the running time there's a nice 
> "Removing RDD  from persistent list" stage. However, during this stage 
> there's a high memory pressure. Weird, since RDDs are about to be removed. 
> Full log of this stage:
> 16/06/12 20:37:33 INFO clustering.KMeans: Run 0 finished in 14 iterations
> 16/06/12 20:37:33 INFO clustering.KMeans: Iterations took 694.544 seconds.
> 16/06/12 20:37:33 INFO clustering.KMeans: KMeans converged in 14 iterations.
> 16/06/12 20:37:33 INFO clustering.KMeans: The cost for the best run is 
> 49784.87126751288.
> 16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 781 from 
> persistence list
> 16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 781
> 16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 780 from 
> persistence list
> 16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 780
> I'm running this K-Means on a 16GB machine, with Spark Context as local[*]. 
> My machine has an i5 hyperthreaded dual-core, thus [*] means 4.
> I'm launching this application though spark-submit with --driver-memory 9G



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-15904) High Memory Pressure using MLlib K-means

2016-06-13 Thread Alessio (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15327476#comment-15327476
 ] 

Alessio edited comment on SPARK-15904 at 6/13/16 2:48 PM:
--

If anyone's interested, the dataset I'm working on is freely available from UCI 
ML Repository 
(http://archive.ics.uci.edu/ml/datasets/Daily+and+Sports+Activities).

I tried just now running the above K-Means for K=9120, with --driver-memory 4G. 
The full traceback can be found here (https://ghostbin.com/paste/9pu9k).

The code is absolutely simple, I don't think there's something wrong with it:

sc = SparkContext("local[*]", "Spark K-Means")
data = sc.textFile()
parsedData = data.map(lambda line: array([float(x) for x in line.split(',')]))
parsedDataNOID=parsedData.map(lambda pattern: pattern[1:])
parsedDataNOID.persist(StorageLevel.MEMORY_AND_DISK)

K_CANDIDATES=

initCentroids=scipy.io.loadmat(<.mat file with initial seeds>)
datatmp=numpy.genfromtxt(,delimiter=",")

for K in K_CANDIDATES:
 clusters = KMeans.train(parsedDataNOID, K, maxIterations=2000, runs=1, 
epsilon=0.0, initialModel = 
KMeansModel(datatmp[initCentroids['initSeedsA'][0][k_tmp][0]-1,:]))


was (Author: purple):
If anyone's interested, the dataset I'm working on is freely available from UCI 
ML Repository 
(http://archive.ics.uci.edu/ml/datasets/Daily+and+Sports+Activities).

I tried just now running the above K-Means for K=9120, with --driver-memory 4G. 
The full traceback can be found here (https://ghostbin.com/paste/9pu9k).

The code is absolutely simple, I don't think there's nothing wrong with it:

sc = SparkContext("local[*]", "Spark K-Means")
data = sc.textFile()
parsedData = data.map(lambda line: array([float(x) for x in line.split(',')]))
parsedDataNOID=parsedData.map(lambda pattern: pattern[1:])
parsedDataNOID.persist(StorageLevel.MEMORY_AND_DISK)

K_CANDIDATES=

initCentroids=scipy.io.loadmat(<.mat file with initial seeds>)
datatmp=numpy.genfromtxt(,delimiter=",")

for K in K_CANDIDATES:
 clusters = KMeans.train(parsedDataNOID, K, maxIterations=2000, runs=1, 
epsilon=0.0, initialModel = 
KMeansModel(datatmp[initCentroids['initSeedsA'][0][k_tmp][0]-1,:]))

> High Memory Pressure using MLlib K-means
> 
>
> Key: SPARK-15904
> URL: https://issues.apache.org/jira/browse/SPARK-15904
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.6.1
> Environment: Mac OS X 10.11.6beta on Macbook Pro 13" mid-2012. 16GB 
> of RAM.
>Reporter: Alessio
>Priority: Minor
>
> Running MLlib K-Means on a ~400MB dataset (12 partitions), persisted on 
> Memory and Disk.
> Everything's fine, although at the end of K-Means, after the number of 
> iterations, the cost function value and the running time there's a nice 
> "Removing RDD  from persistent list" stage. However, during this stage 
> there's a high memory pressure. Weird, since RDDs are about to be removed. 
> Full log of this stage:
> 16/06/12 20:37:33 INFO clustering.KMeans: Run 0 finished in 14 iterations
> 16/06/12 20:37:33 INFO clustering.KMeans: Iterations took 694.544 seconds.
> 16/06/12 20:37:33 INFO clustering.KMeans: KMeans converged in 14 iterations.
> 16/06/12 20:37:33 INFO clustering.KMeans: The cost for the best run is 
> 49784.87126751288.
> 16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 781 from 
> persistence list
> 16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 781
> 16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 780 from 
> persistence list
> 16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 780
> I'm running this K-Means on a 16GB machine, with Spark Context as local[*]. 
> My machine has an i5 hyperthreaded dual-core, thus [*] means 4.
> I'm launching this application though spark-submit with --driver-memory 9G



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15904) High Memory Pressure using MLlib K-means

2016-06-13 Thread Alessio (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15327476#comment-15327476
 ] 

Alessio commented on SPARK-15904:
-

If anyone's interested, the dataset I'm working on is freely available from UCI 
ML Repository 
(http://archive.ics.uci.edu/ml/datasets/Daily+and+Sports+Activities).

I tried just now running the above K-Means for K=9120, with --driver-memory 4G. 
The full traceback can be found here (https://ghostbin.com/paste/9pu9k).

The code is absolutely simple, I don't think there's nothing wrong with it:

sc = SparkContext("local[*]", "Spark K-Means")
data = sc.textFile()
parsedData = data.map(lambda line: array([float(x) for x in line.split(',')]))
parsedDataNOID=parsedData.map(lambda pattern: pattern[1:])
parsedDataNOID.persist(StorageLevel.MEMORY_AND_DISK)

K_CANDIDATES=

initCentroids=scipy.io.loadmat(<.mat file with initial seeds>)
datatmp=numpy.genfromtxt(,delimiter=",")

for K in K_CANDIDATES:
 clusters = KMeans.train(parsedDataNOID, K, maxIterations=2000, runs=1, 
epsilon=0.0, initialModel = 
KMeansModel(datatmp[initCentroids['initSeedsA'][0][k_tmp][0]-1,:]))

> High Memory Pressure using MLlib K-means
> 
>
> Key: SPARK-15904
> URL: https://issues.apache.org/jira/browse/SPARK-15904
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.6.1
> Environment: Mac OS X 10.11.6beta on Macbook Pro 13" mid-2012. 16GB 
> of RAM.
>Reporter: Alessio
>Priority: Minor
>
> Running MLlib K-Means on a ~400MB dataset (12 partitions), persisted on 
> Memory and Disk.
> Everything's fine, although at the end of K-Means, after the number of 
> iterations, the cost function value and the running time there's a nice 
> "Removing RDD  from persistent list" stage. However, during this stage 
> there's a high memory pressure. Weird, since RDDs are about to be removed. 
> Full log of this stage:
> 16/06/12 20:37:33 INFO clustering.KMeans: Run 0 finished in 14 iterations
> 16/06/12 20:37:33 INFO clustering.KMeans: Iterations took 694.544 seconds.
> 16/06/12 20:37:33 INFO clustering.KMeans: KMeans converged in 14 iterations.
> 16/06/12 20:37:33 INFO clustering.KMeans: The cost for the best run is 
> 49784.87126751288.
> 16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 781 from 
> persistence list
> 16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 781
> 16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 780 from 
> persistence list
> 16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 780
> I'm running this K-Means on a 16GB machine, with Spark Context as local[*]. 
> My machine has an i5 hyperthreaded dual-core, thus [*] means 4.
> I'm launching this application though spark-submit with --driver-memory 9G



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15904) High Memory Pressure using MLlib K-means

2016-06-13 Thread Alessio (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15327443#comment-15327443
 ] 

Alessio commented on SPARK-15904:
-

Correct. Memory and Disk gives priority to Memory...but my dataset is 400MB so 
it shouldn't be a problem. If I give Spark less RAM (I tried with 4GB and 8GB) 
Java throws the Out-of-memory error for K>3000.

> High Memory Pressure using MLlib K-means
> 
>
> Key: SPARK-15904
> URL: https://issues.apache.org/jira/browse/SPARK-15904
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.6.1
> Environment: Mac OS X 10.11.6beta on Macbook Pro 13" mid-2012. 16GB 
> of RAM.
>Reporter: Alessio
>Priority: Minor
>
> Running MLlib K-Means on a ~400MB dataset (12 partitions), persisted on 
> Memory and Disk.
> Everything's fine, although at the end of K-Means, after the number of 
> iterations, the cost function value and the running time there's a nice 
> "Removing RDD  from persistent list" stage. However, during this stage 
> there's a high memory pressure. Weird, since RDDs are about to be removed. 
> Full log of this stage:
> 16/06/12 20:37:33 INFO clustering.KMeans: Run 0 finished in 14 iterations
> 16/06/12 20:37:33 INFO clustering.KMeans: Iterations took 694.544 seconds.
> 16/06/12 20:37:33 INFO clustering.KMeans: KMeans converged in 14 iterations.
> 16/06/12 20:37:33 INFO clustering.KMeans: The cost for the best run is 
> 49784.87126751288.
> 16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 781 from 
> persistence list
> 16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 781
> 16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 780 from 
> persistence list
> 16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 780
> I'm running this K-Means on a 16GB machine, with Spark Context as local[*]. 
> My machine has an i5 hyperthreaded dual-core, thus [*] means 4.
> I'm launching this application though spark-submit with --driver-memory 9G



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15904) High Memory Pressure using MLlib K-means

2016-06-13 Thread Alessio (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15327438#comment-15327438
 ] 

Alessio commented on SPARK-15904:
-

My machine has 16GB of RAM. I also tried closing all the other apps, leaving 
just the Terminal with Spark running. Still no luck.

> High Memory Pressure using MLlib K-means
> 
>
> Key: SPARK-15904
> URL: https://issues.apache.org/jira/browse/SPARK-15904
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.6.1
> Environment: Mac OS X 10.11.6beta on Macbook Pro 13" mid-2012. 16GB 
> of RAM.
>Reporter: Alessio
>Priority: Minor
>
> Running MLlib K-Means on a ~400MB dataset (12 partitions), persisted on 
> Memory and Disk.
> Everything's fine, although at the end of K-Means, after the number of 
> iterations, the cost function value and the running time there's a nice 
> "Removing RDD  from persistent list" stage. However, during this stage 
> there's a high memory pressure. Weird, since RDDs are about to be removed. 
> Full log of this stage:
> 16/06/12 20:37:33 INFO clustering.KMeans: Run 0 finished in 14 iterations
> 16/06/12 20:37:33 INFO clustering.KMeans: Iterations took 694.544 seconds.
> 16/06/12 20:37:33 INFO clustering.KMeans: KMeans converged in 14 iterations.
> 16/06/12 20:37:33 INFO clustering.KMeans: The cost for the best run is 
> 49784.87126751288.
> 16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 781 from 
> persistence list
> 16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 781
> 16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 780 from 
> persistence list
> 16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 780
> I'm running this K-Means on a 16GB machine, with Spark Context as local[*]. 
> My machine has an i5 hyperthreaded dual-core, thus [*] means 4.
> I'm launching this application though spark-submit with --driver-memory 9G



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-15904) High Memory Pressure using MLlib K-means

2016-06-13 Thread Alessio (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15327411#comment-15327411
 ] 

Alessio edited comment on SPARK-15904 at 6/13/16 1:55 PM:
--

This is absolutely weird to me. I gave Spark 9GB and during the K-Means 
execution, if I monitor the memory stat I can see that Spark/Java has 9GB 
(nice) and no Swap whatsoever. After K-means has reached convergence, during 
this last, cleaning stage everything goes wild. Also, for the sake of 
scalability, RDDs are persisted on memory *and disk*. So I can't really 
understand this pressure blowup.


was (Author: purple):
This is absolutely weird to me. I gave Spark 9GB and during the K-Means 
execution, if I monitor the memory stat I can see that Spark/Java has 9GB 
(nice) and no Swap whatsoever. After K-means has reached convergence, during 
this last, cleaning stage everything goes wild.

> High Memory Pressure using MLlib K-means
> 
>
> Key: SPARK-15904
> URL: https://issues.apache.org/jira/browse/SPARK-15904
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.6.1
> Environment: Mac OS X 10.11.6beta on Macbook Pro 13" mid-2012. 16GB 
> of RAM.
>Reporter: Alessio
>Priority: Minor
>
> Running MLlib K-Means on a ~400MB dataset (12 partitions), persisted on 
> Memory and Disk.
> Everything's fine, although at the end of K-Means, after the number of 
> iterations, the cost function value and the running time there's a nice 
> "Removing RDD  from persistent list" stage. However, during this stage 
> there's a high memory pressure. Weird, since RDDs are about to be removed. 
> Full log of this stage:
> 16/06/12 20:37:33 INFO clustering.KMeans: Run 0 finished in 14 iterations
> 16/06/12 20:37:33 INFO clustering.KMeans: Iterations took 694.544 seconds.
> 16/06/12 20:37:33 INFO clustering.KMeans: KMeans converged in 14 iterations.
> 16/06/12 20:37:33 INFO clustering.KMeans: The cost for the best run is 
> 49784.87126751288.
> 16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 781 from 
> persistence list
> 16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 781
> 16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 780 from 
> persistence list
> 16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 780
> I'm running this K-Means on a 16GB machine, with Spark Context as local[*]. 
> My machine has an i5 hyperthreaded dual-core, thus [*] means 4.
> I'm launching this application though spark-submit with --driver-memory 9G



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15904) High Memory Pressure using MLlib K-means

2016-06-13 Thread Alessio (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15327411#comment-15327411
 ] 

Alessio commented on SPARK-15904:
-

This is absolutely weird to me. I gave Spark 9GB and during the K-Means 
execution, if I monitor the memory stat I can see that Spark/Java has 9GB 
(nice) and no Swap whatsoever. After K-means has reached convergence, during 
this last, cleaning stage everything goes wild.

> High Memory Pressure using MLlib K-means
> 
>
> Key: SPARK-15904
> URL: https://issues.apache.org/jira/browse/SPARK-15904
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.6.1
> Environment: Mac OS X 10.11.6beta on Macbook Pro 13" mid-2012. 16GB 
> of RAM.
>Reporter: Alessio
>Priority: Minor
>
> Running MLlib K-Means on a ~400MB dataset (12 partitions), persisted on 
> Memory and Disk.
> Everything's fine, although at the end of K-Means, after the number of 
> iterations, the cost function value and the running time there's a nice 
> "Removing RDD  from persistent list" stage. However, during this stage 
> there's a high memory pressure. Weird, since RDDs are about to be removed. 
> Full log of this stage:
> 16/06/12 20:37:33 INFO clustering.KMeans: Run 0 finished in 14 iterations
> 16/06/12 20:37:33 INFO clustering.KMeans: Iterations took 694.544 seconds.
> 16/06/12 20:37:33 INFO clustering.KMeans: KMeans converged in 14 iterations.
> 16/06/12 20:37:33 INFO clustering.KMeans: The cost for the best run is 
> 49784.87126751288.
> 16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 781 from 
> persistence list
> 16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 781
> 16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 780 from 
> persistence list
> 16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 780
> I'm running this K-Means on a 16GB machine, with Spark Context as local[*]. 
> My machine has an i5 hyperthreaded dual-core, thus [*] means 4.
> I'm launching this application though spark-submit with --driver-memory 9G



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-15904) High Memory Pressure using MLlib K-means

2016-06-13 Thread Alessio (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15327397#comment-15327397
 ] 

Alessio edited comment on SPARK-15904 at 6/13/16 1:48 PM:
--

Dear [~srowen], 
at the beginning I noticed that "Cleaning RDD” phase (as in the original post) 
took a lot of time (10~15 minutes).
So I was curious and I opened the Activity Monitor on Mac OS X. That’s when I 
noticed the Memory Pressure indicator going crazy. The swap memory increases up 
to 10GB (when K=9120). And after this Cleaning RDD stage…everything’s back to 
normal. Swap memory will be reduced to 1GB or 2GBs. No more memory pressure and 
ready for the next K.
Moreover, Spark does not stop the execution. I do not receive any 
“Out-of-memory” errors from either Java, Python or Spark.

Have a look at the screenshot here (http://postimg.org/image/l4pc0vlzr/). 
K-means just finished another run for K=6000. See the memory stat, all of these 
peaks under the Last 24 Hours sections are from Spark, after every K-Means run.


was (Author: purple):
Dear [~srowen], 
at the beginning I noticed that "Cleaning RDD” phase (as in the original post) 
took a lot of time (10~15 minutes).
So I was curious and I opened the Activity Monitor on Mac OS X. That’s when I 
noticed the Memory Pressure indicator going crazy. The swap memory increases up 
to 10GB (when K=9120). And after this Cleaning RDD stage…everything’s back to 
normal. Swap memory will be reduced to 1GB or 2GBs. No more memory pressure and 
ready for the next K.
Moreover, Spark does not stop the execution. I do not receive any 
“Out-of-memory” errors from either Java, Python or Spark.

> High Memory Pressure using MLlib K-means
> 
>
> Key: SPARK-15904
> URL: https://issues.apache.org/jira/browse/SPARK-15904
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.6.1
> Environment: Mac OS X 10.11.6beta on Macbook Pro 13" mid-2012. 16GB 
> of RAM.
>Reporter: Alessio
>Priority: Minor
>
> Running MLlib K-Means on a ~400MB dataset (12 partitions), persisted on 
> Memory and Disk.
> Everything's fine, although at the end of K-Means, after the number of 
> iterations, the cost function value and the running time there's a nice 
> "Removing RDD  from persistent list" stage. However, during this stage 
> there's a high memory pressure. Weird, since RDDs are about to be removed. 
> Full log of this stage:
> 16/06/12 20:37:33 INFO clustering.KMeans: Run 0 finished in 14 iterations
> 16/06/12 20:37:33 INFO clustering.KMeans: Iterations took 694.544 seconds.
> 16/06/12 20:37:33 INFO clustering.KMeans: KMeans converged in 14 iterations.
> 16/06/12 20:37:33 INFO clustering.KMeans: The cost for the best run is 
> 49784.87126751288.
> 16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 781 from 
> persistence list
> 16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 781
> 16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 780 from 
> persistence list
> 16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 780
> I'm running this K-Means on a 16GB machine, with Spark Context as local[*]. 
> My machine has an i5 hyperthreaded dual-core, thus [*] means 4.
> I'm launching this application though spark-submit with --driver-memory 9G



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-15904) High Memory Pressure using MLlib K-means

2016-06-13 Thread Alessio (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15327397#comment-15327397
 ] 

Alessio edited comment on SPARK-15904 at 6/13/16 1:49 PM:
--

Dear [~srowen], 
at the beginning I noticed that "Cleaning RDD” phase (as in the original post) 
took a lot of time (10~15 minutes).
So I was curious and I opened the Activity Monitor on Mac OS X. That’s when I 
noticed the Memory Pressure indicator going crazy. The swap memory increases up 
to 10GB (when K=9120). And after this Cleaning RDD stage…everything’s back to 
normal. Swap memory will be reduced to 1GB or 2GBs. No more memory pressure and 
ready for the next K.
Moreover, Spark does not stop the execution. I do not receive any 
“Out-of-memory” errors from either Java, Python or Spark.

Have a look at the screenshot here (http://postimg.org/image/l4pc0vlzr/). 
K-means just finished another run for K=6000. See the memory stat, all of these 
peaks under the Last 24 Hours sections are from Spark, after every K-Means run.
After a couple of minutes, here's the screenshot 
(http://postimg.org/image/qc7re8clt/). The memory pressure indicator is going 
down, but Swap size is 10GB. If I wait a few more minutes, everything will be 
back to normal.


was (Author: purple):
Dear [~srowen], 
at the beginning I noticed that "Cleaning RDD” phase (as in the original post) 
took a lot of time (10~15 minutes).
So I was curious and I opened the Activity Monitor on Mac OS X. That’s when I 
noticed the Memory Pressure indicator going crazy. The swap memory increases up 
to 10GB (when K=9120). And after this Cleaning RDD stage…everything’s back to 
normal. Swap memory will be reduced to 1GB or 2GBs. No more memory pressure and 
ready for the next K.
Moreover, Spark does not stop the execution. I do not receive any 
“Out-of-memory” errors from either Java, Python or Spark.

Have a look at the screenshot here (http://postimg.org/image/l4pc0vlzr/). 
K-means just finished another run for K=6000. See the memory stat, all of these 
peaks under the Last 24 Hours sections are from Spark, after every K-Means run.

> High Memory Pressure using MLlib K-means
> 
>
> Key: SPARK-15904
> URL: https://issues.apache.org/jira/browse/SPARK-15904
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.6.1
> Environment: Mac OS X 10.11.6beta on Macbook Pro 13" mid-2012. 16GB 
> of RAM.
>Reporter: Alessio
>Priority: Minor
>
> Running MLlib K-Means on a ~400MB dataset (12 partitions), persisted on 
> Memory and Disk.
> Everything's fine, although at the end of K-Means, after the number of 
> iterations, the cost function value and the running time there's a nice 
> "Removing RDD  from persistent list" stage. However, during this stage 
> there's a high memory pressure. Weird, since RDDs are about to be removed. 
> Full log of this stage:
> 16/06/12 20:37:33 INFO clustering.KMeans: Run 0 finished in 14 iterations
> 16/06/12 20:37:33 INFO clustering.KMeans: Iterations took 694.544 seconds.
> 16/06/12 20:37:33 INFO clustering.KMeans: KMeans converged in 14 iterations.
> 16/06/12 20:37:33 INFO clustering.KMeans: The cost for the best run is 
> 49784.87126751288.
> 16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 781 from 
> persistence list
> 16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 781
> 16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 780 from 
> persistence list
> 16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 780
> I'm running this K-Means on a 16GB machine, with Spark Context as local[*]. 
> My machine has an i5 hyperthreaded dual-core, thus [*] means 4.
> I'm launching this application though spark-submit with --driver-memory 9G



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15904) High Memory Pressure using MLlib K-means

2016-06-13 Thread Alessio (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15327397#comment-15327397
 ] 

Alessio commented on SPARK-15904:
-

Dear [~srowen], 
at the beginning I noticed that "Cleaning RDD” phase (as in the original post) 
took a lot of time (10~15 minutes).
So I was curious and I opened the Activity Monitor on Mac OS X. That’s when I 
noticed the Memory Pressure indicator going crazy. The swap memory increases up 
to 10GB (when K=9120). And after this Cleaning RDD stage…everything’s back to 
normal. Swap memory will be reduced to 1GB or 2GBs. No more memory pressure and 
ready for the next K.
Moreover, Spark does not stop the execution. I do not receive any 
“Out-of-memory” errors from either Java, Python or Spark.

> High Memory Pressure using MLlib K-means
> 
>
> Key: SPARK-15904
> URL: https://issues.apache.org/jira/browse/SPARK-15904
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.6.1
> Environment: Mac OS X 10.11.6beta on Macbook Pro 13" mid-2012. 16GB 
> of RAM.
>Reporter: Alessio
>Priority: Minor
>
> Running MLlib K-Means on a ~400MB dataset (12 partitions), persisted on 
> Memory and Disk.
> Everything's fine, although at the end of K-Means, after the number of 
> iterations, the cost function value and the running time there's a nice 
> "Removing RDD  from persistent list" stage. However, during this stage 
> there's a high memory pressure. Weird, since RDDs are about to be removed. 
> Full log of this stage:
> 16/06/12 20:37:33 INFO clustering.KMeans: Run 0 finished in 14 iterations
> 16/06/12 20:37:33 INFO clustering.KMeans: Iterations took 694.544 seconds.
> 16/06/12 20:37:33 INFO clustering.KMeans: KMeans converged in 14 iterations.
> 16/06/12 20:37:33 INFO clustering.KMeans: The cost for the best run is 
> 49784.87126751288.
> 16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 781 from 
> persistence list
> 16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 781
> 16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 780 from 
> persistence list
> 16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 780
> I'm running this K-Means on a 16GB machine, with Spark Context as local[*]. 
> My machine has an i5 hyperthreaded dual-core, thus [*] means 4.
> I'm launching this application though spark-submit with --driver-memory 9G



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-15904) High Memory Pressure using MLlib K-means

2016-06-13 Thread Alessio (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15327272#comment-15327272
 ] 

Alessio edited comment on SPARK-15904 at 6/13/16 12:45 PM:
---

Dear Sean,
I must certainly agree with you on k< High Memory Pressure using MLlib K-means
> 
>
> Key: SPARK-15904
> URL: https://issues.apache.org/jira/browse/SPARK-15904
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.6.1
> Environment: Mac OS X 10.11.6beta on Macbook Pro 13" mid-2012. 16GB 
> of RAM.
>Reporter: Alessio
>Priority: Minor
>
> Running MLlib K-Means on a ~400MB dataset (12 partitions), persisted on 
> Memory and Disk.
> Everything's fine, although at the end of K-Means, after the number of 
> iterations, the cost function value and the running time there's a nice 
> "Removing RDD  from persistent list" stage. However, during this stage 
> there's a high memory pressure. Weird, since RDDs are about to be removed. 
> Full log of this stage:
> 16/06/12 20:37:33 INFO clustering.KMeans: Run 0 finished in 14 iterations
> 16/06/12 20:37:33 INFO clustering.KMeans: Iterations took 694.544 seconds.
> 16/06/12 20:37:33 INFO clustering.KMeans: KMeans converged in 14 iterations.
> 16/06/12 20:37:33 INFO clustering.KMeans: The cost for the best run is 
> 49784.87126751288.
> 16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 781 from 
> persistence list
> 16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 781
> 16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 780 from 
> persistence list
> 16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 780
> I'm running this K-Means on a 16GB machine, with Spark Context as local[*]. 
> My machine has an i5 hyperthreaded dual-core, thus [*] means 4.
> I'm launching this application though spark-submit with --driver-memory 9G



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-15904) High Memory Pressure using MLlib K-means

2016-06-13 Thread Alessio (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15327272#comment-15327272
 ] 

Alessio edited comment on SPARK-15904 at 6/13/16 12:44 PM:
---

Dear Sean,
I must certainly agree with you on k< High Memory Pressure using MLlib K-means
> 
>
> Key: SPARK-15904
> URL: https://issues.apache.org/jira/browse/SPARK-15904
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.6.1
> Environment: Mac OS X 10.11.6beta on Macbook Pro 13" mid-2012. 16GB 
> of RAM.
>Reporter: Alessio
>Priority: Minor
>
> Running MLlib K-Means on a ~400MB dataset (12 partitions), persisted on 
> Memory and Disk.
> Everything's fine, although at the end of K-Means, after the number of 
> iterations, the cost function value and the running time there's a nice 
> "Removing RDD  from persistent list" stage. However, during this stage 
> there's a high memory pressure. Weird, since RDDs are about to be removed. 
> Full log of this stage:
> 16/06/12 20:37:33 INFO clustering.KMeans: Run 0 finished in 14 iterations
> 16/06/12 20:37:33 INFO clustering.KMeans: Iterations took 694.544 seconds.
> 16/06/12 20:37:33 INFO clustering.KMeans: KMeans converged in 14 iterations.
> 16/06/12 20:37:33 INFO clustering.KMeans: The cost for the best run is 
> 49784.87126751288.
> 16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 781 from 
> persistence list
> 16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 781
> 16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 780 from 
> persistence list
> 16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 780
> I'm running this K-Means on a 16GB machine, with Spark Context as local[*]. 
> My machine has an i5 hyperthreaded dual-core, thus [*] means 4.
> I'm launching this application though spark-submit with --driver-memory 9G



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-15904) High Memory Pressure using MLlib K-means

2016-06-13 Thread Alessio (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15327272#comment-15327272
 ] 

Alessio edited comment on SPARK-15904 at 6/13/16 12:41 PM:
---

Dear Sean,
I must certainly agree with you on k< High Memory Pressure using MLlib K-means
> 
>
> Key: SPARK-15904
> URL: https://issues.apache.org/jira/browse/SPARK-15904
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.6.1
> Environment: Mac OS X 10.11.6beta on Macbook Pro 13" mid-2012. 16GB 
> of RAM.
>Reporter: Alessio
>Priority: Minor
>
> Running MLlib K-Means on a ~400MB dataset (12 partitions), persisted on 
> Memory and Disk.
> Everything's fine, although at the end of K-Means, after the number of 
> iterations, the cost function value and the running time there's a nice 
> "Removing RDD  from persistent list" stage. However, during this stage 
> there's a high memory pressure. Weird, since RDDs are about to be removed. 
> Full log of this stage:
> 16/06/12 20:37:33 INFO clustering.KMeans: Run 0 finished in 14 iterations
> 16/06/12 20:37:33 INFO clustering.KMeans: Iterations took 694.544 seconds.
> 16/06/12 20:37:33 INFO clustering.KMeans: KMeans converged in 14 iterations.
> 16/06/12 20:37:33 INFO clustering.KMeans: The cost for the best run is 
> 49784.87126751288.
> 16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 781 from 
> persistence list
> 16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 781
> 16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 780 from 
> persistence list
> 16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 780
> I'm running this K-Means on a 16GB machine, with Spark Context as local[*]. 
> My machine has an i5 hyperthreaded dual-core, thus [*] means 4.
> I'm launching this application though spark-submit with --driver-memory 9G



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15904) High Memory Pressure using MLlib K-means

2016-06-13 Thread Alessio (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15327272#comment-15327272
 ] 

Alessio commented on SPARK-15904:
-

Dear Sean,
I must certainly agree with you on k< High Memory Pressure using MLlib K-means
> 
>
> Key: SPARK-15904
> URL: https://issues.apache.org/jira/browse/SPARK-15904
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.6.1
> Environment: Mac OS X 10.11.6beta on Macbook Pro 13" mid-2012. 16GB 
> of RAM.
>Reporter: Alessio
>Priority: Minor
>
> Running MLlib K-Means on a ~400MB dataset (12 partitions), persisted on 
> Memory and Disk.
> Everything's fine, although at the end of K-Means, after the number of 
> iterations, the cost function value and the running time there's a nice 
> "Removing RDD  from persistent list" stage. However, during this stage 
> there's a high memory pressure. Weird, since RDDs are about to be removed. 
> Full log of this stage:
> 16/06/12 20:37:33 INFO clustering.KMeans: Run 0 finished in 14 iterations
> 16/06/12 20:37:33 INFO clustering.KMeans: Iterations took 694.544 seconds.
> 16/06/12 20:37:33 INFO clustering.KMeans: KMeans converged in 14 iterations.
> 16/06/12 20:37:33 INFO clustering.KMeans: The cost for the best run is 
> 49784.87126751288.
> 16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 781 from 
> persistence list
> 16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 781
> 16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 780 from 
> persistence list
> 16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 780
> I'm running this K-Means on a 16GB machine, with Spark Context as local[*]. 
> My machine has an i5 hyperthreaded dual-core, thus [*] means 4.
> I'm launching this application though spark-submit with --driver-memory 9G



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15904) High Memory Pressure using MLlib K-means

2016-06-13 Thread Alessio (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessio updated SPARK-15904:

Description: 
Running MLlib K-Means on a ~400MB dataset (12 partitions), persisted on Memory 
and Disk.
Everything's fine, although at the end of K-Means, after the number of 
iterations, the cost function value and the running time there's a nice 
"Removing RDD  from persistent list" stage. However, during this stage 
there's a high memory pressure. Weird, since RDDs are about to be removed. Full 
log of this stage:

16/06/12 20:37:33 INFO clustering.KMeans: Run 0 finished in 14 iterations
16/06/12 20:37:33 INFO clustering.KMeans: Iterations took 694.544 seconds.
16/06/12 20:37:33 INFO clustering.KMeans: KMeans converged in 14 iterations.
16/06/12 20:37:33 INFO clustering.KMeans: The cost for the best run is 
49784.87126751288.
16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 781 from persistence 
list
16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 781
16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 780 from persistence 
list
16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 780

I'm running this K-Means on a 16GB machine, with Spark Context as local[*]. My 
machine has an i5 hyperthreaded dual-core, thus [*] means 4.
I'm launching this application though spark-submit with --driver-memory 9G

  was:
Running MLlib K-Means on a ~400MB dataset (12 partitions), persisted on Memory 
and Disk.
Everything's fine, although at the end of K-Means, after the number of 
iterations, the cost function value and the running time there's a nice 
"Removing RDD  from persistent list" stage. However, during this stage 
there's a high memory pressure. Weird, since RDDs are about to be removed. Full 
log of this stage:

16/06/12 20:37:33 INFO clustering.KMeans: Run 0 finished in 14 iterations
16/06/12 20:37:33 INFO clustering.KMeans: Iterations took 694.544 seconds.
16/06/12 20:37:33 INFO clustering.KMeans: KMeans converged in 14 iterations.
16/06/12 20:37:33 INFO clustering.KMeans: The cost for the best run is 
49784.87126751288.
16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 781 from persistence 
list
16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 781
16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 780 from persistence 
list
16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 780

I'm running this K-Means on a 16GB machine, with Spark Context as local[*]. My 
machine has an i5 hyperthreaded dual-core, thus [*] means 4.
I'm launching this application though spark-submit with --driver-memory 10G


> High Memory Pressure using MLlib K-means
> 
>
> Key: SPARK-15904
> URL: https://issues.apache.org/jira/browse/SPARK-15904
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.6.1
> Environment: Mac OS X 10.11.6beta on Macbook Pro 13" mid-2012. 16GB 
> of RAM.
>Reporter: Alessio
>Priority: Minor
>
> Running MLlib K-Means on a ~400MB dataset (12 partitions), persisted on 
> Memory and Disk.
> Everything's fine, although at the end of K-Means, after the number of 
> iterations, the cost function value and the running time there's a nice 
> "Removing RDD  from persistent list" stage. However, during this stage 
> there's a high memory pressure. Weird, since RDDs are about to be removed. 
> Full log of this stage:
> 16/06/12 20:37:33 INFO clustering.KMeans: Run 0 finished in 14 iterations
> 16/06/12 20:37:33 INFO clustering.KMeans: Iterations took 694.544 seconds.
> 16/06/12 20:37:33 INFO clustering.KMeans: KMeans converged in 14 iterations.
> 16/06/12 20:37:33 INFO clustering.KMeans: The cost for the best run is 
> 49784.87126751288.
> 16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 781 from 
> persistence list
> 16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 781
> 16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 780 from 
> persistence list
> 16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 780
> I'm running this K-Means on a 16GB machine, with Spark Context as local[*]. 
> My machine has an i5 hyperthreaded dual-core, thus [*] means 4.
> I'm launching this application though spark-submit with --driver-memory 9G



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15904) High Memory Pressure using MLlib K-means

2016-06-13 Thread Alessio (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15327234#comment-15327234
 ] 

Alessio commented on SPARK-15904:
-

My dataset has 9000+ patterns, each of which has 2000+ attributes. Thus it's 
perfectly legal to search for  K>3000 and (of course) smaller than or equal to 
the number of patterns (9120)

> High Memory Pressure using MLlib K-means
> 
>
> Key: SPARK-15904
> URL: https://issues.apache.org/jira/browse/SPARK-15904
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.6.1
> Environment: Mac OS X 10.11.6beta on Macbook Pro 13" mid-2012. 16GB 
> of RAM.
>Reporter: Alessio
>Priority: Minor
>
> Running MLlib K-Means on a ~400MB dataset (12 partitions), persisted on 
> Memory and Disk.
> Everything's fine, although at the end of K-Means, after the number of 
> iterations, the cost function value and the running time there's a nice 
> "Removing RDD  from persistent list" stage. However, during this stage 
> there's a high memory pressure. Weird, since RDDs are about to be removed. 
> Full log of this stage:
> 16/06/12 20:37:33 INFO clustering.KMeans: Run 0 finished in 14 iterations
> 16/06/12 20:37:33 INFO clustering.KMeans: Iterations took 694.544 seconds.
> 16/06/12 20:37:33 INFO clustering.KMeans: KMeans converged in 14 iterations.
> 16/06/12 20:37:33 INFO clustering.KMeans: The cost for the best run is 
> 49784.87126751288.
> 16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 781 from 
> persistence list
> 16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 781
> 16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 780 from 
> persistence list
> 16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 780
> I'm running this K-Means on a 16GB machine, with Spark Context as local[*]. 
> My machine has an i5 hyperthreaded dual-core, thus [*] means 4.
> I'm launching this application though spark-submit with --driver-memory 10G



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15904) High Memory Pressure using MLlib K-means

2016-06-13 Thread Alessio (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessio updated SPARK-15904:

Priority: Minor  (was: Major)

> High Memory Pressure using MLlib K-means
> 
>
> Key: SPARK-15904
> URL: https://issues.apache.org/jira/browse/SPARK-15904
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.6.1
> Environment: Mac OS X 10.11.6beta on Macbook Pro 13" mid-2012. 16GB 
> of RAM.
>Reporter: Alessio
>Priority: Minor
>
> Running MLlib K-Means on a ~400MB dataset (12 partitions), persisted on 
> Memory and Disk.
> Everything's fine, although at the end of K-Means, after the number of 
> iterations, the cost function value and the running time there's a nice 
> "Removing RDD  from persistent list" stage. However, during this stage 
> there's a high memory pressure. Weird, since RDDs are about to be removed. 
> Full log of this stage:
> 16/06/12 20:37:33 INFO clustering.KMeans: Run 0 finished in 14 iterations
> 16/06/12 20:37:33 INFO clustering.KMeans: Iterations took 694.544 seconds.
> 16/06/12 20:37:33 INFO clustering.KMeans: KMeans converged in 14 iterations.
> 16/06/12 20:37:33 INFO clustering.KMeans: The cost for the best run is 
> 49784.87126751288.
> 16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 781 from 
> persistence list
> 16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 781
> 16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 780 from 
> persistence list
> 16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 780
> I'm running this K-Means on a 16GB machine, with Spark Context as local[*]. 
> My machine has an i5 hyperthreaded dual-core, thus [*] means 4.
> I'm launching this application though spark-submit with --driver-memory 10G



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15904) High Memory Pressure using MLlib K-means

2016-06-13 Thread Alessio (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessio updated SPARK-15904:

Issue Type: Improvement  (was: Bug)

> High Memory Pressure using MLlib K-means
> 
>
> Key: SPARK-15904
> URL: https://issues.apache.org/jira/browse/SPARK-15904
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.6.1
> Environment: Mac OS X 10.11.6beta on Macbook Pro 13" mid-2012. 16GB 
> of RAM.
>Reporter: Alessio
>
> Running MLlib K-Means on a ~400MB dataset (12 partitions), persisted on 
> Memory and Disk.
> Everything's fine, although at the end of K-Means, after the number of 
> iterations, the cost function value and the running time there's a nice 
> "Removing RDD  from persistent list" stage. However, during this stage 
> there's a high memory pressure. Weird, since RDDs are about to be removed. 
> Full log of this stage:
> 16/06/12 20:37:33 INFO clustering.KMeans: Run 0 finished in 14 iterations
> 16/06/12 20:37:33 INFO clustering.KMeans: Iterations took 694.544 seconds.
> 16/06/12 20:37:33 INFO clustering.KMeans: KMeans converged in 14 iterations.
> 16/06/12 20:37:33 INFO clustering.KMeans: The cost for the best run is 
> 49784.87126751288.
> 16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 781 from 
> persistence list
> 16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 781
> 16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 780 from 
> persistence list
> 16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 780
> I'm running this K-Means on a 16GB machine, with Spark Context as local[*]. 
> My machine has an i5 hyperthreaded dual-core, thus [*] means 4.
> I'm launching this application though spark-submit with --driver-memory 10G



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15904) High Memory Pressure using MLlib K-means

2016-06-13 Thread Alessio (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15327090#comment-15327090
 ] 

Alessio commented on SPARK-15904:
-

Hi [~yuhaoyan]], the dataset size is 9120 rows and 2125 columns.
This problem appears when K>3000.
What do you suggest as priority label? I'm sorry if "major" is not appropriate, 
this is my first post on JIRA

> High Memory Pressure using MLlib K-means
> 
>
> Key: SPARK-15904
> URL: https://issues.apache.org/jira/browse/SPARK-15904
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.6.1
> Environment: Mac OS X 10.11.6beta on Macbook Pro 13" mid-2012. 16GB 
> of RAM.
>Reporter: Alessio
>
> Running MLlib K-Means on a ~400MB dataset (12 partitions), persisted on 
> Memory and Disk.
> Everything's fine, although at the end of K-Means, after the number of 
> iterations, the cost function value and the running time there's a nice 
> "Removing RDD  from persistent list" stage. However, during this stage 
> there's a high memory pressure. Weird, since RDDs are about to be removed. 
> Full log of this stage:
> 16/06/12 20:37:33 INFO clustering.KMeans: Run 0 finished in 14 iterations
> 16/06/12 20:37:33 INFO clustering.KMeans: Iterations took 694.544 seconds.
> 16/06/12 20:37:33 INFO clustering.KMeans: KMeans converged in 14 iterations.
> 16/06/12 20:37:33 INFO clustering.KMeans: The cost for the best run is 
> 49784.87126751288.
> 16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 781 from 
> persistence list
> 16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 781
> 16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 780 from 
> persistence list
> 16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 780
> I'm running this K-Means on a 16GB machine, with Spark Context as local[*]. 
> My machine has an i5 hyperthreaded dual-core, thus [*] means 4.
> I'm launching this application though spark-submit with --driver-memory 10G



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15904) High Memory Pressure using MLlib K-means

2016-06-12 Thread Alessio (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessio updated SPARK-15904:

Description: 
Running MLlib K-Means on a ~400MB dataset (12 partitions), persisted on Memory 
and Disk.
Everything's fine, although at the end of K-Means, after the number of 
iterations, the cost function value and the running time there's a nice 
"Removing RDD  from persistent list" stage. However, during this stage 
there's a high memory pressure. Weird, since RDDs are about to be removed. Full 
log of this stage:

16/06/12 20:37:33 INFO clustering.KMeans: Run 0 finished in 14 iterations
16/06/12 20:37:33 INFO clustering.KMeans: Iterations took 694.544 seconds.
16/06/12 20:37:33 INFO clustering.KMeans: KMeans converged in 14 iterations.
16/06/12 20:37:33 INFO clustering.KMeans: The cost for the best run is 
49784.87126751288.
16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 781 from persistence 
list
16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 781
16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 780 from persistence 
list
16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 780

I'm running this K-Means on a 16GB machine, with Spark Context as local[*]. My 
machine has an i5 hyperthreaded dual-core, thus [*] means 4.
I'm launching this application though spark-submit with --driver-memory 10G

  was:
Running MLlib K-Means on a ~400MB dataset (12 partitions), persisted on Memory 
and Disk.
Everything's fine, although at the end of K-Means, after the number of 
iterations, the cost function value and the running time there's a nice 
"Removing RDD  from persistent list" stage. However, during this stage 
there's a high memory pressure. Weird, since RDDs are about to be removed. Full 
log of this stage:

16/06/12 20:37:33 INFO clustering.KMeans: Run 0 finished in 14 iterations
16/06/12 20:37:33 INFO clustering.KMeans: Iterations took 694.544 seconds.
16/06/12 20:37:33 INFO clustering.KMeans: KMeans converged in 14 iterations.
16/06/12 20:37:33 INFO clustering.KMeans: The cost for the best run is 
49784.87126751288.
16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 781 from persistence 
list
16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 781
16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 780 from persistence 
list
16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 780

I'm running this cluster analysis on a 16GB machine, with Spark Context as 
local[*]. My machine has an i5 hyperthreaded dual-core, thus [*] means 4.
I'm launching this application though spark-submit with --driver-memory 10G


> High Memory Pressure using MLlib K-means
> 
>
> Key: SPARK-15904
> URL: https://issues.apache.org/jira/browse/SPARK-15904
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.6.1
> Environment: Mac OS X 10.11.6beta on Macbook Pro 13" mid-2012. 16GB 
> of RAM.
>Reporter: Alessio
>
> Running MLlib K-Means on a ~400MB dataset (12 partitions), persisted on 
> Memory and Disk.
> Everything's fine, although at the end of K-Means, after the number of 
> iterations, the cost function value and the running time there's a nice 
> "Removing RDD  from persistent list" stage. However, during this stage 
> there's a high memory pressure. Weird, since RDDs are about to be removed. 
> Full log of this stage:
> 16/06/12 20:37:33 INFO clustering.KMeans: Run 0 finished in 14 iterations
> 16/06/12 20:37:33 INFO clustering.KMeans: Iterations took 694.544 seconds.
> 16/06/12 20:37:33 INFO clustering.KMeans: KMeans converged in 14 iterations.
> 16/06/12 20:37:33 INFO clustering.KMeans: The cost for the best run is 
> 49784.87126751288.
> 16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 781 from 
> persistence list
> 16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 781
> 16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 780 from 
> persistence list
> 16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 780
> I'm running this K-Means on a 16GB machine, with Spark Context as local[*]. 
> My machine has an i5 hyperthreaded dual-core, thus [*] means 4.
> I'm launching this application though spark-submit with --driver-memory 10G



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15904) High Memory Pressure using MLlib K-means

2016-06-12 Thread Alessio (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessio updated SPARK-15904:

Description: 
Running MLlib K-Means on a ~400MB dataset (12 partitions), persisted on Memory 
and Disk.
Everything's fine, although at the end of K-Means, after the number of 
iterations, the cost function value and the running time there's a nice 
"Removing RDD  from persistent list" stage. However, during this stage 
there's a high memory pressure. Weird, since RDDs are about to be removed. Full 
log of this stage:

16/06/12 20:37:33 INFO clustering.KMeans: Run 0 finished in 14 iterations
16/06/12 20:37:33 INFO clustering.KMeans: Iterations took 694.544 seconds.
16/06/12 20:37:33 INFO clustering.KMeans: KMeans converged in 14 iterations.
16/06/12 20:37:33 INFO clustering.KMeans: The cost for the best run is 
49784.87126751288.
16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 781 from persistence 
list
16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 781
16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 780 from persistence 
list
16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 780

I'm running this cluster analysis on a 16GB machine, with Spark Context as 
local[*]. My machine has an i5 hyperthreaded dual-core, thus [*] means 4.
I'm launching this application though spark-submit with --driver-memory 10G

  was:
Running MLlib K-Means on a ~400MB dataset (12 partitions), persisted on Memory 
and Disk.
Everything's fine, although at the end of K-Means, after the number of 
iterations, the cost function value and the running time there's a nice 
"Removing RDD  from persistent list" stage. However, during this stage 
there's a high memory pressure. Weird, since RDDs are about to be removed.

I'm running this cluster analysis on a 16GB machine, with Spark Context as 
local[*]. My machine has an i5 hyperthreaded dual-core, thus [*] means 4.
I'm launching this application though spark-submit with --driver-memory 10G


> High Memory Pressure using MLlib K-means
> 
>
> Key: SPARK-15904
> URL: https://issues.apache.org/jira/browse/SPARK-15904
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.6.1
> Environment: Mac OS X 10.11.6beta on Macbook Pro 13" mid-2012. 16GB 
> of RAM.
>Reporter: Alessio
>
> Running MLlib K-Means on a ~400MB dataset (12 partitions), persisted on 
> Memory and Disk.
> Everything's fine, although at the end of K-Means, after the number of 
> iterations, the cost function value and the running time there's a nice 
> "Removing RDD  from persistent list" stage. However, during this stage 
> there's a high memory pressure. Weird, since RDDs are about to be removed. 
> Full log of this stage:
> 16/06/12 20:37:33 INFO clustering.KMeans: Run 0 finished in 14 iterations
> 16/06/12 20:37:33 INFO clustering.KMeans: Iterations took 694.544 seconds.
> 16/06/12 20:37:33 INFO clustering.KMeans: KMeans converged in 14 iterations.
> 16/06/12 20:37:33 INFO clustering.KMeans: The cost for the best run is 
> 49784.87126751288.
> 16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 781 from 
> persistence list
> 16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 781
> 16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 780 from 
> persistence list
> 16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 780
> I'm running this cluster analysis on a 16GB machine, with Spark Context as 
> local[*]. My machine has an i5 hyperthreaded dual-core, thus [*] means 4.
> I'm launching this application though spark-submit with --driver-memory 10G



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15904) High Memory Pressure using MLlib K-means

2016-06-12 Thread Alessio (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessio updated SPARK-15904:

Description: 
Running MLlib K-Means on a ~400MB dataset (12 partitions), persisted on Memory 
and Disk.
Everything's fine, although at the end of K-Means, after the number of 
iterations, the cost function value and the running time there's a nice 
"Removing RDD  from persistent list" stage. However, during this stage 
there's a high memory pressure. Weird, since RDDs are about to be removed.

I'm running this cluster analysis on a 16GB machine, with Spark Context as 
local[*]. My machine has an i5 hyperthreaded dual-core, thus [*] means 4.
I'm launching this application though spark-submit with --driver-memory 10G

  was:
Running MLlib K-Means on a ~400MB dataset, persisted on Memory and Disk.
Everything's fine, although at the end of K-Means, after the number of 
iterations, the cost function value and the running time there's a nice 
"Removing RDD  from persistent list" stage. However, during this stage 
there's a high memory pressure. Weird, since RDDs are about to be removed.

I'm running this cluster analysis on a 16GB machine, with Spark Context as 
local[*]. My machine has an i5 hyperthreaded dual-core, thus [*] means 4.
I'm launching this application though spark-submit with --driver-memory 10G


> High Memory Pressure using MLlib K-means
> 
>
> Key: SPARK-15904
> URL: https://issues.apache.org/jira/browse/SPARK-15904
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.6.1
> Environment: Mac OS X 10.11.6beta on Macbook Pro 13" mid-2012. 16GB 
> of RAM.
>Reporter: Alessio
>
> Running MLlib K-Means on a ~400MB dataset (12 partitions), persisted on 
> Memory and Disk.
> Everything's fine, although at the end of K-Means, after the number of 
> iterations, the cost function value and the running time there's a nice 
> "Removing RDD  from persistent list" stage. However, during this stage 
> there's a high memory pressure. Weird, since RDDs are about to be removed.
> I'm running this cluster analysis on a 16GB machine, with Spark Context as 
> local[*]. My machine has an i5 hyperthreaded dual-core, thus [*] means 4.
> I'm launching this application though spark-submit with --driver-memory 10G



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15904) High Memory Pressure using MLlib K-means

2016-06-12 Thread Alessio (JIRA)
Alessio created SPARK-15904:
---

 Summary: High Memory Pressure using MLlib K-means
 Key: SPARK-15904
 URL: https://issues.apache.org/jira/browse/SPARK-15904
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.6.1
 Environment: Mac OS X 10.11.6beta on Macbook Pro 13" mid-2012. 16GB of 
RAM.
Reporter: Alessio


Running MLlib K-Means on a ~400MB dataset, persisted on Memory and Disk.
Everything's fine, although at the end of K-Means, after the number of 
iterations, the cost function value and the running time there's a nice 
"Removing RDD  from persistent list" stage. However, during this stage 
there's a high memory pressure. Weird, since RDDs are about to be removed.

I'm running this cluster analysis on a 16GB machine, with Spark Context as 
local[*]. My machine has an i5 hyperthreaded dual-core, thus [*] means 4.
I'm launching this application though spark-submit with --driver-memory 10G



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org