[jira] [Created] (SPARK-5056) Implementing Clara k-medoids clustering algorithm for large datasets

2015-01-02 Thread Tomislav Milinovic (JIRA)
Tomislav Milinovic created SPARK-5056:
-

 Summary: Implementing Clara k-medoids clustering algorithm for 
large datasets
 Key: SPARK-5056
 URL: https://issues.apache.org/jira/browse/SPARK-5056
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Tomislav Milinovic
Priority: Minor


There is a specific k-medoids clustering algorithm for large datasets. The 
algorithm is called Clara in R, and is fully described in chapter 3 of Finding 
Groups in Data: An Introduction to Cluster Analysis. by Kaufman, L and 
Rousseeuw, PJ (1990). 
The algorithm considers sub-datasets of fixed size (sampsize) such that the 
time and storage requirements become linear in n rather than quadratic. Each 
sub-dataset is partitioned into k clusters using the same algorithm as in 
Partinioning around Medoids (PAM).




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2165) spark on yarn: add support for setting maxAppAttempts in the ApplicationSubmissionContext

2015-01-02 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14262881#comment-14262881
 ] 

Apache Spark commented on SPARK-2165:
-

User 'WangTaoTheTonic' has created a pull request for this issue:
https://github.com/apache/spark/pull/3878

 spark on yarn: add support for setting maxAppAttempts in the 
 ApplicationSubmissionContext
 -

 Key: SPARK-2165
 URL: https://issues.apache.org/jira/browse/SPARK-2165
 Project: Spark
  Issue Type: Improvement
  Components: YARN
Affects Versions: 1.0.0
Reporter: Thomas Graves

 Hadoop 2.x adds support for allowing the application to specify the maximum 
 application attempts. We should add support for it by setting in the 
 ApplicationSubmissionContext.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5057) Add more details in log when using actor to get infos

2015-01-02 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14262827#comment-14262827
 ] 

Apache Spark commented on SPARK-5057:
-

User 'WangTaoTheTonic' has created a pull request for this issue:
https://github.com/apache/spark/pull/3875

 Add more details in log when using actor to get infos
 -

 Key: SPARK-5057
 URL: https://issues.apache.org/jira/browse/SPARK-5057
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: WangTaoTheTonic
Priority: Minor

 As is used in many cases, it is better for analysis to print contents of 
 message after attempt failed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5058) Typos and broken URL

2015-01-02 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14262843#comment-14262843
 ] 

Apache Spark commented on SPARK-5058:
-

User 'sigmoidanalytics' has created a pull request for this issue:
https://github.com/apache/spark/pull/3877

 Typos and broken URL
 

 Key: SPARK-5058
 URL: https://issues.apache.org/jira/browse/SPARK-5058
 Project: Spark
  Issue Type: Documentation
  Components: Streaming
Affects Versions: 1.2.0
Reporter: AkhlD
Priority: Minor
 Fix For: 1.2.1


 Spark Streaming + Kafka Integration Guide has a broken Examples link. Also 
 project is spelled as projrect.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5058) Typos and broken URL

2015-01-02 Thread AkhlD (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14262842#comment-14262842
 ] 

AkhlD commented on SPARK-5058:
--

Created a PR https://github.com/apache/spark/pull/3877

 Typos and broken URL
 

 Key: SPARK-5058
 URL: https://issues.apache.org/jira/browse/SPARK-5058
 Project: Spark
  Issue Type: Documentation
  Components: Streaming
Affects Versions: 1.2.0
Reporter: AkhlD
Priority: Minor
 Fix For: 1.2.1


 Spark Streaming + Kafka Integration Guide has a broken Examples link. Also 
 project is spelled as projrect.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5057) Add more details in log when using actor to get infos

2015-01-02 Thread WangTaoTheTonic (JIRA)
WangTaoTheTonic created SPARK-5057:
--

 Summary: Add more details in log when using actor to get infos
 Key: SPARK-5057
 URL: https://issues.apache.org/jira/browse/SPARK-5057
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: WangTaoTheTonic
Priority: Minor


As is used in many cases, it is better for analysis to print contents of 
message after attempt failed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5055) Minor typos on the downloads page

2015-01-02 Thread Marko Bonaci (JIRA)
Marko Bonaci created SPARK-5055:
---

 Summary: Minor typos on the downloads page
 Key: SPARK-5055
 URL: https://issues.apache.org/jira/browse/SPARK-5055
 Project: Spark
  Issue Type: Documentation
  Components: Documentation
Affects Versions: 1.2.0
Reporter: Marko Bonaci
Priority: Trivial


The _Downloads_ page uses the word Chose for present. It should say Choose.
http://spark.apache.org/downloads.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5058) Typos and broken URL

2015-01-02 Thread AkhlD (JIRA)
AkhlD created SPARK-5058:


 Summary: Typos and broken URL
 Key: SPARK-5058
 URL: https://issues.apache.org/jira/browse/SPARK-5058
 Project: Spark
  Issue Type: Documentation
  Components: Streaming
Affects Versions: 1.2.0
Reporter: AkhlD
Priority: Minor
 Fix For: 1.2.1


Spark Streaming + Kafka Integration Guide has a broken Examples link. Also 
project is spelled as projrect.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4943) Parsing error for query with table name having dot

2015-01-02 Thread Alex Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14263230#comment-14263230
 ] 

Alex Liu commented on SPARK-4943:
-

Should we also change the signatures of Catalog methods to use 
{code}tableIdentifier: Seq[String] {code} instead of {code}db: Option[String], 
tableName: String{code}?

{code}

  def tableExists(db: Option[String], tableName: String): Boolean

  def lookupRelation(
databaseName: Option[String],
tableName: String,
alias: Option[String] = None): LogicalPlan

  def registerTable(databaseName: Option[String], tableName: String, plan: 
LogicalPlan): Unit

  def unregisterTable(databaseName: Option[String], tableName: String): Unit

  def unregisterAllTables(): Unit

  protected def processDatabaseAndTableName(
  databaseName: Option[String],
  tableName: String): (Option[String], String)
{code}

 Parsing error for query with table name having dot
 --

 Key: SPARK-4943
 URL: https://issues.apache.org/jira/browse/SPARK-4943
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0
Reporter: Alex Liu

 When integrating Spark 1.2.0 with Cassandra SQL, the following query is 
 broken. It was working for Spark 1.1.0 version. Basically we use the table 
 name having dot to include database name 
 {code}
 [info]   java.lang.RuntimeException: [1.29] failure: ``UNION'' expected but 
 `.' found
 [info] 
 [info] SELECT test1.a FROM sql_test.test1 AS test1 UNION DISTINCT SELECT 
 test2.a FROM sql_test.test2 AS test2
 [info] ^
 [info]   at scala.sys.package$.error(package.scala:27)
 [info]   at 
 org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(SparkSQLParser.scala:33)
 [info]   at 
 org.apache.spark.sql.SQLContext$$anonfun$1.apply(SQLContext.scala:79)
 [info]   at 
 org.apache.spark.sql.SQLContext$$anonfun$1.apply(SQLContext.scala:79)
 [info]   at 
 org.apache.spark.sql.catalyst.SparkSQLParser$$anonfun$org$apache$spark$sql$catalyst$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:174)
 [info]   at 
 org.apache.spark.sql.catalyst.SparkSQLParser$$anonfun$org$apache$spark$sql$catalyst$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:173)
 [info]   at 
 scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136)
 [info]   at 
 scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:135)
 [info]   at 
 scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
 [info]   at 
 scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
 [info]   at 
 scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
 [info]   at 
 scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254)
 [info]   at 
 scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254)
 [info]   at 
 scala.util.parsing.combinator.Parsers$Failure.append(Parsers.scala:202)
 [info]   at 
 scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
 [info]   at 
 scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
 [info]   at 
 scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
 [info]   at 
 scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891)
 [info]   at 
 scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891)
 [info]   at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
 [info]   at 
 scala.util.parsing.combinator.Parsers$$anon$2.apply(Parsers.scala:890)
 [info]   at 
 scala.util.parsing.combinator.PackratParsers$$anon$1.apply(PackratParsers.scala:110)
 [info]   at 
 org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(SparkSQLParser.scala:31)
 [info]   at 
 org.apache.spark.sql.SQLContext$$anonfun$parseSql$1.apply(SQLContext.scala:83)
 [info]   at 
 org.apache.spark.sql.SQLContext$$anonfun$parseSql$1.apply(SQLContext.scala:83)
 [info]   at scala.Option.getOrElse(Option.scala:120)
 [info]   at org.apache.spark.sql.SQLContext.parseSql(SQLContext.scala:83)
 [info]   at 
 org.apache.spark.sql.cassandra.CassandraSQLContext.cassandraSql(CassandraSQLContext.scala:53)
 [info]   at 
 org.apache.spark.sql.cassandra.CassandraSQLContext.sql(CassandraSQLContext.scala:56)
 [info]   at 
 com.datastax.spark.connector.sql.CassandraSQLSpec$$anonfun$20.apply$mcV$sp(CassandraSQLSpec.scala:169)
 [info]   at 
 com.datastax.spark.connector.sql.CassandraSQLSpec$$anonfun$20.apply(CassandraSQLSpec.scala:168)
 [info]   at 
 com.datastax.spark.connector.sql.CassandraSQLSpec$$anonfun$20.apply(CassandraSQLSpec.scala:168)
 [info]   at 
 

[jira] [Commented] (SPARK-3821) Develop an automated way of creating Spark images (AMI, Docker, and others)

2015-01-02 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14263187#comment-14263187
 ] 

Nicholas Chammas commented on SPARK-3821:
-

I need to brush up on my statistics, but I think the difference between base 
AMI and Packer AMI is not statistically significant.

The benchmark just tested time from instance launch to SSH availability. 
Nothing was installed or done with the instances after SSH became available. 
(i.e. I wasn't creating Spark clusters.) I still have to post updated 
benchmarks for full cluster launches.

Is there anything else you wanted to see before reviewing this proposal in more 
detail?

 Develop an automated way of creating Spark images (AMI, Docker, and others)
 ---

 Key: SPARK-3821
 URL: https://issues.apache.org/jira/browse/SPARK-3821
 Project: Spark
  Issue Type: Improvement
  Components: Build, EC2
Reporter: Nicholas Chammas
Assignee: Nicholas Chammas
 Attachments: packer-proposal.html


 Right now the creation of Spark AMIs or Docker containers is done manually. 
 With tools like [Packer|http://www.packer.io/], we should be able to automate 
 this work, and do so in such a way that multiple types of machine images can 
 be created from a single template.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3821) Develop an automated way of creating Spark images (AMI, Docker, and others)

2015-01-02 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14263206#comment-14263206
 ] 

Nicholas Chammas commented on SPARK-3821:
-

I have Packer configured to run {{create_image.sh}}, as well as other scripts I 
added (e.g. to install Python 2.7), to generate the AMIs I am using. So testing 
Packer-generated AMIs against manually-generated ones (by running 
{{create_image.sh}} by hand) should show little difference.

Packer is just tooling to automate the application of existing scripts like 
{{create_image.sh}} towards creating AMIs and other image types like GCE images 
and Docker images. The goal is to make it easy to generate and update Spark 
AMIs (and eventually Docker images too) in an automated fashion.

 Develop an automated way of creating Spark images (AMI, Docker, and others)
 ---

 Key: SPARK-3821
 URL: https://issues.apache.org/jira/browse/SPARK-3821
 Project: Spark
  Issue Type: Improvement
  Components: Build, EC2
Reporter: Nicholas Chammas
Assignee: Nicholas Chammas
 Attachments: packer-proposal.html


 Right now the creation of Spark AMIs or Docker containers is done manually. 
 With tools like [Packer|http://www.packer.io/], we should be able to automate 
 this work, and do so in such a way that multiple types of machine images can 
 be created from a single template.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3821) Develop an automated way of creating Spark images (AMI, Docker, and others)

2015-01-02 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14263193#comment-14263193
 ] 

Shivaram Venkataraman commented on SPARK-3821:
--

Yeah you are right that the times are pretty close for Packer, base AMI. I was 
just curious if I was missing some thing. I don't think there is much else I 
had in mind -- having the full cluster launch times for existing AMI vs. Packer 
would be good and it would also be good to see how Packer compares to images 
created using 
[create_image.sh|https://github.com/mesos/spark-ec2/blob/v4/create_image.sh]

 Develop an automated way of creating Spark images (AMI, Docker, and others)
 ---

 Key: SPARK-3821
 URL: https://issues.apache.org/jira/browse/SPARK-3821
 Project: Spark
  Issue Type: Improvement
  Components: Build, EC2
Reporter: Nicholas Chammas
Assignee: Nicholas Chammas
 Attachments: packer-proposal.html


 Right now the creation of Spark AMIs or Docker containers is done manually. 
 With tools like [Packer|http://www.packer.io/], we should be able to automate 
 this work, and do so in such a way that multiple types of machine images can 
 be created from a single template.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3821) Develop an automated way of creating Spark images (AMI, Docker, and others)

2015-01-02 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14263181#comment-14263181
 ] 

Shivaram Venkataraman commented on SPARK-3821:
--

[~nchammas] Thanks for the benchmark. One thing I am curious about is why the 
Packer AMI is faster than launching just the base Amazon AMI. Is this because 
we spend some time installing things on the base AMI that we avoid with Packer 
? 

 Develop an automated way of creating Spark images (AMI, Docker, and others)
 ---

 Key: SPARK-3821
 URL: https://issues.apache.org/jira/browse/SPARK-3821
 Project: Spark
  Issue Type: Improvement
  Components: Build, EC2
Reporter: Nicholas Chammas
Assignee: Nicholas Chammas
 Attachments: packer-proposal.html


 Right now the creation of Spark AMIs or Docker containers is done manually. 
 With tools like [Packer|http://www.packer.io/], we should be able to automate 
 this work, and do so in such a way that multiple types of machine images can 
 be created from a single template.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5061) SQLContext: overload createParquetFile

2015-01-02 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14263278#comment-14263278
 ] 

Apache Spark commented on SPARK-5061:
-

User 'alexbaretta' has created a pull request for this issue:
https://github.com/apache/spark/pull/3882

 SQLContext: overload createParquetFile
 --

 Key: SPARK-5061
 URL: https://issues.apache.org/jira/browse/SPARK-5061
 Project: Spark
  Issue Type: New Feature
Reporter: Alex Baretta

 Overload createParquetFile to support an explicit schema in the form of a 
 StructType object as follows:
 def createParquetFile(schema: StructType, path: String, allowExisting: 
 Boolean, conf: org.apache.hadoop.conf.Configuration) : SchemaRD



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5061) SQLContext: overload createParquetFile

2015-01-02 Thread Alex Baretta (JIRA)
Alex Baretta created SPARK-5061:
---

 Summary: SQLContext: overload createParquetFile
 Key: SPARK-5061
 URL: https://issues.apache.org/jira/browse/SPARK-5061
 Project: Spark
  Issue Type: New Feature
Reporter: Alex Baretta


Overload createParquetFile to support an explicit schema in the form of a 
StructType object as follows:

def createParquetFile(schema: StructType, path: String, allowExisting: Boolean, 
conf: org.apache.hadoop.conf.Configuration) : SchemaRD



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5059) list of user's objects in Spark REPL

2015-01-02 Thread Tomas Hudik (JIRA)
Tomas Hudik created SPARK-5059:
--

 Summary: list of user's objects in Spark REPL
 Key: SPARK-5059
 URL: https://issues.apache.org/jira/browse/SPARK-5059
 Project: Spark
  Issue Type: New Feature
  Components: Spark Shell
Reporter: Tomas Hudik
Priority: Minor


Often user do not remember all objects he has created in Spark REPL (shell). 
It would be helpful to have an command that would list all such objects. E.g. R 
is using *ls()* to list all objects.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4905) Flaky FlumeStreamSuite test: org.apache.spark.streaming.flume.FlumeStreamSuite.flume input stream

2015-01-02 Thread Tathagata Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14263328#comment-14263328
 ] 

Tathagata Das commented on SPARK-4905:
--

[~hshreedharan] Can you take a look at this please!! I have been seeing this 
once in a while. I has seen this when the test sent one message at a time, and 
to increase the chances of success, i modified the test to send one whole bunch 
at a time, repeatedly, until all got through or nothing got through. But it 
still seems to failing. I have no idea why empty string are being send when i 
trying to send 1, 2, 3, etc. Please take a look.

 Flaky FlumeStreamSuite test: 
 org.apache.spark.streaming.flume.FlumeStreamSuite.flume input stream
 -

 Key: SPARK-4905
 URL: https://issues.apache.org/jira/browse/SPARK-4905
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.3.0
Reporter: Josh Rosen
  Labels: flaky-test

 It looks like the org.apache.spark.streaming.flume.FlumeStreamSuite.flume 
 input stream test might be flaky 
 ([link|https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24647/testReport/junit/org.apache.spark.streaming.flume/FlumeStreamSuite/flume_input_stream/]):
 {code}
 Error Message
 The code passed to eventually never returned normally. Attempted 106 times 
 over 10.045097243 seconds. Last failure message: ArrayBuffer(, , , , 
 , , , , , , , , , , , , , , , , , , , 
 , , , , , , , , , , , , , , , , , , , 
 , , , , , , , , , , , , , , , , , , , 
 , , , , , , , , , , , , , , , , , , , 
 , , , , , , , , , , , , , , , , , , , 
 ) was not equal to Vector(1, 2, 3, 4, 5, 6, 7, 8, 9, 
 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 
 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 
 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 
 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 
 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 
 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 
 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 
 100).
 Stacktrace
 sbt.ForkMain$ForkError: The code passed to eventually never returned 
 normally. Attempted 106 times over 10.045097243 seconds. Last failure 
 message: ArrayBuffer(, , , , , , , , , , , , , , 
 , , , , , , , , , , , , , , , , , , , 
 , , , , , , , , , , , , , , , , , , , 
 , , , , , , , , , , , , , , , , , , , 
 , , , , , , , , , , , , , , , , , , , 
 , , , , , , , , , ) was not equal to Vector(1, 2, 
 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 
 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 
 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 
 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 
 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 
 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 
 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 
 95, 96, 97, 98, 99, 100).
   at 
 org.scalatest.concurrent.Eventually$class.tryTryAgain$1(Eventually.scala:420)
   at 
 org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:438)
   at org.scalatest.concurrent.Eventually$.eventually(Eventually.scala:478)
   at 
 org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:307)
   at org.scalatest.concurrent.Eventually$.eventually(Eventually.scala:478)
   at 
 org.apache.spark.streaming.flume.FlumeStreamSuite.writeAndVerify(FlumeStreamSuite.scala:142)
   at 
 org.apache.spark.streaming.flume.FlumeStreamSuite.org$apache$spark$streaming$flume$FlumeStreamSuite$$testFlumeStream(FlumeStreamSuite.scala:74)
   at 
 org.apache.spark.streaming.flume.FlumeStreamSuite$$anonfun$2.apply$mcV$sp(FlumeStreamSuite.scala:62)
   at 
 org.apache.spark.streaming.flume.FlumeStreamSuite$$anonfun$2.apply(FlumeStreamSuite.scala:62)
   at 
 org.apache.spark.streaming.flume.FlumeStreamSuite$$anonfun$2.apply(FlumeStreamSuite.scala:62)
   at 
 org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
   at org.scalatest.Transformer.apply(Transformer.scala:22)
   at org.scalatest.Transformer.apply(Transformer.scala:20)
   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
   at org.scalatest.Suite$class.withFixture(Suite.scala:1122)
   at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555)
   at 
 org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
   

[jira] [Updated] (SPARK-5036) Better support sending partial messages in Pregel API

2015-01-02 Thread sjk (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sjk updated SPARK-5036:
---
Description: 
Better support sending partial messages in Pregel API

1. the reqirement

In many iterative graph algorithms, only a part of the vertexes (we call them 
ActiveVertexes) need to send messages to their neighbours in each iteration. In 
many cases, ActiveVertexes are the vertexes that their attributes do not change 
between the previous and current iteration. To implement this requirement, we 
can use Pregel API + a flag (e.g., `bool isAttrChanged`) in each vertex's 
attribute. 

However, after `aggregateMessage` or `mapReduceTriplets` of each iteration, we 
need to reset this flag to the init value in every vertex, which needs a heavy 
`joinVertices`. 

We find a more efficient way to meet this requirement and want to discuss it 
here.


Look at a simple example as follows:

In i-th iteartion, the previous attribute of each vertex is `Attr` and the 
newly computed attribute is `NewAttr`:

|VID| Attr| NewAttr| Neighbours|
|:|:-|:|:--|
| 1 | 4| 5| 2, 3 |
| 2 | 3| 2| 1, 4 |
| 3 | 2| 2| 1, 4 |
| 4|  3| 4| 1, 2, 3 |

Our requirement is that: 

1.  Set each vertex's `Attr` to be `NewAttr` in i-th iteration
2.  For each vertex whose `Attr!=NewAttr`, send message to its neighbours 
in the next iteration's `aggregateMessage`.


We found it is hard to implement this requirment using current Pregel API 
efficiently. The reason is that we not only need to perform `pregel()` to  
compute the `NewAttr`  (2) but also need to perform `outJoin()` to satisfy (1).

A simple idea is to keep a `isAttrChanged:Boolean` (solution 1)  or `flag:Int` 
(solution 2) in each vertex's attribute.

 2. two solution  
---

2.1 solution 1: label and reset `isAttrChanged:Boolean` of Vertex Attr

![alt text](s1.jpeg Title)

1. init message by `aggregateMessage`
it return a messageRDD
2. `innerJoin`
compute the messages on the received vertex, return a new VertexRDD 
which have the computed value by customed logic function `vprog`, set 
`isAttrChanged = true`
3. `outerJoinVertices`
update the changed vertex to the whole graph. now the graph is new.
4. `aggregateMessage`. it return a messageRDD
5. `joinVertices`  reset erery `isAttrChanged` of Vertex attr to false

```
//  here reset the isAttrChanged to false
g = updateG.joinVertices(updateG.vertices) {
(vid, oriVertex, updateGVertex) = updateGVertex.reset()
}
   ```
   here need to reset the vertex attribute object's variable as false

if don't reset the `isAttrChanged`, it will send message next iteration 
directly.

**result:**  

*   Edge: 890041895 
*   Vertex: 181640208
*   Iterate: 150 times
*   Cost total: 8.4h
*   can't run until the 0 message 


solution 2. color vertex

![alt text](s2.jpeg Title)

iterate process:

1. innerJoin 
  `vprog` using as a partial function, looks like `vprog(curIter, _: VertexId, 
_: VD, _: A)`
  ` i = i + 1; val curIter = i`. 
  in `vprog`, user can fetch `curIter` and assign to `falg`.
2. outerJoinVertices
`graph = graph.outerJoinVertices(changedVerts) { (vid, old, newOpt) = 
newOpt.getOrElse(old)}.cache()`
3. aggregateMessages 
sendMsg is partial function, looks like `sendMsg(curIter, _: 
EdgeContext[VD, ED, A]`
**in `sendMsg`, compare `curIter` with `flag`, determine whether 
sending message**

result

raw data   from

*   vertex: 181640208
*   edge: 890041895


|  | iteration average cost | 150 iteration cost | 420 iteration cost | 
|  | - |  |  |
|  solution 1 | 188m | 7.8h | cannot finish  |
|  solution 2 | 24 | 1.2h   | 3.1h | 
| compare  | 7x  | 6.5x  | finished in 3.1 |


##  the end

i think the second solution(Pregel + a flag) is better.
this can really support the iterative graph algorithms which only part of the 
vertexes send messages to their neighbours in each iteration.

we shall use it in product environment.

pr: https://github.com/apache/spark/pull/3866

EOF


  was:
Better support sending partial messages in Pregel API

1. the reqirement

In many iterative graph algorithms, only a part of the vertexes (we call them 
ActiveVertexes) need to send messages to their neighbours in each iteration. In 
many cases, ActiveVertexes are the vertexes that their attributes do not change 
between the previous and current iteration. To implement this requirement, we 
can use Pregel API + a flag (e.g., `bool isAttrChanged`) in each vertex's 
attribute. 

However, after `aggregateMessage` or `mapReduceTriplets` of each iteration, we 
need to reset this flag to the init value in every vertex, which needs a heavy 
`joinVertices`. 

We find a more efficient way to meet this requirement and want to discuss 

[jira] [Updated] (SPARK-5063) Raise more helpful errors when RDD actions or transformations are called inside of transformations

2015-01-02 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-5063:
--
Summary: Raise more helpful errors when RDD actions or transformations are 
called inside of transformations  (was: Raise more helpful errors when 
SparkContext methods are called inside of transformations)

 Raise more helpful errors when RDD actions or transformations are called 
 inside of transformations
 --

 Key: SPARK-5063
 URL: https://issues.apache.org/jira/browse/SPARK-5063
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Josh Rosen
Assignee: Josh Rosen

 Spark does not support nested RDDs or performing Spark actions inside of 
 transformations; this usually leads to NullPointerExceptions (see SPARK-718 
 as one example).  The confusing NPE is one of the most common sources of 
 Spark questions on StackOverflow:
 - 
 https://stackoverflow.com/questions/13770218/call-of-distinct-and-map-together-throws-npe-in-spark-library/14130534#14130534
 - 
 https://stackoverflow.com/questions/23793117/nullpointerexception-in-scala-spark-appears-to-be-caused-be-collection-type/23793399#23793399
 - 
 https://stackoverflow.com/questions/25997558/graphx-ive-got-nullpointerexception-inside-mapvertices/26003674#26003674
 (those are just a sample of the ones that I've answered personally; there are 
 many others).
 I think that we should add some logic to attempt to detect these sorts of 
 errors: we can use a DynamicVariable to check whether we're inside a task and 
 throw more useful errors when the RDD constructor is called from inside a 
 task or when the SparkContext job submission methods are called.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-1529) Support setting spark.local.dirs to a hadoop FileSystem

2015-01-02 Thread Kannan Rajah (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14263287#comment-14263287
 ] 

Kannan Rajah edited comment on SPARK-1529 at 1/2/15 10:53 PM:
--

[~lian cheng] [~pwendell] I want to work on this JIRA. It's been a while since 
there has been any update. So can you please share what the current status is? 
Has there been a consensus on replacing the file API with a HDFS kind of 
interface and plugging in the right implementation? I will be looking at the 
code base in the mean time.


was (Author: rkannan82):
[~lian cheng] [~pwendell]] I want to work on this JIRA. It's been a while since 
there has been any update. So can you please share what the current status is? 
Has there been a consensus on replacing the file API with a HDFS kind of 
interface and plugging in the right implementation? I will be looking at the 
code base in the mean time.

 Support setting spark.local.dirs to a hadoop FileSystem 
 

 Key: SPARK-1529
 URL: https://issues.apache.org/jira/browse/SPARK-1529
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Patrick Wendell
Assignee: Cheng Lian

 In some environments, like with MapR, local volumes are accessed through the 
 Hadoop filesystem interface. We should allow setting spark.local.dir to a 
 Hadoop filesystem location. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1529) Support setting spark.local.dirs to a hadoop FileSystem

2015-01-02 Thread Kannan Rajah (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14263287#comment-14263287
 ] 

Kannan Rajah commented on SPARK-1529:
-

[~lian cheng] [~pwendell]] I want to work on this JIRA. It's been a while since 
there has been any update. So can you please share what the current status is? 
Has there been a consensus on replacing the file API with a HDFS kind of 
interface and plugging in the right implementation? I will be looking at the 
code base in the mean time.

 Support setting spark.local.dirs to a hadoop FileSystem 
 

 Key: SPARK-1529
 URL: https://issues.apache.org/jira/browse/SPARK-1529
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Patrick Wendell
Assignee: Cheng Lian

 In some environments, like with MapR, local volumes are accessed through the 
 Hadoop filesystem interface. We should allow setting spark.local.dir to a 
 Hadoop filesystem location. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5062) Pregel use aggregateMessage instead of mapReduceTriplets function

2015-01-02 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14263369#comment-14263369
 ] 

Apache Spark commented on SPARK-5062:
-

User 'shijinkui' has created a pull request for this issue:
https://github.com/apache/spark/pull/3883

 Pregel use aggregateMessage instead of mapReduceTriplets function
 -

 Key: SPARK-5062
 URL: https://issues.apache.org/jira/browse/SPARK-5062
 Project: Spark
  Issue Type: Wish
  Components: GraphX
Reporter: sjk
 Attachments: graphx_aggreate_msg.jpg


 since spark 1.2 introduce aggregateMessage instead of mapReduceTriplets, it 
 improve the performance indeed.
 it's time to replace mapReduceTriplets with aggregateMessage in Pregel.
 we can discuss it.
 i have draw a graph of aggregateMessage to show why it can improve the 
 performance.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5062) Pregel use aggregateMessage instead of mapReduceTriplets function

2015-01-02 Thread sjk (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sjk updated SPARK-5062:
---
Attachment: graphx_aggreate_msg.jpg

 Pregel use aggregateMessage instead of mapReduceTriplets function
 -

 Key: SPARK-5062
 URL: https://issues.apache.org/jira/browse/SPARK-5062
 Project: Spark
  Issue Type: Wish
  Components: GraphX
Reporter: sjk
 Attachments: graphx_aggreate_msg.jpg


 since spark 1.2 introduce aggregateMessage instead of mapReduceTriplets, it 
 improve the performance indeed.
 it's time to replace mapReduceTriplets with aggregateMessage in Pregel.
 we can discuss it.
 i have draw a graph of aggregateMessage to show why it can improve the 
 performance.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5062) Pregel use aggregateMessage instead of mapReduceTriplets function

2015-01-02 Thread sjk (JIRA)
sjk created SPARK-5062:
--

 Summary: Pregel use aggregateMessage instead of mapReduceTriplets 
function
 Key: SPARK-5062
 URL: https://issues.apache.org/jira/browse/SPARK-5062
 Project: Spark
  Issue Type: Wish
  Components: GraphX
Reporter: sjk


since spark 1.2 introduce aggregateMessage instead of mapReduceTriplets, it 
improve the performance indeed.

it's time to replace mapReduceTriplets with aggregateMessage in Pregel.

we can discuss it.

i have draw a graph of aggregateMessage to show why it can improve the 
performance.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5063) Raise more helpful errors when RDD actions or transformations are called inside of transformations

2015-01-02 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14263415#comment-14263415
 ] 

Apache Spark commented on SPARK-5063:
-

User 'JoshRosen' has created a pull request for this issue:
https://github.com/apache/spark/pull/3884

 Raise more helpful errors when RDD actions or transformations are called 
 inside of transformations
 --

 Key: SPARK-5063
 URL: https://issues.apache.org/jira/browse/SPARK-5063
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Josh Rosen
Assignee: Josh Rosen

 Spark does not support nested RDDs or performing Spark actions inside of 
 transformations; this usually leads to NullPointerExceptions (see SPARK-718 
 as one example).  The confusing NPE is one of the most common sources of 
 Spark questions on StackOverflow:
 - 
 https://stackoverflow.com/questions/13770218/call-of-distinct-and-map-together-throws-npe-in-spark-library/14130534#14130534
 - 
 https://stackoverflow.com/questions/23793117/nullpointerexception-in-scala-spark-appears-to-be-caused-be-collection-type/23793399#23793399
 - 
 https://stackoverflow.com/questions/25997558/graphx-ive-got-nullpointerexception-inside-mapvertices/26003674#26003674
 (those are just a sample of the ones that I've answered personally; there are 
 many others).
 I think we can detect these errors by adding logic to {{RDD}} to check 
 whether {{sc}} is null (e.g. turn {{sc}} into a getter function); we can use 
 this to add a better error message.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4923) Maven build should keep publishing spark-repl

2015-01-02 Thread Chip Senkbeil (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14263458#comment-14263458
 ] 

Chip Senkbeil commented on SPARK-4923:
--

FYI, this is a blocker for us as well: https://github.com/ibm-et/spark-kernel

 Maven build should keep publishing spark-repl
 -

 Key: SPARK-4923
 URL: https://issues.apache.org/jira/browse/SPARK-4923
 Project: Spark
  Issue Type: Bug
  Components: Build, Spark Shell
Affects Versions: 1.2.0
Reporter: Peng Cheng
Priority: Critical
  Labels: shell
 Attachments: 
 SPARK-4923__Maven_build_should_keep_publishing_spark-repl.patch

   Original Estimate: 1h
  Remaining Estimate: 1h

 Spark-repl installation and deployment has been discontinued (see 
 SPARK-3452). But its in the dependency list of a few projects that extends 
 its initialization process.
 Please remove the 'skip' setting in spark-repl and make it an 'official' API 
 to encourage more platform to integrate with it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org