date:20150309

Sean Owen created SPARK-6225:


 Summary: Resolve most build warnings, 1.3.0 edition
 Key: SPARK-6225
 URL: https://issues.apache.org/jira/browse/SPARK-6225
 Project: Spark
  Issue Type: Improvement
  Components: MLlib, Spark Core, SQL, Streaming
Affects Versions: 1.3.0
Reporter: Sean Owen
Assignee: Sean Owen
Priority: Minor


Post-1.3.0, I think it would be a good exercise to resolve a number of build 
warnings that have accumulated recently.

See for example efforts begun at
https://github.com/apache/spark/pull/4948
https://github.com/apache/spark/pull/4900



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-6188) Instance types can be mislabeled when re-starting cluster with default arguments

[
https://issues.apache.org/jira/browse/SPARK-6188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Sean Owen resolved SPARK-6188.
--
Resolution: Fixed
Fix Version/s: 1.4.0

Issue resolved by pull request 4916
[https://github.com/apache/spark/pull/4916]

Instance types can be mislabeled when re-starting cluster with default
arguments

Key: SPARK-6188
URL: https://issues.apache.org/jira/browse/SPARK-6188
Project: Spark
Issue Type: Bug
Components: EC2
Affects Versions: 1.0.2, 1.1.0, 1.1.1, 1.2.0, 1.2.1
Reporter: Theodore Vasiloudis
Priority: Minor
Fix For: 1.4.0

This was discovered when investigating
https://issues.apache.org/jira/browse/SPARK-5838.
In short, when restarting a cluster that you launched with an alternative
instance type, you have to provide the instance type(s) again in the
/spark-ec2 -i key-file --region=ec2-region start cluster-name
command. Otherwise it gets set to the default m1.large.
This then affects the setup of the machines.
I'll submit a pull request that takes cares of this, without the user needing
to provide the instance type(s) again.
EDIT:
Example case where this becomes a problem:
1. User launches a cluster with instances with 1 disk, ex. m3.large.
2. The user stops the cluster.
3. When the user restarts the cluster with the start command without
providing the instance type, the setup is performed using the default
instance type, m1.large, which assumes 2 disks present in the machine.
4. The SPARK_LOCAL_DIRS is then set to mnt/spark,mnt2/spark. /mnt2
corresponds to the snapshot partition in a m3.large instance, which is only
8GB in size. When the user runs jobs that shuffle data, this partition fills
up quickly, resulting in failed jobs due to No space left on device errors.
Apart from this example one could come up with other examples where the setup
of the machines is wrong, due to assuming that they are of type m1.large.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6224) Also collect NamedExpressions in PhysicalOperation


[ 
https://issues.apache.org/jira/browse/SPARK-6224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14352945#comment-14352945
 ] 

Apache Spark commented on SPARK-6224:
-

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/4949

 Also collect NamedExpressions in PhysicalOperation
 --

 Key: SPARK-6224
 URL: https://issues.apache.org/jira/browse/SPARK-6224
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Liang-Chi Hsieh
Priority: Minor

 Currently in PhysicalOperation, only Alias expressions are collected. 
 Similarly, NamedExpression can be collected for substitution.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5986) Model import/export for KMeansModel


[ 
https://issues.apache.org/jira/browse/SPARK-5986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14353022#comment-14353022
 ] 

Apache Spark commented on SPARK-5986:
-

User 'yinxusen' has created a pull request for this issue:
https://github.com/apache/spark/pull/4951

 Model import/export for KMeansModel
 ---

 Key: SPARK-5986
 URL: https://issues.apache.org/jira/browse/SPARK-5986
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley
Assignee: Xusen Yin

 Support save/load for KMeansModel



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-6223) Avoid Build warning- enable implicit value scala.language.existentials visible

2015-03-09 Thread Vinod KC (JIRA)

Vinod KC created SPARK-6223:
---

 Summary: Avoid Build warning- enable implicit value 
scala.language.existentials visible
 Key: SPARK-6223
 URL: https://issues.apache.org/jira/browse/SPARK-6223
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.0
Reporter: Vinod KC
Priority: Trivial


spark/sql/core/src/main/scala/org/apache/spark/sql/sources/ddl.scala:316: 
inferred existential type Option[(Class[_$4], 
org.apache.spark.sql.sources.BaseRelation)] forSome { type _$4 }, which cannot 
be expressed by wildcards,  should be enabled by making the implicit value 
scala.language.existentials visible.
This can be achieved by adding the import clause 'import 
scala.language.existentials'



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6223) Avoid Build warning- enable implicit value scala.language.existentials visible


[ 
https://issues.apache.org/jira/browse/SPARK-6223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14352866#comment-14352866
 ] 

Apache Spark commented on SPARK-6223:
-

User 'vinodkc' has created a pull request for this issue:
https://github.com/apache/spark/pull/4948

 Avoid Build warning- enable implicit value scala.language.existentials visible
 --

 Key: SPARK-6223
 URL: https://issues.apache.org/jira/browse/SPARK-6223
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.0
Reporter: Vinod KC
Priority: Trivial

 spark/sql/core/src/main/scala/org/apache/spark/sql/sources/ddl.scala:316: 
 inferred existential type Option[(Class[_$4], 
 org.apache.spark.sql.sources.BaseRelation)] forSome { type _$4 }, which 
 cannot be expressed by wildcards,  should be enabled by making the implicit 
 value scala.language.existentials visible.
 This can be achieved by adding the import clause 'import 
 scala.language.existentials'



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6225) Resolve most build warnings, 1.3.0 edition


[ 
https://issues.apache.org/jira/browse/SPARK-6225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14352984#comment-14352984
 ] 

Apache Spark commented on SPARK-6225:
-

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/4950

 Resolve most build warnings, 1.3.0 edition
 --

 Key: SPARK-6225
 URL: https://issues.apache.org/jira/browse/SPARK-6225
 Project: Spark
  Issue Type: Improvement
  Components: MLlib, Spark Core, SQL, Streaming
Affects Versions: 1.3.0
Reporter: Sean Owen
Assignee: Sean Owen
Priority: Minor

 Post-1.3.0, I think it would be a good exercise to resolve a number of build 
 warnings that have accumulated recently.
 See for example efforts begun at
 https://github.com/apache/spark/pull/4948
 https://github.com/apache/spark/pull/4900



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3066) Support recommendAll in matrix factorization model


[ 
https://issues.apache.org/jira/browse/SPARK-3066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14353129#comment-14353129
 ] 

Sean Owen commented on SPARK-3066:
--

My anecdotal experience with it was that getting an order-of-magnitude speedup 
meant losing a small but noticeable amount of quality in the top 
recommendations. That is, you would fail to consider as candidates some of the 
items that were actually top recs. 

The most actionable test / implementation I have to show this for ALS is ... 
https://github.com/cloudera/oryx/blob/master/als-common/src/it/java/com/cloudera/oryx/als/common/candidate/LocationSensitiveHashIT.java
  This could let you run tests for a certain scale, certain degree of hashing, 
etc., if you wanted to.

I've actually tried to avoid needing LSH just for speed in order to avoid this 
tradeoff.

Anyway for papers? I found this pretty complex treatment: 
http://papers.nips.cc/paper/5329-asymmetric-lsh-alsh-for-sublinear-time-maximum-inner-product-search-mips.pdf

This has a little info on the quality of LSH:
https://fruct.org/sites/default/files/files/conference15/Ponomarev_LSH_P2P.pdf

It's one of those things where I'm sure it can be done better than the basic 
ways I know to do it, but haven't yet found a killer paper.


 Support recommendAll in matrix factorization model
 --

 Key: SPARK-3066
 URL: https://issues.apache.org/jira/browse/SPARK-3066
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Xiangrui Meng
Assignee: Debasish Das

 ALS returns a matrix factorization model, which we can use to predict ratings 
 for individual queries as well as small batches. In practice, users may want 
 to compute top-k recommendations offline for all users. It is very expensive 
 but a common problem. We can do some optimization like
 1) collect one side (either user or product) and broadcast it as a matrix
 2) use level-3 BLAS to compute inner products
 3) use Utils.takeOrdered to find top-k



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6188) Instance types can be mislabeled when re-starting cluster with default arguments

[
https://issues.apache.org/jira/browse/SPARK-6188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Sean Owen updated SPARK-6188:
-
Shepherd: (was: Josh Rosen)
Assignee: Theodore Vasiloudis

Instance types can be mislabeled when re-starting cluster with default
arguments

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-6223) Avoid Build warning- enable implicit value scala.language.existentials visible


 [ 
https://issues.apache.org/jira/browse/SPARK-6223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-6223.
--
Resolution: Duplicate

I think this is a subset of a larger logical change, to clean up all similar 
build warnings at once.

 Avoid Build warning- enable implicit value scala.language.existentials visible
 --

 Key: SPARK-6223
 URL: https://issues.apache.org/jira/browse/SPARK-6223
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.0
Reporter: Vinod KC
Priority: Trivial

 spark/sql/core/src/main/scala/org/apache/spark/sql/sources/ddl.scala:316: 
 inferred existential type Option[(Class[_$4], 
 org.apache.spark.sql.sources.BaseRelation)] forSome { type _$4 }, which 
 cannot be expressed by wildcards,  should be enabled by making the implicit 
 value scala.language.existentials visible.
 This can be achieved by adding the import clause 'import 
 scala.language.existentials'



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5986) Model import/export for KMeansModel

2015-03-09 Thread Xusen Yin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14353117#comment-14353117
 ] 

Xusen Yin commented on SPARK-5986:
--

Get it.

Do you mind assign SPARK-5991 to me? Thanks!

 Model import/export for KMeansModel
 ---

 Key: SPARK-5986
 URL: https://issues.apache.org/jira/browse/SPARK-5986
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley
Assignee: Xusen Yin

 Support save/load for KMeansModel



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6201) INSET should coerce types

2015-03-09 Thread Cheng Lian (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14353030#comment-14353030
 ] 

Cheng Lian commented on SPARK-6201:
---

Played Hive type implicit conversion a bit more and found that Hive actually 
converts integers to strings in your case:
{code:sql}
hive create table t1 as select '1.00' as c1;
hive select * from t1 where c1 in (1.0);
{code}
If {{c1}} is converted to numeric, then the {{1.00}} should appear in the 
result. However, the result set is empty.

References:
# [Implicit type coercion support in existing database 
systems|http://chapeau.freevariable.com/2014/08/existing-system-coercion.html] 
by William Benton
# 
[{{GenericUDFIn.initialize}}|https://github.com/apache/hive/blob/release-0.13.1/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFIn.java#L84-L100]



 INSET should coerce types
 -

 Key: SPARK-6201
 URL: https://issues.apache.org/jira/browse/SPARK-6201
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0, 1.3.0, 1.2.1
Reporter: Jianshi Huang

 Suppose we have the following table:
 {code}
 sqlc.jsonRDD(sc.parallelize(Seq({\a\: \1\}}, {\a\: \2\}}, 
 {\a\: \3\}}))).registerTempTable(d)
 {code}
 The schema is
 {noformat}
 root
  |-- a: string (nullable = true)
 {noformat}
 Then,
 {code}
 sql(select * from d where (d.a = 1 or d.a = 2)).collect
 =
 Array([1], [2])
 {code}
 where d.a and constants 1,2 will be casted to Double first and do the 
 comparison as you can find it out in the plan:
 {noformat}
 Filter ((CAST(a#155, DoubleType) = CAST(1, DoubleType)) || (CAST(a#155, 
 DoubleType) = CAST(2, DoubleType)))
 {noformat}
 However, if I use
 {code}
 sql(select * from d where d.a in (1,2)).collect
 {code}
 The result is empty.
 The physical plan shows it's using INSET:
 {noformat}
 == Physical Plan ==
 Filter a#155 INSET (1,2)
  PhysicalRDD [a#155], MappedRDD[499] at map at JsonRDD.scala:47
 {noformat}
 *It seems INSET implementation in SparkSQL doesn't coerce type implicitly, 
 where Hive does. We should make SparkSQL conform to Hive's behavior, even 
 though doing implicit coercion here is very confusing for comparing String 
 and Int.*
 Jianshi



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3066) Support recommendAll in matrix factorization model


[ 
https://issues.apache.org/jira/browse/SPARK-3066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14353114#comment-14353114
 ] 

Joseph K. Bradley commented on SPARK-3066:
--

Oops, true, not an actual metric.  LSH sounds reasonable.  Do you know of use 
cases or how well it's been found to work for recommendation problems?

 Support recommendAll in matrix factorization model
 --

 Key: SPARK-3066
 URL: https://issues.apache.org/jira/browse/SPARK-3066
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Xiangrui Meng
Assignee: Debasish Das

 ALS returns a matrix factorization model, which we can use to predict ratings 
 for individual queries as well as small batches. In practice, users may want 
 to compute top-k recommendations offline for all users. It is very expensive 
 but a common problem. We can do some optimization like
 1) collect one side (either user or product) and broadcast it as a matrix
 2) use level-3 BLAS to compute inner products
 3) use Utils.takeOrdered to find top-k



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6227) PCA and SVD for PySpark

2015-03-09 Thread Julien Amelot (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Amelot updated SPARK-6227:
-
Affects Version/s: 1.2.1

 PCA and SVD for PySpark
 ---

 Key: SPARK-6227
 URL: https://issues.apache.org/jira/browse/SPARK-6227
 Project: Spark
  Issue Type: Improvement
  Components: MLlib, PySpark
Affects Versions: 1.2.1
Reporter: Julien Amelot
Priority: Minor

 The Dimensionality Reduction techniques are not available via Python (Scala + 
 Java only).
 * Principal component analysis (PCA)
 * Singular value decomposition (SVD)
 Doc:
 http://spark.apache.org/docs/1.2.1/mllib-dimensionality-reduction.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5986) Model import/export for KMeansModel


[ 
https://issues.apache.org/jira/browse/SPARK-5986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14353109#comment-14353109
 ] 

Joseph K. Bradley commented on SPARK-5986:
--

I'd recommend doing the 2 separately to make smaller PRs.  The Python ones can 
be matched with a subtask from this JIRA: 
[https://issues.apache.org/jira/browse/SPARK-5991]  Thanks!

 Model import/export for KMeansModel
 ---

 Key: SPARK-5986
 URL: https://issues.apache.org/jira/browse/SPARK-5986
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley
Assignee: Xusen Yin

 Support save/load for KMeansModel



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-6227) PCA and SVD for PySpark

2015-03-09 Thread Julien Amelot (JIRA)

Julien Amelot created SPARK-6227:


 Summary: PCA and SVD for PySpark
 Key: SPARK-6227
 URL: https://issues.apache.org/jira/browse/SPARK-6227
 Project: Spark
  Issue Type: Improvement
  Components: MLlib, PySpark
Reporter: Julien Amelot
Priority: Minor


The Dimensionality Reduction techniques are not available via Python (Scala + 
Java only).

* Principal component analysis (PCA)
* Singular value decomposition (SVD)

Doc:
http://spark.apache.org/docs/1.2.1/mllib-dimensionality-reduction.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5134) Bump default Hadoop version to 2+


[ 
https://issues.apache.org/jira/browse/SPARK-5134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14353047#comment-14353047
 ] 

Sean Owen commented on SPARK-5134:
--

Yep, I confirmed that ...

{code}
[INFO] \- org.apache.spark:spark-core_2.10:jar:1.2.1:compile
...
[INFO]+- org.apache.hadoop:hadoop-client:jar:2.2.0:compile
[INFO]|  +- org.apache.hadoop:hadoop-common:jar:2.2.0:compile
[INFO]|  |  +- commons-cli:commons-cli:jar:1.2:compile
...
{code}

Well, FWIW, although unintentional I do think there are upsides to this change. 
It would be good to codify that in the build, I suppose, by updating the 
default version number. How about updating to 2.2.0 to match what has actually 
happened? This would not entail activating the Hadoop build profiles by default 
or anything.

[~rdub] would you care to do the honors?

 Bump default Hadoop version to 2+
 -

 Key: SPARK-5134
 URL: https://issues.apache.org/jira/browse/SPARK-5134
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 1.2.0
Reporter: Ryan Williams
Priority: Minor

 [~srowen] and I discussed bumping [the default hadoop version in the parent 
 POM|https://github.com/apache/spark/blob/bb38ebb1abd26b57525d7d29703fd449e40cd6de/pom.xml#L122]
  from {{1.0.4}} to something more recent.
 There doesn't seem to be a good reason that it was set/kept at {{1.0.4}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-6226) Support model save/load in Python's KMeans

Joseph K. Bradley created SPARK-6226:


 Summary: Support model save/load in Python's KMeans
 Key: SPARK-6226
 URL: https://issues.apache.org/jira/browse/SPARK-6226
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib, PySpark
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-4734) [Streaming]limit the file Dstream size for each batch


 [ 
https://issues.apache.org/jira/browse/SPARK-4734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-4734.
--
Resolution: Won't Fix

I feel strongly that this sort of change introduces new problems and doesn't 
solve problems. It can't accelerate a streaming system's throughput; this is 
about smoothing peakiness, which is what message queueing systems are for. This 
would attempt to design a simple queueing system without the fault tolerance. 
In some cases, a larger batch duration also suffices to smooth peaks.

 [Streaming]limit the file Dstream size for each batch
 -

 Key: SPARK-4734
 URL: https://issues.apache.org/jira/browse/SPARK-4734
 Project: Spark
  Issue Type: New Feature
  Components: Streaming
Reporter: 宿荣全
Priority: Minor

 Streaming scan new files form the HDFS and process those files in each batch 
 process.Current streaming exist some problems：
 1.When the number of files is very large(the count size of those files is 
 very large) in some batch segement.The processing time required will become 
 very long.The processing time maybe over slideDuration time.Eventually lead 
 to dispatch the next batch process is delay.
 2.when the size of total file Dstream  is very large in one batch,those  
 dstream data do shuffle after memory will be n times increasing 
 occupation,app will be slow or even terminated by operating system.
 So if we set a upper limit value of input data for each batch to control the 
 batch process time,the job dispatch delay and the process delay wil be 
 alleviated.
 modification:
 Add a new parameter spark.streaming.segmentSizeThreshold in InputDStream 
 (input data base class).the size of each batch process segments  will be set 
 in this parameter from [spark-defaults.conf] or setting in source.
 all implements class of InputDStream will do corresponding action be aimed at 
 the segmentSizeThreshold.
 This patch is a modification about FileInputDStream ,so when find new files   
,put those files's name and size in a queue and take elements package to a 
 batch data with totail size  segmentSizeThreshold  in 
 FileInputDStream.Please look source about detailed logic.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-6230) Provide authentication and encryption for Spark's RPC

2015-03-09 Thread Marcelo Vanzin (JIRA)

Marcelo Vanzin created SPARK-6230:
-

 Summary: Provide authentication and encryption for Spark's RPC
 Key: SPARK-6230
 URL: https://issues.apache.org/jira/browse/SPARK-6230
 Project: Spark
  Issue Type: Sub-task
Reporter: Marcelo Vanzin


Make sure the RPC layer used by Spark supports the auth and encryption features 
of the network/common module.

This kinda ignores akka; adding support for SASL to akka, while possible, seems 
to be at odds with the direction being taken in Spark, so let's restrict this 
to the new RPC layer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3066) Support recommendAll in matrix factorization model


[ 
https://issues.apache.org/jira/browse/SPARK-3066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14353225#comment-14353225
 ] 

Joseph K. Bradley commented on SPARK-3066:
--

Thanks for the references!  I'll take a look, but based on what you say, 
perhaps focusing on BLAS is the best path for now.

 Support recommendAll in matrix factorization model
 --

 Key: SPARK-3066
 URL: https://issues.apache.org/jira/browse/SPARK-3066
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Xiangrui Meng
Assignee: Debasish Das

 ALS returns a matrix factorization model, which we can use to predict ratings 
 for individual queries as well as small batches. In practice, users may want 
 to compute top-k recommendations offline for all users. It is very expensive 
 but a common problem. We can do some optimization like
 1) collect one side (either user or product) and broadcast it as a matrix
 2) use level-3 BLAS to compute inner products
 3) use Utils.takeOrdered to find top-k



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-6201) INSET should coerce types

2015-03-09 Thread Jianshi Huang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14353248#comment-14353248
 ] 

Jianshi Huang edited comment on SPARK-6201 at 3/9/15 5:40 PM:
--

Implicit coercion outside the Numeric domain is quite evil. I don't think 
Hive's behavior makes sense here. 

Raising an exception is fine in this case. And if you want to make it Hive 
compliant, then pls think about adding an switch, say

bq.  spark.sql.strict_mode = true(default) / false

Jianshi


was (Author: huangjs):
Implicit coercion outside the Numeric domain is quite evil. I don't think 
Hive's behavior makes sense here. 

Raising an exception is fine in this case. And if you want to make it Hive 
compliant, then pls think about adding an switch, say

```
  spark.sql.strict_mode = true(default) / false
```

Jianshi

 INSET should coerce types
 -

 Key: SPARK-6201
 URL: https://issues.apache.org/jira/browse/SPARK-6201
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0, 1.3.0, 1.2.1
Reporter: Jianshi Huang

 Suppose we have the following table:
 {code}
 sqlc.jsonRDD(sc.parallelize(Seq({\a\: \1\}}, {\a\: \2\}}, 
 {\a\: \3\}}))).registerTempTable(d)
 {code}
 The schema is
 {noformat}
 root
  |-- a: string (nullable = true)
 {noformat}
 Then,
 {code}
 sql(select * from d where (d.a = 1 or d.a = 2)).collect
 =
 Array([1], [2])
 {code}
 where d.a and constants 1,2 will be casted to Double first and do the 
 comparison as you can find it out in the plan:
 {noformat}
 Filter ((CAST(a#155, DoubleType) = CAST(1, DoubleType)) || (CAST(a#155, 
 DoubleType) = CAST(2, DoubleType)))
 {noformat}
 However, if I use
 {code}
 sql(select * from d where d.a in (1,2)).collect
 {code}
 The result is empty.
 The physical plan shows it's using INSET:
 {noformat}
 == Physical Plan ==
 Filter a#155 INSET (1,2)
  PhysicalRDD [a#155], MappedRDD[499] at map at JsonRDD.scala:47
 {noformat}
 *It seems INSET implementation in SparkSQL doesn't coerce type implicitly, 
 where Hive does. We should make SparkSQL conform to Hive's behavior, even 
 though doing implicit coercion here is very confusing for comparing String 
 and Int.*
 Jianshi



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6201) INSET should coerce types

2015-03-09 Thread Jianshi Huang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14353248#comment-14353248
 ] 

Jianshi Huang commented on SPARK-6201:
--

Implicit coercion outside the Numeric domain is quite evil. I don't think 
Hive's behavior makes sense here. 

Raising an exception is fine in this case. And if you want to make it Hive 
compliant, then pls think about adding an switch, say

  spark.sql.strict_mode = true(default) / false

Jianshi

 INSET should coerce types
 -

 Key: SPARK-6201
 URL: https://issues.apache.org/jira/browse/SPARK-6201
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0, 1.3.0, 1.2.1
Reporter: Jianshi Huang

 Suppose we have the following table:
 {code}
 sqlc.jsonRDD(sc.parallelize(Seq({\a\: \1\}}, {\a\: \2\}}, 
 {\a\: \3\}}))).registerTempTable(d)
 {code}
 The schema is
 {noformat}
 root
  |-- a: string (nullable = true)
 {noformat}
 Then,
 {code}
 sql(select * from d where (d.a = 1 or d.a = 2)).collect
 =
 Array([1], [2])
 {code}
 where d.a and constants 1,2 will be casted to Double first and do the 
 comparison as you can find it out in the plan:
 {noformat}
 Filter ((CAST(a#155, DoubleType) = CAST(1, DoubleType)) || (CAST(a#155, 
 DoubleType) = CAST(2, DoubleType)))
 {noformat}
 However, if I use
 {code}
 sql(select * from d where d.a in (1,2)).collect
 {code}
 The result is empty.
 The physical plan shows it's using INSET:
 {noformat}
 == Physical Plan ==
 Filter a#155 INSET (1,2)
  PhysicalRDD [a#155], MappedRDD[499] at map at JsonRDD.scala:47
 {noformat}
 *It seems INSET implementation in SparkSQL doesn't coerce type implicitly, 
 where Hive does. We should make SparkSQL conform to Hive's behavior, even 
 though doing implicit coercion here is very confusing for comparing String 
 and Int.*
 Jianshi



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-6201) INSET should coerce types

2015-03-09 Thread Jianshi Huang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14353248#comment-14353248
 ] 

Jianshi Huang edited comment on SPARK-6201 at 3/9/15 5:39 PM:
--

Implicit coercion outside the Numeric domain is quite evil. I don't think 
Hive's behavior makes sense here. 

Raising an exception is fine in this case. And if you want to make it Hive 
compliant, then pls think about adding an switch, say

```
  spark.sql.strict_mode = true(default) / false
```

Jianshi


was (Author: huangjs):
Implicit coercion outside the Numeric domain is quite evil. I don't think 
Hive's behavior makes sense here. 

Raising an exception is fine in this case. And if you want to make it Hive 
compliant, then pls think about adding an switch, say

  spark.sql.strict_mode = true(default) / false

Jianshi

 INSET should coerce types
 -

 Key: SPARK-6201
 URL: https://issues.apache.org/jira/browse/SPARK-6201
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0, 1.3.0, 1.2.1
Reporter: Jianshi Huang

 Suppose we have the following table:
 {code}
 sqlc.jsonRDD(sc.parallelize(Seq({\a\: \1\}}, {\a\: \2\}}, 
 {\a\: \3\}}))).registerTempTable(d)
 {code}
 The schema is
 {noformat}
 root
  |-- a: string (nullable = true)
 {noformat}
 Then,
 {code}
 sql(select * from d where (d.a = 1 or d.a = 2)).collect
 =
 Array([1], [2])
 {code}
 where d.a and constants 1,2 will be casted to Double first and do the 
 comparison as you can find it out in the plan:
 {noformat}
 Filter ((CAST(a#155, DoubleType) = CAST(1, DoubleType)) || (CAST(a#155, 
 DoubleType) = CAST(2, DoubleType)))
 {noformat}
 However, if I use
 {code}
 sql(select * from d where d.a in (1,2)).collect
 {code}
 The result is empty.
 The physical plan shows it's using INSET:
 {noformat}
 == Physical Plan ==
 Filter a#155 INSET (1,2)
  PhysicalRDD [a#155], MappedRDD[499] at map at JsonRDD.scala:47
 {noformat}
 *It seems INSET implementation in SparkSQL doesn't coerce type implicitly, 
 where Hive does. We should make SparkSQL conform to Hive's behavior, even 
 though doing implicit coercion here is very confusing for comparing String 
 and Int.*
 Jianshi



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-6201) INSET should coerce types

2015-03-09 Thread Cheng Lian (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14353030#comment-14353030
 ] 

Cheng Lian edited comment on SPARK-6201 at 3/9/15 5:10 PM:
---

Played Hive type implicit conversion a bit more and found that Hive actually 
converts integers to strings in your case:
{code:sql}
hive create table t1 as select '1.00' as c1;
hive select * from t1 where c1 in (1.0);
{code}
If {{c1}} is converted to numeric, then the {{1.00}} should appear in the 
result. However, the result set is empty. For expression {{1.00 IN (1.0)}}, a 
{{GenericUDFIn}} instance is created and called with an argument list 
{{(1.00, 1.0}}. Then {{GenericUDFIn}} tries to convert all arguments into a 
common data type from left to right. Since double is allowed to be translated 
into string, {{1.0}} is converted into string {{1.0}}.

References:
# [Implicit type coercion support in existing database 
systems|http://chapeau.freevariable.com/2014/08/existing-system-coercion.html] 
by William Benton
# 
[{{GenericUDFIn.initialize}}|https://github.com/apache/hive/blob/release-0.13.1/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFIn.java#L84-L100]




was (Author: lian cheng):
Played Hive type implicit conversion a bit more and found that Hive actually 
converts integers to strings in your case:
{code:sql}
hive create table t1 as select '1.00' as c1;
hive select * from t1 where c1 in (1.0);
{code}
If {{c1}} is converted to numeric, then the {{1.00}} should appear in the 
result. However, the result set is empty.

References:
# [Implicit type coercion support in existing database 
systems|http://chapeau.freevariable.com/2014/08/existing-system-coercion.html] 
by William Benton
# 
[{{GenericUDFIn.initialize}}|https://github.com/apache/hive/blob/release-0.13.1/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFIn.java#L84-L100]



 INSET should coerce types
 -

 Key: SPARK-6201
 URL: https://issues.apache.org/jira/browse/SPARK-6201
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0, 1.3.0, 1.2.1
Reporter: Jianshi Huang

 Suppose we have the following table:
 {code}
 sqlc.jsonRDD(sc.parallelize(Seq({\a\: \1\}}, {\a\: \2\}}, 
 {\a\: \3\}}))).registerTempTable(d)
 {code}
 The schema is
 {noformat}
 root
  |-- a: string (nullable = true)
 {noformat}
 Then,
 {code}
 sql(select * from d where (d.a = 1 or d.a = 2)).collect
 =
 Array([1], [2])
 {code}
 where d.a and constants 1,2 will be casted to Double first and do the 
 comparison as you can find it out in the plan:
 {noformat}
 Filter ((CAST(a#155, DoubleType) = CAST(1, DoubleType)) || (CAST(a#155, 
 DoubleType) = CAST(2, DoubleType)))
 {noformat}
 However, if I use
 {code}
 sql(select * from d where d.a in (1,2)).collect
 {code}
 The result is empty.
 The physical plan shows it's using INSET:
 {noformat}
 == Physical Plan ==
 Filter a#155 INSET (1,2)
  PhysicalRDD [a#155], MappedRDD[499] at map at JsonRDD.scala:47
 {noformat}
 *It seems INSET implementation in SparkSQL doesn't coerce type implicitly, 
 where Hive does. We should make SparkSQL conform to Hive's behavior, even 
 though doing implicit coercion here is very confusing for comparing String 
 and Int.*
 Jianshi



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-6229) Support encryption in network/common module

2015-03-09 Thread Marcelo Vanzin (JIRA)

Marcelo Vanzin created SPARK-6229:
-

 Summary: Support encryption in network/common module
 Key: SPARK-6229
 URL: https://issues.apache.org/jira/browse/SPARK-6229
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Reporter: Marcelo Vanzin


After SASL support has been added to network/common, supporting encryption 
should be rather simple. Encryption is supported for DIGEST-MD5 and GSSAPI. 
Since the latter requires a valid kerberos login to work (and so doesn't really 
work with executors), encryption would require the use of DIGEST-MD5.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-6201) INSET should coerce types

2015-03-09 Thread Cheng Lian (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14353030#comment-14353030
 ] 

Cheng Lian edited comment on SPARK-6201 at 3/9/15 5:13 PM:
---

Played Hive type implicit conversion a bit more and found that Hive actually 
converts integers to strings in your case:
{code:sql}
hive create table t1 as select '1.00' as c1;
hive select * from t1 where c1 in (1.0);
{code}
If {{c1}} is converted to numeric, then {{1.00}} should appear in the result. 
However, the result set is empty. For expression {{1.00 IN (1.0)}}, a 
{{GenericUDFIn}} instance is created and called with argument list {{(1.00, 
1.0)}}. Then {{GenericUDFIn.initialize}} tries to convert all arguments into a 
common data type from left to right. Since double is allowed to be translated 
into string, {{1.0}} is converted into string {{1.0}}.

References:
# [Implicit type coercion support in existing database 
systems|http://chapeau.freevariable.com/2014/08/existing-system-coercion.html] 
by William Benton
# 
[{{GenericUDFIn.initialize}}|https://github.com/apache/hive/blob/release-0.13.1/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFIn.java#L84-L100]




was (Author: lian cheng):
Played Hive type implicit conversion a bit more and found that Hive actually 
converts integers to strings in your case:
{code:sql}
hive create table t1 as select '1.00' as c1;
hive select * from t1 where c1 in (1.0);
{code}
If {{c1}} is converted to numeric, then the {{1.00}} should appear in the 
result. However, the result set is empty. For expression {{1.00 IN (1.0)}}, a 
{{GenericUDFIn}} instance is created and called with an argument list 
{{(1.00, 1.0}}. Then {{GenericUDFIn}} tries to convert all arguments into a 
common data type from left to right. Since double is allowed to be translated 
into string, {{1.0}} is converted into string {{1.0}}.

References:
# [Implicit type coercion support in existing database 
systems|http://chapeau.freevariable.com/2014/08/existing-system-coercion.html] 
by William Benton
# 
[{{GenericUDFIn.initialize}}|https://github.com/apache/hive/blob/release-0.13.1/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFIn.java#L84-L100]



 INSET should coerce types
 -

 Key: SPARK-6201
 URL: https://issues.apache.org/jira/browse/SPARK-6201
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0, 1.3.0, 1.2.1
Reporter: Jianshi Huang

 Suppose we have the following table:
 {code}
 sqlc.jsonRDD(sc.parallelize(Seq({\a\: \1\}}, {\a\: \2\}}, 
 {\a\: \3\}}))).registerTempTable(d)
 {code}
 The schema is
 {noformat}
 root
  |-- a: string (nullable = true)
 {noformat}
 Then,
 {code}
 sql(select * from d where (d.a = 1 or d.a = 2)).collect
 =
 Array([1], [2])
 {code}
 where d.a and constants 1,2 will be casted to Double first and do the 
 comparison as you can find it out in the plan:
 {noformat}
 Filter ((CAST(a#155, DoubleType) = CAST(1, DoubleType)) || (CAST(a#155, 
 DoubleType) = CAST(2, DoubleType)))
 {noformat}
 However, if I use
 {code}
 sql(select * from d where d.a in (1,2)).collect
 {code}
 The result is empty.
 The physical plan shows it's using INSET:
 {noformat}
 == Physical Plan ==
 Filter a#155 INSET (1,2)
  PhysicalRDD [a#155], MappedRDD[499] at map at JsonRDD.scala:47
 {noformat}
 *It seems INSET implementation in SparkSQL doesn't coerce type implicitly, 
 where Hive does. We should make SparkSQL conform to Hive's behavior, even 
 though doing implicit coercion here is very confusing for comparing String 
 and Int.*
 Jianshi



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3278) Isotonic regression

2015-03-09 Thread Vladimir Vladimirov (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14353252#comment-14353252
 ] 

Vladimir Vladimirov commented on SPARK-3278:


Had anyone benchmarked the performance of Spark Isotonic Regression 
implementation on big datasets (100 M, 1000M) ?

 Isotonic regression
 ---

 Key: SPARK-3278
 URL: https://issues.apache.org/jira/browse/SPARK-3278
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Xiangrui Meng
Assignee: Martin Zapletal
 Fix For: 1.3.0


 Add isotonic regression for score calibration.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5986) Model import/export for KMeansModel


[ 
https://issues.apache.org/jira/browse/SPARK-5986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14353221#comment-14353221
 ] 

Joseph K. Bradley commented on SPARK-5986:
--

Subtask assigned

 Model import/export for KMeansModel
 ---

 Key: SPARK-5986
 URL: https://issues.apache.org/jira/browse/SPARK-5986
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley
Assignee: Xusen Yin

 Support save/load for KMeansModel



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6226) Support model save/load in Python's KMeans


 [ 
https://issues.apache.org/jira/browse/SPARK-6226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-6226:
-
Assignee: Xusen Yin

 Support model save/load in Python's KMeans
 --

 Key: SPARK-6226
 URL: https://issues.apache.org/jira/browse/SPARK-6226
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib, PySpark
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley
Assignee: Xusen Yin





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-6228) Provide SASL support in network/common module

2015-03-09 Thread Marcelo Vanzin (JIRA)

Marcelo Vanzin created SPARK-6228:
-

 Summary: Provide SASL support in network/common module
 Key: SPARK-6228
 URL: https://issues.apache.org/jira/browse/SPARK-6228
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Reporter: Marcelo Vanzin


Currently, there's support for SASL in network/shuffle, but not in 
network/common. Moving the SASL code to network/common would enable other 
applications using that code to also support secure authentication and, later, 
encryption.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6219) Expand Python lint checks to check for compilation errors

2015-03-09 Thread Nicholas Chammas (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14353325#comment-14353325
 ] 

Nicholas Chammas commented on SPARK-6219:
-

That's a good point, I haven't checked to see what's already covered in
that way by unit tests.

At the very least, I can say that this will catch stuff in spark-ec2 and
examples that unit tests currently do not cover.

Also, it runs very, very quickly.



 Expand Python lint checks to check for  compilation errors
 --

 Key: SPARK-6219
 URL: https://issues.apache.org/jira/browse/SPARK-6219
 Project: Spark
  Issue Type: Improvement
  Components: Build
Reporter: Nicholas Chammas
Priority: Minor

 An easy lint check for Python would be to make sure the stuff at least 
 compiles. That will catch only the most egregious errors, but it should help.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4911) Report the inputs and outputs of Spark jobs so that external systems can track data lineage

2015-03-09 Thread Ted Malaska (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14353598#comment-14353598
 ] 

Ted Malaska commented on SPARK-4911:


Thanks [~sandyr].  Yes this would be very helpful.  Today there is a good bit 
of information in the logs to get this information, but it is not standardized.

I'm not 100% sure what is the best solution for this but it is needed.  I would 
be happen to help review.

Ted Malaska



 Report the inputs and outputs of Spark jobs so that external systems can 
 track data lineage 
 

 Key: SPARK-4911
 URL: https://issues.apache.org/jira/browse/SPARK-4911
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Sandy Ryza

 When Spark runs a job, it would be useful to log its filesystem inputs and 
 outputs somewhere.  This allows external tools to track which persisted 
 datasets are derived from other persisted datasets.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4600) org.apache.spark.graphx.VertexRDD.diff does not work

2015-03-09 Thread Ankur Dave (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14353627#comment-14353627
 ] 

Ankur Dave commented on SPARK-4600:
---

As I wrote in SPARK-6022, this is the documented behavior for diff: for each 
vertex present in _both_ A and B, it returns only those with differing values. 
Also, it's only guaranteed to work if the VertexRDDs share a common ancestor, 
which is not true in this test. We don't currently have a set difference 
operator, which would do what you expect.

 org.apache.spark.graphx.VertexRDD.diff does not work
 

 Key: SPARK-4600
 URL: https://issues.apache.org/jira/browse/SPARK-4600
 Project: Spark
  Issue Type: Bug
  Components: GraphX
 Environment: scala 2.10.4
 spark 1.1.0
Reporter: Teppei Tosa
Assignee: Brennon York
  Labels: graphx

 VertexRDD.diff doesn't work.
 For example : 
 val setA: VertexRDD[Int] = VertexRDD(sc.parallelize(0L until 2L).map(id = 
 (id, id.toInt)))
 setA.collect.foreach(println(_))
 // (0,0)
 // (1,1)
 val setB: VertexRDD[Int] = VertexRDD(sc.parallelize(1L until 3L).map(id = 
 (id, id.toInt)))
 setB.collect.foreach(println(_))
 // (1,1)
 // (2,2)
 val diff = setA.diff(setB)
 diff.collect.foreach(println(_))
 // printed none



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6190) create LargeByteBuffer abstraction for eliminating 2GB limit on blocks

2015-03-09 Thread Imran Rashid (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14353687#comment-14353687
 ] 

Imran Rashid commented on SPARK-6190:
-

Another observation as I've dug into the implementation a little more.  
{{LargeByteBuf}} (the version which wraps netty's {{ByteBuf}}) isn't necessary 
on the *sending* side.   The large messages are in {{ChunkFetchSuccess}}, which 
is either sending a file region, or something that is already available in a 
nio {{ByteBuffer}}.  We do still need a {{LargeByteBuf}} to make it easy for 
the *receiving* end to get these messages, though.  (Example decoder: 
https://github.com/apache/spark/blob/5e83a55daa30a19840214f77681248e112635bf6/network/common/src/main/java/org/apache/spark/network/protocol/FixedChunkLargeFrameDecoder.java)
 But that simplifies the api it needs to expose -- the encode / decoding can 
still be done on {{ByteBuf}}, and we just expose the bulk of the data in the 
{{LargeByteBuf}} via conversion to nio {{LargeByteBuffer}} and expose as a 
{{InputStream}}.

Here is a WIP branch, that demonstrates the basics of transferring large 
blocks, though its got a fair amount of cleanup necessary before you look too 
closely.  I'm pretty sure some of the functionality here is not needed (eg. 
{{LargeByteBuf#getInt}}, since the decoding can really just use one of the 
{{ByteBuf}} s).  

https://github.com/apache/spark/compare/master...squito:SPARK-6190_largeBB



 create LargeByteBuffer abstraction for eliminating 2GB limit on blocks
 --

 Key: SPARK-6190
 URL: https://issues.apache.org/jira/browse/SPARK-6190
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Imran Rashid
Assignee: Imran Rashid
 Attachments: LargeByteBuffer.pdf


 A key component in eliminating the 2GB limit on blocks is creating a proper 
 abstraction for storing more than 2GB.  Currently spark is limited by a 
 reliance on nio ByteBuffer and netty ByteBuf, both of which are limited at 
 2GB.  This task will introduce the new abstraction and the relevant 
 implementation and utilities, without effecting the existing implementation 
 at all.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6211) Test Python Kafka API using Python unit tests


[ 
https://issues.apache.org/jira/browse/SPARK-6211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14353737#comment-14353737
 ] 

Tathagata Das commented on SPARK-6211:
--

That a good point. That requires the kafka-assembly. Fortunately, the 
external/kafka-assembly project already exists, that create the JAR. All you 
need to figure out how to add that the generated kafka-assembly.jar to the Java 
class path when pyspark is called for running tests. 

1. bin/pyspark already calls bin/spark-submit, which can take the 
kafka-assembly jar as--jars path-to-jar. 
2. The python tests are run with bin/python/run-tests, which calls bin/pyspark. 
Please take a look at those to figure out how we can pass on the kafka assembly 
with --jars for the kafka python tests.

Does that make sense?




 Test Python Kafka API using Python unit tests
 -

 Key: SPARK-6211
 URL: https://issues.apache.org/jira/browse/SPARK-6211
 Project: Spark
  Issue Type: Test
  Components: Streaming, Tests
Reporter: Tathagata Das
Assignee: Saisai Shao
Priority: Critical

 This is tricky in python because the KafkaStreamSuiteBase (which has the 
 functionality of creating embedded kafka clusters) is in the test package, 
 which is not in the python path. To fix that, we have to ways. 
 1. Add test jar to classpath in python test. Thats kind of trickier.
 2. Bring that into the src package (maybe renamed as KafkaTestUtils), and 
 then wrap that in python to use it from python. 
 If (2) does not add any extra test dependencies to the main Kafka pom, then 2 
 should be simpler to do.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6232) Spark Streaming: simple application stalls processing


 [ 
https://issues.apache.org/jira/browse/SPARK-6232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Platon Potapov updated SPARK-6232:
--
Description: 
Below is a snippet of a simple test application.
Run it in one terminal window, and nc -lk  in another.

Once per second, enter a number (so that the window would slide over several 
non-empty RDDs). 2-3 such iterations is going to be enough for the program to 
stall with the following output:

{code}
---
Time: 1425922369000 ms
---

---
Time: 142592237 ms
---
(1.0,4.0)

---
Time: 1425922371000 ms
---
(1.0,4.0)

[Stage 17:=(1 + 0) / 2]
{code}

The stage... message is output to stderr.

We've tried both standalone (local master) and clustered setups - reproduces in 
all cases. We tried raw sockets and Kafka as a receiver - reproduces in both 
cases.

NOTE that the bug does not reproduce under the following conditions:
* the receiver is from a queue (StreamingContext.queueStream)
* if the commented-out print is un-commented.
* if (window + reduceByKey) is substituted to reduceByKeyAndWindow

here is the simple test application:

{code}
import org.apache.spark.SparkConf
import org.apache.spark.storage.StorageLevel
import org.apache.spark.streaming.StreamingContext._
import org.apache.spark.streaming._

object SparkStreamingTest extends App {

  val sparkConf = new 
SparkConf().setMaster(local[*]).setAppName(SparkStreamingTest)
  val ssc = new StreamingContext(sparkConf, Seconds(1))

  val lines0 = ssc.socketTextStream(localhost, , 
StorageLevel.MEMORY_AND_DISK_SER)
  val words = lines0.map(x = (1.0, x.toDouble))
  // words.print() // TODO: enable this print to avoid the program freeze

  val windowed = words.window(Seconds(4), Seconds(1))
  val grouped = windowed.reduceByKey(_ + _)
  grouped.print()

  ssc.start()
  ssc.awaitTermination()
}
{code}


  was:
Below is a snippet of a simple test application.
Run it in one terminal window, and nc -lk  in another.

Once per second, enter a number (so that the window would slide over several 
non-empty RDDs). 2-3 such iterations is going to be enough for the program to 
stall with the following output:

{code}
---
Time: 1425922369000 ms
---

---
Time: 142592237 ms
---
(1.0,4.0)

---
Time: 1425922371000 ms
---
(1.0,4.0)

[Stage 17:=(1 + 0) / 2]
{code}

The stage... message is output to stderr.

We've tried both standalone (local master) and clustered setups - reproduces in 
all cases. We tried raw sockets and Kafka as a receiver - reproduces in both 
cases.

NOTE that the bug does not reproduce under the following conditions:
* the receiver is from a queue (StreamingContext.queueStream)
* in the commented-out print is un-commented.
* if the window+reduce is substituted to reduceByKeyAndWindow

here is the simple test application:

{code}
import org.apache.spark.SparkConf
import org.apache.spark.storage.StorageLevel
import org.apache.spark.streaming.StreamingContext._
import org.apache.spark.streaming._

object SparkStreamingTest extends App {

  val sparkConf = new 
SparkConf().setMaster(local[*]).setAppName(SparkStreamingTest)
  val ssc = new StreamingContext(sparkConf, Seconds(1))

  val lines0 = ssc.socketTextStream(localhost, , 
StorageLevel.MEMORY_AND_DISK_SER)
  val words = lines0.map(x = (1.0, x.toDouble))
  // words.print() // TODO: enable this print to avoid the program freeze

  val windowed = words.window(Seconds(4), Seconds(1))
  val grouped = windowed.reduceByKey(_ + _)
  grouped.print()

  ssc.start()
  ssc.awaitTermination()
}
{code}



 Spark Streaming: simple application stalls processing
 -

 Key: SPARK-6232
 URL: https://issues.apache.org/jira/browse/SPARK-6232
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.2.1
 Environment: Ubuntu, MacOS.
Reporter: Platon Potapov
Priority: Critical

 Below is a snippet of a simple test application.
 Run it in one terminal window, and nc -lk  in another.
 Once per second, enter a number (so that the window would slide over several 
 non-empty RDDs). 2-3 such iterations is going to be enough for the program to 
 stall with the following output:
 {code}
 ---
 Time: 1425922369000 ms

[jira] [Commented] (SPARK-6232) Spark Streaming: simple application stalls processing


[ 
https://issues.apache.org/jira/browse/SPARK-6232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14353487#comment-14353487
 ] 

Sean Owen commented on SPARK-6232:
--

I can't reproduce this, although, I just tried a slightly different setup using 
spark-shell and the latest master (~1.3.0). I used the code above otherwise, 
including words.print(). Maybe you can try that instead just to see whether it 
works or not for you? that could narrow things down.

You have at least two cores available right? if you had only one core the 
receiver could starve the workers.
I have also in the past observed that Spark programs sometimes don't work right 
as an {{App}} due to weird closure problems. Declaring a simple main() method 
resolves those. That's a wild guess though.

 Spark Streaming: simple application stalls processing
 -

 Key: SPARK-6232
 URL: https://issues.apache.org/jira/browse/SPARK-6232
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.2.1
 Environment: Ubuntu, MacOS.
 Tried builds with scala 2.11 and 2.10 (for kafka receiver).
 Also tried the pre-built spark-1.2.1-bin-hadoop2.4.tgz
 The bug reproduces in all cases on 3 different computers we've tried on.
Reporter: Platon Potapov
Priority: Critical

 Below is a complete source code of a very simple test application.
 Run it in one terminal window, and nc -lk  in another.
 Once per second, enter a number into the nc terminal (so that the window 
 would slide over several non-empty RDDs). 2-3 such iterations is going to be 
 enough for the program to stall completely (no new events are processed) with 
 the following output:
 {code}
 ---
 Time: 1425922369000 ms
 ---
 ---
 Time: 142592237 ms
 ---
 (1.0,4.0)
 ---
 Time: 1425922371000 ms
 ---
 (1.0,4.0)
 [Stage 17:=(1 + 0) / 
 2]
 {code}
 The stage... message is output to stderr.
 We've tried both standalone (local master) and clustered setups - reproduces 
 in all cases. We tried raw sockets and Kafka as a receiver - reproduces in 
 both cases.
 NOTE that the bug does not reproduce under the following conditions:
 * the receiver is from a queue (StreamingContext.queueStream)
 * if the commented-out print is un-commented.
 * if (window + reduceByKey) is substituted to reduceByKeyAndWindow
 here is the simple test application:
 {code}
 import org.apache.spark.SparkConf
 import org.apache.spark.storage.StorageLevel
 import org.apache.spark.streaming.StreamingContext._
 import org.apache.spark.streaming._
 object SparkStreamingTest extends App {
   val sparkConf = new 
 SparkConf().setMaster(local[*]).setAppName(SparkStreamingTest)
   val ssc = new StreamingContext(sparkConf, Seconds(1))
   val lines0 = ssc.socketTextStream(localhost, , 
 StorageLevel.MEMORY_AND_DISK_SER)
   val words = lines0.map(x = (1.0, x.toDouble))
   // words.print() // TODO: enable this print to avoid the program freeze
   val windowed = words.window(Seconds(4), Seconds(1))
   val grouped = windowed.reduceByKey(_ + _)
   grouped.print()
   ssc.start()
   ssc.awaitTermination()
 }
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4911) Report the inputs and outputs of Spark jobs so that external systems can track data lineage

2015-03-09 Thread Sandy Ryza (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14353554#comment-14353554
 ] 

Sandy Ryza commented on SPARK-4911:
---

I know that [~malaskat] has played around with a solution that hacks around the 
lack of this, but it's definitely something still needed.

 Report the inputs and outputs of Spark jobs so that external systems can 
 track data lineage 
 

 Key: SPARK-4911
 URL: https://issues.apache.org/jira/browse/SPARK-4911
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Sandy Ryza

 When Spark runs a job, it would be useful to log its filesystem inputs and 
 outputs somewhere.  This allows external tools to track which persisted 
 datasets are derived from other persisted datasets.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5368) Spark should support NAT (via akka improvements)


[ 
https://issues.apache.org/jira/browse/SPARK-5368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14353573#comment-14353573
 ] 

Sean Owen commented on SPARK-5368:
--

I feel qualified enough to review doc or config changes. It seems harmless 
enough if the intent is to allow passing some config straight through, and in 
fact it should already possible in general. I don't know the Akka bits that 
well, but we can CC people who do. I know it may not be trivial to update to 
Akka 2.4, if that's what's on deck here.

Anyway is there a PR to look at?

 Spark should  support NAT (via akka improvements)
 -

 Key: SPARK-5368
 URL: https://issues.apache.org/jira/browse/SPARK-5368
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: jay vyas
 Fix For: 1.2.2


 Spark sets up actors for akka with a set of variables which are defined in 
 the {{AkkaUtils.scala}} class.  
 A snippet:
 {noformat}
  98   |akka.loggers = [akka.event.slf4j.Slf4jLogger]
  99   |akka.stdout-loglevel = ERROR
 100   |akka.jvm-exit-on-fatal-error = off
 101   |akka.remote.require-cookie = $requireCookie
 102   |akka.remote.secure-cookie = $secureCookie
 {noformat}
 We should allow users to pass in custom settings, for example, so that 
 arbitrary akka modifications can be used at runtime for security, 
 performance, logging, and so on.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6022) GraphX `diff` test incorrectly operating on values (not VertexId's)

2015-03-09 Thread Ankur Dave (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14353617#comment-14353617
 ] 

Ankur Dave commented on SPARK-6022:
---

[~maropu] is correct: the original intent of diff was to operate on values, not 
VertexIds. It was really written for internal use in 
[mapVertices|https://github.com/apache/spark/blob/master/graphx/src/main/scala/org/apache/spark/graphx/impl/GraphImpl.scala#L133]
 and 
[outerJoinVertices|https://github.com/apache/spark/blob/master/graphx/src/main/scala/org/apache/spark/graphx/impl/GraphImpl.scala#L284],
 which use it to find the set of vertices whose values have changed so they can 
ship only those to the edge partitions.

Based on your test you're looking for the set difference. Maybe you could 
introduce a new method called minus?

 GraphX `diff` test incorrectly operating on values (not VertexId's)
 ---

 Key: SPARK-6022
 URL: https://issues.apache.org/jira/browse/SPARK-6022
 Project: Spark
  Issue Type: Bug
  Components: GraphX
Reporter: Brennon York

 The current GraphX {{diff}} test operates on values rather than the 
 VertexId's and, if {{diff}} were working properly (per 
 [SPARK-4600|https://issues.apache.org/jira/browse/SPARK-4600]), it should 
 fail this test. The code to test {{diff}} should look like the below as it 
 correctly generates {{VertexRDD}}'s with different {{VertexId}}'s to {{diff}} 
 against.
 {code}
 test(diff functionality with small concrete values) {
 withSpark { sc =
   val setA: VertexRDD[Int] = VertexRDD(sc.parallelize(0L until 2L).map(id 
 = (id, id.toInt)))
   // setA := Set((0L, 0), (1L, 1))
   val setB: VertexRDD[Int] = VertexRDD(sc.parallelize(1L until 3L).map(id 
 = (id, id.toInt+2)))
   // setB := Set((1L, 3), (2L, 4))
   val diff = setA.diff(setB)
   assert(diff.collect.toSet == Set((2L, 4)))
 }
   }
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6113) Stabilize DecisionTree and ensembles APIs

[
https://issues.apache.org/jira/browse/SPARK-6113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14353634#comment-14353634
]

Joseph K. Bradley commented on SPARK-6113:
--

Thanks! I don't think there are blockers. I'm going to get started on the
initial PR, and we should be able to get your 1 open PR for GBTs in before
merging becomes an issue.

Stabilize DecisionTree and ensembles APIs
-

Key: SPARK-6113
URL: https://issues.apache.org/jira/browse/SPARK-6113
Project: Spark
Issue Type: Sub-task
Components: MLlib, PySpark
Affects Versions: 1.4.0
Reporter: Joseph K. Bradley
Assignee: Joseph K. Bradley
Priority: Critical

*Issue*: The APIs for DecisionTree and ensembles (RandomForests and
GradientBoostedTrees) have been experimental for a long time. The API has
become very convoluted because trees and ensembles have many, many variants,
some of which we have added incrementally without a long-term design.
*Proposal*: This JIRA is for discussing changes required to finalize the
APIs. After we discuss, I will make a PR to update the APIs and make them
non-Experimental. This will require making many breaking changes; see the
design doc for details.
[Design doc |
https://docs.google.com/document/d/1rJ_DZinyDG3PkYkAKSsQlY0QgCeefn4hUv7GsPkzBP4]:
This outlines current issues and the proposed API.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-6234) 10% Performance regression with Breeze upgrade

Nishkam Ravi created SPARK-6234:
---

 Summary: 10% Performance regression with Breeze upgrade
 Key: SPARK-6234
 URL: https://issues.apache.org/jira/browse/SPARK-6234
 Project: Spark
  Issue Type: Bug
Reporter: Nishkam Ravi


KMeans regresses by 10% with the Breeze upgrade from 0.10 to 0.11



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5368) Spark should support NAT (via akka improvements)

2015-03-09 Thread Timothy St. Clair (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14353586#comment-14353586
 ] 

Timothy St. Clair commented on SPARK-5368:
--

[~sowen] IIRC there are other Bugs around no longer maintaining and akka fork 
and updating to 2.4.  https://issues.apache.org/jira/browse/SPARK-5293

 Spark should  support NAT (via akka improvements)
 -

 Key: SPARK-5368
 URL: https://issues.apache.org/jira/browse/SPARK-5368
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: jay vyas
 Fix For: 1.2.2


 Spark sets up actors for akka with a set of variables which are defined in 
 the {{AkkaUtils.scala}} class.  
 A snippet:
 {noformat}
  98   |akka.loggers = [akka.event.slf4j.Slf4jLogger]
  99   |akka.stdout-loglevel = ERROR
 100   |akka.jvm-exit-on-fatal-error = off
 101   |akka.remote.require-cookie = $requireCookie
 102   |akka.remote.secure-cookie = $secureCookie
 {noformat}
 We should allow users to pass in custom settings, for example, so that 
 arbitrary akka modifications can be used at runtime for security, 
 performance, logging, and so on.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5544) wholeTextFiles should recognize multiple input paths delimited by ,

2015-03-09 Thread Lev Khomich (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14353588#comment-14353588
 ] 

Lev Khomich commented on SPARK-5544:


I would like to work on this.

[~mengxr], some things I need to clarify. 
1. Should this behaviour also apply to `sc.newAPIHadoopFile` and 
`sc.binaryFiles`?
2. Does this change need to be explicitly documented elsewhere, because it can 
break backward compatibility (paths containing commas, for example)?

 wholeTextFiles should recognize multiple input paths delimited by ,
 ---

 Key: SPARK-5544
 URL: https://issues.apache.org/jira/browse/SPARK-5544
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Xiangrui Meng

 textFile takes delimited paths in a single path string. wholeTextFiles should 
 behave the same.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-677) PySpark should not collect results through local filesystem

2015-03-09 Thread Davies Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-677:
-
 Target Version/s: 1.2.2, 1.4.0, 1.3.1
Affects Version/s: (was: 0.7.0)
   1.4.0
   1.3.0
   1.0.2
   1.1.1
   1.2.1
 Assignee: Davies Liu

 PySpark should not collect results through local filesystem
 ---

 Key: SPARK-677
 URL: https://issues.apache.org/jira/browse/SPARK-677
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 1.0.2, 1.1.1, 1.3.0, 1.2.1, 1.4.0
Reporter: Josh Rosen
Assignee: Davies Liu

 Py4J is slow when transferring large arrays, so PySpark currently dumps data 
 to the disk and reads it back in order to collect() RDDs.  On large enough 
 datasets, this data will spill from the buffer cache and write to the 
 physical disk, resulting in terrible performance.
 Instead, we should stream the data from Java to Python over a local socket or 
 a FIFO.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-677) PySpark should not collect results through local filesystem


[ 
https://issues.apache.org/jira/browse/SPARK-677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14353641#comment-14353641
 ] 

Apache Spark commented on SPARK-677:


User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/4923

 PySpark should not collect results through local filesystem
 ---

 Key: SPARK-677
 URL: https://issues.apache.org/jira/browse/SPARK-677
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 1.0.2, 1.1.1, 1.3.0, 1.2.1, 1.4.0
Reporter: Josh Rosen
Assignee: Davies Liu

 Py4J is slow when transferring large arrays, so PySpark currently dumps data 
 to the disk and reads it back in order to collect() RDDs.  On large enough 
 datasets, this data will spill from the buffer cache and write to the 
 physical disk, resulting in terrible performance.
 Instead, we should stream the data from Java to Python over a local socket or 
 a FIFO.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-6233) Should spark.ml Models be distributed by default?

Joseph K. Bradley created SPARK-6233:


 Summary: Should spark.ml Models be distributed by default?
 Key: SPARK-6233
 URL: https://issues.apache.org/jira/browse/SPARK-6233
 Project: Spark
  Issue Type: Brainstorming
  Components: ML
Affects Versions: 1.4.0
Reporter: Joseph K. Bradley


This JIRA is for discussing a potential change for the spark.ml package.

*Issue*: When an Estimator runs, it often computes helpful side information 
which is not stored in the returned Model.  (E.g., linear methods have RDDs of 
residuals.)  It would be nice to have this information by default, rather than 
having to recompute it.

*Suggestion*: Introduce a DistributedModel trait.  Every Estimator in the 
spark.ml package should be able to return a distributed model with extra info 
computed during training.

*Motivation*: This kind of info is one of the most useful aspects of R.  E.g., 
when you train a linear model, you can immediately summarize or plot 
information about the residuals.  For MLlib, the user currently has to take 
extra steps (and computation time) to recompute this info.

*API*: My general idea is as follows.
{code}
trait Model
trait LocalModel extends Model
trait DistributedModel[LocalModelType: LocalModel] extends Model {
  /** convert to local model */
  def toLocal: LocalModelType
}

class LocalLDAModel extends LocalModel
class DistributedLDAModel[LocalLDAModel] extends DistributedModel {
  def toLocal: LocalLDAModel
}
{code}

*Issues with this API*:
* API stability: To keep the API stable in the future, either (a) all models 
should return DistributedModels, or (b) all models should return Models which 
can then be tested for the LocalModel or DistributedModel trait.
* memory “leaks”: Users may not expect models to store references to RDDs, so 
they may be surprised by how much storage is being used.
* naturally distributed models: Some models will simply be too large to be 
converted into LocalModels.  It is unclear what to do here.

*Is this worthwhile?*
Pros:
* Saving computation
* Easier for users (skipping 1 more step of computing this info)

Cons:
* API issues
* Limited savings on computation.  In general, computing this info may take 
much less time than model training (e.g., computing residuals vs. training a 
GLM).




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6232) Spark Streaming: simple application stalls processing


 [ 
https://issues.apache.org/jira/browse/SPARK-6232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Platon Potapov updated SPARK-6232:
--
Description: 
Below is a complete source code of a very simple test application.
Run it in one terminal window, and nc -lk  in another.

Once per second, enter a number into the nc terminal (so that the window 
would slide over several non-empty RDDs). 2-3 such iterations is going to be 
enough for the program to stall completely (no new events are processed) with 
the following output:

{code}
---
Time: 1425922369000 ms
---

---
Time: 142592237 ms
---
(1.0,4.0)

---
Time: 1425922371000 ms
---
(1.0,4.0)

[Stage 17:=(1 + 0) / 2]
{code}

The stage... message is output to stderr.

We've tried both standalone (local master) and clustered setups - reproduces in 
all cases. We tried raw sockets and Kafka as a receiver - reproduces in both 
cases.

NOTE that the bug does not reproduce under the following conditions:
* the receiver is from a queue (StreamingContext.queueStream)
* if the commented-out print is un-commented.
* if (window + reduceByKey) is substituted to reduceByKeyAndWindow

here is the simple test application:

{code}
import org.apache.spark.SparkConf
import org.apache.spark.storage.StorageLevel
import org.apache.spark.streaming.StreamingContext._
import org.apache.spark.streaming._

object SparkStreamingTest extends App {

  val sparkConf = new 
SparkConf().setMaster(local[*]).setAppName(SparkStreamingTest)
  val ssc = new StreamingContext(sparkConf, Seconds(1))

  val lines0 = ssc.socketTextStream(localhost, , 
StorageLevel.MEMORY_AND_DISK_SER)
  val words = lines0.map(x = (1.0, x.toDouble))
  // words.print() // TODO: enable this print to avoid the program freeze

  val windowed = words.window(Seconds(4), Seconds(1))
  val grouped = windowed.reduceByKey(_ + _)
  grouped.print()

  ssc.start()
  ssc.awaitTermination()
}
{code}


  was:
Below is a complete source code of a very simple test application.
Run it in one terminal window, and nc -lk  in another.

Once per second, enter a number into the nc terminal (so that the window 
would slide over several non-empty RDDs). 2-3 such iterations is going to be 
enough for the program to stall with the following output:

{code}
---
Time: 1425922369000 ms
---

---
Time: 142592237 ms
---
(1.0,4.0)

---
Time: 1425922371000 ms
---
(1.0,4.0)

[Stage 17:=(1 + 0) / 2]
{code}

The stage... message is output to stderr.

We've tried both standalone (local master) and clustered setups - reproduces in 
all cases. We tried raw sockets and Kafka as a receiver - reproduces in both 
cases.

NOTE that the bug does not reproduce under the following conditions:
* the receiver is from a queue (StreamingContext.queueStream)
* if the commented-out print is un-commented.
* if (window + reduceByKey) is substituted to reduceByKeyAndWindow

here is the simple test application:

{code}
import org.apache.spark.SparkConf
import org.apache.spark.storage.StorageLevel
import org.apache.spark.streaming.StreamingContext._
import org.apache.spark.streaming._

object SparkStreamingTest extends App {

  val sparkConf = new 
SparkConf().setMaster(local[*]).setAppName(SparkStreamingTest)
  val ssc = new StreamingContext(sparkConf, Seconds(1))

  val lines0 = ssc.socketTextStream(localhost, , 
StorageLevel.MEMORY_AND_DISK_SER)
  val words = lines0.map(x = (1.0, x.toDouble))
  // words.print() // TODO: enable this print to avoid the program freeze

  val windowed = words.window(Seconds(4), Seconds(1))
  val grouped = windowed.reduceByKey(_ + _)
  grouped.print()

  ssc.start()
  ssc.awaitTermination()
}
{code}



 Spark Streaming: simple application stalls processing
 -

 Key: SPARK-6232
 URL: https://issues.apache.org/jira/browse/SPARK-6232
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.2.1
 Environment: Ubuntu, MacOS.
Reporter: Platon Potapov
Priority: Critical

 Below is a complete source code of a very simple test application.
 Run it in one terminal window, and nc -lk  in another.
 Once per second, enter a number into the nc terminal (so that the window 
 would slide over several non-empty RDDs). 2-3 such

[jira] [Resolved] (SPARK-4355) OnlineSummarizer doesn't merge mean correctly


 [ 
https://issues.apache.org/jira/browse/SPARK-4355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-4355.
--
  Resolution: Fixed
Target Version/s: 1.2.0, 1.1.1, 1.0.3  (was: 1.1.1, 1.2.0, 1.0.3)

 OnlineSummarizer doesn't merge mean correctly
 -

 Key: SPARK-4355
 URL: https://issues.apache.org/jira/browse/SPARK-4355
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.0.2, 1.1.1, 1.2.0
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng
  Labels: backport-needed
 Fix For: 1.0.3, 1.2.0, 1.1.1


 It happens when the mean on one side is zero. I will send an PR with some 
 code clean-up.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4355) OnlineSummarizer doesn't merge mean correctly


 [ 
https://issues.apache.org/jira/browse/SPARK-4355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-4355:
-
Fix Version/s: 1.0.3

 OnlineSummarizer doesn't merge mean correctly
 -

 Key: SPARK-4355
 URL: https://issues.apache.org/jira/browse/SPARK-4355
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.0.2, 1.1.1, 1.2.0
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng
  Labels: backport-needed
 Fix For: 1.1.1, 1.2.0, 1.0.3


 It happens when the mean on one side is zero. I will send an PR with some 
 code clean-up.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6232) Spark Streaming: simple application stalls processing


 [ 
https://issues.apache.org/jira/browse/SPARK-6232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Platon Potapov updated SPARK-6232:
--
Environment: 
Ubuntu, MacOS.
Tried builds with scala 2.11 and 2.10 (for kafka receiver).
Also tried the pre-built spark-1.2.1-bin-hadoop2.4.tgz
The bug reproduces in all cases on 3 different computers we've tried on.

  was:
Ubuntu, MacOS.



 Spark Streaming: simple application stalls processing
 -

 Key: SPARK-6232
 URL: https://issues.apache.org/jira/browse/SPARK-6232
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.2.1
 Environment: Ubuntu, MacOS.
 Tried builds with scala 2.11 and 2.10 (for kafka receiver).
 Also tried the pre-built spark-1.2.1-bin-hadoop2.4.tgz
 The bug reproduces in all cases on 3 different computers we've tried on.
Reporter: Platon Potapov
Priority: Critical

 Below is a complete source code of a very simple test application.
 Run it in one terminal window, and nc -lk  in another.
 Once per second, enter a number into the nc terminal (so that the window 
 would slide over several non-empty RDDs). 2-3 such iterations is going to be 
 enough for the program to stall completely (no new events are processed) with 
 the following output:
 {code}
 ---
 Time: 1425922369000 ms
 ---
 ---
 Time: 142592237 ms
 ---
 (1.0,4.0)
 ---
 Time: 1425922371000 ms
 ---
 (1.0,4.0)
 [Stage 17:=(1 + 0) / 
 2]
 {code}
 The stage... message is output to stderr.
 We've tried both standalone (local master) and clustered setups - reproduces 
 in all cases. We tried raw sockets and Kafka as a receiver - reproduces in 
 both cases.
 NOTE that the bug does not reproduce under the following conditions:
 * the receiver is from a queue (StreamingContext.queueStream)
 * if the commented-out print is un-commented.
 * if (window + reduceByKey) is substituted to reduceByKeyAndWindow
 here is the simple test application:
 {code}
 import org.apache.spark.SparkConf
 import org.apache.spark.storage.StorageLevel
 import org.apache.spark.streaming.StreamingContext._
 import org.apache.spark.streaming._
 object SparkStreamingTest extends App {
   val sparkConf = new 
 SparkConf().setMaster(local[*]).setAppName(SparkStreamingTest)
   val ssc = new StreamingContext(sparkConf, Seconds(1))
   val lines0 = ssc.socketTextStream(localhost, , 
 StorageLevel.MEMORY_AND_DISK_SER)
   val words = lines0.map(x = (1.0, x.toDouble))
   // words.print() // TODO: enable this print to avoid the program freeze
   val windowed = words.window(Seconds(4), Seconds(1))
   val grouped = windowed.reduceByKey(_ + _)
   grouped.print()
   ssc.start()
   ssc.awaitTermination()
 }
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6222) [STREAMING] All data may not be recovered from WAL when driver is killed

2015-03-09 Thread Hari Shreedharan (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-6222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Hari Shreedharan updated SPARK-6222:

Description:
When testing for our next release, our internal tests written by [~wypoon]
caught a regression in Spark Streaming between 1.2.0 and 1.3.0. The test runs
FlumePolling stream to read data from Flume, then kills the Application Master.
Once YARN restarts it, the test waits until no more data is to be written and
verifies the original against the data on HDFS. This was passing in 1.2.0, but
is failing now.

Since the test ties into Cloudera's internal infrastructure and build process,
it cannot be directly run on an Apache build. But I have been working on
isolating the commit that may have caused the regression. I have confirmed that
it was caused by SPARK-5147 (PR #
[4149|https://github.com/apache/spark/pull/4149]). I confirmed this several
times using the test and the failure is consistently reproducible.

To re-confirm, I reverted just this one commit (and Clock consolidation one to
avoid conflicts), and the issue was no longer reproducible.

Since this is a data loss issue, I believe this is a blocker for Spark 1.3.0
/cc [~tdas], [~pwendell]

was:
When testing for our next release, our internal tests written by [~wypoon]
caught a regression in Spark Streaming between 1.2.0 and 1.3.0. The test runs
FlumePolling stream to read data from Flume, then kills the Application Master.
Once YARN restarts it, the test waits until no more data is to be written and
verifies the original against the data on HDFS. This was passing in 1.2.0, but
is failing now.

Since the test ties into Cloudera's internal infrastructure and build process,
it cannot be directly run on an Apache build. But I have been working on
isolating the commit that may have caused the regression. I have confirmed that
it was caused by SPARK-5157 (PR #
[4149|https://github.com/apache/spark/pull/4149]). I confirmed this several
times using the test and the failure is consistently reproducible.

To re-confirm, I reverted just this one commit (and Clock consolidation one to
avoid conflicts), and the issue was no longer reproducible.

Since this is a data loss issue, I believe this is a blocker for Spark 1.3.0
/cc [~tdas], [~pwendell]

[STREAMING] All data may not be recovered from WAL when driver is killed

Key: SPARK-6222
URL: https://issues.apache.org/jira/browse/SPARK-6222
Project: Spark
Issue Type: Bug
Components: Streaming
Affects Versions: 1.3.0
Reporter: Hari Shreedharan
Priority: Blocker

When testing for our next release, our internal tests written by [~wypoon]
caught a regression in Spark Streaming between 1.2.0 and 1.3.0. The test runs
FlumePolling stream to read data from Flume, then kills the Application
Master. Once YARN restarts it, the test waits until no more data is to be
written and verifies the original against the data on HDFS. This was passing
in 1.2.0, but is failing now.
Since the test ties into Cloudera's internal infrastructure and build
process, it cannot be directly run on an Apache build. But I have been
working on isolating the commit that may have caused the regression. I have
confirmed that it was caused by SPARK-5147 (PR #
[4149|https://github.com/apache/spark/pull/4149]). I confirmed this several
times using the test and the failure is consistently reproducible.
To re-confirm, I reverted just this one commit (and Clock consolidation one
to avoid conflicts), and the issue was no longer reproducible.
Since this is a data loss issue, I believe this is a blocker for Spark 1.3.0
/cc [~tdas], [~pwendell]

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6231) Join on two tables (generated from same one) is broken


 [ 
https://issues.apache.org/jira/browse/SPARK-6231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-6231:
---
Labels: DataFrame  (was: )

 Join on two tables (generated from same one) is broken
 --

 Key: SPARK-6231
 URL: https://issues.apache.org/jira/browse/SPARK-6231
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0, 1.4.0
Reporter: Davies Liu
Assignee: Michael Armbrust
Priority: Critical
  Labels: DataFrame

 If the two column used in joinExpr come from the same table, they have the 
 same id, then the joniExpr is explained in wrong way.
 {code}
 val df = sqlContext.load(path, parquet)
 val txns = df.groupBy(cust_id).agg($cust_id, 
 countDistinct($day_num).as(txns))
 val spend = df.groupBy(cust_id).agg($cust_id, 
 sum($extended_price).as(spend))
 val rmJoin = txns.join(spend, txns(cust_id) === spend(cust_id), inner)
 scala rmJoin.explain
 == Physical Plan ==
 CartesianProduct
  Filter (cust_id#0 = cust_id#0)
   Aggregate false, [cust_id#0], [cust_id#0,CombineAndCount(partialSets#25) AS 
 txns#7L]
Exchange (HashPartitioning [cust_id#0], 200)
 Aggregate true, [cust_id#0], [cust_id#0,AddToHashSet(day_num#2L) AS 
 partialSets#25]
  PhysicalRDD [cust_id#0,day_num#2L], MapPartitionsRDD[1] at map at 
 newParquet.scala:542
  Aggregate false, [cust_id#17], [cust_id#17,SUM(PartialSum#38) AS spend#8]
   Exchange (HashPartitioning [cust_id#17], 200)
Aggregate true, [cust_id#17], [cust_id#17,SUM(extended_price#20) AS 
 PartialSum#38]
 PhysicalRDD [cust_id#17,extended_price#20], MapPartitionsRDD[3] at map at 
 newParquet.scala:542
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6050) Spark on YARN does not work --executor-cores is specified

2015-03-09 Thread Patrick Wendell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-6050:
---
Fix Version/s: (was: 1.4.0)

 Spark on YARN does not work --executor-cores is specified
 -

 Key: SPARK-6050
 URL: https://issues.apache.org/jira/browse/SPARK-6050
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.3.0
 Environment: 2.5 based YARN cluster.
Reporter: Mridul Muralidharan
Assignee: Marcelo Vanzin
Priority: Blocker
 Fix For: 1.3.0


 There are multiple issues here (which I will detail as comments), but to 
 reproduce running the following ALWAYS hangs in our cluster with the 1.3 RC
 ./bin/spark-submit --class org.apache.spark.examples.SparkPi --master 
 yarn-cluster --executor-cores 8--num-executors 15 --driver-memory 4g  
--executor-memory 2g  --queue webmap lib/spark-examples*.jar   
   10



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3278) Isotonic regression

2015-03-09 Thread Martin Zapletal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14353549#comment-14353549
 ] 

Martin Zapletal commented on SPARK-3278:


What particular benchmarks would you like to see? I can do them.

 Isotonic regression
 ---

 Key: SPARK-3278
 URL: https://issues.apache.org/jira/browse/SPARK-3278
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Xiangrui Meng
Assignee: Martin Zapletal
 Fix For: 1.3.0


 Add isotonic regression for score calibration.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5368) Spark should support NAT (via akka improvements)

2015-03-09 Thread Matthew Farrellee (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14353534#comment-14353534
 ] 

Matthew Farrellee commented on SPARK-5368:
--

[~srowen] will you take a look at this? i'm trying to run spark via kubernetes 
(master pod + master service + slave replicationcontroller), and the service 
layer is creating a NAT-like environment.

 Spark should  support NAT (via akka improvements)
 -

 Key: SPARK-5368
 URL: https://issues.apache.org/jira/browse/SPARK-5368
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: jay vyas
 Fix For: 1.2.2


 Spark sets up actors for akka with a set of variables which are defined in 
 the {{AkkaUtils.scala}} class.  
 A snippet:
 {noformat}
  98   |akka.loggers = [akka.event.slf4j.Slf4jLogger]
  99   |akka.stdout-loglevel = ERROR
 100   |akka.jvm-exit-on-fatal-error = off
 101   |akka.remote.require-cookie = $requireCookie
 102   |akka.remote.secure-cookie = $secureCookie
 {noformat}
 We should allow users to pass in custom settings, for example, so that 
 arbitrary akka modifications can be used at runtime for security, 
 performance, logging, and so on.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6222) [STREAMING] All data may not be recovered from WAL when driver is killed

[
https://issues.apache.org/jira/browse/SPARK-6222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14353358#comment-14353358
]

Tathagata Das commented on SPARK-6222:
--

Could you upload the stack traces, and logs that show is error? The PR # 4149
is about automatically deleting old log files. Is it an error that WAL files
are deleted automatically too early?

[STREAMING] All data may not be recovered from WAL when driver is killed

Key: SPARK-6222
URL: https://issues.apache.org/jira/browse/SPARK-6222
Project: Spark
Issue Type: Bug
Components: Streaming
Affects Versions: 1.3.0
Reporter: Hari Shreedharan
Priority: Blocker

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-6231) Join on two tables (generated from same one) is broken

2015-03-09 Thread Davies Liu (JIRA)

Davies Liu created SPARK-6231:
-

 Summary: Join on two tables (generated from same one) is broken
 Key: SPARK-6231
 URL: https://issues.apache.org/jira/browse/SPARK-6231
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0, 1.4.0
Reporter: Davies Liu
Assignee: Michael Armbrust
Priority: Critical


If the two column used in joinExpr come from the same table, they have the same 
id, then the joniExpr is explained in wrong way.

{code}
val df = sqlContext.load(path, parquet)

val txns = df.groupBy(cust_id).agg($cust_id, 
countDistinct($day_num).as(txns))

val spend = df.groupBy(cust_id).agg($cust_id, 
sum($extended_price).as(spend))

val rmJoin = txns.join(spend, txns(cust_id) === spend(cust_id), inner)

scala rmJoin.explain
== Physical Plan ==
CartesianProduct
 Filter (cust_id#0 = cust_id#0)
  Aggregate false, [cust_id#0], [cust_id#0,CombineAndCount(partialSets#25) AS 
txns#7L]
   Exchange (HashPartitioning [cust_id#0], 200)
Aggregate true, [cust_id#0], [cust_id#0,AddToHashSet(day_num#2L) AS 
partialSets#25]
 PhysicalRDD [cust_id#0,day_num#2L], MapPartitionsRDD[1] at map at 
newParquet.scala:542
 Aggregate false, [cust_id#17], [cust_id#17,SUM(PartialSum#38) AS spend#8]
  Exchange (HashPartitioning [cust_id#17], 200)
   Aggregate true, [cust_id#17], [cust_id#17,SUM(extended_price#20) AS 
PartialSum#38]
PhysicalRDD [cust_id#17,extended_price#20], MapPartitionsRDD[3] at map at 
newParquet.scala:542

{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6192) Enhance MLlib's Python API (GSoC 2015)

[
https://issues.apache.org/jira/browse/SPARK-6192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14353347#comment-14353347
]

Xiangrui Meng commented on SPARK-6192:
--

[~Manglano] and [~leckie-chn] Thanks for your interests in GSoC Spark MLlib!
As [~MechCoder] mentioned, this JIRA was created for him based on his past
experience and recent contributions to Spark MLlib. We tried to set a theme for
the project but make the actual tasks flexible. So it doesn't mean that we are
blocking others from implementing these features. You can contribute any of
these features at any time.

It would be great if you can start with some small features or helping review
others' PRs. We need to know each other before we can plan a GSoC project, but
I'm afraid that we may not have enough time to make it happen this year.
Anyway, this is a good place to start:
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark

Enhance MLlib's Python API (GSoC 2015)
--

Key: SPARK-6192
URL: https://issues.apache.org/jira/browse/SPARK-6192
Project: Spark
Issue Type: Umbrella
Components: ML, MLlib, PySpark
Reporter: Xiangrui Meng
Assignee: Manoj Kumar
Labels: gsoc, gsoc2015, mentor

This is an umbrella JIRA for [~MechCoder]'s GSoC 2015 project. The main theme
is to enhance MLlib's Python API, to make it on par with the Scala/Java API.
The main tasks are:
1. For all models in MLlib, provide save/load method. This also
includes save/load in Scala.
2. Python API for evaluation metrics.
3. Python API for streaming ML algorithms.
4. Python API for distributed linear algebra.
5. Simplify MLLibPythonAPI using DataFrames. Currently, we use
customized serialization, making MLLibPythonAPI hard to maintain. It
would be nice to use the DataFrames for serialization.
I'll link the JIRAs for each of the tasks.
Note that this doesn't mean all these JIRAs are pre-assigned to [~MechCoder].
The TODO list will be dynamic based on the backlog.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-6232) Spark Streaming: simple application stalls processing

Platon Potapov created SPARK-6232:
-

 Summary: Spark Streaming: simple application stalls processing
 Key: SPARK-6232
 URL: https://issues.apache.org/jira/browse/SPARK-6232
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.2.1
 Environment: Ubuntu, MacOS.

Reporter: Platon Potapov
Priority: Critical


Below is a snippet of a simple test application.
Run it in one terminal window, and nc -lk  in another.

Once per second, enter a number (so that the window would slide over several 
non-empty RDDs). 2-3 numbers is going to be enough for the program to stall 
with the following output:

{code}
---
Time: 1425922369000 ms
---

---
Time: 142592237 ms
---
(1.0,4.0)

---
Time: 1425922371000 ms
---
(1.0,4.0)

[Stage 17:=(1 + 0) / 2]
{code}

We've tried both standalone (local master) and clustered setups - reproduces in 
all cases. We tried raw sockets and Kafka as a receiver - reproduces in both 
cases.

NOTE that the bug does not reproduce under the following conditions:
* the receiver is from a queue (StreamingContext.queueStream)
* in the commented-out print is un-commented.
* if the window+reduce is substituted to reduceByKeyAndWindow

here is the simple test application:

{code}
import org.apache.spark.SparkConf
import org.apache.spark.storage.StorageLevel
import org.apache.spark.streaming.StreamingContext._
import org.apache.spark.streaming._

object SparkStreamingTest extends App {

  val sparkConf = new 
SparkConf().setMaster(local[*]).setAppName(SparkStreamingTest)
  val ssc = new StreamingContext(sparkConf, Seconds(1))

  val lines0 = ssc.socketTextStream(localhost, , 
StorageLevel.MEMORY_AND_DISK_SER)
  val words = lines0.map(x = (1.0, x.toDouble))
  // words.print() // TODO: enable this print to avoid the program freeze

  val windowed = words.window(Seconds(4), Seconds(1))
  val grouped = windowed.reduceByKey(_ + _)
  grouped.print()

  ssc.start()
  ssc.awaitTermination()
}
{code}




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6232) Spark Streaming: simple application stalls processing


 [ 
https://issues.apache.org/jira/browse/SPARK-6232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Platon Potapov updated SPARK-6232:
--
Description: 
Below is a complete source code of a very simple test application.
Run it in one terminal window, and nc -lk  in another.

Once per second, enter a number (so that the window would slide over several 
non-empty RDDs). 2-3 such iterations is going to be enough for the program to 
stall with the following output:

{code}
---
Time: 1425922369000 ms
---

---
Time: 142592237 ms
---
(1.0,4.0)

---
Time: 1425922371000 ms
---
(1.0,4.0)

[Stage 17:=(1 + 0) / 2]
{code}

The stage... message is output to stderr.

We've tried both standalone (local master) and clustered setups - reproduces in 
all cases. We tried raw sockets and Kafka as a receiver - reproduces in both 
cases.

NOTE that the bug does not reproduce under the following conditions:
* the receiver is from a queue (StreamingContext.queueStream)
* if the commented-out print is un-commented.
* if (window + reduceByKey) is substituted to reduceByKeyAndWindow

here is the simple test application:

{code}
import org.apache.spark.SparkConf
import org.apache.spark.storage.StorageLevel
import org.apache.spark.streaming.StreamingContext._
import org.apache.spark.streaming._

object SparkStreamingTest extends App {

  val sparkConf = new 
SparkConf().setMaster(local[*]).setAppName(SparkStreamingTest)
  val ssc = new StreamingContext(sparkConf, Seconds(1))

  val lines0 = ssc.socketTextStream(localhost, , 
StorageLevel.MEMORY_AND_DISK_SER)
  val words = lines0.map(x = (1.0, x.toDouble))
  // words.print() // TODO: enable this print to avoid the program freeze

  val windowed = words.window(Seconds(4), Seconds(1))
  val grouped = windowed.reduceByKey(_ + _)
  grouped.print()

  ssc.start()
  ssc.awaitTermination()
}
{code}


  was:
Below is a snippet of a simple test application.
Run it in one terminal window, and nc -lk  in another.

Once per second, enter a number (so that the window would slide over several 
non-empty RDDs). 2-3 such iterations is going to be enough for the program to 
stall with the following output:

{code}
---
Time: 1425922369000 ms
---

---
Time: 142592237 ms
---
(1.0,4.0)

---
Time: 1425922371000 ms
---
(1.0,4.0)

[Stage 17:=(1 + 0) / 2]
{code}

The stage... message is output to stderr.

We've tried both standalone (local master) and clustered setups - reproduces in 
all cases. We tried raw sockets and Kafka as a receiver - reproduces in both 
cases.

NOTE that the bug does not reproduce under the following conditions:
* the receiver is from a queue (StreamingContext.queueStream)
* if the commented-out print is un-commented.
* if (window + reduceByKey) is substituted to reduceByKeyAndWindow

here is the simple test application:

{code}
import org.apache.spark.SparkConf
import org.apache.spark.storage.StorageLevel
import org.apache.spark.streaming.StreamingContext._
import org.apache.spark.streaming._

object SparkStreamingTest extends App {

  val sparkConf = new 
SparkConf().setMaster(local[*]).setAppName(SparkStreamingTest)
  val ssc = new StreamingContext(sparkConf, Seconds(1))

  val lines0 = ssc.socketTextStream(localhost, , 
StorageLevel.MEMORY_AND_DISK_SER)
  val words = lines0.map(x = (1.0, x.toDouble))
  // words.print() // TODO: enable this print to avoid the program freeze

  val windowed = words.window(Seconds(4), Seconds(1))
  val grouped = windowed.reduceByKey(_ + _)
  grouped.print()

  ssc.start()
  ssc.awaitTermination()
}
{code}



 Spark Streaming: simple application stalls processing
 -

 Key: SPARK-6232
 URL: https://issues.apache.org/jira/browse/SPARK-6232
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.2.1
 Environment: Ubuntu, MacOS.
Reporter: Platon Potapov
Priority: Critical

 Below is a complete source code of a very simple test application.
 Run it in one terminal window, and nc -lk  in another.
 Once per second, enter a number (so that the window would slide over several 
 non-empty RDDs). 2-3 such iterations is going to be enough for the program to 
 stall with the following output:
 {code}

[jira] [Updated] (SPARK-6232) Spark Streaming: simple application stalls processing


 [ 
https://issues.apache.org/jira/browse/SPARK-6232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Platon Potapov updated SPARK-6232:
--
Description: 
Below is a complete source code of a very simple test application.
Run it in one terminal window, and nc -lk  in another.

Once per second, enter a number into the nc terminal (so that the window 
would slide over several non-empty RDDs). 2-3 such iterations is going to be 
enough for the program to stall with the following output:

{code}
---
Time: 1425922369000 ms
---

---
Time: 142592237 ms
---
(1.0,4.0)

---
Time: 1425922371000 ms
---
(1.0,4.0)

[Stage 17:=(1 + 0) / 2]
{code}

The stage... message is output to stderr.

We've tried both standalone (local master) and clustered setups - reproduces in 
all cases. We tried raw sockets and Kafka as a receiver - reproduces in both 
cases.

NOTE that the bug does not reproduce under the following conditions:
* the receiver is from a queue (StreamingContext.queueStream)
* if the commented-out print is un-commented.
* if (window + reduceByKey) is substituted to reduceByKeyAndWindow

here is the simple test application:

{code}
import org.apache.spark.SparkConf
import org.apache.spark.storage.StorageLevel
import org.apache.spark.streaming.StreamingContext._
import org.apache.spark.streaming._

object SparkStreamingTest extends App {

  val sparkConf = new 
SparkConf().setMaster(local[*]).setAppName(SparkStreamingTest)
  val ssc = new StreamingContext(sparkConf, Seconds(1))

  val lines0 = ssc.socketTextStream(localhost, , 
StorageLevel.MEMORY_AND_DISK_SER)
  val words = lines0.map(x = (1.0, x.toDouble))
  // words.print() // TODO: enable this print to avoid the program freeze

  val windowed = words.window(Seconds(4), Seconds(1))
  val grouped = windowed.reduceByKey(_ + _)
  grouped.print()

  ssc.start()
  ssc.awaitTermination()
}
{code}


  was:
Below is a complete source code of a very simple test application.
Run it in one terminal window, and nc -lk  in another.

Once per second, enter a number (so that the window would slide over several 
non-empty RDDs). 2-3 such iterations is going to be enough for the program to 
stall with the following output:

{code}
---
Time: 1425922369000 ms
---

---
Time: 142592237 ms
---
(1.0,4.0)

---
Time: 1425922371000 ms
---
(1.0,4.0)

[Stage 17:=(1 + 0) / 2]
{code}

The stage... message is output to stderr.

We've tried both standalone (local master) and clustered setups - reproduces in 
all cases. We tried raw sockets and Kafka as a receiver - reproduces in both 
cases.

NOTE that the bug does not reproduce under the following conditions:
* the receiver is from a queue (StreamingContext.queueStream)
* if the commented-out print is un-commented.
* if (window + reduceByKey) is substituted to reduceByKeyAndWindow

here is the simple test application:

{code}
import org.apache.spark.SparkConf
import org.apache.spark.storage.StorageLevel
import org.apache.spark.streaming.StreamingContext._
import org.apache.spark.streaming._

object SparkStreamingTest extends App {

  val sparkConf = new 
SparkConf().setMaster(local[*]).setAppName(SparkStreamingTest)
  val ssc = new StreamingContext(sparkConf, Seconds(1))

  val lines0 = ssc.socketTextStream(localhost, , 
StorageLevel.MEMORY_AND_DISK_SER)
  val words = lines0.map(x = (1.0, x.toDouble))
  // words.print() // TODO: enable this print to avoid the program freeze

  val windowed = words.window(Seconds(4), Seconds(1))
  val grouped = windowed.reduceByKey(_ + _)
  grouped.print()

  ssc.start()
  ssc.awaitTermination()
}
{code}



 Spark Streaming: simple application stalls processing
 -

 Key: SPARK-6232
 URL: https://issues.apache.org/jira/browse/SPARK-6232
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.2.1
 Environment: Ubuntu, MacOS.
Reporter: Platon Potapov
Priority: Critical

 Below is a complete source code of a very simple test application.
 Run it in one terminal window, and nc -lk  in another.
 Once per second, enter a number into the nc terminal (so that the window 
 would slide over several non-empty RDDs). 2-3 such iterations is going to be 
 enough for the program to stall

[jira] [Commented] (SPARK-6228) Provide SASL support in network/common module


[ 
https://issues.apache.org/jira/browse/SPARK-6228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14353412#comment-14353412
 ] 

Apache Spark commented on SPARK-6228:
-

User 'vanzin' has created a pull request for this issue:
https://github.com/apache/spark/pull/4953

 Provide SASL support in network/common module
 -

 Key: SPARK-6228
 URL: https://issues.apache.org/jira/browse/SPARK-6228
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Reporter: Marcelo Vanzin

 Currently, there's support for SASL in network/shuffle, but not in 
 network/common. Moving the SASL code to network/common would enable other 
 applications using that code to also support secure authentication and, 
 later, encryption.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6219) Expand Python lint checks to check for compilation errors


[ 
https://issues.apache.org/jira/browse/SPARK-6219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14353473#comment-14353473
 ] 

Sean Owen commented on SPARK-6219:
--

OK, that's reasonable to make sure that everything gets compiled, not just what 
the tests cover.
If it's fast, I suppose the only cost is complexity, but your changes are 
actually more about refactoring than the compilation.
You would know this script well as you created most of it I think. I suppose 
it'd be best to get another Python person to look at it but I don't object, in 
the name of catching more stuff early.

 Expand Python lint checks to check for  compilation errors
 --

 Key: SPARK-6219
 URL: https://issues.apache.org/jira/browse/SPARK-6219
 Project: Spark
  Issue Type: Improvement
  Components: Build
Reporter: Nicholas Chammas
Priority: Minor

 An easy lint check for Python would be to make sure the stuff at least 
 compiles. That will catch only the most egregious errors, but it should help.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6232) Spark Streaming: simple application stalls processing


 [ 
https://issues.apache.org/jira/browse/SPARK-6232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Platon Potapov updated SPARK-6232:
--
Description: 
Below is a snippet of a simple test application.
Run it in one terminal window, and nc -lk  in another.

Once per second, enter a number (so that the window would slide over several 
non-empty RDDs). 2-3 such iterations is going to be enough for the program to 
stall with the following output:

{code}
---
Time: 1425922369000 ms
---

---
Time: 142592237 ms
---
(1.0,4.0)

---
Time: 1425922371000 ms
---
(1.0,4.0)

[Stage 17:=(1 + 0) / 2]
{code}

The stage... message is output to stderr.

We've tried both standalone (local master) and clustered setups - reproduces in 
all cases. We tried raw sockets and Kafka as a receiver - reproduces in both 
cases.

NOTE that the bug does not reproduce under the following conditions:
* the receiver is from a queue (StreamingContext.queueStream)
* in the commented-out print is un-commented.
* if the window+reduce is substituted to reduceByKeyAndWindow

here is the simple test application:

{code}
import org.apache.spark.SparkConf
import org.apache.spark.storage.StorageLevel
import org.apache.spark.streaming.StreamingContext._
import org.apache.spark.streaming._

object SparkStreamingTest extends App {

  val sparkConf = new 
SparkConf().setMaster(local[*]).setAppName(SparkStreamingTest)
  val ssc = new StreamingContext(sparkConf, Seconds(1))

  val lines0 = ssc.socketTextStream(localhost, , 
StorageLevel.MEMORY_AND_DISK_SER)
  val words = lines0.map(x = (1.0, x.toDouble))
  // words.print() // TODO: enable this print to avoid the program freeze

  val windowed = words.window(Seconds(4), Seconds(1))
  val grouped = windowed.reduceByKey(_ + _)
  grouped.print()

  ssc.start()
  ssc.awaitTermination()
}
{code}


  was:
Below is a snippet of a simple test application.
Run it in one terminal window, and nc -lk  in another.

Once per second, enter a number (so that the window would slide over several 
non-empty RDDs). 2-3 numbers is going to be enough for the program to stall 
with the following output:

{code}
---
Time: 1425922369000 ms
---

---
Time: 142592237 ms
---
(1.0,4.0)

---
Time: 1425922371000 ms
---
(1.0,4.0)

[Stage 17:=(1 + 0) / 2]
{code}

We've tried both standalone (local master) and clustered setups - reproduces in 
all cases. We tried raw sockets and Kafka as a receiver - reproduces in both 
cases.

NOTE that the bug does not reproduce under the following conditions:
* the receiver is from a queue (StreamingContext.queueStream)
* in the commented-out print is un-commented.
* if the window+reduce is substituted to reduceByKeyAndWindow

here is the simple test application:

{code}
import org.apache.spark.SparkConf
import org.apache.spark.storage.StorageLevel
import org.apache.spark.streaming.StreamingContext._
import org.apache.spark.streaming._

object SparkStreamingTest extends App {

  val sparkConf = new 
SparkConf().setMaster(local[*]).setAppName(SparkStreamingTest)
  val ssc = new StreamingContext(sparkConf, Seconds(1))

  val lines0 = ssc.socketTextStream(localhost, , 
StorageLevel.MEMORY_AND_DISK_SER)
  val words = lines0.map(x = (1.0, x.toDouble))
  // words.print() // TODO: enable this print to avoid the program freeze

  val windowed = words.window(Seconds(4), Seconds(1))
  val grouped = windowed.reduceByKey(_ + _)
  grouped.print()

  ssc.start()
  ssc.awaitTermination()
}
{code}



 Spark Streaming: simple application stalls processing
 -

 Key: SPARK-6232
 URL: https://issues.apache.org/jira/browse/SPARK-6232
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.2.1
 Environment: Ubuntu, MacOS.
Reporter: Platon Potapov
Priority: Critical

 Below is a snippet of a simple test application.
 Run it in one terminal window, and nc -lk  in another.
 Once per second, enter a number (so that the window would slide over several 
 non-empty RDDs). 2-3 such iterations is going to be enough for the program to 
 stall with the following output:
 {code}
 ---
 Time: 1425922369000 ms
 ---

[jira] [Commented] (SPARK-6022) GraphX `diff` test incorrectly operating on values (not VertexId's)

2015-03-09 Thread Brennon York (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14353364#comment-14353364
 ] 

Brennon York commented on SPARK-6022:
-

The test is correct (in what I believe {{diff}} should do). Maybe [~ankurd] can 
chime in here? And you're also correct in that the code implementing {{diff}} 
doesn't currently work properly which is why I believe this test should 
correctly assess whether {{diff}} is operating correctly.

 GraphX `diff` test incorrectly operating on values (not VertexId's)
 ---

 Key: SPARK-6022
 URL: https://issues.apache.org/jira/browse/SPARK-6022
 Project: Spark
  Issue Type: Bug
  Components: GraphX
Reporter: Brennon York

 The current GraphX {{diff}} test operates on values rather than the 
 VertexId's and, if {{diff}} were working properly (per 
 [SPARK-4600|https://issues.apache.org/jira/browse/SPARK-4600]), it should 
 fail this test. The code to test {{diff}} should look like the below as it 
 correctly generates {{VertexRDD}}'s with different {{VertexId}}'s to {{diff}} 
 against.
 {code}
 test(diff functionality with small concrete values) {
 withSpark { sc =
   val setA: VertexRDD[Int] = VertexRDD(sc.parallelize(0L until 2L).map(id 
 = (id, id.toInt)))
   // setA := Set((0L, 0), (1L, 1))
   val setB: VertexRDD[Int] = VertexRDD(sc.parallelize(1L until 3L).map(id 
 = (id, id.toInt+2)))
   // setB := Set((1L, 3), (2L, 4))
   val diff = setA.diff(setB)
   assert(diff.collect.toSet == Set((2L, 4)))
 }
   }
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3278) Isotonic regression


[ 
https://issues.apache.org/jira/browse/SPARK-3278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14353419#comment-14353419
 ] 

Xiangrui Meng commented on SPARK-3278:
--

I don't know any. It really depends on how may buckets it outputs. I can 
imagine problems with 100M buckets.

 Isotonic regression
 ---

 Key: SPARK-3278
 URL: https://issues.apache.org/jira/browse/SPARK-3278
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Xiangrui Meng
Assignee: Martin Zapletal
 Fix For: 1.3.0


 Add isotonic regression for score calibration.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3477) Clean up code in Yarn Client / ClientBase

2015-03-09 Thread Peter Rudenko (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14353784#comment-14353784
 ] 

Peter Rudenko commented on SPARK-3477:
--

+1 to return these classes to public. There's [an 
article|http://blog.sequenceiq.com/blog/2014/08/22/spark-submit-in-java/] and i 
also have a use case to submit job programattically. 

 Clean up code in Yarn Client / ClientBase
 -

 Key: SPARK-3477
 URL: https://issues.apache.org/jira/browse/SPARK-3477
 Project: Spark
  Issue Type: Sub-task
  Components: YARN
Affects Versions: 1.1.0
Reporter: Andrew Or
Assignee: Andrew Or
 Fix For: 1.2.0


 With the addition of new features and supporting multiple versions of yarn 
 the code has become cumbersome and could use some clean.  We should add 
 comments and update to follow the new style guilines also.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6234) 10% Performance regression with Breeze upgrade


[ 
https://issues.apache.org/jira/browse/SPARK-6234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14353793#comment-14353793
 ] 

Nishkam Ravi commented on SPARK-6234:
-

[~mengxr] Variant of org.apache.spark.examples.LocalKMeans which uses Breeze. 
Input dataset: 20GB, 6-node cluster. 

 10% Performance regression with Breeze upgrade
 --

 Key: SPARK-6234
 URL: https://issues.apache.org/jira/browse/SPARK-6234
 Project: Spark
  Issue Type: Bug
Reporter: Nishkam Ravi

 KMeans regresses by 10% with the Breeze upgrade from 0.10 to 0.11



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6142) 10-12% Performance regression with finalize


[ 
https://issues.apache.org/jira/browse/SPARK-6142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14353826#comment-14353826
 ] 

Sean Owen commented on SPARK-6142:
--

Is this resolved by reverting those commits then?

 10-12% Performance regression with finalize
 -

 Key: SPARK-6142
 URL: https://issues.apache.org/jira/browse/SPARK-6142
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.3.0
Reporter: Nishkam Ravi

 10-12% performance regression in PageRank (and potentially other workloads) 
 caused due to the use of finalize in ExternalAppendOnlyMap. Introduced by a 
 commit on Feb 19th. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6005) Flaky test: o.a.s.streaming.kafka.DirectKafkaStreamSuite.offset recovery


[ 
https://issues.apache.org/jira/browse/SPARK-6005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14353838#comment-14353838
 ] 

Tathagata Das commented on SPARK-6005:
--

[~c...@koeninger.org] Can you check this out?

 Flaky test: o.a.s.streaming.kafka.DirectKafkaStreamSuite.offset recovery
 

 Key: SPARK-6005
 URL: https://issues.apache.org/jira/browse/SPARK-6005
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Reporter: Iulian Dragos
  Labels: flaky-test, kafka, streaming

 [Link to failing test on 
 Jenkins|https://ci.typesafe.com/view/Spark/job/spark-nightly-build/lastCompletedBuild/testReport/org.apache.spark.streaming.kafka/DirectKafkaStreamSuite/offset_recovery/]
 {code}
 The code passed to eventually never returned normally. Attempted 208 times 
 over 10.00622791 seconds. Last failure message: strings.forall({   ((elem: 
 Any) = DirectKafkaStreamSuite.collectedData.contains(elem)) }) was false.
 {code}
 {code:title=Stack trace}
 sbt.ForkMain$ForkError: The code passed to eventually never returned 
 normally. Attempted 208 times over 10.00622791 seconds. Last failure message: 
 strings.forall({
   ((elem: Any) = DirectKafkaStreamSuite.collectedData.contains(elem))
 }) was false.
   at 
 org.scalatest.concurrent.Eventually$class.tryTryAgain$1(Eventually.scala:420)
   at 
 org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:438)
   at 
 org.apache.spark.streaming.kafka.KafkaStreamSuiteBase.eventually(KafkaStreamSuite.scala:49)
   at 
 org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:307)
   at 
 org.apache.spark.streaming.kafka.KafkaStreamSuiteBase.eventually(KafkaStreamSuite.scala:49)
   at 
 org.apache.spark.streaming.kafka.DirectKafkaStreamSuite$$anonfun$5.org$apache$spark$streaming$kafka$DirectKafkaStreamSuite$$anonfun$$sendDataAndWaitForReceive$1(DirectKafkaStreamSuite.scala:225)
   at 
 org.apache.spark.streaming.kafka.DirectKafkaStreamSuite$$anonfun$5.apply$mcV$sp(DirectKafkaStreamSuite.scala:287)
   at 
 org.apache.spark.streaming.kafka.DirectKafkaStreamSuite$$anonfun$5.apply(DirectKafkaStreamSuite.scala:211)
   at 
 org.apache.spark.streaming.kafka.DirectKafkaStreamSuite$$anonfun$5.apply(DirectKafkaStreamSuite.scala:211)
   at 
 org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
   at org.scalatest.Transformer.apply(Transformer.scala:22)
   at org.scalatest.Transformer.apply(Transformer.scala:20)
   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
   at org.scalatest.Suite$class.withFixture(Suite.scala:1122)
   at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555)
   at 
 org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
   at 
 org.apache.spark.streaming.kafka.DirectKafkaStreamSuite.org$scalatest$BeforeAndAfter$$super$runTest(DirectKafkaStreamSuite.scala:39)
   at org.scalatest.BeforeAndAfter$class.runTest(BeforeAndAfter.scala:200)
   at 
 org.apache.spark.streaming.kafka.DirectKafkaStreamSuite.runTest(DirectKafkaStreamSuite.scala:39)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
   at 
 org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413)
   at 
 org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401)
   at scala.collection.immutable.List.foreach(List.scala:318)
   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
   at 
 org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396)
   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483)
   at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208)
   at org.scalatest.FunSuite.runTests(FunSuite.scala:1555)
   at org.scalatest.Suite$class.run(Suite.scala:1424)
   at 
 org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555)
   at 
 org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
   at 
 org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
   at org.scalatest.SuperEngine.runImpl(Engine.scala:545)
   at

[jira] [Commented] (SPARK-6222) [STREAMING] All data may not be recovered from WAL when driver is killed


[ 
https://issues.apache.org/jira/browse/SPARK-6222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14353840#comment-14353840
 ] 

Tathagata Das commented on SPARK-6222:
--

Which patch fixes the issue?

 [STREAMING] All data may not be recovered from WAL when driver is killed
 

 Key: SPARK-6222
 URL: https://issues.apache.org/jira/browse/SPARK-6222
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.3.0
Reporter: Hari Shreedharan
Priority: Blocker
 Attachments: SPARK-6122.patch


 When testing for our next release, our internal tests written by [~wypoon] 
 caught a regression in Spark Streaming between 1.2.0 and 1.3.0. The test runs 
 FlumePolling stream to read data from Flume, then kills the Application 
 Master. Once YARN restarts it, the test waits until no more data is to be 
 written and verifies the original against the data on HDFS. This was passing 
 in 1.2.0, but is failing now.
 Since the test ties into Cloudera's internal infrastructure and build 
 process, it cannot be directly run on an Apache build. But I have been 
 working on isolating the commit that may have caused the regression. I have 
 confirmed that it was caused by SPARK-5147 (PR # 
 [4149|https://github.com/apache/spark/pull/4149]). I confirmed this several 
 times using the test and the failure is consistently reproducible. 
 To re-confirm, I reverted just this one commit (and Clock consolidation one 
 to avoid conflicts), and the issue was no longer reproducible.
 Since this is a data loss issue, I believe this is a blocker for Spark 1.3.0
 /cc [~tdas], [~pwendell]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6234) 10% Performance regression with Breeze upgrade


[ 
https://issues.apache.org/jira/browse/SPARK-6234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14353854#comment-14353854
 ] 

Sean Owen commented on SPARK-6234:
--

No, the thing that's not important here is the example implementation. It is 
not an example of using K-means in MLlib, but an example of a completely de 
novo, separate implementation of K-means that is provided as an example of 
using *Spark*.

I don't know why Breeze or something that uses it would be slower though. The 
only thing here doing any serious computation is squaredDistance. That did 
change in 0.11:

https://github.com/scalanlp/breeze/commit/5c26a9bceb1fbd621421fa459e1b1202e91f5e9b#diff-e9531f2d5b65b7140b75c0b1c4dab541

If you have the energy, a tightly-focused test case on this method that shows a 
performance hit would be useful to report against Breeze. 

I think all in all the positives of 0.11 outweigh negatives, but, this downside 
was not expected, if it is confirmed. If so it may not only affect this example.

 10% Performance regression with Breeze upgrade
 --

 Key: SPARK-6234
 URL: https://issues.apache.org/jira/browse/SPARK-6234
 Project: Spark
  Issue Type: Bug
Reporter: Nishkam Ravi

 KMeans regresses by 10% with the Breeze upgrade from 0.10 to 0.11



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6222) [STREAMING] All data may not be recovered from WAL when driver is killed

2015-03-09 Thread Hari Shreedharan (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-6222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14353867#comment-14353867
]

Hari Shreedharan commented on SPARK-6222:
-

[~srowen] This patch is actually not intended to fix the issue, since this
patch will cause the WAL to not be cleaned up - which is not something we want.
This was only intended to help isolate the problem -- from this patch it is
clear that we are somehow attempting to clear the WAL data prematurely, causing
the regression - when and why I am not yet sure.

[STREAMING] All data may not be recovered from WAL when driver is killed

Key: SPARK-6222
URL: https://issues.apache.org/jira/browse/SPARK-6222
Project: Spark
Issue Type: Bug
Components: Streaming
Affects Versions: 1.3.0
Reporter: Hari Shreedharan
Priority: Blocker
Attachments: SPARK-6122.patch

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-2629) Improve performance of DStream.updateStateByKey


 [ 
https://issues.apache.org/jira/browse/SPARK-2629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-2629:
-
Summary: Improve performance of DStream.updateStateByKey  (was: Improve 
performance of DStream.updateStateByKey using IndexRDD)

 Improve performance of DStream.updateStateByKey
 ---

 Key: SPARK-2629
 URL: https://issues.apache.org/jira/browse/SPARK-2629
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Reporter: Tathagata Das
Assignee: Tathagata Das





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5155) Python API for MQTT streaming


[ 
https://issues.apache.org/jira/browse/SPARK-5155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14353913#comment-14353913
 ] 

Tathagata Das commented on SPARK-5155:
--

This issue is still blocking on us figuring out all the details of python API 
for Kafka. 

 Python API for MQTT streaming
 -

 Key: SPARK-5155
 URL: https://issues.apache.org/jira/browse/SPARK-5155
 Project: Spark
  Issue Type: New Feature
  Components: PySpark, Streaming
Reporter: Davies Liu
Assignee: Prabeesh K

 Python API for MQTT Utils



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-6238) Support shuffle where individual blocks might be 2G

Reynold Xin created SPARK-6238:
--

 Summary: Support shuffle where individual blocks might be  2G
 Key: SPARK-6238
 URL: https://issues.apache.org/jira/browse/SPARK-6238
 Project: Spark
  Issue Type: Sub-task
Reporter: Reynold Xin






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-6237) Support network transfer for blocks larger than 2G

Reynold Xin created SPARK-6237:
--

 Summary: Support network transfer for blocks larger than 2G
 Key: SPARK-6237
 URL: https://issues.apache.org/jira/browse/SPARK-6237
 Project: Spark
  Issue Type: Sub-task
Reporter: Reynold Xin






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5155) Python API for MQTT streaming


 [ 
https://issues.apache.org/jira/browse/SPARK-5155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-5155:
-
Target Version/s: 1.4.0  (was: 1.3.0)

 Python API for MQTT streaming
 -

 Key: SPARK-5155
 URL: https://issues.apache.org/jira/browse/SPARK-5155
 Project: Spark
  Issue Type: New Feature
  Components: PySpark, Streaming
Reporter: Davies Liu
Assignee: Prabeesh K

 Python API for MQTT Utils



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-6236) Support caching blocks larger than 2G

Reynold Xin created SPARK-6236:
--

 Summary: Support caching blocks larger than 2G
 Key: SPARK-6236
 URL: https://issues.apache.org/jira/browse/SPARK-6236
 Project: Spark
  Issue Type: Sub-task
Reporter: Reynold Xin


Due to the use java.nio.ByteBuffer, BlockManager does not support blocks larger 
than 2G. 





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6190) create LargeByteBuffer abstraction for eliminating 2GB limit on blocks

[
https://issues.apache.org/jira/browse/SPARK-6190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14353940#comment-14353940
]

Reynold Xin commented on SPARK-6190:

Hi [~imranr],

As I said earlier, I would advise against attacking the network transfer
problem at this point. We don't hear that often from users complaining about
the 2G limit, and the complain of various issues drop probably by an order of
magnitude in the following order:
- caching 2g
- fetching 2g non shuffle block
- fetching 2g shuffle block
- uploading 2g

I think it'd make sense to solve the caching 2g limit first. It is important to
think about the network part, but I would not try to address it here. It is
much more complicated to deal with, e.g. transferring very large data in one
shot brings all sorts of complicated resource management problems (e.g. large
transfer blocking small ones, memory management, allocation...).

For caching, I can think of two days to do this. First is to have a large byte
buffer abstraction that encapsulates multiple, smallers buffers, as proposed
here. Another is to assume the block manager can only handle blocks 2g, and
then have the upper layers handle the chunking and reconnecting. It is not yet
clear to me which one is better. While the first approach provides a better,
clearer abstraction, the 2nd approach would allow us to cache partial blocks.
Do you have any thoughts on this?

Now for the large buffer abstraction here -- I'm confused. The proposed design
is read-only. How do we even create a buffer?

create LargeByteBuffer abstraction for eliminating 2GB limit on blocks
--

Key: SPARK-6190
URL: https://issues.apache.org/jira/browse/SPARK-6190
Project: Spark
Issue Type: Sub-task
Components: Spark Core
Reporter: Imran Rashid
Assignee: Imran Rashid
Attachments: LargeByteBuffer.pdf

A key component in eliminating the 2GB limit on blocks is creating a proper
abstraction for storing more than 2GB. Currently spark is limited by a
reliance on nio ByteBuffer and netty ByteBuf, both of which are limited at
2GB. This task will introduce the new abstraction and the relevant
implementation and utilities, without effecting the existing implementation
at all.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6128) Update Spark Streaming Guide for Spark 1.3


[ 
https://issues.apache.org/jira/browse/SPARK-6128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14353818#comment-14353818
 ] 

Apache Spark commented on SPARK-6128:
-

User 'tdas' has created a pull request for this issue:
https://github.com/apache/spark/pull/4956

 Update Spark Streaming Guide for Spark 1.3
 --

 Key: SPARK-6128
 URL: https://issues.apache.org/jira/browse/SPARK-6128
 Project: Spark
  Issue Type: Improvement
  Components: Documentation, Streaming
Reporter: Tathagata Das
Assignee: Tathagata Das
Priority: Critical

 Things to update
 - New Kafka Direct API
 - Python Kafka API
 - Add joins to streaming guide



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6222) [STREAMING] All data may not be recovered from WAL when driver is killed

2015-03-09 Thread Hari Shreedharan (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-6222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14353842#comment-14353842
]

Hari Shreedharan commented on SPARK-6222:
-

The one on the jira.

[STREAMING] All data may not be recovered from WAL when driver is killed

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5045) Update FlumePollingReceiver to use updated Receiver API


 [ 
https://issues.apache.org/jira/browse/SPARK-5045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-5045:
-
Target Version/s:   (was: 1.3.0)

 Update FlumePollingReceiver to use updated Receiver API
 ---

 Key: SPARK-5045
 URL: https://issues.apache.org/jira/browse/SPARK-5045
 Project: Spark
  Issue Type: Sub-task
  Components: Streaming
Reporter: Tathagata Das
Assignee: Hari Shreedharan





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5048) Add Flume to the Python Streaming API


 [ 
https://issues.apache.org/jira/browse/SPARK-5048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-5048:
-
Assignee: Hari Shreedharan

 Add Flume to the Python Streaming API
 -

 Key: SPARK-5048
 URL: https://issues.apache.org/jira/browse/SPARK-5048
 Project: Spark
  Issue Type: Improvement
  Components: PySpark, Streaming
Reporter: Tathagata Das
Assignee: Hari Shreedharan

 This is a similar effort as SPARK-5047 is for Kafka, and should take the same 
 approach as it. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5048) Add Flume to the Python Streaming API


 [ 
https://issues.apache.org/jira/browse/SPARK-5048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-5048:
-
Target Version/s: 1.4.0  (was: 1.3.0)

 Add Flume to the Python Streaming API
 -

 Key: SPARK-5048
 URL: https://issues.apache.org/jira/browse/SPARK-5048
 Project: Spark
  Issue Type: Improvement
  Components: PySpark, Streaming
Reporter: Tathagata Das

 This is a similar effort as SPARK-5047 is for Kafka, and should take the same 
 approach as it. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5048) Add Flume to the Python Streaming API


[ 
https://issues.apache.org/jira/browse/SPARK-5048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14353918#comment-14353918
 ] 

Tathagata Das commented on SPARK-5048:
--

[~hshreedharan] Can you take a crack at this?


 Add Flume to the Python Streaming API
 -

 Key: SPARK-5048
 URL: https://issues.apache.org/jira/browse/SPARK-5048
 Project: Spark
  Issue Type: Improvement
  Components: PySpark, Streaming
Reporter: Tathagata Das
Assignee: Hari Shreedharan

 This is a similar effort as SPARK-5047 is for Kafka, and should take the same 
 approach as it. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5682) Add encrypted shuffle in spark

2015-03-09 Thread liyunzhang_intel (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liyunzhang_intel updated SPARK-5682:

Summary: Add encrypted shuffle in spark  (was: Reuse hadoop encrypted 
shuffle algorithm to enable spark encrypted shuffle)

 Add encrypted shuffle in spark
 --

 Key: SPARK-5682
 URL: https://issues.apache.org/jira/browse/SPARK-5682
 Project: Spark
  Issue Type: New Feature
  Components: Shuffle
Reporter: liyunzhang_intel
 Attachments: Design Document of Encrypted Spark Shuffle_20150209.docx


 Encrypted shuffle is enabled in hadoop 2.6 which make the process of shuffle 
 data safer. This feature is necessary in spark. We reuse hadoop encrypted 
 shuffle feature to spark and because ugi credential info is necessary in 
 encrypted shuffle, we first enable encrypted shuffle on spark-on-yarn 
 framework.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6234) 10% Performance regression with Breeze upgrade


[ 
https://issues.apache.org/jira/browse/SPARK-6234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14353843#comment-14353843
 ] 

Nishkam Ravi commented on SPARK-6234:
-

Are we saying that Breeze's performance is unimportant or are we saying that 
performance of K-Means-with-breeze is unimportant? If it's the former, we can 
promptly close this JIRA. If it's the later, K-Means should be perceived as a 
random piece of code that exposes a performance bug in Breeze. 

 10% Performance regression with Breeze upgrade
 --

 Key: SPARK-6234
 URL: https://issues.apache.org/jira/browse/SPARK-6234
 Project: Spark
  Issue Type: Bug
Reporter: Nishkam Ravi

 KMeans regresses by 10% with the Breeze upgrade from 0.10 to 0.11



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6222) [STREAMING] All data may not be recovered from WAL when driver is killed


[ 
https://issues.apache.org/jira/browse/SPARK-6222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14353858#comment-14353858
 ] 

Sean Owen commented on SPARK-6222:
--

[~hshreedharan] you can make a [WIP] pull request instead of a patch. It's 
easier to review that way.

 [STREAMING] All data may not be recovered from WAL when driver is killed
 

 Key: SPARK-6222
 URL: https://issues.apache.org/jira/browse/SPARK-6222
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.3.0
Reporter: Hari Shreedharan
Priority: Blocker
 Attachments: SPARK-6122.patch


 When testing for our next release, our internal tests written by [~wypoon] 
 caught a regression in Spark Streaming between 1.2.0 and 1.3.0. The test runs 
 FlumePolling stream to read data from Flume, then kills the Application 
 Master. Once YARN restarts it, the test waits until no more data is to be 
 written and verifies the original against the data on HDFS. This was passing 
 in 1.2.0, but is failing now.
 Since the test ties into Cloudera's internal infrastructure and build 
 process, it cannot be directly run on an Apache build. But I have been 
 working on isolating the commit that may have caused the regression. I have 
 confirmed that it was caused by SPARK-5147 (PR # 
 [4149|https://github.com/apache/spark/pull/4149]). I confirmed this several 
 times using the test and the failure is consistently reproducible. 
 To re-confirm, I reverted just this one commit (and Clock consolidation one 
 to avoid conflicts), and the issue was no longer reproducible.
 Since this is a data loss issue, I believe this is a blocker for Spark 1.3.0
 /cc [~tdas], [~pwendell]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5252) Streaming StatefulNetworkWordCount example hangs


[ 
https://issues.apache.org/jira/browse/SPARK-5252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14353857#comment-14353857
 ] 

Tathagata Das commented on SPARK-5252:
--

[~LutzBuech] Can you try out the latest master and see if the problem still 
persists? This should have been fixed.

 Streaming StatefulNetworkWordCount example hangs
 

 Key: SPARK-5252
 URL: https://issues.apache.org/jira/browse/SPARK-5252
 Project: Spark
  Issue Type: Bug
  Components: Examples, PySpark, Streaming
Affects Versions: 1.2.0
 Environment: Ubuntu Linux
Reporter: Lutz Buech
 Attachments: debug.txt


 Running the stateful network word count example in Python (on one local node):
 https://github.com/apache/spark/blob/master/examples/src/main/python/streaming/stateful_network_wordcount.py
 At the beginning, when no data is streamed, empty status outputs are 
 generated, only decorated by the current Time, e.g.:
 ---
 Time: 2015-01-14 17:58:20
 ---
 ---
 Time: 2015-01-14 17:58:21
 ---
 As soon as I stream some data via netcat, no new status updates will show. 
 Instead, one line saying
 [Stage number: 
  (2 + 0) / 3]
 where number is some integer number, e.g. 132. There is no further output 
 on stdout.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6234) 10% Performance regression with Breeze upgrade


[ 
https://issues.apache.org/jira/browse/SPARK-6234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14353877#comment-14353877
 ] 

Nishkam Ravi commented on SPARK-6234:
-

Right. This particular implementation can be thought of as a unit test case for 
Breeze (or squaredDistance if you will) for the purpose of this discussion. As 
mentioned in the PR, we seem to have three options:

1. Absorb the perf regression
2. Find the problem in Breeze and fix it while retaining 0.11 in Spark
3. Revert back to 0.10 (potentially open a JIRA for Breeze and upgrade to 0.11 
when fixed)

Assuming that we can/will report this problem against Breeze, for upstream 
Spark which of these three options do we prefer? For downstream, I'd recommend 
3. 

 10% Performance regression with Breeze upgrade
 --

 Key: SPARK-6234
 URL: https://issues.apache.org/jira/browse/SPARK-6234
 Project: Spark
  Issue Type: Bug
Reporter: Nishkam Ravi

 KMeans regresses by 10% with the Breeze upgrade from 0.10 to 0.11



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5042) Updated Receiver API to make it easier to write reliable receivers that ack source

[
https://issues.apache.org/jira/browse/SPARK-5042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Tathagata Das updated SPARK-5042:
-
Target Version/s: (was: 1.4.0)

Updated Receiver API to make it easier to write reliable receivers that ack
source
--

Key: SPARK-5042
URL: https://issues.apache.org/jira/browse/SPARK-5042
Project: Spark
Issue Type: Improvement
Components: Streaming
Reporter: Tathagata Das
Assignee: Tathagata Das

Receivers in Spark Streaming receive data from different sources and push
them into Spark’s block manager. However, the received records must be
chunked into blocks before being pushed into the BlockManager. Related to
this, the Receiver API provides two kinds of store() -
1. store(single record) - The receiver implementation submits one
record-at-a-time and the system takes care of dividing it into right sized
blocks, and limiting the ingestion rates. In future, it should also be able
to do automatic rate / flow control. However, there is no feedback to the
receiver on when blocks are formed thus no way to ensure reliability
guarantees. Overall, receivers using this are easy to implement.
2. store(multiple records)- The receiver submits multiple records and that
forms the blocks that are stored in the block manager. The receiver
implementation has full control over block generation, which allows the
receiver acknowledge source when blocks have been reliably received by
BlockManager and/or WriteAheadLog. However, the implementation of the
receivers will not get automatic block sizing and rate controlling; the
developer will have to take care of that. All this adds to the complexity of
the receiver implementation.
So, to summarize, the (2) has the advantage of full control over block
generation, but the users have to deal with the complexity of generating
blocks of the right block size and rate control.
So we want to update this API such that it is becomes easier for developers
to achieve reliable receiving of records without sacrificing automatic block
sizing and rate control.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2629) Improve performance of DStream.updateStateByKey


[ 
https://issues.apache.org/jira/browse/SPARK-2629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14353906#comment-14353906
 ] 

Tathagata Das commented on SPARK-2629:
--

Since IndexRDD is not supposed to be added to the core Spark API, we are going 
to investigate other ways of improving the performance.

 Improve performance of DStream.updateStateByKey
 ---

 Key: SPARK-2629
 URL: https://issues.apache.org/jira/browse/SPARK-2629
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Reporter: Tathagata Das
Assignee: Tathagata Das





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5205) Inconsistent behaviour between Streaming job and others, when click kill link in WebUI


 [ 
https://issues.apache.org/jira/browse/SPARK-5205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-5205:
-
Target Version/s: 1.4.0, 1.3.1  (was: 1.3.0, 1.2.1)

 Inconsistent behaviour between Streaming job and others, when click kill link 
 in WebUI
 --

 Key: SPARK-5205
 URL: https://issues.apache.org/jira/browse/SPARK-5205
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Reporter: uncleGen

 The kill link is used to kill a stage in job. It works in any kinds of 
 Spark job but Spark Streaming. To be specific, we can only kill the stage 
 which is used to run Receiver, but not kill the Receivers. Well, the 
 stage can be killed and cleaned from the ui, but the receivers are still 
 alive and receiving data. I think it dose not fit with the common sense. 
 IMHO, killing the receiver stage means kill the receivers and stopping 
 receiving data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5046) Update KinesisReceiver to use updated Receiver API