[jira] [Updated] (SPARK-10699) Support checkpointInterval can be disabled

2015-09-22 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10699?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-10699:
--
Shepherd: Joseph K. Bradley
Assignee: Yanbo Liang
Target Version/s: 1.6.0

> Support checkpointInterval can be disabled
> --
>
> Key: SPARK-10699
> URL: https://issues.apache.org/jira/browse/SPARK-10699
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Yanbo Liang
>Assignee: Yanbo Liang
>Priority: Minor
>
> Currently use can set checkpointInterval to specify how often should the 
> cache be checkpointed. But we also need the function that users can disable 
> it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10732) Starting spark streaming from a specific point in time.

2015-09-22 Thread Bijay Singh Bisht (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14903092#comment-14903092
 ] 

Bijay Singh Bisht commented on SPARK-10732:
---

I think if a stream does not support replay the expected behavior for it should 
be to not give any data for a period in the past. It should not limit the API 
and stop one from using replay when the upstream can store data and provide a 
data from a period in the past.

As far a Kafka is concerned, it does seem to have an API to get the offsets 
before a given timestamps, I am no expert in Kafka but I assume it would be a 
bug if it does not always give the same offsets for the same value of timestamp 
as long as that offset is available. Pls correct me if my understanding of the 
API or its usage in this particular context is off the mark, cause it also 
makes the [SPARK-10734] invalid.

> Starting spark streaming from a specific point in time.
> ---
>
> Key: SPARK-10732
> URL: https://issues.apache.org/jira/browse/SPARK-10732
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.5.0
>Reporter: Bijay Singh Bisht
>
> Currently, spark streaming either starts from current time or from the latest 
> checkpoint. It would be extremely useful to start from any arbitrary point. 
> This would be useful in replay scenarios or in running regression tests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10750) ML Param validate should print better error information

2015-09-22 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-10750:
--
Shepherd: Joseph K. Bradley
Assignee: Yanbo Liang
Target Version/s: 1.6.0

> ML Param validate should print better error information
> ---
>
> Key: SPARK-10750
> URL: https://issues.apache.org/jira/browse/SPARK-10750
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Reporter: Yanbo Liang
>Assignee: Yanbo Liang
>Priority: Minor
>
> Currently when you set illegal value for params of array type (such as 
> IntArrayParam, DoubleArrayParam, StringArrayParam), it will throw 
> IllegalArgumentException but with incomprehensible error information.
> For example:
> val vectorSlicer = new 
> VectorSlicer().setInputCol("features").setOutputCol("result")
> vectorSlicer.setIndices(Array.empty).setNames(Array("f1", "f4", "f1"))
> It will throw IllegalArgumentException as:
> vectorSlicer_b3b4d1a10f43 parameter names given invalid value 
> [Ljava.lang.String;@798256c5.
> java.lang.IllegalArgumentException: vectorSlicer_b3b4d1a10f43 parameter names 
> given invalid value [Ljava.lang.String;@798256c5.
> Users can not understand which params were set incorrectly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7129) Add generic boosting algorithm to spark.ml

2015-09-22 Thread Seth Hendrickson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14903180#comment-14903180
 ] 

Seth Hendrickson commented on SPARK-7129:
-

I had some time to give this topic some thought and started constructing [a 
document with notes on a generic boosting 
architecture|https://docs.google.com/document/d/1Zeoj99gwiJBF0JWL8170KicVB0U5xtUOk6VUeFj0Nz8/edit]
 and some of the concerns it raises. I don't think this is acceptable as a 
design doc because it's a bit wordy and it doesn't make an effort to follow the 
structure of other design docs, but hopefully [~meihuawu] can find something 
useful in it.

I found the [R mboost 
vignette|https://cran.r-project.org/web/packages/mboost/vignettes/mboost_tutorial.pdf]
 to be a good starting point. I'm still learning the ML package, but I'd love 
to be involved in the discussion and potentially take on some of the code tasks 
once we get there.

> Add generic boosting algorithm to spark.ml
> --
>
> Key: SPARK-7129
> URL: https://issues.apache.org/jira/browse/SPARK-7129
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Joseph K. Bradley
>
> The Pipelines API will make it easier to create a generic Boosting algorithm 
> which can work with any Classifier or Regressor. Creating this feature will 
> require researching the possible variants and extensions of boosting which we 
> may want to support now and/or in the future, and planning an API which will 
> be properly extensible.
> In particular, it will be important to think about supporting:
> * multiple loss functions (for AdaBoost, LogitBoost, gradient boosting, etc.)
> * multiclass variants
> * multilabel variants (which will probably be in a separate class and JIRA)
> * For more esoteric variants, we should consider them but not design too much 
> around them: totally corrective boosting, cascaded models
> Note: This may interact some with the existing tree ensemble methods, but it 
> should be largely separate since the tree ensemble APIs and implementations 
> are specialized for trees.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10485) IF expression is not correctly resolved when one of the options have NullType

2015-09-22 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-10485.
--
Resolution: Fixed

I tested on 1.5 and it seems fixed to me.  Please reopen if you have a 
reproduction that fails still.

> IF expression is not correctly resolved when one of the options have NullType
> -
>
> Key: SPARK-10485
> URL: https://issues.apache.org/jira/browse/SPARK-10485
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1, 1.5.0
>Reporter: Antonio Jesus Navarro
>
> If we have this query:
> {code}
> SELECT IF(column > 1, 1, NULL) FROM T1
> {code}
> On Spark 1.4.1 we have this:
> {code}
> override lazy val resolved = childrenResolved && trueValue.dataType == 
> falseValue.dataType
> {code}
> So if one of the types is NullType, the if expression is not resolved.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-10732) Starting spark streaming from a specific point in time.

2015-09-22 Thread Cody Koeninger (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14903231#comment-14903231
 ] 

Cody Koeninger edited comment on SPARK-10732 at 9/22/15 7:02 PM:
-

Yeah, even if that gets implemented it will likely be at 1 minute granularity, 
which might be ok for some failure recovery situations but is unlikely to work 
for your SPARK-10734 ticket


was (Author: c...@koeninger.org):
Yeah, even if that gets implemented it will likely be at 1 minute granularity, 
which might be ok for some failure recovery situations but is unlikely to work 
for your SPARK-1074 ticket

> Starting spark streaming from a specific point in time.
> ---
>
> Key: SPARK-10732
> URL: https://issues.apache.org/jira/browse/SPARK-10732
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.5.0
>Reporter: Bijay Singh Bisht
>
> Currently, spark streaming either starts from current time or from the latest 
> checkpoint. It would be extremely useful to start from any arbitrary point. 
> This would be useful in replay scenarios or in running regression tests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10759) Missing Python code example in ML Programming guide

2015-09-22 Thread Lauren Moos (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14903263#comment-14903263
 ] 

Lauren Moos commented on SPARK-10759:
-

I can work on this 

> Missing Python code example in ML Programming guide
> ---
>
> Key: SPARK-10759
> URL: https://issues.apache.org/jira/browse/SPARK-10759
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 1.5.0
>Reporter: Raela Wang
>Priority: Minor
>
> http://spark.apache.org/docs/latest/ml-guide.html#example-model-selection-via-cross-validation
> http://spark.apache.org/docs/latest/ml-guide.html#example-model-selection-via-train-validation-split



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10740) handle nondeterministic expressions correctly for set operations

2015-09-22 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-10740:
-
Priority: Blocker  (was: Major)

> handle nondeterministic expressions correctly for set operations
> 
>
> Key: SPARK-10740
> URL: https://issues.apache.org/jira/browse/SPARK-10740
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Blocker
> Fix For: 1.6.0, 1.5.1
>
>
> We should only push down deterministic filter condition to set operator.
> For Union, let's say we do a non-deterministic filter on 1...5 union 1...5, 
> and we may get 1,3 for the left side and 2,4 for the right side, then the 
> result should be 1,3,2,4. If we push down this filter, we get 1,3 for both 
> side(we create a new random object with same seed in each side) and the 
> result would be 1,3,1,3.
> For Intersect, let's say there is a non-deterministic condition with a 0.5 
> possibility to accept a row and we have a row that presents in both sides of 
> an Intersect. Once we push down this condition, the possibility to accept 
> this row will be 0.25.
> For Except, let's say there is a row that presents in both sides of an 
> Except. This row should not be in the final output. However, if we pushdown a 
> non-deterministic condition, it is possible that this row is rejected from 
> one side and then we output a row that should not be a part of the result.
>  We should only push down deterministic projection to Union.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10740) handle nondeterministic expressions correctly for set operations

2015-09-22 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-10740:
-
Target Version/s: 1.6.0, 1.5.1

> handle nondeterministic expressions correctly for set operations
> 
>
> Key: SPARK-10740
> URL: https://issues.apache.org/jira/browse/SPARK-10740
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Blocker
> Fix For: 1.6.0, 1.5.1
>
>
> We should only push down deterministic filter condition to set operator.
> For Union, let's say we do a non-deterministic filter on 1...5 union 1...5, 
> and we may get 1,3 for the left side and 2,4 for the right side, then the 
> result should be 1,3,2,4. If we push down this filter, we get 1,3 for both 
> side(we create a new random object with same seed in each side) and the 
> result would be 1,3,1,3.
> For Intersect, let's say there is a non-deterministic condition with a 0.5 
> possibility to accept a row and we have a row that presents in both sides of 
> an Intersect. Once we push down this condition, the possibility to accept 
> this row will be 0.25.
> For Except, let's say there is a row that presents in both sides of an 
> Except. This row should not be in the final output. However, if we pushdown a 
> non-deterministic condition, it is possible that this row is rejected from 
> one side and then we output a row that should not be a part of the result.
>  We should only push down deterministic projection to Union.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10740) handle nondeterministic expressions correctly for set operations

2015-09-22 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-10740:
-
Assignee: Wenchen Fan

> handle nondeterministic expressions correctly for set operations
> 
>
> Key: SPARK-10740
> URL: https://issues.apache.org/jira/browse/SPARK-10740
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
> Fix For: 1.6.0, 1.5.1
>
>
> We should only push down deterministic filter condition to set operator.
> For Union, let's say we do a non-deterministic filter on 1...5 union 1...5, 
> and we may get 1,3 for the left side and 2,4 for the right side, then the 
> result should be 1,3,2,4. If we push down this filter, we get 1,3 for both 
> side(we create a new random object with same seed in each side) and the 
> result would be 1,3,1,3.
> For Intersect, let's say there is a non-deterministic condition with a 0.5 
> possibility to accept a row and we have a row that presents in both sides of 
> an Intersect. Once we push down this condition, the possibility to accept 
> this row will be 0.25.
> For Except, let's say there is a row that presents in both sides of an 
> Except. This row should not be in the final output. However, if we pushdown a 
> non-deterministic condition, it is possible that this row is rejected from 
> one side and then we output a row that should not be a part of the result.
>  We should only push down deterministic projection to Union.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10740) handle nondeterministic expressions correctly for set operations

2015-09-22 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-10740.
--
   Resolution: Fixed
Fix Version/s: 1.5.1
   1.6.0

Issue resolved by pull request 8858
[https://github.com/apache/spark/pull/8858]

> handle nondeterministic expressions correctly for set operations
> 
>
> Key: SPARK-10740
> URL: https://issues.apache.org/jira/browse/SPARK-10740
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Wenchen Fan
> Fix For: 1.6.0, 1.5.1
>
>
> We should only push down deterministic filter condition to set operator.
> For Union, let's say we do a non-deterministic filter on 1...5 union 1...5, 
> and we may get 1,3 for the left side and 2,4 for the right side, then the 
> result should be 1,3,2,4. If we push down this filter, we get 1,3 for both 
> side(we create a new random object with same seed in each side) and the 
> result would be 1,3,1,3.
> For Intersect, let's say there is a non-deterministic condition with a 0.5 
> possibility to accept a row and we have a row that presents in both sides of 
> an Intersect. Once we push down this condition, the possibility to accept 
> this row will be 0.25.
> For Except, let's say there is a row that presents in both sides of an 
> Except. This row should not be in the final output. However, if we pushdown a 
> non-deterministic condition, it is possible that this row is rejected from 
> one side and then we output a row that should not be a part of the result.
>  We should only push down deterministic projection to Union.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9442) java.lang.ArithmeticException: / by zero when reading Parquet

2015-09-22 Thread Chris Heller (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14903260#comment-14903260
 ] 

Chris Heller commented on SPARK-9442:
-

Curious if the issue seen here was with a parquet file created with a small 
block size? I just ran into a similar case with a nested schema, but ad no 
problems on many of the files, except for a larger file -- all were written 
with a block size of 16MB.

> java.lang.ArithmeticException: / by zero when reading Parquet
> -
>
> Key: SPARK-9442
> URL: https://issues.apache.org/jira/browse/SPARK-9442
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1
>Reporter: DB Tsai
>
> I am counting how many records in my nested parquet file with this schema,
> {code}
> scala> u1aTesting.printSchema
> root
>  |-- profileId: long (nullable = true)
>  |-- country: string (nullable = true)
>  |-- data: array (nullable = true)
>  ||-- element: struct (containsNull = true)
>  |||-- videoId: long (nullable = true)
>  |||-- date: long (nullable = true)
>  |||-- label: double (nullable = true)
>  |||-- weight: double (nullable = true)
>  |||-- features: vector (nullable = true)
> {code}
> and the number of the records in the nested data array is around 10k, and 
> each of the parquet file is around 600MB. The total size is around 120GB. 
> I am doing a simple count
> {code}
> scala> u1aTesting.count
> parquet.io.ParquetDecodingException: Can not read value at 100 in block 0 in 
> file 
> hdfs://compute-1.amazonaws.com:9000/users/dbtsai/testing/u1old/20150721/part-r-00115-d70c946b-b0f0-45fe-9965-b9f062b9ec6d.gz.parquet
>   at 
> parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:213)
>   at 
> parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:204)
>   at 
> org.apache.spark.sql.sources.SqlNewHadoopRDD$$anon$1.hasNext(SqlNewHadoopRDD.scala:163)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at 
> org.apache.spark.sql.execution.Aggregate$$anonfun$doExecute$1$$anonfun$6.apply(Aggregate.scala:129)
>   at 
> org.apache.spark.sql.execution.Aggregate$$anonfun$doExecute$1$$anonfun$6.apply(Aggregate.scala:126)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:686)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:686)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>   at org.apache.spark.scheduler.Task.run(Task.scala:70)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.ArithmeticException: / by zero
>   at 
> parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:109)
>   at 
> parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:193)
>   ... 21 more
> {code}
> BTW, no all the tasks fail, and some of them are successful. 
> Another note: By explicitly looping through the data to count, it will works.
> {code}
> sqlContext.read.load(hdfsPath + s"/testing/u1snappy/${date}/").map(x => 
> 1L).reduce((x, y) => x + y) 
> {code}
> I think maybe some metadata in parquet files are corrupted. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10732) Starting spark streaming from a specific point in time.

2015-09-22 Thread Bijay Singh Bisht (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14903183#comment-14903183
 ] 

Bijay Singh Bisht commented on SPARK-10732:
---

I get it. Apparently there is a discussion to have explicit and granular 
timestamp based support 
https://cwiki.apache.org/confluence/display/KAFKA/KIP-33+-+Add+a+time+based+log+index#KIP-33-Addatimebasedlogindex-Usecasediscussion.

> Starting spark streaming from a specific point in time.
> ---
>
> Key: SPARK-10732
> URL: https://issues.apache.org/jira/browse/SPARK-10732
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.5.0
>Reporter: Bijay Singh Bisht
>
> Currently, spark streaming either starts from current time or from the latest 
> checkpoint. It would be extremely useful to start from any arbitrary point. 
> This would be useful in replay scenarios or in running regression tests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10704) Rename HashShufflereader to BlockStoreShuffleReader

2015-09-22 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-10704.
-
   Resolution: Fixed
Fix Version/s: 1.6.0

> Rename HashShufflereader to BlockStoreShuffleReader
> ---
>
> Key: SPARK-10704
> URL: https://issues.apache.org/jira/browse/SPARK-10704
> Project: Spark
>  Issue Type: Sub-task
>  Components: Shuffle
>Reporter: Josh Rosen
>Assignee: Josh Rosen
> Fix For: 1.6.0
>
>
> The current shuffle code has an interface named ShuffleReader with only one 
> implementation, HashShuffleReader. This naming is confusing, since the same 
> read path code is used for both sort- and hash-based shuffle. -We should 
> consolidate these classes.- We should rename HashShuffleReader.
> --In addition, there are aspects of ShuffleManager.getReader()'s API which 
> don't make a lot of sense: it exposes the ability to request a contiguous 
> range of shuffle partitions, but this feature isn't supported by any 
> ShuffleReader implementations and isn't used anywhere in the existing code. 
> We should clean this up, too.--



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10732) Starting spark streaming from a specific point in time.

2015-09-22 Thread Cody Koeninger (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14903231#comment-14903231
 ] 

Cody Koeninger commented on SPARK-10732:


Yeah, even if that gets implemented it will likely be at 1 minute granularity, 
which might be ok for some failure recovery situations but is unlikely to work 
for your SPARK-1074 ticket

> Starting spark streaming from a specific point in time.
> ---
>
> Key: SPARK-10732
> URL: https://issues.apache.org/jira/browse/SPARK-10732
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.5.0
>Reporter: Bijay Singh Bisht
>
> Currently, spark streaming either starts from current time or from the latest 
> checkpoint. It would be extremely useful to start from any arbitrary point. 
> This would be useful in replay scenarios or in running regression tests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10749) Support multiple roles with Spark Mesos dispatcher

2015-09-22 Thread Timothy Chen (JIRA)
Timothy Chen created SPARK-10749:


 Summary: Support multiple roles with Spark Mesos dispatcher
 Key: SPARK-10749
 URL: https://issues.apache.org/jira/browse/SPARK-10749
 Project: Spark
  Issue Type: Improvement
  Components: Mesos
Reporter: Timothy Chen


Although you can currently set the framework role of the Mesos dispatcher, it 
doesn't correctly use the offers given to it.

It should inherit how Coarse/Fine grain scheduler works and use multiple roles 
offers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10748) Log error instead of crashing Spark Mesos dispatcher when a job is misconfigured

2015-09-22 Thread Timothy Chen (JIRA)
Timothy Chen created SPARK-10748:


 Summary: Log error instead of crashing Spark Mesos dispatcher when 
a job is misconfigured
 Key: SPARK-10748
 URL: https://issues.apache.org/jira/browse/SPARK-10748
 Project: Spark
  Issue Type: Bug
  Components: Mesos
Reporter: Timothy Chen


Currently when a dispatcher is submitting a new driver, it simply throws a 
SparkExecption when necessary configuration is not set. We should log and keep 
the dispatcher running instead of crashing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10419) Add JDBC dialect for Microsoft SQL Server

2015-09-22 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-10419.
-
   Resolution: Fixed
 Assignee: Ewan Leith
Fix Version/s: 1.6.0

> Add JDBC dialect for Microsoft SQL Server
> -
>
> Key: SPARK-10419
> URL: https://issues.apache.org/jira/browse/SPARK-10419
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.4.1, 1.5.0
>Reporter: Ewan Leith
>Assignee: Ewan Leith
>Priority: Minor
> Fix For: 1.6.0
>
>
> Running JDBC connections against Microsoft SQL Server database tables, when a 
> table contains a datetimeoffset column type, the following error is received:
> {code}
> sqlContext.read.jdbc("jdbc:sqlserver://127.0.0.1:1433;DatabaseName=testdb", 
> "sampletable", prop)
> java.sql.SQLException: Unsupported type -155
> at 
> org.apache.spark.sql.jdbc.JDBCRDD$.org$apache$spark$sql$jdbc$JDBCRDD$$getCatalystType(JDBCRDD.scala:100)
> at 
> org.apache.spark.sql.jdbc.JDBCRDD$$anonfun$1.apply(JDBCRDD.scala:137)
> at 
> org.apache.spark.sql.jdbc.JDBCRDD$$anonfun$1.apply(JDBCRDD.scala:137)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.sql.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:136)
> at 
> org.apache.spark.sql.jdbc.JDBCRelation.(JDBCRelation.scala:128)
> at 
> org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:200)
> at 
> org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:130)
> {code}
> Based on the JdbcDialect code for DB2 and the Microsoft SQL Server 
> documentation, we should probably treat datetimeoffset types as Strings 
> https://technet.microsoft.com/en-us/library/bb630289%28v=sql.105%29.aspx
> We've created a small addition to JdbcDialects.scala to do this conversion, 
> I'll create a pull request for it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10458) Would like to know if a given Spark Context is stopped or currently stopping

2015-09-22 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-10458:
--
Assignee: Madhusudanan Kandasamy

> Would like to know if a given Spark Context is stopped or currently stopping
> 
>
> Key: SPARK-10458
> URL: https://issues.apache.org/jira/browse/SPARK-10458
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Matt Cheah
>Assignee: Madhusudanan Kandasamy
>Priority: Minor
> Fix For: 1.6.0
>
>
> I ran into a case where a thread stopped a Spark Context, specifically when I 
> hit the "kill" link from the Spark standalone UI. There was no real way for 
> another thread to know that the context had stopped and thus should have 
> handled that accordingly.
> Checking that the SparkEnv is null is one way, but that doesn't handle the 
> case where the context is in the midst of stopping, and stopping the context 
> may actually not be instantaneous - in my case for some reason the 
> DAGScheduler was taking a non-trivial amount of time to stop.
> Implementation wise I'm more or less requesting the boolean value returned 
> from SparkContext.stopped.get() to be visible in some way. As long as we 
> return the value and not the Atomic Boolean itself (we wouldn't want anyone 
> to be setting this, after all!) it would help client applications check the 
> context's liveliness.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10750) ML Param validate should print better error information

2015-09-22 Thread Yanbo Liang (JIRA)
Yanbo Liang created SPARK-10750:
---

 Summary: ML Param validate should print better error information
 Key: SPARK-10750
 URL: https://issues.apache.org/jira/browse/SPARK-10750
 Project: Spark
  Issue Type: Bug
  Components: ML
Reporter: Yanbo Liang
Priority: Minor


Currently when you set illegal value for params of array type (such as 
IntArrayParam, DoubleArrayParam, StringArrayParam), it will throw 
IllegalArgumentException but with incomprehensible error information.
For example:

val vectorSlicer = new 
VectorSlicer().setInputCol("features").setOutputCol("result")
vectorSlicer.setIndices(Array.empty).setNames(Array("f1", "f4", "f1"))

It will throw IllegalArgumentException as:
vectorSlicer_b3b4d1a10f43 parameter names given invalid value 
[Ljava.lang.String;@798256c5.
java.lang.IllegalArgumentException: vectorSlicer_b3b4d1a10f43 parameter names 
given invalid value [Ljava.lang.String;@798256c5.

Users can not understand which params were set incorrectly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10750) ML Param validate should print better error information

2015-09-22 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14902029#comment-14902029
 ] 

Yanbo Liang commented on SPARK-10750:
-

This is because Param.validate(value: T) use value.toString at error 
information which did not distinguish value of array type or not.

> ML Param validate should print better error information
> ---
>
> Key: SPARK-10750
> URL: https://issues.apache.org/jira/browse/SPARK-10750
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Reporter: Yanbo Liang
>Priority: Minor
>
> Currently when you set illegal value for params of array type (such as 
> IntArrayParam, DoubleArrayParam, StringArrayParam), it will throw 
> IllegalArgumentException but with incomprehensible error information.
> For example:
> val vectorSlicer = new 
> VectorSlicer().setInputCol("features").setOutputCol("result")
> vectorSlicer.setIndices(Array.empty).setNames(Array("f1", "f4", "f1"))
> It will throw IllegalArgumentException as:
> vectorSlicer_b3b4d1a10f43 parameter names given invalid value 
> [Ljava.lang.String;@798256c5.
> java.lang.IllegalArgumentException: vectorSlicer_b3b4d1a10f43 parameter names 
> given invalid value [Ljava.lang.String;@798256c5.
> Users can not understand which params were set incorrectly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8386) DataFrame and JDBC regression

2015-09-22 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14902042#comment-14902042
 ] 

Reynold Xin commented on SPARK-8386:


[~viirya] do you have time to take a look?

> DataFrame and JDBC regression
> -
>
> Key: SPARK-8386
> URL: https://issues.apache.org/jira/browse/SPARK-8386
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.0
> Environment: RHEL 7.1
>Reporter: Peter Haumer
>Priority: Critical
>
> I have an ETL app that appends to a JDBC table new results found at each run. 
>  In 1.3.1 I did this:
> testResultsDF.insertIntoJDBC(CONNECTION_URL, TABLE_NAME, false);
> When I do this now in 1.4 it complains that the "object" 'TABLE_NAME' already 
> exists. I get this even if I switch the overwrite to true.  I also tried this 
> now:
> testResultsDF.write().mode(SaveMode.Append).jdbc(CONNECTION_URL, TABLE_NAME, 
> connectionProperties);
> getting the same error. It works running the first time creating the new 
> table and adding data successfully. But, running it a second time it (the 
> jdbc driver) will tell me that the table already exists. Even 
> SaveMode.Overwrite will give me the same error. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10649) Streaming jobs unexpectedly inherits job group, job descriptions from context starting thread

2015-09-22 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-10649:
--
Fix Version/s: 1.5.1

> Streaming jobs unexpectedly inherits job group, job descriptions from context 
> starting thread
> -
>
> Key: SPARK-10649
> URL: https://issues.apache.org/jira/browse/SPARK-10649
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.3.1, 1.4.1, 1.5.0
>Reporter: Tathagata Das
>Assignee: Tathagata Das
> Fix For: 1.6.0, 1.5.1
>
>
> The job group, and job descriptions information is passed through thread 
> local properties, and get inherited by child threads. In case of spark 
> streaming, the streaming jobs inherit these properties from the thread that 
> called streamingContext.start(). This may not make sense. 
> 1. Job group: This is mainly used for cancelling a group of jobs together. It 
> does not make sense to cancel streaming jobs like this, as the effect will be 
> unpredictable. And its not a valid usecase any way, to cancel a streaming 
> context, call streamingContext.stop()
> 2. Job description: This is used to pass on nice text descriptions for jobs 
> to show up in the UI. The job description of the thread that calls 
> streamingContext.start() is not useful for all the streaming jobs, as it does 
> not make sense for all of the streaming jobs to have the same description, 
> and the description may or may not be related to streaming.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10695) spark.mesos.constraints documentation uses "=" to separate value instead ":" as parser and mesos expects.

2015-09-22 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-10695:
--
Assignee: Akash Mishra

> spark.mesos.constraints documentation uses "=" to separate value instead ":" 
> as parser and mesos expects.
> -
>
> Key: SPARK-10695
> URL: https://issues.apache.org/jira/browse/SPARK-10695
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, Mesos
>Affects Versions: 1.5.0
>Reporter: Akash Mishra
>Assignee: Akash Mishra
>Priority: Critical
> Fix For: 1.6.0, 1.5.1
>
>
> Incorrect documentation which leads to exception when using constraints value 
> as specified in documentation. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10695) spark.mesos.constraints documentation uses "=" to separate value instead ":" as parser and mesos expects.

2015-09-22 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or resolved SPARK-10695.
---
  Resolution: Fixed
   Fix Version/s: 1.6.0
Target Version/s: 1.6.0, 1.5.1  (was: 1.5.1)

> spark.mesos.constraints documentation uses "=" to separate value instead ":" 
> as parser and mesos expects.
> -
>
> Key: SPARK-10695
> URL: https://issues.apache.org/jira/browse/SPARK-10695
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 1.5.0
>Reporter: Akash Mishra
>Priority: Critical
> Fix For: 1.6.0, 1.5.1
>
>
> Incorrect documentation which leads to exception when using constraints value 
> as specified in documentation. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10458) Would like to know if a given Spark Context is stopped or currently stopping

2015-09-22 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or resolved SPARK-10458.
---
  Resolution: Fixed
   Fix Version/s: 1.6.0
Target Version/s: 1.6.0

> Would like to know if a given Spark Context is stopped or currently stopping
> 
>
> Key: SPARK-10458
> URL: https://issues.apache.org/jira/browse/SPARK-10458
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Matt Cheah
>Priority: Minor
> Fix For: 1.6.0
>
>
> I ran into a case where a thread stopped a Spark Context, specifically when I 
> hit the "kill" link from the Spark standalone UI. There was no real way for 
> another thread to know that the context had stopped and thus should have 
> handled that accordingly.
> Checking that the SparkEnv is null is one way, but that doesn't handle the 
> case where the context is in the midst of stopping, and stopping the context 
> may actually not be instantaneous - in my case for some reason the 
> DAGScheduler was taking a non-trivial amount of time to stop.
> Implementation wise I'm more or less requesting the boolean value returned 
> from SparkContext.stopped.get() to be visible in some way. As long as we 
> return the value and not the Atomic Boolean itself (we wouldn't want anyone 
> to be setting this, after all!) it would help client applications check the 
> context's liveliness.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10695) spark.mesos.constraints documentation uses "=" to separate value instead ":" as parser and mesos expects.

2015-09-22 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-10695:
--
Component/s: Mesos

> spark.mesos.constraints documentation uses "=" to separate value instead ":" 
> as parser and mesos expects.
> -
>
> Key: SPARK-10695
> URL: https://issues.apache.org/jira/browse/SPARK-10695
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, Mesos
>Affects Versions: 1.5.0
>Reporter: Akash Mishra
>Assignee: Akash Mishra
>Priority: Critical
> Fix For: 1.6.0, 1.5.1
>
>
> Incorrect documentation which leads to exception when using constraints value 
> as specified in documentation. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-9821) pyspark reduceByKey should allow a custom partitioner

2015-09-22 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-9821.
---
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 8569
[https://github.com/apache/spark/pull/8569]

> pyspark reduceByKey should allow a custom partitioner
> -
>
> Key: SPARK-9821
> URL: https://issues.apache.org/jira/browse/SPARK-9821
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 1.3.0
>Reporter: Diana Carroll
>Priority: Minor
> Fix For: 1.6.0
>
>
> In Scala, I can supply a custom partitioner to reduceByKey (and other 
> aggregation/repartitioning methods like aggregateByKey and combinedByKey), 
> but as far as I can tell from the Pyspark API, there's no way to do the same 
> in Python.
> Here's an example of my code in Scala:
> {code}weblogs.map(s => (getFileType(s), 1)).reduceByKey(new 
> FileTypePartitioner(),_+_){code}
> But I can't figure out how to do the same in Python.  The closest I can get 
> is to call repartition before reduceByKey like so:
> {code}weblogs.map(lambda s: (getFileType(s), 
> 1)).partitionBy(3,hash_filetype).reduceByKey(lambda v1,v2: 
> v1+v2).collect(){code}
> But that defeats the purpose, because I'm shuffling twice instead of once, so 
> my performance is worse instead of better.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10751) ML Param validate should print better error information

2015-09-22 Thread Yanbo Liang (JIRA)
Yanbo Liang created SPARK-10751:
---

 Summary: ML Param validate should print better error information
 Key: SPARK-10751
 URL: https://issues.apache.org/jira/browse/SPARK-10751
 Project: Spark
  Issue Type: Bug
  Components: ML
Reporter: Yanbo Liang
Priority: Minor


Currently when you set illegal value for params of array type (such as 
IntArrayParam, DoubleArrayParam, StringArrayParam), it will throw 
IllegalArgumentException but with incomprehensible error information.
For example:

val vectorSlicer = new 
VectorSlicer().setInputCol("features").setOutputCol("result")
vectorSlicer.setIndices(Array.empty).setNames(Array("f1", "f4", "f1"))

It will throw IllegalArgumentException as:
vectorSlicer_b3b4d1a10f43 parameter names given invalid value 
[Ljava.lang.String;@798256c5.
java.lang.IllegalArgumentException: vectorSlicer_b3b4d1a10f43 parameter names 
given invalid value [Ljava.lang.String;@798256c5.

Users can not understand which params were set incorrectly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10716) spark-1.5.0-bin-hadoop2.6.tgz file doesn't uncompress on OS X due to hidden file

2015-09-22 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-10716.
-
   Resolution: Fixed
 Assignee: Sean Owen
Fix Version/s: 1.5.1
   1.6.0

> spark-1.5.0-bin-hadoop2.6.tgz file doesn't uncompress on OS X due to hidden 
> file
> 
>
> Key: SPARK-10716
> URL: https://issues.apache.org/jira/browse/SPARK-10716
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Deploy
>Affects Versions: 1.5.0
> Environment: Yosemite 10.10.5
>Reporter: Jack Jack
>Assignee: Sean Owen
>Priority: Minor
> Fix For: 1.6.0, 1.5.1
>
>
> Directly downloaded prebuilt binaries of 
> http://d3kbcqa49mib13.cloudfront.net/spark-1.5.0-bin-hadoop2.6.tgz 
> got error when tar xvzf it.  Tried download twice and extract twice.
> error log:
> ..
> x spark-1.5.0-bin-hadoop2.6/lib/
> x spark-1.5.0-bin-hadoop2.6/lib/datanucleus-core-3.2.10.jar
> x spark-1.5.0-bin-hadoop2.6/lib/datanucleus-api-jdo-3.2.6.jar
> x spark-1.5.0-bin-hadoop2.6/lib/datanucleus-rdbms-3.2.9.jar
> x spark-1.5.0-bin-hadoop2.6/lib/spark-examples-1.5.0-hadoop2.6.0.jar
> x spark-1.5.0-bin-hadoop2.6/lib/spark-assembly-1.5.0-hadoop2.6.0.jar
> x spark-1.5.0-bin-hadoop2.6/lib/spark-1.5.0-yarn-shuffle.jar
> x spark-1.5.0-bin-hadoop2.6/README.md
> tar: copyfile unpack 
> (spark-1.5.0-bin-hadoop2.6/python/test_support/sql/orc_partitioned/SUCCESS.crc)
>  failed: No such file or directory
> ~ :>



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10577) [PySpark] DataFrame hint for broadcast join

2015-09-22 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-10577.
-
   Resolution: Fixed
Fix Version/s: 1.6.0

> [PySpark] DataFrame hint for broadcast join
> ---
>
> Key: SPARK-10577
> URL: https://issues.apache.org/jira/browse/SPARK-10577
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 1.5.0
>Reporter: Maciej Bryński
>  Labels: starter
> Fix For: 1.6.0
>
>
> As in https://issues.apache.org/jira/browse/SPARK-8300
> there should by possibility to add hint for broadcast join in:
> - Pyspark



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10577) [PySpark] DataFrame hint for broadcast join

2015-09-22 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14902049#comment-14902049
 ] 

Reynold Xin commented on SPARK-10577:
-

[~maver1ck] the patch is now merged - you can just create a broadcast function 
yourself similar to the ones here in order to use it in Spark 1.5: 
https://github.com/apache/spark/pull/8801/files


> [PySpark] DataFrame hint for broadcast join
> ---
>
> Key: SPARK-10577
> URL: https://issues.apache.org/jira/browse/SPARK-10577
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 1.5.0
>Reporter: Maciej Bryński
>  Labels: starter
> Fix For: 1.6.0
>
>
> As in https://issues.apache.org/jira/browse/SPARK-8300
> there should by possibility to add hint for broadcast join in:
> - Pyspark



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10577) [PySpark] DataFrame hint for broadcast join

2015-09-22 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-10577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14902073#comment-14902073
 ] 

Maciej Bryński commented on SPARK-10577:


[~rxin]
As I wrote before.
I already tested this patch and it works great.
Thank you.

> [PySpark] DataFrame hint for broadcast join
> ---
>
> Key: SPARK-10577
> URL: https://issues.apache.org/jira/browse/SPARK-10577
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 1.5.0
>Reporter: Maciej Bryński
>  Labels: starter
> Fix For: 1.6.0
>
>
> As in https://issues.apache.org/jira/browse/SPARK-8300
> there should by possibility to add hint for broadcast join in:
> - Pyspark



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-8567) Flaky test: o.a.s.sql.hive.HiveSparkSubmitSuite --jars

2015-09-22 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or resolved SPARK-8567.
--
  Resolution: Fixed
Target Version/s: 1.5.0, 1.4.1, 1.6.0  (was: 1.4.1, 1.5.0, 1.6.0)

> Flaky test: o.a.s.sql.hive.HiveSparkSubmitSuite --jars
> --
>
> Key: SPARK-8567
> URL: https://issues.apache.org/jira/browse/SPARK-8567
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 1.4.1
>Reporter: Yin Huai
>Assignee: Yin Huai
>Priority: Blocker
>  Labels: flaky-test
> Fix For: 1.6.0, 1.5.1, 1.5.0, 1.4.1
>
>
> Seems tests in HiveSparkSubmitSuite fail with timeout pretty frequently.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8567) Flaky test: o.a.s.sql.hive.HiveSparkSubmitSuite --jars

2015-09-22 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-8567:
-
Fix Version/s: 1.5.1
   1.6.0

> Flaky test: o.a.s.sql.hive.HiveSparkSubmitSuite --jars
> --
>
> Key: SPARK-8567
> URL: https://issues.apache.org/jira/browse/SPARK-8567
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 1.4.1
>Reporter: Yin Huai
>Assignee: Yin Huai
>Priority: Blocker
>  Labels: flaky-test
> Fix For: 1.4.1, 1.5.0, 1.6.0, 1.5.1
>
>
> Seems tests in HiveSparkSubmitSuite fail with timeout pretty frequently.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-10751) ML Param validate should print better error information

2015-09-22 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang closed SPARK-10751.
---
Resolution: Duplicate

> ML Param validate should print better error information
> ---
>
> Key: SPARK-10751
> URL: https://issues.apache.org/jira/browse/SPARK-10751
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Reporter: Yanbo Liang
>Priority: Minor
>
> Currently when you set illegal value for params of array type (such as 
> IntArrayParam, DoubleArrayParam, StringArrayParam), it will throw 
> IllegalArgumentException but with incomprehensible error information.
> For example:
> val vectorSlicer = new 
> VectorSlicer().setInputCol("features").setOutputCol("result")
> vectorSlicer.setIndices(Array.empty).setNames(Array("f1", "f4", "f1"))
> It will throw IllegalArgumentException as:
> vectorSlicer_b3b4d1a10f43 parameter names given invalid value 
> [Ljava.lang.String;@798256c5.
> java.lang.IllegalArgumentException: vectorSlicer_b3b4d1a10f43 parameter names 
> given invalid value [Ljava.lang.String;@798256c5.
> Users can not understand which params were set incorrectly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10750) ML Param validate should print better error information

2015-09-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14902044#comment-14902044
 ] 

Apache Spark commented on SPARK-10750:
--

User 'yanboliang' has created a pull request for this issue:
https://github.com/apache/spark/pull/8863

> ML Param validate should print better error information
> ---
>
> Key: SPARK-10750
> URL: https://issues.apache.org/jira/browse/SPARK-10750
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Reporter: Yanbo Liang
>Priority: Minor
>
> Currently when you set illegal value for params of array type (such as 
> IntArrayParam, DoubleArrayParam, StringArrayParam), it will throw 
> IllegalArgumentException but with incomprehensible error information.
> For example:
> val vectorSlicer = new 
> VectorSlicer().setInputCol("features").setOutputCol("result")
> vectorSlicer.setIndices(Array.empty).setNames(Array("f1", "f4", "f1"))
> It will throw IllegalArgumentException as:
> vectorSlicer_b3b4d1a10f43 parameter names given invalid value 
> [Ljava.lang.String;@798256c5.
> java.lang.IllegalArgumentException: vectorSlicer_b3b4d1a10f43 parameter names 
> given invalid value [Ljava.lang.String;@798256c5.
> Users can not understand which params were set incorrectly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10750) ML Param validate should print better error information

2015-09-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10750:


Assignee: (was: Apache Spark)

> ML Param validate should print better error information
> ---
>
> Key: SPARK-10750
> URL: https://issues.apache.org/jira/browse/SPARK-10750
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Reporter: Yanbo Liang
>Priority: Minor
>
> Currently when you set illegal value for params of array type (such as 
> IntArrayParam, DoubleArrayParam, StringArrayParam), it will throw 
> IllegalArgumentException but with incomprehensible error information.
> For example:
> val vectorSlicer = new 
> VectorSlicer().setInputCol("features").setOutputCol("result")
> vectorSlicer.setIndices(Array.empty).setNames(Array("f1", "f4", "f1"))
> It will throw IllegalArgumentException as:
> vectorSlicer_b3b4d1a10f43 parameter names given invalid value 
> [Ljava.lang.String;@798256c5.
> java.lang.IllegalArgumentException: vectorSlicer_b3b4d1a10f43 parameter names 
> given invalid value [Ljava.lang.String;@798256c5.
> Users can not understand which params were set incorrectly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10750) ML Param validate should print better error information

2015-09-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10750:


Assignee: Apache Spark

> ML Param validate should print better error information
> ---
>
> Key: SPARK-10750
> URL: https://issues.apache.org/jira/browse/SPARK-10750
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Reporter: Yanbo Liang
>Assignee: Apache Spark
>Priority: Minor
>
> Currently when you set illegal value for params of array type (such as 
> IntArrayParam, DoubleArrayParam, StringArrayParam), it will throw 
> IllegalArgumentException but with incomprehensible error information.
> For example:
> val vectorSlicer = new 
> VectorSlicer().setInputCol("features").setOutputCol("result")
> vectorSlicer.setIndices(Array.empty).setNames(Array("f1", "f4", "f1"))
> It will throw IllegalArgumentException as:
> vectorSlicer_b3b4d1a10f43 parameter names given invalid value 
> [Ljava.lang.String;@798256c5.
> java.lang.IllegalArgumentException: vectorSlicer_b3b4d1a10f43 parameter names 
> given invalid value [Ljava.lang.String;@798256c5.
> Users can not understand which params were set incorrectly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10446) Support to specify join type when calling join with usingColumns

2015-09-22 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-10446.
-
   Resolution: Fixed
 Assignee: Liang-Chi Hsieh
Fix Version/s: 1.6.0

> Support to specify join type when calling join with usingColumns
> 
>
> Key: SPARK-10446
> URL: https://issues.apache.org/jira/browse/SPARK-10446
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
> Fix For: 1.6.0
>
>
> Currently the method join(right: DataFrame, usingColumns: Seq[String]) only 
> supports inner join. It is more convenient to have it support other join 
> types.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10324) MLlib 1.6 Roadmap

2015-09-22 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-10324:
--
Description: 
Following SPARK-8445, we created this master list for MLlib features we plan to 
have in Spark 1.6. Please view this list as a wish list rather than a concrete 
plan, because we don't have an accurate estimate of available resources. Due to 
limited review bandwidth, features appearing on this list will get higher 
priority during code review. But feel free to suggest new items to the list in 
comments. We are experimenting with this process. Your feedback would be 
greatly appreciated.

h1. Instructions

h2. For contributors:

* Please read 
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark 
carefully. Code style, documentation, and unit tests are important.
* If you are a first-time Spark contributor, please always start with a 
[starter task|https://issues.apache.org/jira/issues/?filter=12333209] rather 
than a medium/big feature. Based on our experience, mixing the development 
process with a big feature usually causes long delay in code review.
* Never work silently. Let everyone know on the corresponding JIRA page when 
you start working on some features. This is to avoid duplicate work. For small 
features, you don't need to wait to get JIRA assigned.
* For medium/big features or features with dependencies, please get assigned 
first before coding and keep the ETA updated on the JIRA. If there exist no 
activity on the JIRA page for a certain amount of time, the JIRA should be 
released for other contributors.
* Do not claim multiple (>3) JIRAs at the same time. Try to finish them one 
after another.
* Remember to add `@Since("1.6.0")` annotation to new public APIs.
* Please review others' PRs (https://spark-prs.appspot.com/#mllib). Code review 
greatly helps improve others' code as well as yours.

h2. For committers:

* Try to break down big features into small and specific JIRA tasks and link 
them properly.
* Add "starter" label to starter tasks.
* Put a rough estimate for medium/big features and track the progress.
* If you start reviewing a PR, please add yourself to the Shepherd field on 
JIRA.
* If the code looks good to you, please comment "LGTM". For non-trivial PRs, 
please ping a maintainer to make a final pass.
* After merging a PR, create and link JIRAs for Python, example code, and 
documentation if necessary.

h1. Roadmap (WIP)

This is NOT [a complete list of MLlib JIRAs for 
1.6|https://issues.apache.org/jira/issues/?filter=12333208]. We only include 
umbrella JIRAs and high-level tasks.

h2. Algorithms and performance

* log-linear model for survival analysis (SPARK-8518)
* normal equation approach for linear regression (SPARK-9834)
* iteratively re-weighted least squares (IRLS) for GLMs (SPARK-9835)
* robust linear regression with Huber loss (SPARK-3181)
* vector-free L-BFGS (SPARK-10078)
* tree partition by features (SPARK-3717)
* bisecting k-means (SPARK-6517)
* weighted instance support (SPARK-9610)
** logistic regression (SPARK-7685)
** linear regression (SPARK-9642)
** random forest (SPARK-9478)
* locality sensitive hashing (LSH) (SPARK-5992)
* deep learning (SPARK-5575)
** autoencoder (SPARK-10408)
** restricted Boltzmann machine (RBM) (SPARK-4251)
** convolutional neural network (stretch)
* factorization machine (SPARK-7008)
* local linear algebra (SPARK-6442)
* distributed LU decomposition (SPARK-8514)

h2. Statistics

* univariate statistics as UDAFs (SPARK-10384)
* bivariate statistics as UDAFs (SPARK-10385)
* R-like statistics for GLMs (SPARK-9835)
* online hypothesis testing (SPARK-3147)

h2. Pipeline API

* pipeline persistence (SPARK-6725)
* ML attribute API improvements (SPARK-8515)
* feature transformers (SPARK-9930)
** feature interaction (SPARK-9698)
** SQL transformer (SPARK-8345)
** ??
* predict single instance (SPARK-10413)
* test Kaggle datasets (SPARK-9941)

h2. Model persistence

* PMML export
** naive Bayes (SPARK-8546)
** decision tree (SPARK-8542)
* model save/load
** FPGrowth (SPARK-6724)
** PrefixSpan (SPARK-10386)
* code generation
** decision tree and tree ensembles (SPARK-10387)

h2. Data sources

* LIBSVM data source (SPARK-10117)
* public dataset loader (SPARK-10388)

h2. Python API for ML

The main goal of Python API is to have feature parity with Scala/Java API. You 
can find a complete list 
[here|https://issues.apache.org/jira/issues/?filter=12333214]. The tasks fall 
into two major categories:

* Python API for new algorithms
* Python API for missing methods (Some listed in [SPARK-10022] and [SPARK-9663])

h2. SparkR API for ML

* support more families and link functions in SparkR::glm (SPARK-9838, 
SPARK-9839, SPARK-9840)
* better R formula support (SPARK-9681)
* model summary with R-like statistics for GLMs (SPARK-9836, SPARK-9837)

h2. Documentation

* re-organize user guide (SPARK-8517)
* @Since 

[jira] [Commented] (SPARK-10734) DirectKafkaInputDStream uses the OffsetRequest.LatestTime to find the latest offset, however using the batch time would be more desireable.

2015-09-22 Thread Cody Koeninger (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14903106#comment-14903106
 ] 

Cody Koeninger commented on SPARK-10734:


as I explained in SPARK-10732 , kafka's getOffsetsBefore api is limited to the 
timestamps on log file segments, so its granularity is quite poor and doesn't 
really behave as one might expect.


> DirectKafkaInputDStream uses the OffsetRequest.LatestTime to find the latest 
> offset, however using the batch time would be more desireable.
> ---
>
> Key: SPARK-10734
> URL: https://issues.apache.org/jira/browse/SPARK-10734
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output
>Reporter: Bijay Singh Bisht
>
> DirectKafkaInputDStream uses the OffsetRequest.LatestTime to find the latest 
> offset, however since OffsetRequest.LatestTime is a relative thing, its 
> depends on when the batch is scheduled. One would imagine that given an input 
> data set the data in the batches should be predictable, irrespective of the 
> system conditions. Using the batch time implies that the stream processing 
> will have the same batches irrespective of whether when the processing was 
> started and the load conditions on the system.
> This along with [SPARK-10732] provides for a nice regression scenarios.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10756) DataFrame write to teradata using jdbc not working, tries to create table each time irrespective of table existence

2015-09-22 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10756?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-10756.
---
Resolution: Duplicate

> DataFrame write to teradata using jdbc not working, tries to create table 
> each time irrespective of table existence
> ---
>
> Key: SPARK-10756
> URL: https://issues.apache.org/jira/browse/SPARK-10756
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Amit
>
> DataFrame write to teradata using jdbc not working, tries to create table 
> each time irrespective of table existence. 
> Whenever it goes to persist dataframe it checks for the table existence by 
> using query "SELECT 1 FROM $table LIMIT 1" and the keyword limit is not 
> supported in teradata. So the exception thrown by teradata for keyword is 
> understood as exception for table not exist in Spark and then Spark runs the 
> create table command irrespective of table was present. 
> So Create table command execution fails with the exception of table already 
> exist hence saving of data frame fails 
> Below is the method of JDBCUtils class
> /**
>* Returns true if the table already exists in the JDBC database.
>*/
>   def tableExists(conn: Connection, table: String): Boolean = {
> // Somewhat hacky, but there isn't a good way to identify whether a table 
> exists for all
> // SQL database systems, considering "table" could also include the 
> database name.
> Try(conn.prepareStatement(s"SELECT 1 FROM $table LIMIT 
> 1").executeQuery().next()).isSuccess
>   }
> In case of teradata, It returns false for every save/write operation 
> irrespective of the fact that table was present.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10750) ML Param validate should print better error information

2015-09-22 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-10750.
---
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 8863
[https://github.com/apache/spark/pull/8863]

> ML Param validate should print better error information
> ---
>
> Key: SPARK-10750
> URL: https://issues.apache.org/jira/browse/SPARK-10750
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Reporter: Yanbo Liang
>Assignee: Yanbo Liang
>Priority: Minor
> Fix For: 1.6.0
>
>
> Currently when you set illegal value for params of array type (such as 
> IntArrayParam, DoubleArrayParam, StringArrayParam), it will throw 
> IllegalArgumentException but with incomprehensible error information.
> For example:
> val vectorSlicer = new 
> VectorSlicer().setInputCol("features").setOutputCol("result")
> vectorSlicer.setIndices(Array.empty).setNames(Array("f1", "f4", "f1"))
> It will throw IllegalArgumentException as:
> vectorSlicer_b3b4d1a10f43 parameter names given invalid value 
> [Ljava.lang.String;@798256c5.
> java.lang.IllegalArgumentException: vectorSlicer_b3b4d1a10f43 parameter names 
> given invalid value [Ljava.lang.String;@798256c5.
> Users can not understand which params were set incorrectly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-5152) Let metrics.properties file take an hdfs:// path

2015-09-22 Thread Yongjia Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14902883#comment-14902883
 ] 

Yongjia Wang edited comment on SPARK-5152 at 9/22/15 6:00 PM:
--

I voted for this. 
This would enable configuring metrics or log4j properties of all the workers 
just from one placing when submitting the job. Without it, you will have to 
setup on each of the workers. If hdfs:// can be supported, I assume s3n:// 
s3a:// would all be supported since they go through the same interface.
Alternatively, it's probably even better if there is a way, to specify through 
"conf" spark properties as in the spark-submit command line, to upload custom 
files to spark executor's working directory before the executor process starts. 
the "spark.files" option upload the files lazily when the first task starts, 
which is too late for configuration.


was (Author: yongjiaw):
I voted for this. 
It enables configuring metrics or log4j properties of all the workers just from 
the driver. Without it, you will have to setup on each of the workers.
Alternatively, it's probably even better if there is a way, to specify through 
"conf" spark properties in the spark-submit command line, to upload custom 
files to spark executor's working directory before the executor process starts. 
the "spark.files" option upload the files lazily when the first task starts, 
which is too late for configuration.

> Let metrics.properties file take an hdfs:// path
> 
>
> Key: SPARK-5152
> URL: https://issues.apache.org/jira/browse/SPARK-5152
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.2.0
>Reporter: Ryan Williams
>
> From my reading of [the 
> code|https://github.com/apache/spark/blob/06dc4b5206a578065ebbb6bb8d54246ca007397f/core/src/main/scala/org/apache/spark/metrics/MetricsConfig.scala#L53],
>  the {{spark.metrics.conf}} property must be a path that is resolvable on the 
> local filesystem of each executor.
> Running a Spark job with {{--conf 
> spark.metrics.conf=hdfs://host1.domain.com/path/metrics.properties}} logs 
> many errors (~1 per executor, presumably?) like:
> {code}
> 15/01/08 13:20:57 ERROR metrics.MetricsConfig: Error loading configure file
> java.io.FileNotFoundException: hdfs:/host1.domain.com/path/metrics.properties 
> (No such file or directory)
> at java.io.FileInputStream.open(Native Method)
> at java.io.FileInputStream.(FileInputStream.java:146)
> at java.io.FileInputStream.(FileInputStream.java:101)
> at 
> org.apache.spark.metrics.MetricsConfig.initialize(MetricsConfig.scala:53)
> at 
> org.apache.spark.metrics.MetricsSystem.(MetricsSystem.scala:92)
> at 
> org.apache.spark.metrics.MetricsSystem$.createMetricsSystem(MetricsSystem.scala:218)
> at org.apache.spark.SparkEnv$.create(SparkEnv.scala:329)
> at org.apache.spark.SparkEnv$.createExecutorEnv(SparkEnv.scala:181)
> at 
> org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$run$1.apply$mcV$sp(CoarseGrainedExecutorBackend.scala:131)
> at 
> org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:61)
> at 
> org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:60)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
> at 
> org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:60)
> at 
> org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:113)
> at 
> org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:163)
> at 
> org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)
> {code}
> which seems consistent with the idea that it's looking on the local 
> filesystem and not parsing the "scheme" portion of the URL.
> Letting all executors get their {{metrics.properties}} files from one 
> location on HDFS would be an improvement, right?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10593) sql lateral view same name gives wrong value

2015-09-22 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-10593.
--
   Resolution: Fixed
Fix Version/s: 1.5.1
   1.6.0

Issue resolved by pull request 8755
[https://github.com/apache/spark/pull/8755]

> sql lateral view same name gives wrong value
> 
>
> Key: SPARK-10593
> URL: https://issues.apache.org/jira/browse/SPARK-10593
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Davies Liu
>Priority: Blocker
> Fix For: 1.6.0, 1.5.1
>
>
> This query will return wrong result:
> {code}
> select 
> insideLayer1.json as json_insideLayer1, 
> insideLayer2.json as json_insideLayer2 
> from (select '1' id) creatives 
> lateral view json_tuple('{"layer1": {"layer2": "text inside layer 2"}}', 
> 'layer1') insideLayer1 as json 
> lateral view json_tuple(insideLayer1.json, 'layer2') insideLayer2 as json 
> {code}
> It got 
> {code}
> ( {"layer2": "text inside layer 2"},  {"layer2": "text inside layer 2"})
> {code}
> instead of
> {code}
> ( {"layer2": "text inside layer 2"},  text inside layer 2)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10739) Add attempt window for long running Spark application on Yarn

2015-09-22 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14902937#comment-14902937
 ] 

Marcelo Vanzin commented on SPARK-10739:


Sean might be talking about SPARK-6735.

> Add attempt window for long running Spark application on Yarn
> -
>
> Key: SPARK-10739
> URL: https://issues.apache.org/jira/browse/SPARK-10739
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Reporter: Saisai Shao
>Priority: Minor
>
> Currently Spark on Yarn uses max attempts to control the failure number, if 
> application's failure number reaches to the max attempts, application will 
> not be recovered by RM, it is not very effective for long running 
> applications, since it will easily exceed the max number at a long time 
> period, also setting a very large max attempts will hide the real problem.
> So here introduce an attempt window to control the application attempt times, 
> this will ignore the out of window attempts, it is introduced in Hadoop 2.6+ 
> to support long running application, it is quite useful for Spark Streaming, 
> Spark shell like applications.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10760) SparkR glm: the documentation in examples - family argument is missing

2015-09-22 Thread Narine Kokhlikyan (JIRA)
Narine Kokhlikyan created SPARK-10760:
-

 Summary: SparkR glm: the documentation in examples - family 
argument is missing
 Key: SPARK-10760
 URL: https://issues.apache.org/jira/browse/SPARK-10760
 Project: Spark
  Issue Type: Documentation
  Components: SparkR
Reporter: Narine Kokhlikyan
Priority: Minor


Hi everyone,

Since the family argument is required for the glm function, the execution of:

model <- glm(Sepal_Length ~ Sepal_Width, df) 

is failing.

I've fixed the documentation by adding the family argument and also added the 
summay(model) which will show the coefficients for the model. 

Thanks,
Narine



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10760) SparkR glm: the documentation in examples - family argument is missing

2015-09-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10760:


Assignee: (was: Apache Spark)

> SparkR glm: the documentation in examples - family argument is missing
> --
>
> Key: SPARK-10760
> URL: https://issues.apache.org/jira/browse/SPARK-10760
> Project: Spark
>  Issue Type: Documentation
>  Components: SparkR
>Reporter: Narine Kokhlikyan
>Priority: Minor
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Hi everyone,
> Since the family argument is required for the glm function, the execution of:
> model <- glm(Sepal_Length ~ Sepal_Width, df) 
> is failing.
> I've fixed the documentation by adding the family argument and also added the 
> summay(model) which will show the coefficients for the model. 
> Thanks,
> Narine



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-9585) HiveHBaseTableInputFormat can'be cached

2015-09-22 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-9585.

   Resolution: Fixed
Fix Version/s: 1.6.0

> HiveHBaseTableInputFormat can'be cached
> ---
>
> Key: SPARK-9585
> URL: https://issues.apache.org/jira/browse/SPARK-9585
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: meiyoula
> Fix For: 1.6.0
>
>
> Below exception occurs in Spark On HBase function.
> {quote}
> java.lang.RuntimeException: java.util.concurrent.RejectedExecutionException: 
> Task 
> org.apache.hadoop.hbase.client.ResultBoundedCompletionService$QueueingFuture@11c6577
>  rejected from java.util.concurrent.ThreadPoolExecutor@3414350b[Terminated, 
> pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 17451]
> {quote}
> When an executor has many cores, the tasks belongs to same RDD will using the 
> same InputFormat. But the HiveHBaseTableInputFormat is not thread safety.
> So I think we should add a config to enable cache InputFormat or not.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10760) SparkR glm: the documentation in examples - family argument is missing

2015-09-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14903113#comment-14903113
 ] 

Apache Spark commented on SPARK-10760:
--

User 'NarineK' has created a pull request for this issue:
https://github.com/apache/spark/pull/8870

> SparkR glm: the documentation in examples - family argument is missing
> --
>
> Key: SPARK-10760
> URL: https://issues.apache.org/jira/browse/SPARK-10760
> Project: Spark
>  Issue Type: Documentation
>  Components: SparkR
>Reporter: Narine Kokhlikyan
>Priority: Minor
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Hi everyone,
> Since the family argument is required for the glm function, the execution of:
> model <- glm(Sepal_Length ~ Sepal_Width, df) 
> is failing.
> I've fixed the documentation by adding the family argument and also added the 
> summay(model) which will show the coefficients for the model. 
> Thanks,
> Narine



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7129) Add generic boosting algorithm to spark.ml

2015-09-22 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14903136#comment-14903136
 ] 

Joseph K. Bradley commented on SPARK-7129:
--

It's not really on the roadmap for 1.6, so I shouldn't make promises.  The main 
issue for me are some design questions:
* Should boosting depend on the prediction abstractions (Classifier, Regressor, 
etc.)?  If so, are those abstractions sufficient, or should they be turned into 
traits?

If you're interested, it would be valuable to get your input on designing the 
abstractions.  Would you be able to write a short design doc?  I figure we 
should:
* List the boosting algorithms of interest
* List what requirements those algorithms place on the base learner
* Design minimal abstractions which describe those requirements
* See how those abstractions compare with MLlib's current abstractions, and if 
we need to rethink them

If you have time for that, it'd be great if you could post it here as a Google 
doc or PDF to collect feedback.  Thanks!

> Add generic boosting algorithm to spark.ml
> --
>
> Key: SPARK-7129
> URL: https://issues.apache.org/jira/browse/SPARK-7129
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Joseph K. Bradley
>
> The Pipelines API will make it easier to create a generic Boosting algorithm 
> which can work with any Classifier or Regressor. Creating this feature will 
> require researching the possible variants and extensions of boosting which we 
> may want to support now and/or in the future, and planning an API which will 
> be properly extensible.
> In particular, it will be important to think about supporting:
> * multiple loss functions (for AdaBoost, LogitBoost, gradient boosting, etc.)
> * multiclass variants
> * multilabel variants (which will probably be in a separate class and JIRA)
> * For more esoteric variants, we should consider them but not design too much 
> around them: totally corrective boosting, cascaded models
> Note: This may interact some with the existing tree ensemble methods, but it 
> should be largely separate since the tree ensemble APIs and implementations 
> are specialized for trees.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10638) spark streaming stop gracefully keeps the spark context

2015-09-22 Thread Mamdouh Alramadan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14903018#comment-14903018
 ] 

Mamdouh Alramadan commented on SPARK-10638:
---

Any updates on this issue?

> spark streaming stop gracefully keeps the spark context
> ---
>
> Key: SPARK-10638
> URL: https://issues.apache.org/jira/browse/SPARK-10638
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.4.0
>Reporter: Mamdouh Alramadan
>
> With spark 1.4 on Mesos cluster, I am trying to stop the context with 
> graceful shutdown, I have seen this mailing list that [~tdas] addressed
> http://mail-archives.apache.org/mod_mbox/incubator-spark-commits/201505.mbox/%3c176cb228a2704ab996839fb97fa90...@git.apache.org%3E
> which introduces a new config that was not documented, however, even with 
> including it, the streaming job still stops correctly but the process doesn't 
> die after all e.g. the Spark Context still running. My Mesos UI still sees 
> the framework which is still allocating all the cores needed
> the code used for the shutdown hook is:
> {code:title=Start.scala|borderStyle=solid}
> sys.ShutdownHookThread {
> logInfo("Received SIGTERM, calling streaming stop")
> streamingContext.stop(stopSparkContext = true, stopGracefully = true)
> logInfo("Application Stopped")
>   }
> {code}
> The logs are for this process are:
> {code:title=SparkLogs|borderStyle=solid}
> ```
> 5/09/16 16:37:51 INFO Start: Received SIGTERM, calling streaming stop
> 15/09/16 16:37:51 INFO JobGenerator: Stopping JobGenerator gracefully
> 15/09/16 16:37:51 INFO JobGenerator: Waiting for all received blocks to be 
> consumed for job generation
> 15/09/16 16:37:51 INFO JobGenerator: Waited for all received blocks to be 
> consumed for job generation
> 15/09/16 16:37:51 INFO StreamingContext: Invoking stop(stopGracefully=true) 
> from shutdown hook
> 15/09/16 16:38:00 INFO RecurringTimer: Stopped timer for JobGenerator after 
> time 144242148
> 15/09/16 16:38:00 INFO JobScheduler: Starting job streaming job 144242148 
> ms.0 from job set of time 144242148 ms
> 15/09/16 16:38:00 INFO JobGenerator: Stopped generation timer
> 15/09/16 16:38:00 INFO JobGenerator: Waiting for jobs to be processed and 
> checkpoints to be written
> 15/09/16 16:38:00 INFO JobScheduler: Added jobs for time 144242148 ms
> 15/09/16 16:38:00 INFO JobGenerator: Checkpointing graph for time 
> 144242148 ms
> 15/09/16 16:38:00 INFO DStreamGraph: Updating checkpoint data for time 
> 144242148 ms
> 15/09/16 16:38:00 INFO DStreamGraph: Updated checkpoint data for time 
> 144242148 ms
> 15/09/16 16:38:00 INFO SparkContext: Starting job: foreachRDD at 
> StreamDigest.scala:21
> 15/09/16 16:38:00 INFO DAGScheduler: Got job 12 (foreachRDD at 
> StreamDigest.scala:21) with 1 output partitions (allowLocal=true)
> 15/09/16 16:38:00 INFO DAGScheduler: Final stage: ResultStage 12(foreachRDD 
> at StreamDigest.scala:21)
> 15/09/16 16:38:00 INFO DAGScheduler: Parents of final stage: List()
> 15/09/16 16:38:00 INFO CheckpointWriter: Saving checkpoint for time 
> 144242148 ms to file 
> 'hdfs://EMRURL/sparkStreaming/checkpoint/checkpoint-144242148'
> 15/09/16 16:38:00 INFO DAGScheduler: Missing parents: List()
> .
> .
> .
> .
> 15/09/16 16:38:00 INFO JobGenerator: Waited for jobs to be processed and 
> checkpoints to be written
> 15/09/16 16:38:00 INFO CheckpointWriter: CheckpointWriter executor terminated 
> ? true, waited for 1 ms.
> 15/09/16 16:38:00 INFO JobGenerator: Stopped JobGenerator
> 15/09/16 16:38:00 INFO JobScheduler: Stopped JobScheduler
> ```
> {code}
> And in my spark-defaults.conf I included
> {code:title=spark-defaults.conf|borderStyle=solid}
> spark.streaming.stopGracefullyOnShutdowntrue
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10691) Make LogisticRegressionModel.evaluate() method public

2015-09-22 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14903070#comment-14903070
 ] 

Joseph K. Bradley commented on SPARK-10691:
---

We could document that `evaluate` calls `transform`, so users can change model 
parameters before calling evaluate.  Or, we can have it like transform and take 
a ParamMap to configure parameters.

I'm not sure how to handle extra parameters such as binning for evaluation 
metrics.  However, if a user knows enough to want to adjust something like 
binning, then they should be able to do evaluation manually easily.

I'm ambivalent about `evaluate` vs `score`.

> Make LogisticRegressionModel.evaluate() method public
> -
>
> Key: SPARK-10691
> URL: https://issues.apache.org/jira/browse/SPARK-10691
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 1.5.0
>Reporter: Hao Ren
>
> The following method in {{LogisticRegressionModel}} is marked as {{private}}, 
> which prevents users from creating a summary on any given data set. Check 
> [here|https://github.com/feynmanliang/spark/blob/d219fa4c216e8f35b71a26921561104d15cd6055/mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala#L272].
> {code}
> // TODO: decide on a good name before exposing to public API
> private[classification] def evaluate(dataset: DataFrame)
> : LogisticRegressionSummary = {
> new BinaryLogisticRegressionSummary(
> this.transform(dataset), 
> $(probabilityCol), 
> $(labelCol))
> }
> {code}
> This method is definitely necessary to test model performance.
> By the way, the name {{evaluate}} is already pretty good for me.
> [~mengxr] Could you check this ? Thx



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10739) Add attempt window for long running Spark application on Yarn

2015-09-22 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14902967#comment-14902967
 ] 

Sean Owen commented on SPARK-10739:
---

Yessir that's the one, thanks. [~jerryshao] it looks pretty similar, is that 
the same as your intent?

> Add attempt window for long running Spark application on Yarn
> -
>
> Key: SPARK-10739
> URL: https://issues.apache.org/jira/browse/SPARK-10739
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Reporter: Saisai Shao
>Priority: Minor
>
> Currently Spark on Yarn uses max attempts to control the failure number, if 
> application's failure number reaches to the max attempts, application will 
> not be recovered by RM, it is not very effective for long running 
> applications, since it will easily exceed the max number at a long time 
> period, also setting a very large max attempts will hide the real problem.
> So here introduce an attempt window to control the application attempt times, 
> this will ignore the out of window attempts, it is introduced in Hadoop 2.6+ 
> to support long running application, it is quite useful for Spark Streaming, 
> Spark shell like applications.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10759) Missing Python code example in ML Programming guide

2015-09-22 Thread Raela Wang (JIRA)
Raela Wang created SPARK-10759:
--

 Summary: Missing Python code example in ML Programming guide
 Key: SPARK-10759
 URL: https://issues.apache.org/jira/browse/SPARK-10759
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Affects Versions: 1.5.0
Reporter: Raela Wang
Priority: Minor


http://spark.apache.org/docs/latest/ml-guide.html#example-model-selection-via-cross-validation

http://spark.apache.org/docs/latest/ml-guide.html#example-model-selection-via-train-validation-split



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10739) Add attempt window for long running Spark application on Yarn

2015-09-22 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14902982#comment-14902982
 ] 

Sandy Ryza commented on SPARK-10739:


That's the one I was referring to as well.  That's about executor failure, 
where this is about AM failure, so different issues.

> Add attempt window for long running Spark application on Yarn
> -
>
> Key: SPARK-10739
> URL: https://issues.apache.org/jira/browse/SPARK-10739
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Reporter: Saisai Shao
>Priority: Minor
>
> Currently Spark on Yarn uses max attempts to control the failure number, if 
> application's failure number reaches to the max attempts, application will 
> not be recovered by RM, it is not very effective for long running 
> applications, since it will easily exceed the max number at a long time 
> period, also setting a very large max attempts will hide the real problem.
> So here introduce an attempt window to control the application attempt times, 
> this will ignore the out of window attempts, it is introduced in Hadoop 2.6+ 
> to support long running application, it is quite useful for Spark Streaming, 
> Spark shell like applications.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10739) Add attempt window for long running Spark application on Yarn

2015-09-22 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14902996#comment-14902996
 ] 

Saisai Shao commented on SPARK-10739:
-

Yes, as Sandy mentioned about, SPARK-6735 is focused on executor failure, 
whereas this PR is focused on AM failure, so this is different. Also what I did 
is to pass this parameter to Yarn RM, let Yarn to control the attempt window.

> Add attempt window for long running Spark application on Yarn
> -
>
> Key: SPARK-10739
> URL: https://issues.apache.org/jira/browse/SPARK-10739
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Reporter: Saisai Shao
>Priority: Minor
>
> Currently Spark on Yarn uses max attempts to control the failure number, if 
> application's failure number reaches to the max attempts, application will 
> not be recovered by RM, it is not very effective for long running 
> applications, since it will easily exceed the max number at a long time 
> period, also setting a very large max attempts will hide the real problem.
> So here introduce an attempt window to control the application attempt times, 
> this will ignore the out of window attempts, it is introduced in Hadoop 2.6+ 
> to support long running application, it is quite useful for Spark Streaming, 
> Spark shell like applications.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-9962) Decision Tree training: prevNodeIdsForInstances.unpersist() at end of training

2015-09-22 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-9962.
--
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 8541
[https://github.com/apache/spark/pull/8541]

> Decision Tree training: prevNodeIdsForInstances.unpersist() at end of training
> --
>
> Key: SPARK-9962
> URL: https://issues.apache.org/jira/browse/SPARK-9962
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: holdenk
>Priority: Minor
> Fix For: 1.6.0
>
>
> NodeIdCache: prevNodeIdsForInstances.unpersist() needs to be called at end of 
> training.
> This applies to both the ML and MLlib implementations, but it is Ok to skip 
> the MLlib implementation since it will eventually be replaced by the ML one.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10729) word2vec model save for python

2015-09-22 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-10729:
--
Target Version/s: 1.6.0

> word2vec model save for python
> --
>
> Key: SPARK-10729
> URL: https://issues.apache.org/jira/browse/SPARK-10729
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.4.1, 1.5.0
>Reporter: Joseph A Gartner III
>
> The ability to save a word2vec model has not been ported to python, and would 
> be extremely useful to have given the long training period.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10732) Starting spark streaming from a specific point in time.

2015-09-22 Thread Cody Koeninger (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14903110#comment-14903110
 ] 

Cody Koeninger commented on SPARK-10732:


As I already said, kafka's implementation of getOffsetsBefore is based on log 
file timestamps, not a granular index, so it really isn't accurate enough for 
this.

> Starting spark streaming from a specific point in time.
> ---
>
> Key: SPARK-10732
> URL: https://issues.apache.org/jira/browse/SPARK-10732
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.5.0
>Reporter: Bijay Singh Bisht
>
> Currently, spark streaming either starts from current time or from the latest 
> checkpoint. It would be extremely useful to start from any arbitrary point. 
> This would be useful in replay scenarios or in running regression tests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10760) SparkR glm: the documentation in examples - family argument is missing

2015-09-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10760:


Assignee: Apache Spark

> SparkR glm: the documentation in examples - family argument is missing
> --
>
> Key: SPARK-10760
> URL: https://issues.apache.org/jira/browse/SPARK-10760
> Project: Spark
>  Issue Type: Documentation
>  Components: SparkR
>Reporter: Narine Kokhlikyan
>Assignee: Apache Spark
>Priority: Minor
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Hi everyone,
> Since the family argument is required for the glm function, the execution of:
> model <- glm(Sepal_Length ~ Sepal_Width, df) 
> is failing.
> I've fixed the documentation by adding the family argument and also added the 
> summay(model) which will show the coefficients for the model. 
> Thanks,
> Narine



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10756) DataFrame write to teradata using jdbc not working, tries to create table each time irrespective of table existence

2015-09-22 Thread Suresh Thalamati (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10756?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14903100#comment-14903100
 ] 

Suresh Thalamati commented on SPARK-10756:
--

This issue is similar to https://issues.apache.org/jira/browse/SPARK-9078.  Fix 
might address  saving to Teradata also. Default table exists check query is 
changed to : "SELECT * FROM $table WHERE 1=0".


> DataFrame write to teradata using jdbc not working, tries to create table 
> each time irrespective of table existence
> ---
>
> Key: SPARK-10756
> URL: https://issues.apache.org/jira/browse/SPARK-10756
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Amit
>
> DataFrame write to teradata using jdbc not working, tries to create table 
> each time irrespective of table existence. 
> Whenever it goes to persist dataframe it checks for the table existence by 
> using query "SELECT 1 FROM $table LIMIT 1" and the keyword limit is not 
> supported in teradata. So the exception thrown by teradata for keyword is 
> understood as exception for table not exist in Spark and then Spark runs the 
> create table command irrespective of table was present. 
> So Create table command execution fails with the exception of table already 
> exist hence saving of data frame fails 
> Below is the method of JDBCUtils class
> /**
>* Returns true if the table already exists in the JDBC database.
>*/
>   def tableExists(conn: Connection, table: String): Boolean = {
> // Somewhat hacky, but there isn't a good way to identify whether a table 
> exists for all
> // SQL database systems, considering "table" could also include the 
> database name.
> Try(conn.prepareStatement(s"SELECT 1 FROM $table LIMIT 
> 1").executeQuery().next()).isSuccess
>   }
> In case of teradata, It returns false for every save/write operation 
> irrespective of the fact that table was present.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10381) Infinite loop when OutputCommitCoordination is enabled and OutputCommitter.commitTask throws exception

2015-09-22 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-10381:
---
Fix Version/s: 1.3.2

> Infinite loop when OutputCommitCoordination is enabled and 
> OutputCommitter.commitTask throws exception
> --
>
> Key: SPARK-10381
> URL: https://issues.apache.org/jira/browse/SPARK-10381
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 1.3.1, 1.4.1, 1.5.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>Priority: Critical
> Fix For: 1.3.2, 1.4.2, 1.6.0, 1.5.1
>
>
> When speculative execution is enabled, consider a scenario where the 
> authorized committer of a particular output partition fails during the 
> OutputCommitter.commitTask() call. In this case, the OutputCommitCoordinator 
> is supposed to release that committer's exclusive lock on committing once 
> that task fails. However, due to a unit mismatch the lock will not be 
> released, causing Spark to go into an infinite retry loop.
> This bug was masked by the fact that the OutputCommitCoordinator does not 
> have enough end-to-end tests (the current tests use many mocks). Other 
> factors contributing to this bug are the fact that we have many 
> similarly-named identifiers that have different semantics but the same data 
> types (e.g. attemptNumber and taskAttemptId, with inconsistent variable 
> naming which makes them difficult to distinguish).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8447) Test external shuffle service with all shuffle managers

2015-09-22 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-8447:
--
Target Version/s: 1.6.0  (was: 1.5.1)

> Test external shuffle service with all shuffle managers
> ---
>
> Key: SPARK-8447
> URL: https://issues.apache.org/jira/browse/SPARK-8447
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 1.4.0
>Reporter: Andrew Or
>Priority: Critical
>
> There is a mismatch between the shuffle managers in Spark core and in the 
> external shuffle service. The latest unsafe shuffle manager is an example of 
> this (SPARK-8430). This issue arose because we apparently do not have 
> sufficient tests for making sure that these two components deal with the same 
> set of shuffle managers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10749) Support multiple roles with Spark Mesos dispatcher

2015-09-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10749:


Assignee: (was: Apache Spark)

> Support multiple roles with Spark Mesos dispatcher
> --
>
> Key: SPARK-10749
> URL: https://issues.apache.org/jira/browse/SPARK-10749
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Reporter: Timothy Chen
>
> Although you can currently set the framework role of the Mesos dispatcher, it 
> doesn't correctly use the offers given to it.
> It should inherit how Coarse/Fine grain scheduler works and use multiple 
> roles offers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8418) Add single- and multi-value support to ML Transformers

2015-09-22 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14903350#comment-14903350
 ] 

Joseph K. Bradley commented on SPARK-8418:
--

New idea: We could allow transformers to leverage RFormula. That might be the 
nicest way to specify a bunch of columns and leverage existing code for 
assembling them.

> Add single- and multi-value support to ML Transformers
> --
>
> Key: SPARK-8418
> URL: https://issues.apache.org/jira/browse/SPARK-8418
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Joseph K. Bradley
>
> It would be convenient if all feature transformers supported transforming 
> columns of single values and multiple values, specifically:
> * one column with one value (e.g., type {{Double}})
> * one column with multiple values (e.g., {{Array[Double]}} or {{Vector}})
> We could go as far as supporting multiple columns, but that may not be 
> necessary since VectorAssembler could be used to handle that.
> Estimators under {{ml.feature}} should also support this.
> This will likely require a short design doc to describe:
> * how input and output columns will be specified
> * schema validation
> * code sharing to reduce duplication



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6701) Flaky test: o.a.s.deploy.yarn.YarnClusterSuite Python application

2015-09-22 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-6701:
--
Target Version/s:   (was: 1.5.1)

> Flaky test: o.a.s.deploy.yarn.YarnClusterSuite Python application
> -
>
> Key: SPARK-6701
> URL: https://issues.apache.org/jira/browse/SPARK-6701
> Project: Spark
>  Issue Type: Bug
>  Components: Tests, YARN
>Affects Versions: 1.3.0
>Reporter: Andrew Or
>Priority: Critical
>
> Observed in Master and 1.3, both in SBT and in Maven (with YARN).
> {code}
> Process 
> List(/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.3/label/centos/bin/spark-submit,
>  --master, yarn-cluster, --num-executors, 1, --properties-file, 
> /tmp/spark-ea49597c-2a95-4d8c-a9ea-23861a02c9bd/spark968020731409047027.properties,
>  --py-files, /tmp/spark-ea49597c-2a95-4d8c-a9ea-23861a02c9bd/test2.py, 
> /tmp/spark-ea49597c-2a95-4d8c-a9ea-23861a02c9bd/test.py, 
> /tmp/spark-ea49597c-2a95-4d8c-a9ea-23861a02c9bd/result961582960984674264.tmp) 
> exited with code 1
> sbt.ForkMain$ForkError: Process 
> List(/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.3/label/centos/bin/spark-submit,
>  --master, yarn-cluster, --num-executors, 1, --properties-file, 
> /tmp/spark-ea49597c-2a95-4d8c-a9ea-23861a02c9bd/spark968020731409047027.properties,
>  --py-files, /tmp/spark-ea49597c-2a95-4d8c-a9ea-23861a02c9bd/test2.py, 
> /tmp/spark-ea49597c-2a95-4d8c-a9ea-23861a02c9bd/test.py, 
> /tmp/spark-ea49597c-2a95-4d8c-a9ea-23861a02c9bd/result961582960984674264.tmp) 
> exited with code 1
>   at org.apache.spark.util.Utils$.executeAndGetOutput(Utils.scala:1122)
>   at 
> org.apache.spark.deploy.yarn.YarnClusterSuite.org$apache$spark$deploy$yarn$YarnClusterSuite$$runSpark(YarnClusterSuite.scala:259)
>   at 
> org.apache.spark.deploy.yarn.YarnClusterSuite$$anonfun$4.apply$mcV$sp(YarnClusterSuite.scala:160)
>   at 
> org.apache.spark.deploy.yarn.YarnClusterSuite$$anonfun$4.apply(YarnClusterSuite.scala:146)
>   at 
> org.apache.spark.deploy.yarn.YarnClusterSuite$$anonfun$4.apply(YarnClusterSuite.scala:146)
>   at 
> org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
>   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7420) Flaky test: o.a.s.streaming.JobGeneratorSuite "Do not clear received block data too soon"

2015-09-22 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-7420:
--
Target Version/s:   (was: 1.5.1)

> Flaky test: o.a.s.streaming.JobGeneratorSuite "Do not clear received block 
> data too soon"
> -
>
> Key: SPARK-7420
> URL: https://issues.apache.org/jira/browse/SPARK-7420
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 1.3.1, 1.4.0
>Reporter: Andrew Or
>Assignee: Tathagata Das
>Priority: Critical
>  Labels: flaky-test
>
> {code}
> The code passed to eventually never returned normally. Attempted 18 times 
> over 10.13803606001 seconds. Last failure message: 
> receiverTracker.hasUnallocatedBlocks was false.
> {code}
> It seems to be failing only in maven.
> https://amplab.cs.berkeley.edu/jenkins/job/Spark-1.3-Maven-pre-YARN/hadoop.version=2.0.0-mr1-cdh4.1.2,label=centos/458/
> https://amplab.cs.berkeley.edu/jenkins/job/Spark-1.3-Maven-pre-YARN/hadoop.version=1.0.4,label=centos/459/
> https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/hadoop.version=1.0.4,label=centos/2173/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6484) Ganglia metrics xml reporter doesn't escape correctly

2015-09-22 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-6484:
--
Target Version/s:   (was: 1.5.1)

I'm going to untarget this from 1.5.1 because, as far as I know, this is not a 
problem with newer versions of our Ganglia dependency.

> Ganglia metrics xml reporter doesn't escape correctly
> -
>
> Key: SPARK-6484
> URL: https://issues.apache.org/jira/browse/SPARK-6484
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Michael Armbrust
>Assignee: Josh Rosen
>Priority: Critical
>
> The following should be escaped:
> {code}
> "   
> '   
> <   
> >   
> &   
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10685) Misaligned data with RDD.zip and DataFrame.withColumn after repartition

2015-09-22 Thread Dan Brown (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14903512#comment-14903512
 ] 

Dan Brown commented on SPARK-10685:
---

Thanks for fixing the python udf part of the issue!

What about the zip after repartition behavior? That seems like a bug since it's 
surprising and easy to do without realizing it.

> Misaligned data with RDD.zip and DataFrame.withColumn after repartition
> ---
>
> Key: SPARK-10685
> URL: https://issues.apache.org/jira/browse/SPARK-10685
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.3.0, 1.4.1, 1.5.0
> Environment: - OSX 10.10.4, java 1.7.0_51, hadoop 2.6.0-cdh5.4.5
> - Ubuntu 12.04, java 1.7.0_80, hadoop 2.6.0-cdh5.4.5
>Reporter: Dan Brown
>Assignee: Reynold Xin
>Priority: Blocker
> Fix For: 1.6.0, 1.5.1
>
>
> Here's a weird behavior where {{RDD.zip}} or {{DataFrame.withColumn}} after a 
> {{repartition}} produces "misaligned" data, meaning different column values 
> in the same row aren't matched, as if a zip shuffled the collections before 
> zipping them. It's difficult to reproduce because it's nondeterministic, 
> doesn't occur in local mode, and requires ≥2 workers (≥3 in one case). I was 
> able to repro it using pyspark 1.3.0 (cdh5.4.5), 1.4.1 (bin-without-hadoop), 
> and 1.5.0 (bin-without-hadoop).
> Here's the most similar issue I was able to find. It appears to not have been 
> repro'd and then closed optimistically, and it smells like it could have been 
> the same underlying cause that was never fixed:
> - https://issues.apache.org/jira/browse/SPARK-9131
> Also, this {{DataFrame.zip}} issue is related in spirit, since we were trying 
> to build it ourselves when we ran into this problem. Let me put in my vote 
> for reopening the issue and supporting {{DataFrame.zip}} in the standard lib.
> - https://issues.apache.org/jira/browse/SPARK-7460
> h3. Brief repro
> Fail: withColumn(udf) after DataFrame.repartition
> {code}
> df = sqlCtx.createDataFrame(Row(a=a) for a in xrange(1))
> df = df.repartition(100)
> df = df.withColumn('b', udf(lambda r: r, IntegerType())(df.a))
> [r for r in df.collect() if r.a != r.b][:3] # Should be []
> {code}
> Sample outputs (nondeterministic):
> {code}
> [Row(a=39, b=639), Row(a=139, b=739), Row(a=239, b=839)]
> [Row(a=639, b=39), Row(a=739, b=139), Row(a=839, b=239)]
> []
> [Row(a=641, b=41), Row(a=741, b=141), Row(a=841, b=241)]
> [Row(a=641, b=1343), Row(a=741, b=1443), Row(a=841, b=1543)]
> [Row(a=639, b=39), Row(a=739, b=139), Row(a=839, b=239)]
> {code}
> Fail: RDD.zip after DataFrame.repartition
> {code}
> df  = sqlCtx.createDataFrame(Row(a=a) for a in xrange(1))
> df  = df.repartition(100)
> rdd = df.rdd.zip(df.map(lambda r: Row(b=r.a))).map(lambda (x,y): Row(a=x.a, 
> b=y.b))
> [r for r in rdd.collect() if r.a != r.b][:3] # Should be []
> {code}
> Sample outputs (nondeterministic):
> {code}
> []
> [Row(a=50, b=6947), Row(a=150, b=7047), Row(a=250, b=7147)]
> []
> []
> [Row(a=44, b=644), Row(a=144, b=744), Row(a=244, b=844)]
> []
> {code}
> Test setup:
> - local\[8]: {{MASTER=local\[8]}}
> - dist\[N]: 1 driver + 1 master + N workers
> {code}
> "Fail" tests pass?  cluster mode  spark version
> 
> yes local[8]  1.3.0-cdh5.4.5
> no  dist[4]   1.3.0-cdh5.4.5
> yes local[8]  1.4.1
> yes dist[1]   1.4.1
> no  dist[2]   1.4.1
> no  dist[4]   1.4.1
> yes local[8]  1.5.0
> yes dist[1]   1.5.0
> no  dist[2]   1.5.0
> no  dist[4]   1.5.0
> {code}
> h3. Detailed repro
> Start `pyspark` and run these imports:
> {code}
> from pyspark.sql import Row
> from pyspark.sql.functions import udf
> from pyspark.sql.types import IntegerType, StructType, StructField
> {code}
> Fail: withColumn(udf) after DataFrame.repartition
> {code}
> df = sqlCtx.createDataFrame(Row(a=a) for a in xrange(1))
> df = df.repartition(100)
> df = df.withColumn('b', udf(lambda r: r, IntegerType())(df.a))
> len([r for r in df.collect() if r.a != r.b]) # Should be 0
> {code}
> Ok: withColumn(udf) after DataFrame.repartition(100) after 1 starting 
> partition
> {code}
> df = sqlCtx.createDataFrame(sc.parallelize((Row(a=a) for a in xrange(1)), 
> numSlices=1))
> df = df.repartition(100)
> df = df.withColumn('b', udf(lambda r: r, IntegerType())(df.a))
> len([r for r in df.collect() if r.a != r.b]) # Should be 0
> {code}
> Fail: withColumn(udf) after DataFrame.repartition(100) after 100 starting 
> partitions
> {code}
> df = sqlCtx.createDataFrame(sc.parallelize((Row(a=a) for a in 

[jira] [Commented] (SPARK-10663) Change test.toDF to test in Spark ML Programming Guide

2015-09-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14903749#comment-14903749
 ] 

Apache Spark commented on SPARK-10663:
--

User 'hagenhaus' has created a pull request for this issue:
https://github.com/apache/spark/pull/8875

> Change test.toDF to test in Spark ML Programming Guide
> --
>
> Key: SPARK-10663
> URL: https://issues.apache.org/jira/browse/SPARK-10663
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Reporter: Matt Hagen
>Priority: Trivial
>
> Spark 1.5.0 > Spark ML Programming Guide > Example: Pipeline
> I believe model.transform(test.toDF) should be model.transform(test).
> Note that "test" is already a DataFrame.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10705) Stop converting internal rows to external rows in DataFrame.toJSON

2015-09-22 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-10705:
---
Assignee: Liang-Chi Hsieh

> Stop converting internal rows to external rows in DataFrame.toJSON
> --
>
> Key: SPARK-10705
> URL: https://issues.apache.org/jira/browse/SPARK-10705
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.1, 1.4.1, 1.5.0
>Reporter: Cheng Lian
>Assignee: Liang-Chi Hsieh
>
> {{DataFrame.toJSON}} uses {{DataFrame.mapPartitions}}, which converts 
> internal rows to external rows. We can use 
> {{queryExecution.toRdd.mapPartitions}} instead for better performance.
> Another issue is that, for UDT values, {{serialize}} produces internal types. 
> So currently we must deal with both internal and external types within 
> {{toJSON}} (see 
> [here|https://github.com/apache/spark/pull/8806/files#diff-0f04c36e499d4dcf6931fbd62b3aa012R77]),
>  which is pretty weird.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10733) TungstenAggregation cannot acquire page after switching to sort-based

2015-09-22 Thread Yi Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14903740#comment-14903740
 ] 

Yi Zhou commented on SPARK-10733:
-

 yes. i still got error after applying the commit.

> TungstenAggregation cannot acquire page after switching to sort-based
> -
>
> Key: SPARK-10733
> URL: https://issues.apache.org/jira/browse/SPARK-10733
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>Priority: Blocker
>
> This is uncovered after fixing SPARK-10474. Stack trace:
> {code}
> 15/09/21 12:51:46 WARN scheduler.TaskSetManager: Lost task 115.0 in stage 
> 152.0 (TID 1736, bb-node2): java.io.IOException: Unable to acquire 16777216 
> bytes of memory
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPage(UnsafeExternalSorter.java:378)
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPageIfNecessary(UnsafeExternalSorter.java:359)
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertKVRecord(UnsafeExternalSorter.java:488)
> at 
> org.apache.spark.sql.execution.UnsafeKVExternalSorter.insertKV(UnsafeKVExternalSorter.java:144)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.switchToSortBasedAggregation(TungstenAggregationIterator.scala:465)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:379)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.start(TungstenAggregationIterator.scala:622)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1.org$apache$spark$sql$execution$aggregate$TungstenAggregate$$anonfun$$executePartition$1(TungstenAggregate.scala:110)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
> at 
> org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:64)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:88)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10663) Change test.toDF to test in Spark ML Programming Guide

2015-09-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10663:


Assignee: (was: Apache Spark)

> Change test.toDF to test in Spark ML Programming Guide
> --
>
> Key: SPARK-10663
> URL: https://issues.apache.org/jira/browse/SPARK-10663
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Reporter: Matt Hagen
>Priority: Trivial
>
> Spark 1.5.0 > Spark ML Programming Guide > Example: Pipeline
> I believe model.transform(test.toDF) should be model.transform(test).
> Note that "test" is already a DataFrame.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7442) Spark 1.3.1 / Hadoop 2.6 prebuilt pacakge has broken S3 filesystem access

2015-09-22 Thread Amey Ghadigaonkar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14903750#comment-14903750
 ] 

Amey Ghadigaonkar commented on SPARK-7442:
--

Getting the same error with Spark 1.4.1 running on Hadoop 2.6.0. I am reverting 
back to Hadoop 2.4.0 till this is fixed.

> Spark 1.3.1 / Hadoop 2.6 prebuilt pacakge has broken S3 filesystem access
> -
>
> Key: SPARK-7442
> URL: https://issues.apache.org/jira/browse/SPARK-7442
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 1.3.1
> Environment: OS X
>Reporter: Nicholas Chammas
>
> # Download Spark 1.3.1 pre-built for Hadoop 2.6 from the [Spark downloads 
> page|http://spark.apache.org/downloads.html].
> # Add {{localhost}} to your {{slaves}} file and {{start-all.sh}}
> # Fire up PySpark and try reading from S3 with something like this:
> {code}sc.textFile('s3n://bucket/file_*').count(){code}
> # You will get an error like this:
> {code}py4j.protocol.Py4JJavaError: An error occurred while calling 
> z:org.apache.spark.api.python.PythonRDD.collectAndServe.
> : java.io.IOException: No FileSystem for scheme: s3n{code}
> {{file:///...}} works. Spark 1.3.1 prebuilt for Hadoop 2.4 works. Spark 1.3.0 
> works.
> It's just the combination of Spark 1.3.1 prebuilt for Hadoop 2.6 accessing S3 
> that doesn't work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10663) Change test.toDF to test in Spark ML Programming Guide

2015-09-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10663:


Assignee: Apache Spark

> Change test.toDF to test in Spark ML Programming Guide
> --
>
> Key: SPARK-10663
> URL: https://issues.apache.org/jira/browse/SPARK-10663
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Reporter: Matt Hagen
>Assignee: Apache Spark
>Priority: Trivial
>
> Spark 1.5.0 > Spark ML Programming Guide > Example: Pipeline
> I believe model.transform(test.toDF) should be model.transform(test).
> Note that "test" is already a DataFrame.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10731) The head() implementation of dataframe is very slow

2015-09-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14903818#comment-14903818
 ] 

Apache Spark commented on SPARK-10731:
--

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/8876

> The head() implementation of dataframe is very slow
> ---
>
> Key: SPARK-10731
> URL: https://issues.apache.org/jira/browse/SPARK-10731
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.4.1, 1.5.0
>Reporter: Jerry Lam
>  Labels: pyspark
>
> {code}
> df=sqlContext.read.parquet("someparquetfiles")
> df.head()
> {code}
> The above lines take over 15 minutes. It seems the dataframe requires 3 
> stages to return the first row. It reads all data (which is about 1 billion 
> rows) and run Limit twice. The take(1) implementation in the RDD performs 
> much better.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8882) A New Receiver Scheduling Mechanism to solve unbalanced receivers

2015-09-22 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-8882:
-
Summary: A New Receiver Scheduling Mechanism to solve unbalanced receivers  
(was: A New Receiver Scheduling Mechanism)

> A New Receiver Scheduling Mechanism to solve unbalanced receivers
> -
>
> Key: SPARK-8882
> URL: https://issues.apache.org/jira/browse/SPARK-8882
> Project: Spark
>  Issue Type: New Feature
>  Components: Streaming
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
> Fix For: 1.5.0
>
>
> There are some problems in the current mechanism:
>  - If a task fails more than “spark.task.maxFailures” (default: 4) times, the 
> job will fail. For a long-running Streaming applications, it’s possible that 
> a Receiver task fails more than 4 times because of Executor lost.
> - When an executor is lost, the Receiver tasks on it will be rescheduled. 
> However, because there may be many Spark jobs at the same time, it’s possible 
> that TaskScheduler cannot schedule them to make Receivers be distributed 
> evenly.
> To solve such limitations, we need to change the receiver scheduling 
> mechanism. Here is the design doc: 
> https://docs.google.com/document/d/1ZsoRvHjpISPrDmSjsGzuSu8UjwgbtmoCTzmhgTurHJw/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8882) A New Receiver Scheduling Mechanism

2015-09-22 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-8882:
-
Description: 
There are some problems in the current mechanism:
 - If a task fails more than “spark.task.maxFailures” (default: 4) times, the 
job will fail. For a long-running Streaming applications, it’s possible that a 
Receiver task fails more than 4 times because of Executor lost.
- When an executor is lost, the Receiver tasks on it will be rescheduled. 
However, because there may be many Spark jobs at the same time, it’s possible 
that TaskScheduler cannot schedule them to make Receivers be distributed evenly.

To solve such limitations, we need to change the receiver scheduling mechanism. 
Here is the design doc: 
https://docs.google.com/document/d/1ZsoRvHjpISPrDmSjsGzuSu8UjwgbtmoCTzmhgTurHJw/edit?usp=sharing

  was:The design doc: 
https://docs.google.com/document/d/1ZsoRvHjpISPrDmSjsGzuSu8UjwgbtmoCTzmhgTurHJw/edit?usp=sharing


> A New Receiver Scheduling Mechanism
> ---
>
> Key: SPARK-8882
> URL: https://issues.apache.org/jira/browse/SPARK-8882
> Project: Spark
>  Issue Type: New Feature
>  Components: Streaming
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
> Fix For: 1.5.0
>
>
> There are some problems in the current mechanism:
>  - If a task fails more than “spark.task.maxFailures” (default: 4) times, the 
> job will fail. For a long-running Streaming applications, it’s possible that 
> a Receiver task fails more than 4 times because of Executor lost.
> - When an executor is lost, the Receiver tasks on it will be rescheduled. 
> However, because there may be many Spark jobs at the same time, it’s possible 
> that TaskScheduler cannot schedule them to make Receivers be distributed 
> evenly.
> To solve such limitations, we need to change the receiver scheduling 
> mechanism. Here is the design doc: 
> https://docs.google.com/document/d/1ZsoRvHjpISPrDmSjsGzuSu8UjwgbtmoCTzmhgTurHJw/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10672) We should not fail to create a table If we cannot persist metadata of a data source table to metastore in a Hive compatible way

2015-09-22 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-10672.
--
   Resolution: Fixed
Fix Version/s: 1.5.1
   1.6.0

Issue resolved by pull request 8824
[https://github.com/apache/spark/pull/8824]

> We should not fail to create a table If we cannot persist metadata of a data 
> source table to metastore in a Hive compatible way
> ---
>
> Key: SPARK-10672
> URL: https://issues.apache.org/jira/browse/SPARK-10672
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Yin Huai
>Priority: Blocker
> Fix For: 1.6.0, 1.5.1
>
>
> It is possible that Hive has some internal restrictions on what kinds of 
> metadata of a table it accepts (e.g. Hive 0.13 does not support decimal 
> stored in parquet). If it is the case, we should not fail when we try to 
> store the metadata in a Hive compatible way. We should just save it in the 
> Spark SQL specific format.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10737) When using UnsafeRows, SortMergeJoin may return wrong results

2015-09-22 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-10737.
--
   Resolution: Fixed
Fix Version/s: 1.5.1
   1.6.0

Issue resolved by pull request 8854
[https://github.com/apache/spark/pull/8854]

> When using UnsafeRows, SortMergeJoin may return wrong results
> -
>
> Key: SPARK-10737
> URL: https://issues.apache.org/jira/browse/SPARK-10737
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Yin Huai
>Assignee: Yin Huai
>Priority: Blocker
> Fix For: 1.6.0, 1.5.1
>
>
> {code}
> val df1 = (1 to 10).map(i => (s"str_$i", i)).toDF("i", "j")
> val df2 =
>   df1
>   .join(df1.select(df1("i")), "i")
>   .select(df1("i"), df1("j"))
> val df3 = df2.withColumnRenamed("i", "i1").withColumnRenamed("j", "j1")
> val df4 =
>   df2
>   .join(df3, df2("i") === df3("i1"))
>   .withColumn("diff", $"j" - $"j1")
> df4.show(100, false)
> +--+---+--+---++
> |i |j  |i1|j1 |diff|
> +--+---+--+---++
> |str_2 |2  |str_2 |2  |0   |
> |str_7 |7  |str_2 |2  |5   |
> |str_10|10 |str_10|10 |0   |
> |str_3 |3  |str_3 |3  |0   |
> |str_8 |8  |str_3 |3  |5   |
> |str_4 |4  |str_4 |4  |0   |
> |str_9 |9  |str_4 |4  |5   |
> |str_5 |5  |str_5 |5  |0   |
> |str_1 |1  |str_1 |1  |0   |
> |str_6 |6  |str_1 |1  |5   |
> +--+---+--+---++
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10761) Refactor DiskBlockObjectWriter to not require BlockId

2015-09-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14903450#comment-14903450
 ] 

Apache Spark commented on SPARK-10761:
--

User 'JoshRosen' has created a pull request for this issue:
https://github.com/apache/spark/pull/8871

> Refactor DiskBlockObjectWriter to not require BlockId
> -
>
> Key: SPARK-10761
> URL: https://issues.apache.org/jira/browse/SPARK-10761
> Project: Spark
>  Issue Type: Sub-task
>  Components: Block Manager
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>Priority: Minor
>
> The DiskBlockObjectWriter constructor takes a BlockId parameter but never 
> uses it internally. We should try to clean this up.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10761) Refactor DiskBlockObjectWriter to not require BlockId

2015-09-22 Thread Josh Rosen (JIRA)
Josh Rosen created SPARK-10761:
--

 Summary: Refactor DiskBlockObjectWriter to not require BlockId
 Key: SPARK-10761
 URL: https://issues.apache.org/jira/browse/SPARK-10761
 Project: Spark
  Issue Type: Sub-task
  Components: Block Manager
Reporter: Josh Rosen
Assignee: Josh Rosen
Priority: Minor


The DiskBlockObjectWriter constructor takes a BlockId parameter but never uses 
it internally. We should try to clean this up.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10333) Add user guide for linear-methods.md columns

2015-09-22 Thread Lauren Moos (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14903448#comment-14903448
 ] 

Lauren Moos commented on SPARK-10333:
-

I'd be happy to work on this 

> Add user guide for linear-methods.md columns
> 
>
> Key: SPARK-10333
> URL: https://issues.apache.org/jira/browse/SPARK-10333
> Project: Spark
>  Issue Type: Documentation
>  Components: ML
>Reporter: Feynman Liang
>Priority: Minor
>
> Add example code to document input output columns based on 
> https://github.com/apache/spark/pull/8491 feedback



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10685) Misaligned data with RDD.zip and DataFrame.withColumn after repartition

2015-09-22 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-10685:
---
Assignee: Reynold Xin

> Misaligned data with RDD.zip and DataFrame.withColumn after repartition
> ---
>
> Key: SPARK-10685
> URL: https://issues.apache.org/jira/browse/SPARK-10685
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.3.0, 1.4.1, 1.5.0
> Environment: - OSX 10.10.4, java 1.7.0_51, hadoop 2.6.0-cdh5.4.5
> - Ubuntu 12.04, java 1.7.0_80, hadoop 2.6.0-cdh5.4.5
>Reporter: Dan Brown
>Assignee: Reynold Xin
>Priority: Blocker
> Fix For: 1.6.0, 1.5.1
>
>
> Here's a weird behavior where {{RDD.zip}} or {{DataFrame.withColumn}} after a 
> {{repartition}} produces "misaligned" data, meaning different column values 
> in the same row aren't matched, as if a zip shuffled the collections before 
> zipping them. It's difficult to reproduce because it's nondeterministic, 
> doesn't occur in local mode, and requires ≥2 workers (≥3 in one case). I was 
> able to repro it using pyspark 1.3.0 (cdh5.4.5), 1.4.1 (bin-without-hadoop), 
> and 1.5.0 (bin-without-hadoop).
> Here's the most similar issue I was able to find. It appears to not have been 
> repro'd and then closed optimistically, and it smells like it could have been 
> the same underlying cause that was never fixed:
> - https://issues.apache.org/jira/browse/SPARK-9131
> Also, this {{DataFrame.zip}} issue is related in spirit, since we were trying 
> to build it ourselves when we ran into this problem. Let me put in my vote 
> for reopening the issue and supporting {{DataFrame.zip}} in the standard lib.
> - https://issues.apache.org/jira/browse/SPARK-7460
> h3. Brief repro
> Fail: withColumn(udf) after DataFrame.repartition
> {code}
> df = sqlCtx.createDataFrame(Row(a=a) for a in xrange(1))
> df = df.repartition(100)
> df = df.withColumn('b', udf(lambda r: r, IntegerType())(df.a))
> [r for r in df.collect() if r.a != r.b][:3] # Should be []
> {code}
> Sample outputs (nondeterministic):
> {code}
> [Row(a=39, b=639), Row(a=139, b=739), Row(a=239, b=839)]
> [Row(a=639, b=39), Row(a=739, b=139), Row(a=839, b=239)]
> []
> [Row(a=641, b=41), Row(a=741, b=141), Row(a=841, b=241)]
> [Row(a=641, b=1343), Row(a=741, b=1443), Row(a=841, b=1543)]
> [Row(a=639, b=39), Row(a=739, b=139), Row(a=839, b=239)]
> {code}
> Fail: RDD.zip after DataFrame.repartition
> {code}
> df  = sqlCtx.createDataFrame(Row(a=a) for a in xrange(1))
> df  = df.repartition(100)
> rdd = df.rdd.zip(df.map(lambda r: Row(b=r.a))).map(lambda (x,y): Row(a=x.a, 
> b=y.b))
> [r for r in rdd.collect() if r.a != r.b][:3] # Should be []
> {code}
> Sample outputs (nondeterministic):
> {code}
> []
> [Row(a=50, b=6947), Row(a=150, b=7047), Row(a=250, b=7147)]
> []
> []
> [Row(a=44, b=644), Row(a=144, b=744), Row(a=244, b=844)]
> []
> {code}
> Test setup:
> - local\[8]: {{MASTER=local\[8]}}
> - dist\[N]: 1 driver + 1 master + N workers
> {code}
> "Fail" tests pass?  cluster mode  spark version
> 
> yes local[8]  1.3.0-cdh5.4.5
> no  dist[4]   1.3.0-cdh5.4.5
> yes local[8]  1.4.1
> yes dist[1]   1.4.1
> no  dist[2]   1.4.1
> no  dist[4]   1.4.1
> yes local[8]  1.5.0
> yes dist[1]   1.5.0
> no  dist[2]   1.5.0
> no  dist[4]   1.5.0
> {code}
> h3. Detailed repro
> Start `pyspark` and run these imports:
> {code}
> from pyspark.sql import Row
> from pyspark.sql.functions import udf
> from pyspark.sql.types import IntegerType, StructType, StructField
> {code}
> Fail: withColumn(udf) after DataFrame.repartition
> {code}
> df = sqlCtx.createDataFrame(Row(a=a) for a in xrange(1))
> df = df.repartition(100)
> df = df.withColumn('b', udf(lambda r: r, IntegerType())(df.a))
> len([r for r in df.collect() if r.a != r.b]) # Should be 0
> {code}
> Ok: withColumn(udf) after DataFrame.repartition(100) after 1 starting 
> partition
> {code}
> df = sqlCtx.createDataFrame(sc.parallelize((Row(a=a) for a in xrange(1)), 
> numSlices=1))
> df = df.repartition(100)
> df = df.withColumn('b', udf(lambda r: r, IntegerType())(df.a))
> len([r for r in df.collect() if r.a != r.b]) # Should be 0
> {code}
> Fail: withColumn(udf) after DataFrame.repartition(100) after 100 starting 
> partitions
> {code}
> df = sqlCtx.createDataFrame(sc.parallelize((Row(a=a) for a in xrange(1)), 
> numSlices=100))
> df = df.repartition(100)
> df = df.withColumn('b', udf(lambda r: r, IntegerType())(df.a))
> len([r for r in df.collect() if r.a != r.b]) # Should be 0
> {code}
> Fail: 

[jira] [Resolved] (SPARK-10685) Misaligned data with RDD.zip and DataFrame.withColumn after repartition

2015-09-22 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-10685.

   Resolution: Fixed
Fix Version/s: 1.5.1
   1.6.0

Issue resolved by pull request 8835
[https://github.com/apache/spark/pull/8835]

> Misaligned data with RDD.zip and DataFrame.withColumn after repartition
> ---
>
> Key: SPARK-10685
> URL: https://issues.apache.org/jira/browse/SPARK-10685
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.3.0, 1.4.1, 1.5.0
> Environment: - OSX 10.10.4, java 1.7.0_51, hadoop 2.6.0-cdh5.4.5
> - Ubuntu 12.04, java 1.7.0_80, hadoop 2.6.0-cdh5.4.5
>Reporter: Dan Brown
>Priority: Blocker
> Fix For: 1.6.0, 1.5.1
>
>
> Here's a weird behavior where {{RDD.zip}} or {{DataFrame.withColumn}} after a 
> {{repartition}} produces "misaligned" data, meaning different column values 
> in the same row aren't matched, as if a zip shuffled the collections before 
> zipping them. It's difficult to reproduce because it's nondeterministic, 
> doesn't occur in local mode, and requires ≥2 workers (≥3 in one case). I was 
> able to repro it using pyspark 1.3.0 (cdh5.4.5), 1.4.1 (bin-without-hadoop), 
> and 1.5.0 (bin-without-hadoop).
> Here's the most similar issue I was able to find. It appears to not have been 
> repro'd and then closed optimistically, and it smells like it could have been 
> the same underlying cause that was never fixed:
> - https://issues.apache.org/jira/browse/SPARK-9131
> Also, this {{DataFrame.zip}} issue is related in spirit, since we were trying 
> to build it ourselves when we ran into this problem. Let me put in my vote 
> for reopening the issue and supporting {{DataFrame.zip}} in the standard lib.
> - https://issues.apache.org/jira/browse/SPARK-7460
> h3. Brief repro
> Fail: withColumn(udf) after DataFrame.repartition
> {code}
> df = sqlCtx.createDataFrame(Row(a=a) for a in xrange(1))
> df = df.repartition(100)
> df = df.withColumn('b', udf(lambda r: r, IntegerType())(df.a))
> [r for r in df.collect() if r.a != r.b][:3] # Should be []
> {code}
> Sample outputs (nondeterministic):
> {code}
> [Row(a=39, b=639), Row(a=139, b=739), Row(a=239, b=839)]
> [Row(a=639, b=39), Row(a=739, b=139), Row(a=839, b=239)]
> []
> [Row(a=641, b=41), Row(a=741, b=141), Row(a=841, b=241)]
> [Row(a=641, b=1343), Row(a=741, b=1443), Row(a=841, b=1543)]
> [Row(a=639, b=39), Row(a=739, b=139), Row(a=839, b=239)]
> {code}
> Fail: RDD.zip after DataFrame.repartition
> {code}
> df  = sqlCtx.createDataFrame(Row(a=a) for a in xrange(1))
> df  = df.repartition(100)
> rdd = df.rdd.zip(df.map(lambda r: Row(b=r.a))).map(lambda (x,y): Row(a=x.a, 
> b=y.b))
> [r for r in rdd.collect() if r.a != r.b][:3] # Should be []
> {code}
> Sample outputs (nondeterministic):
> {code}
> []
> [Row(a=50, b=6947), Row(a=150, b=7047), Row(a=250, b=7147)]
> []
> []
> [Row(a=44, b=644), Row(a=144, b=744), Row(a=244, b=844)]
> []
> {code}
> Test setup:
> - local\[8]: {{MASTER=local\[8]}}
> - dist\[N]: 1 driver + 1 master + N workers
> {code}
> "Fail" tests pass?  cluster mode  spark version
> 
> yes local[8]  1.3.0-cdh5.4.5
> no  dist[4]   1.3.0-cdh5.4.5
> yes local[8]  1.4.1
> yes dist[1]   1.4.1
> no  dist[2]   1.4.1
> no  dist[4]   1.4.1
> yes local[8]  1.5.0
> yes dist[1]   1.5.0
> no  dist[2]   1.5.0
> no  dist[4]   1.5.0
> {code}
> h3. Detailed repro
> Start `pyspark` and run these imports:
> {code}
> from pyspark.sql import Row
> from pyspark.sql.functions import udf
> from pyspark.sql.types import IntegerType, StructType, StructField
> {code}
> Fail: withColumn(udf) after DataFrame.repartition
> {code}
> df = sqlCtx.createDataFrame(Row(a=a) for a in xrange(1))
> df = df.repartition(100)
> df = df.withColumn('b', udf(lambda r: r, IntegerType())(df.a))
> len([r for r in df.collect() if r.a != r.b]) # Should be 0
> {code}
> Ok: withColumn(udf) after DataFrame.repartition(100) after 1 starting 
> partition
> {code}
> df = sqlCtx.createDataFrame(sc.parallelize((Row(a=a) for a in xrange(1)), 
> numSlices=1))
> df = df.repartition(100)
> df = df.withColumn('b', udf(lambda r: r, IntegerType())(df.a))
> len([r for r in df.collect() if r.a != r.b]) # Should be 0
> {code}
> Fail: withColumn(udf) after DataFrame.repartition(100) after 100 starting 
> partitions
> {code}
> df = sqlCtx.createDataFrame(sc.parallelize((Row(a=a) for a in xrange(1)), 
> numSlices=100))
> df = df.repartition(100)
> df = df.withColumn('b', udf(lambda r: r, 

[jira] [Commented] (SPARK-10749) Support multiple roles with Spark Mesos dispatcher

2015-09-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14903474#comment-14903474
 ] 

Apache Spark commented on SPARK-10749:
--

User 'tnachen' has created a pull request for this issue:
https://github.com/apache/spark/pull/8872

> Support multiple roles with Spark Mesos dispatcher
> --
>
> Key: SPARK-10749
> URL: https://issues.apache.org/jira/browse/SPARK-10749
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Reporter: Timothy Chen
>
> Although you can currently set the framework role of the Mesos dispatcher, it 
> doesn't correctly use the offers given to it.
> It should inherit how Coarse/Fine grain scheduler works and use multiple 
> roles offers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10749) Support multiple roles with Spark Mesos dispatcher

2015-09-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10749:


Assignee: Apache Spark

> Support multiple roles with Spark Mesos dispatcher
> --
>
> Key: SPARK-10749
> URL: https://issues.apache.org/jira/browse/SPARK-10749
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Reporter: Timothy Chen
>Assignee: Apache Spark
>
> Although you can currently set the framework role of the Mesos dispatcher, it 
> doesn't correctly use the offers given to it.
> It should inherit how Coarse/Fine grain scheduler works and use multiple 
> roles offers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10762) GenericRowWithSchema exception in casting ArrayBuffer to HashSet in DataFrame to RDD from Hive table

2015-09-22 Thread Glenn Strycker (JIRA)
Glenn Strycker created SPARK-10762:
--

 Summary: GenericRowWithSchema exception in casting ArrayBuffer to 
HashSet in DataFrame to RDD from Hive table
 Key: SPARK-10762
 URL: https://issues.apache.org/jira/browse/SPARK-10762
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Glenn Strycker


I have a Hive table in parquet format that was generated using

{code}
create table myTable (var1 int, var2 string, var3 int, var4 string, var5 
array>) stored as parquet;
{code}

I am able to verify that it was filled -- here is a sample value

{code}
[1, "abcdef", 2, "ghijkl", ArrayBuffer([1, "hello"])]
{code}

I wish to put this into a Spark RDD of the form

{code}
((1,"abcdef"), ((2,"ghijkl"), Set((1,"hello"
{code}

Now, using spark-shell (I get the same problem in spark-submit), I made a test 
RDD with these values

{code}
scala> val tempRDD = sc.parallelize(Seq(((1,"abcdef"),((2,"ghijkl"), 
ArrayBuffer[(Int,String)]((1,"hello"))
tempRDD: org.apache.spark.rdd.RDD[((Int, String), ((Int, String), 
scala.collection.mutable.ArrayBuffer[(Int, String)]))] = 
ParallelCollectionRDD[44] at parallelize at :85
{code}

using an iterator, I can cast the ArrayBuffer as a HashSet in the following new 
RDD:

{code}
scala> val tempRDD2 = tempRDD.map(a => (a._1, (a._2._1, { var tempHashSet = new 
HashSet[(Int,String)]; a._2._2.foreach(a => tempHashSet = tempHashSet ++ 
HashSet(a)); tempHashSet } )))
tempRDD2: org.apache.spark.rdd.RDD[((Int, String), ((Int, String), 
scala.collection.immutable.HashSet[(Int, String)]))] = MapPartitionsRDD[46] at 
map at :87

scala> tempRDD2.collect.foreach(println)
((1,abcdef),((2,ghijkl),Set((1,hello
{code}

But when I attempt to do the EXACT SAME THING with a DataFrame with a 
HiveContext / SQLContext, I get the following error:

{code}
scala> val hc = new HiveContext(sc)
scala> import hc._
scala> import hc.implicits._

scala> val tempHiveQL = hc.sql("""select var1, var2, var3, var4, var5 from 
myTable""")

scala> val tempRDDfromHive = tempHiveQL.map(a => ((a(0).toString.toInt, 
a(1).toString), ((a(2).toString.toInt, a(3).toString), 
a(4).asInstanceOf[ArrayBuffer[(Int,String)]] )))

scala> val tempRDD3 = tempRDDfromHive.map(a => (a._1, (a._2._1, { var 
tempHashSet = new HashSet[(Int,String)]; a._2._2.foreach(a => tempHashSet = 
tempHashSet ++ HashSet(a)); tempHashSet } )))
tempRDD3: org.apache.spark.rdd.RDD[((Int, String), ((Int, String), 
scala.collection.immutable.HashSet[(Int, String)]))] = MapPartitionsRDD[47] at 
map at :91

scala> tempRDD3.collect.foreach(println)
org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in 
stage 14.0 failed 1 times, most recent failure: Lost task 1.0 in stage 14.0 
(TID 5211, localhost): java.lang.ClassCastException: 
org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema cannot be cast 
to scala.Tuple2
   at 
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1$$anonfun$apply$1.apply(:91)
   at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
   at 
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(:91)
   at 
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(:91)
   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
   at 
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
   at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
   at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
   at scala.collection.AbstractIterator.to(Iterator.scala:1157)
   at 
scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
   at 
scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
   at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
   at org.apache.spark.rdd.RDD$$anonfun$17.apply(RDD.scala:813)
   at org.apache.spark.rdd.RDD$$anonfun$17.apply(RDD.scala:813)
   at 

[jira] [Created] (SPARK-10763) Update Java MLLIB/ML tests to use simplified dataframe construction

2015-09-22 Thread holdenk (JIRA)
holdenk created SPARK-10763:
---

 Summary: Update Java MLLIB/ML tests to use simplified dataframe 
construction
 Key: SPARK-10763
 URL: https://issues.apache.org/jira/browse/SPARK-10763
 Project: Spark
  Issue Type: Test
  Components: ML, MLlib
Reporter: holdenk
Priority: Minor


As introduced in https://issues.apache.org/jira/browse/SPARK-10630 we now have 
an easier way to create dataframes from local Java lists. Lets update the tests 
to use those.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2737) ClassCastExceptions when collect()ing JavaRDDs' underlying Scala RDDs

2015-09-22 Thread Glenn Strycker (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14903542#comment-14903542
 ] 

Glenn Strycker commented on SPARK-2737:
---

I am getting a similar error in Spark 1.3.0... see a new ticket I created:  
https://issues.apache.org/jira/browse/SPARK-10762

> ClassCastExceptions when collect()ing JavaRDDs' underlying Scala RDDs
> -
>
> Key: SPARK-2737
> URL: https://issues.apache.org/jira/browse/SPARK-2737
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 0.8.0, 0.9.0, 1.0.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
> Fix For: 1.1.0
>
>
> The Java API's use of fake ClassTags doesn't seem to cause any problems for 
> Java users, but it can lead to issues when passing JavaRDDs' underlying RDDs 
> to Scala code (e.g. in the MLlib Java API wrapper code).  If we call 
> {{collect()}} on a Scala RDD with an incorrect ClassTag, this causes 
> ClassCastExceptions when we try to allocate an array of the wrong type (for 
> example, see SPARK-2197).
> There are a few possible fixes here.  An API-breaking fix would be to 
> completely remove the fake ClassTags and require Java API users to pass 
> {{java.lang.Class}} instances to all {{parallelize()}} calls and add 
> {{returnClass}} fields to all {{Function}} implementations.  This would be 
> extremely verbose.
> Instead, I propose that we add internal APIs to "repair" a Scala RDD with an 
> incorrect ClassTag by wrapping it and overriding its ClassTag.  This should 
> be okay for cases where the Scala code that calls {{collect()}} knows what 
> type of array should be allocated, which is the case in the MLlib wrappers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10409) Multilayer perceptron regression

2015-09-22 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14903480#comment-14903480
 ] 

Xiangrui Meng commented on SPARK-10409:
---

[~lmoos] This is a major feature. Could you work on SPARK-10759 and SPARK-10333 
first? 

> Multilayer perceptron regression
> 
>
> Key: SPARK-10409
> URL: https://issues.apache.org/jira/browse/SPARK-10409
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 1.5.0
>Reporter: Alexander Ulanov
>Priority: Minor
>
> Implement regression based on multilayer perceptron (MLP). It should support 
> different kinds of outputs: binary, real in [0;1) and real in [-inf; +inf]. 
> The implementation might take advantage of autoencoder. Time-series 
> forecasting for financial data might be one of the use cases, see 
> http://dl.acm.org/citation.cfm?id=561452. So there is the need for more 
> specific requirements from this (or other) area.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10607) Scheduler should include defensive measures against infinite loops due to task commit denial

2015-09-22 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14903476#comment-14903476
 ] 

Josh Rosen commented on SPARK-10607:


Retargeting; this enhancement doesn't need to be targeted at a specific point 
release.

> Scheduler should include defensive measures against infinite loops due to 
> task commit denial
> 
>
> Key: SPARK-10607
> URL: https://issues.apache.org/jira/browse/SPARK-10607
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 1.3.1, 1.4.1, 1.5.0
>Reporter: Josh Rosen
>Priority: Minor
>
> If OutputCommitter.commitTask() repeatedly fails due to the 
> OutputCommitCoordinator denying the right to commit, then scheduler may get 
> stuck in an infinite task retry loop. The reason for this behavior is the 
> fact  that DAGScheduler treats failures due to CommitDenied separately from 
> other failures: they don't count towards the typical count of maximum task 
> failures which can trigger a job failure. The correct fix is to add an 
> upper-bound on the number of times that a commit can be denied as a 
> last-ditch safety net to avoid infinite loop behavior.
> See SPARK-10381 for additional context. This is not a high priority issue to 
> fix right now, since the fix in SPARK-10381 should prevent this scenario from 
> happening in the first place. However, another layer of conservative 
> defensive limits / timeouts certainly would not hurt.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10607) Scheduler should include defensive measures against infinite loops due to task commit denial

2015-09-22 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-10607:
---
Target Version/s:   (was: 1.3.2, 1.4.2, 1.5.1)

> Scheduler should include defensive measures against infinite loops due to 
> task commit denial
> 
>
> Key: SPARK-10607
> URL: https://issues.apache.org/jira/browse/SPARK-10607
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 1.3.1, 1.4.1, 1.5.0
>Reporter: Josh Rosen
>Priority: Minor
>
> If OutputCommitter.commitTask() repeatedly fails due to the 
> OutputCommitCoordinator denying the right to commit, then scheduler may get 
> stuck in an infinite task retry loop. The reason for this behavior is the 
> fact  that DAGScheduler treats failures due to CommitDenied separately from 
> other failures: they don't count towards the typical count of maximum task 
> failures which can trigger a job failure. The correct fix is to add an 
> upper-bound on the number of times that a commit can be denied as a 
> last-ditch safety net to avoid infinite loop behavior.
> See SPARK-10381 for additional context. This is not a high priority issue to 
> fix right now, since the fix in SPARK-10381 should prevent this scenario from 
> happening in the first place. However, another layer of conservative 
> defensive limits / timeouts certainly would not hurt.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10333) Add user guide for linear-methods.md columns

2015-09-22 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-10333:
--
Assignee: Lauren Moos

> Add user guide for linear-methods.md columns
> 
>
> Key: SPARK-10333
> URL: https://issues.apache.org/jira/browse/SPARK-10333
> Project: Spark
>  Issue Type: Documentation
>  Components: ML
>Reporter: Feynman Liang
>Assignee: Lauren Moos
>Priority: Minor
>
> Add example code to document input output columns based on 
> https://github.com/apache/spark/pull/8491 feedback



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10759) Missing Python code example in ML Programming guide

2015-09-22 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-10759:
--
Assignee: Lauren Moos

> Missing Python code example in ML Programming guide
> ---
>
> Key: SPARK-10759
> URL: https://issues.apache.org/jira/browse/SPARK-10759
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 1.5.0
>Reporter: Raela Wang
>Assignee: Lauren Moos
>Priority: Minor
>
> http://spark.apache.org/docs/latest/ml-guide.html#example-model-selection-via-cross-validation
> http://spark.apache.org/docs/latest/ml-guide.html#example-model-selection-via-train-validation-split



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10409) Multilayer perceptron regression

2015-09-22 Thread Lauren Moos (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14903553#comment-14903553
 ] 

Lauren Moos commented on SPARK-10409:
-

no problem!

> Multilayer perceptron regression
> 
>
> Key: SPARK-10409
> URL: https://issues.apache.org/jira/browse/SPARK-10409
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 1.5.0
>Reporter: Alexander Ulanov
>Priority: Minor
>
> Implement regression based on multilayer perceptron (MLP). It should support 
> different kinds of outputs: binary, real in [0;1) and real in [-inf; +inf]. 
> The implementation might take advantage of autoencoder. Time-series 
> forecasting for financial data might be one of the use cases, see 
> http://dl.acm.org/citation.cfm?id=561452. So there is the need for more 
> specific requirements from this (or other) area.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10714) Refactor PythonRDD to decouple iterator computation from PythonRDD

2015-09-22 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-10714.

   Resolution: Fixed
Fix Version/s: 1.5.1
   1.6.0

Issue resolved by pull request 8835
[https://github.com/apache/spark/pull/8835]

> Refactor PythonRDD to decouple iterator computation from PythonRDD
> --
>
> Key: SPARK-10714
> URL: https://issues.apache.org/jira/browse/SPARK-10714
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 1.6.0, 1.5.1
>
>
> The idea is that most of the logic of calling Python actually has nothing to 
> do with RDD (it is really just communicating with a socket -- there is 
> nothing distributed about it), and it is only currently depending on RDD 
> because it was written this way.
> If we extract that functionality out, we can apply it to area of the code 
> that doesn't depend on RDDs, and also make it easier to test.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-8632) Poor Python UDF performance because of RDD caching

2015-09-22 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-8632.
---
   Resolution: Fixed
Fix Version/s: 1.5.1
   1.6.0

Issue resolved by pull request 8835
[https://github.com/apache/spark/pull/8835]

> Poor Python UDF performance because of RDD caching
> --
>
> Key: SPARK-8632
> URL: https://issues.apache.org/jira/browse/SPARK-8632
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.4.0
>Reporter: Justin Uang
>Assignee: Davies Liu
>Priority: Blocker
> Fix For: 1.6.0, 1.5.1
>
>
> {quote}
> We have been running into performance problems using Python UDFs with 
> DataFrames at large scale.
> From the implementation of BatchPythonEvaluation, it looks like the goal was 
> to reuse the PythonRDD code. It caches the entire child RDD so that it can do 
> two passes over the data. One to give to the PythonRDD, then one to join the 
> python lambda results with the original row (which may have java objects that 
> should be passed through).
> In addition, it caches all the columns, even the ones that don't need to be 
> processed by the Python UDF. In the cases I was working with, I had a 500 
> column table, and i wanted to use a python UDF for one column, and it ended 
> up caching all 500 columns. 
> {quote}
> http://apache-spark-developers-list.1001551.n3.nabble.com/Python-UDF-performance-at-large-scale-td12843.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >