[jira] [Comment Edited] (SPARK-13065) streaming-twitter pass twitter4j.FilterQuery argument to TwitterUtils.createStream()

2016-02-02 Thread Andrew Davidson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15128832#comment-15128832
 ] 

Andrew Davidson edited comment on SPARK-13065 at 2/2/16 7:20 PM:
-

Hi Sachin

I attached my java implementation for this enhancement as a reference. I also 
changed the description above. I added the code I use in my streaming spark app 
main() 

I chose a  bad name for the attachement . its not in patch format

Kind regards

Andy


was (Author: aedwip):
sorry bad name. its not in patch format

> streaming-twitter pass twitter4j.FilterQuery argument to 
> TwitterUtils.createStream()
> 
>
> Key: SPARK-13065
> URL: https://issues.apache.org/jira/browse/SPARK-13065
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.6.0
> Environment: all
>Reporter: Andrew Davidson
>Priority: Minor
>  Labels: twitter
> Attachments: twitterFilterQueryPatch.tar.gz
>
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> The twitter stream api is very powerful provides a lot of support for 
> twitter.com side filtering of status objects. When ever possible we want to 
> let twitter do as much work as possible for us.
> currently the spark twitter api only allows you to configure a small sub set 
> of possible filters 
> String{} filters = {"tag1", tag2"}
> JavaDStream tweets =TwitterUtils.createStream(ssc, twitterAuth, 
> filters);
> The current implemenation does 
> private[streaming]
> class TwitterReceiver(
> twitterAuth: Authorization,
> filters: Seq[String],
> storageLevel: StorageLevel
>   ) extends Receiver[Status](storageLevel) with Logging {
> . . .
>   val query = new FilterQuery
>   if (filters.size > 0) {
> query.track(filters.mkString(","))
> newTwitterStream.filter(query)
>   } else {
> newTwitterStream.sample()
>   }
> ...
> rather than construct the FilterQuery object in TwitterReceiver.onStart(). we 
> should be able to pass a FilterQueryObject
> looks like an easy fix. See source code links bellow
> kind regards
> Andy
> https://github.com/apache/spark/blob/master/external/twitter/src/main/scala/org/apache/spark/streaming/twitter/TwitterInputDStream.scala#L60
> https://github.com/apache/spark/blob/master/external/twitter/src/main/scala/org/apache/spark/streaming/twitter/TwitterInputDStream.scala#L89
> $ 2/2/16
> attached is my java implementation for this problem. Feel free to reuse it 
> how ever you like. In my streaming spark app main() I have the following code
>FilterQuery query = config.getFilterQuery().fetch();
> if (query != null) {
> // TODO https://issues.apache.org/jira/browse/SPARK-13065
> tweets = TwitterFilterQueryUtils.createStream(ssc, twitterAuth, 
> query);
> } /*else 
> spark native api
> String[] filters = {"tag1", tag2"}
> tweets = TwitterUtils.createStream(ssc, twitterAuth, filters);
> 
> see 
> https://github.com/apache/spark/blob/master/external/twitter/src/main/scala/org/apache/spark/streaming/twitter/TwitterInputDStream.scala#L89
> 
> causes
>  val query = new FilterQuery
>   if (filters.size > 0) {
> query.track(filters.mkString(","))
> newTwitterStream.filter(query)
> } */



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13094) No encoder implicits for Seq[Primitive]

2016-02-02 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-13094:
-
Assignee: Michael Armbrust

> No encoder implicits for Seq[Primitive]
> ---
>
> Key: SPARK-13094
> URL: https://issues.apache.org/jira/browse/SPARK-13094
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Deenar Toraskar
>Assignee: Michael Armbrust
> Fix For: 1.6.1, 2.0.0
>
>
> Dataset aggregators with complex types fail with unable to find encoder for 
> type stored in a Dataset. Though Datasets with these complex types are 
> supported.
> {code}
> val arraySum = new Aggregator[Seq[Float], Seq[Float],
>   Seq[Float]] with Serializable {
>   def zero: Seq[Float] = Nil
>   // The initial value.
>   def reduce(currentSum: Seq[Float], currentRow: Seq[Float]) =
> sumArray(currentSum, currentRow)
>   def merge(sum: Seq[Float], row: Seq[Float]) = sumArray(sum, row)
>   def finish(b: Seq[Float]) = b // Return the final result.
>   def sumArray(a: Seq[Float], b: Seq[Float]): Seq[Float] = {
> (a, b) match {
>   case (Nil, Nil) => Nil
>   case (Nil, row) => row
>   case (sum, Nil) => sum
>   case (sum, row) => (a, b).zipped.map { case (a, b) => a + b }
> }
>   }
> }.toColumn
> {code}
> {code}
> :47: error: Unable to find encoder for type stored in a Dataset.  
> Primitive types (Int, String, etc) and Product types (case classes) are 
> supported by importing sqlContext.implicits._  Support for serializing other 
> types will be added in future releases.
>}.toColumn
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12783) Dataset map serialization error

2016-02-02 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-12783.
--
   Resolution: Fixed
Fix Version/s: 1.6.1

Closing, please reopen if you can reproduce in 1.6.1-RC1.

> Dataset map serialization error
> ---
>
> Key: SPARK-12783
> URL: https://issues.apache.org/jira/browse/SPARK-12783
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Muthu Jayakumar
>Assignee: Wenchen Fan
>Priority: Critical
> Fix For: 1.6.1
>
> Attachments: MyMap.scala
>
>
> When Dataset API is used to map to another case class, an error is thrown.
> {code}
> case class MyMap(map: Map[String, String])
> case class TestCaseClass(a: String, b: String){
>   def toMyMap: MyMap = {
> MyMap(Map(a->b))
>   }
>   def toStr: String = {
> a
>   }
> }
> //Main method section below
> import sqlContext.implicits._
> val df1 = sqlContext.createDataset(Seq(TestCaseClass("2015-05-01", "data1"), 
> TestCaseClass("2015-05-01", "data2"))).toDF()
> df1.as[TestCaseClass].map(_.toStr).show() //works fine
> df1.as[TestCaseClass].map(_.toMyMap).show() //fails
> {code}
> Error message:
> {quote}
> Caused by: java.io.NotSerializableException: 
> scala.reflect.runtime.SynchronizedSymbols$SynchronizedSymbol$$anon$1
> Serialization stack:
>   - object not serializable (class: 
> scala.reflect.runtime.SynchronizedSymbols$SynchronizedSymbol$$anon$1, value: 
> package lang)
>   - field (class: scala.reflect.internal.Types$ThisType, name: sym, type: 
> class scala.reflect.internal.Symbols$Symbol)
>   - object (class scala.reflect.internal.Types$UniqueThisType, 
> java.lang.type)
>   - field (class: scala.reflect.internal.Types$TypeRef, name: pre, type: 
> class scala.reflect.internal.Types$Type)
>   - object (class scala.reflect.internal.Types$ClassNoArgsTypeRef, String)
>   - field (class: scala.reflect.internal.Types$TypeRef, name: normalized, 
> type: class scala.reflect.internal.Types$Type)
>   - object (class scala.reflect.internal.Types$AliasNoArgsTypeRef, String)
>   - field (class: 
> org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$6, name: keyType$1, 
> type: class scala.reflect.api.Types$TypeApi)
>   - object (class 
> org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$6, )
>   - field (class: org.apache.spark.sql.catalyst.expressions.MapObjects, 
> name: function, type: interface scala.Function1)
>   - object (class org.apache.spark.sql.catalyst.expressions.MapObjects, 
> mapobjects(,invoke(upcast('map,MapType(StringType,StringType,true),-
>  field (class: "scala.collection.immutable.Map", name: "map"),- root class: 
> "collector.MyMap"),keyArray,ArrayType(StringType,true)),StringType))
>   - field (class: org.apache.spark.sql.catalyst.expressions.Invoke, name: 
> targetObject, type: class 
> org.apache.spark.sql.catalyst.expressions.Expression)
>   - object (class org.apache.spark.sql.catalyst.expressions.Invoke, 
> invoke(mapobjects(,invoke(upcast('map,MapType(StringType,StringType,true),-
>  field (class: "scala.collection.immutable.Map", name: "map"),- root class: 
> "collector.MyMap"),keyArray,ArrayType(StringType,true)),StringType),array,ObjectType(class
>  [Ljava.lang.Object;)))
>   - writeObject data (class: 
> scala.collection.immutable.List$SerializationProxy)
>   - object (class scala.collection.immutable.List$SerializationProxy, 
> scala.collection.immutable.List$SerializationProxy@4c7e3aab)
>   - writeReplace data (class: 
> scala.collection.immutable.List$SerializationProxy)
>   - object (class scala.collection.immutable.$colon$colon, 
> List(invoke(mapobjects(,invoke(upcast('map,MapType(StringType,StringType,true),-
>  field (class: "scala.collection.immutable.Map", name: "map"),- root class: 
> "collector.MyMap"),keyArray,ArrayType(StringType,true)),StringType),array,ObjectType(class
>  [Ljava.lang.Object;)), 
> invoke(mapobjects(,invoke(upcast('map,MapType(StringType,StringType,true),-
>  field (class: "scala.collection.immutable.Map", name: "map"),- root class: 
> "collector.MyMap"),valueArray,ArrayType(StringType,true)),StringType),array,ObjectType(class
>  [Ljava.lang.Object;
>   - field (class: org.apache.spark.sql.catalyst.expressions.StaticInvoke, 
> name: arguments, type: interface scala.collection.Seq)
>   - object (class org.apache.spark.sql.catalyst.expressions.StaticInvoke, 
> staticinvoke(class 
> org.apache.spark.sql.catalyst.util.ArrayBasedMapData$,ObjectType(interface 
> scala.collection.Map),toScalaMap,invoke(mapobjects(,invoke(upcast('map,MapType(StringType,StringType,true),-
>  field (class: "scala.collection.immutable.Map", name: "map"),- root class: 
> 

[jira] [Commented] (SPARK-12988) Can't drop columns that contain dots

2016-02-02 Thread Wenchen Fan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15128745#comment-15128745
 ] 

Wenchen Fan commented on SPARK-12988:
-

I'd also like to forbid to use invalid column names in `drop`

> Can't drop columns that contain dots
> 
>
> Key: SPARK-12988
> URL: https://issues.apache.org/jira/browse/SPARK-12988
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Michael Armbrust
>
> Neither of theses works:
> {code}
> val df = Seq((1, 1)).toDF("a_b", "a.c")
> df.drop("a.c").collect()
> df: org.apache.spark.sql.DataFrame = [a_b: int, a.c: int]
> {code}
> {code}
> val df = Seq((1, 1)).toDF("a_b", "a.c")
> df.drop("`a.c`").collect()
> df: org.apache.spark.sql.DataFrame = [a_b: int, a.c: int]
> {code}
> Given that you can't use drop to drop subfields, it seems to me that we 
> should treat the column name literally (i.e. as though it is wrapped in back 
> ticks).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12631) Make Parameter Descriptions Consistent for PySpark MLlib Clustering

2016-02-02 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-12631.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 10610
[https://github.com/apache/spark/pull/10610]

> Make Parameter Descriptions Consistent for PySpark MLlib Clustering
> ---
>
> Key: SPARK-12631
> URL: https://issues.apache.org/jira/browse/SPARK-12631
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 1.6.0
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
>Priority: Trivial
>  Labels: doc, starter
> Fix For: 2.0.0
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Follow example parameter description format from parent task to fix up 
> clustering.py



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13120) Shade protobuf-java

2016-02-02 Thread Ted Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15128759#comment-15128759
 ] 

Ted Yu commented on SPARK-13120:


https://groups.google.com/forum/#!topic/protobuf/wAqvtPLBsE8

PB2 and PB3 are wire compatible, but, protobuf-java is not compatible so 
dependency will be a problem.
Shading protobuf-java would provide better experience for downstream projects.

> Shade protobuf-java
> ---
>
> Key: SPARK-13120
> URL: https://issues.apache.org/jira/browse/SPARK-13120
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Reporter: Ted Yu
>
> See this thread for background information:
> http://search-hadoop.com/m/q3RTtdkUFK11xQhP1/Spark+not+able+to+fetch+events+from+Amazon+Kinesis
> This issue shades com.google.protobuf:protobuf-java as 
> org.spark-project.protobuf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12988) Can't drop columns that contain dots

2016-02-02 Thread Yan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15128829#comment-15128829
 ] 

Yan commented on SPARK-12988:
-

My thinking is that projections should parse the column names; while the 
schema-based ops should keep the names as is. One thing I'm not sure is 
"Column". Given its current capabilities, it seems it is for projections so its 
name should be backticked if it contains a '.'. But please correct me if I'm 
wrong here.

> Can't drop columns that contain dots
> 
>
> Key: SPARK-12988
> URL: https://issues.apache.org/jira/browse/SPARK-12988
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Michael Armbrust
>
> Neither of theses works:
> {code}
> val df = Seq((1, 1)).toDF("a_b", "a.c")
> df.drop("a.c").collect()
> df: org.apache.spark.sql.DataFrame = [a_b: int, a.c: int]
> {code}
> {code}
> val df = Seq((1, 1)).toDF("a_b", "a.c")
> df.drop("`a.c`").collect()
> df: org.apache.spark.sql.DataFrame = [a_b: int, a.c: int]
> {code}
> Given that you can't use drop to drop subfields, it seems to me that we 
> should treat the column name literally (i.e. as though it is wrapped in back 
> ticks).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13065) streaming-twitter pass twitter4j.FilterQuery argument to TwitterUtils.createStream()

2016-02-02 Thread Andrew Davidson (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Davidson updated SPARK-13065:

Description: 
The twitter stream api is very powerful provides a lot of support for 
twitter.com side filtering of status objects. When ever possible we want to let 
twitter do as much work as possible for us.

currently the spark twitter api only allows you to configure a small sub set of 
possible filters 

String{} filters = {"tag1", tag2"}
JavaDStream tweets =TwitterUtils.createStream(ssc, twitterAuth, 
filters);

The current implemenation does 

private[streaming]
class TwitterReceiver(
twitterAuth: Authorization,
filters: Seq[String],
storageLevel: StorageLevel
  ) extends Receiver[Status](storageLevel) with Logging {

. . .


  val query = new FilterQuery
  if (filters.size > 0) {
query.track(filters.mkString(","))
newTwitterStream.filter(query)
  } else {
newTwitterStream.sample()
  }

...

rather than construct the FilterQuery object in TwitterReceiver.onStart(). we 
should be able to pass a FilterQueryObject

looks like an easy fix. See source code links bellow

kind regards

Andy

https://github.com/apache/spark/blob/master/external/twitter/src/main/scala/org/apache/spark/streaming/twitter/TwitterInputDStream.scala#L60

https://github.com/apache/spark/blob/master/external/twitter/src/main/scala/org/apache/spark/streaming/twitter/TwitterInputDStream.scala#L89


$ 2/2/16
attached is my java implementation for this problem. Feel free to reuse it how 
ever you like. In my streaming spark app main() I have the following code

   FilterQuery query = config.getFilterQuery().fetch();
if (query != null) {
// TODO https://issues.apache.org/jira/browse/SPARK-13065
tweets = TwitterFilterQueryUtils.createStream(ssc, twitterAuth, 
query);
} /*else 
spark native api
String[] filters = {"tag1", tag2"}
tweets = TwitterUtils.createStream(ssc, twitterAuth, filters);

see 
https://github.com/apache/spark/blob/master/external/twitter/src/main/scala/org/apache/spark/streaming/twitter/TwitterInputDStream.scala#L89

causes
 val query = new FilterQuery
  if (filters.size > 0) {
query.track(filters.mkString(","))
newTwitterStream.filter(query)
} */

  was:
The twitter stream api is very powerful provides a lot of support for 
twitter.com side filtering of status objects. When ever possible we want to let 
twitter do as much work as possible for us.

currently the spark twitter api only allows you to configure a small sub set of 
possible filters 

String{} filters = {"tag1", tag2"}
JavaDStream tweets =TwitterUtils.createStream(ssc, twitterAuth, 
filters);

The current implemenation does 

private[streaming]
class TwitterReceiver(
twitterAuth: Authorization,
filters: Seq[String],
storageLevel: StorageLevel
  ) extends Receiver[Status](storageLevel) with Logging {

. . .


  val query = new FilterQuery
  if (filters.size > 0) {
query.track(filters.mkString(","))
newTwitterStream.filter(query)
  } else {
newTwitterStream.sample()
  }

...

rather than construct the FilterQuery object in TwitterReceiver.onStart(). we 
should be able to pass a FilterQueryObject

looks like an easy fix. See source code links bellow

kind regards

Andy

https://github.com/apache/spark/blob/master/external/twitter/src/main/scala/org/apache/spark/streaming/twitter/TwitterInputDStream.scala#L60

https://github.com/apache/spark/blob/master/external/twitter/src/main/scala/org/apache/spark/streaming/twitter/TwitterInputDStream.scala#L89


> streaming-twitter pass twitter4j.FilterQuery argument to 
> TwitterUtils.createStream()
> 
>
> Key: SPARK-13065
> URL: https://issues.apache.org/jira/browse/SPARK-13065
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.6.0
> Environment: all
>Reporter: Andrew Davidson
>Priority: Minor
>  Labels: twitter
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> The twitter stream api is very powerful provides a lot of support for 
> twitter.com side filtering of status objects. When ever possible we want to 
> let twitter do as much work as possible for us.
> currently the spark twitter api only allows you to configure a small sub set 
> of possible filters 
> String{} filters = {"tag1", tag2"}
> JavaDStream tweets =TwitterUtils.createStream(ssc, twitterAuth, 
> filters);
> The current implemenation does 
> private[streaming]
> class TwitterReceiver(
> twitterAuth: Authorization,
> 

[jira] [Resolved] (SPARK-12711) ML StopWordsRemover does not protect itself from column name duplication

2016-02-02 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-12711.
---
   Resolution: Fixed
Fix Version/s: 1.6.1
   2.0.0

Issue resolved by pull request 10741
[https://github.com/apache/spark/pull/10741]

> ML StopWordsRemover does not protect itself from column name duplication
> 
>
> Key: SPARK-12711
> URL: https://issues.apache.org/jira/browse/SPARK-12711
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 1.6.0
>Reporter: Grzegorz Chilkiewicz
>Priority: Trivial
>  Labels: ml, mllib, newbie, suggestion
> Fix For: 2.0.0, 1.6.1
>
>
> At work we were 'taking a closer look' at ML transformers and I 
> spotted that anomally.
> On first look, resolution looks simple:
> Add to StopWordsRemover.transformSchema line (as is done in e.g. 
> PCA.transformSchema, StandardScaler.transformSchema, 
> OneHotEncoder.transformSchema):
> {code}
> require(!schema.fieldNames.contains($(outputCol)), s"Output column 
> ${$(outputCol)} already exists.")
> {code}
> Am I correct? Is that a bug?If yes - I am willing to prepare an 
> appropriate pull request.
> Maybe a better idea is to make use of super.transformSchema in 
> StopWordsRemover (and possibly in all other places)?
> Links to files at github, mentioned above:
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/StopWordsRemover.scala#L147
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/Transformer.scala#L109-L111
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/StandardScaler.scala#L101-L102
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/PCA.scala#L138-L139
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/OneHotEncoder.scala#L75-L76



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11293) Spillable collections leak shuffle memory

2016-02-02 Thread Mridul Muralidharan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15128935#comment-15128935
 ] 

Mridul Muralidharan commented on SPARK-11293:
-

Not iterating to the end has a bunch of issues IIRC - including what you 
mention above. For example, m'mapped buffers are not released, etc.
Unfortunately, I dont think there is a general clean solution for it. Would be 
good to see what alternatives exist to resolve this.

> Spillable collections leak shuffle memory
> -
>
> Key: SPARK-11293
> URL: https://issues.apache.org/jira/browse/SPARK-11293
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.3.1, 1.4.1, 1.5.1, 1.6.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>Priority: Critical
>
> I discovered multiple leaks of shuffle memory while working on my memory 
> manager consolidation patch, which added the ability to do strict memory leak 
> detection for the bookkeeping that used to be performed by the 
> ShuffleMemoryManager. This uncovered a handful of places where tasks can 
> acquire execution/shuffle memory but never release it, starving themselves of 
> memory.
> Problems that I found:
> * {{ExternalSorter.stop()}} should release the sorter's shuffle/execution 
> memory.
> * BlockStoreShuffleReader should call {{ExternalSorter.stop()}} using a 
> {{CompletionIterator}}.
> * {{ExternalAppendOnlyMap}} exposes no equivalent of {{stop()}} for freeing 
> its resources.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13094) No encoder implicits for Seq[Primitive]

2016-02-02 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-13094.
--
   Resolution: Fixed
Fix Version/s: 1.6.1
   2.0.0

Issue resolved by pull request 11014
[https://github.com/apache/spark/pull/11014]

> No encoder implicits for Seq[Primitive]
> ---
>
> Key: SPARK-13094
> URL: https://issues.apache.org/jira/browse/SPARK-13094
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Deenar Toraskar
> Fix For: 2.0.0, 1.6.1
>
>
> Dataset aggregators with complex types fail with unable to find encoder for 
> type stored in a Dataset. Though Datasets with these complex types are 
> supported.
> {code}
> val arraySum = new Aggregator[Seq[Float], Seq[Float],
>   Seq[Float]] with Serializable {
>   def zero: Seq[Float] = Nil
>   // The initial value.
>   def reduce(currentSum: Seq[Float], currentRow: Seq[Float]) =
> sumArray(currentSum, currentRow)
>   def merge(sum: Seq[Float], row: Seq[Float]) = sumArray(sum, row)
>   def finish(b: Seq[Float]) = b // Return the final result.
>   def sumArray(a: Seq[Float], b: Seq[Float]): Seq[Float] = {
> (a, b) match {
>   case (Nil, Nil) => Nil
>   case (Nil, row) => row
>   case (sum, Nil) => sum
>   case (sum, row) => (a, b).zipped.map { case (a, b) => a + b }
> }
>   }
> }.toColumn
> {code}
> {code}
> :47: error: Unable to find encoder for type stored in a Dataset.  
> Primitive types (Int, String, etc) and Product types (case classes) are 
> supported by importing sqlContext.implicits._  Support for serializing other 
> types will be added in future releases.
>}.toColumn
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12780) Inconsistency returning value of ML python models' properties

2016-02-02 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-12780:
--
Fix Version/s: 1.6.1

> Inconsistency returning value of ML python models' properties
> -
>
> Key: SPARK-12780
> URL: https://issues.apache.org/jira/browse/SPARK-12780
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Reporter: Xusen Yin
>Assignee: Xusen Yin
>Priority: Minor
> Fix For: 1.6.1, 2.0.0
>
>
> In spark/python/pyspark/ml/feature.py, StringIndexerModel has a property 
> method named labels, which is different with other properties in other models.
> In StringIndexerModel:
> {code:title=StringIndexerModel|theme=FadeToGrey|linenumbers=true|language=python|firstline=0001|collapse=false}
> @property
> @since("1.5.0")
> def labels(self):
> """
> Ordered list of labels, corresponding to indices to be assigned.
> """
> return self._java_obj.labels
> {code}
> In CounterVectorizerModel (as an example):
> {code:title=CounterVectorizerModel|theme=FadeToGrey|linenumbers=true|language=python|firstline=0001|collapse=false}
> @property
> @since("1.6.0")
> def vocabulary(self):
> """
> An array of terms in the vocabulary.
> """
> return self._call_java("vocabulary")
> {code}
> In StringIndexerModel, the returned value of labels is not an array of labels 
> as expected. Otherwise it is a JavaMember of py4j.
> What's more, the Pickle in Python side cannot deserialize Scala Array 
> normally. According to my experiments, it translates Array[String] into 
> Tuple, Array[Int] to array.array. It may bring some errors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13002) Mesos scheduler backend does not follow the property spark.dynamicAllocation.initialExecutors

2016-02-02 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-13002:
-
Target Version/s: 2.0.0  (was: 1.6.1, 2.0.0)

> Mesos scheduler backend does not follow the property 
> spark.dynamicAllocation.initialExecutors
> -
>
> Key: SPARK-13002
> URL: https://issues.apache.org/jira/browse/SPARK-13002
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 1.5.2, 1.6.0
>Reporter: Luc Bourlier
>  Labels: dynamic_allocation, mesos
>
> When starting a Spark job on a Mesos cluster, all available cores are 
> reserved (up to {{spark.cores.max}}), creating one executor per Mesos node, 
> and as many executors as needed.
> This is the case even when dynamic allocation is enabled.
> When dynamic allocation is enabled, the number of executor launched at 
> startup should be limited to the value of 
> {{spark.dynamicAllocation.initialExecutors}}.
> The Mesos scheduler backend already follows the value computed by the 
> {{ExecutorAllocationManager}} for the number of executors that should be up 
> and running. Expect at startup, when it just creates all the executors it can.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12913) Reimplement stat functions as declarative function

2016-02-02 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-12913.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 10960
[https://github.com/apache/spark/pull/10960]

> Reimplement stat functions as declarative function
> --
>
> Key: SPARK-12913
> URL: https://issues.apache.org/jira/browse/SPARK-12913
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Davies Liu
> Fix For: 2.0.0
>
>
> As benchmarked and discussed here: 
> https://github.com/apache/spark/pull/10786/files#r50038294.
> Benefits from codegen, the declarative aggregate function could be much 
> faster than imperative one,  we should re-implement all the builtin aggregate 
> functions as declarative one.
> For skewness and kurtosis, we need to benchmark it to make sure that the 
> declarative one is actually faster than imperative one.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13114) java.lang.NegativeArraySizeException in CSV

2016-02-02 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-13114.
-
   Resolution: Fixed
 Assignee: Hyukjin Kwon
Fix Version/s: 2.0.0

> java.lang.NegativeArraySizeException in CSV
> ---
>
> Key: SPARK-13114
> URL: https://issues.apache.org/jira/browse/SPARK-13114
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Hyukjin Kwon
>Priority: Critical
> Fix For: 2.0.0
>
>
> It could be that token.length > schemaFields.length
> {code}
> java.lang.NegativeArraySizeException
> at 
> com.databricks.spark.csv.CsvRelation$$anonfun$buildScan$6.apply(CsvRelation.scala:171)
> at 
> com.databricks.spark.csv.CsvRelation$$anonfun$buildScan$6.apply(CsvRelation.scala:162)
> at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
> at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:148)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:89)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13065) streaming-twitter pass twitter4j.FilterQuery argument to TwitterUtils.createStream()

2016-02-02 Thread Andrew Davidson (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Davidson updated SPARK-13065:

Attachment: twitterFilterQueryPatch.tar.gz

sorry bad name. its not in patch format

> streaming-twitter pass twitter4j.FilterQuery argument to 
> TwitterUtils.createStream()
> 
>
> Key: SPARK-13065
> URL: https://issues.apache.org/jira/browse/SPARK-13065
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.6.0
> Environment: all
>Reporter: Andrew Davidson
>Priority: Minor
>  Labels: twitter
> Attachments: twitterFilterQueryPatch.tar.gz
>
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> The twitter stream api is very powerful provides a lot of support for 
> twitter.com side filtering of status objects. When ever possible we want to 
> let twitter do as much work as possible for us.
> currently the spark twitter api only allows you to configure a small sub set 
> of possible filters 
> String{} filters = {"tag1", tag2"}
> JavaDStream tweets =TwitterUtils.createStream(ssc, twitterAuth, 
> filters);
> The current implemenation does 
> private[streaming]
> class TwitterReceiver(
> twitterAuth: Authorization,
> filters: Seq[String],
> storageLevel: StorageLevel
>   ) extends Receiver[Status](storageLevel) with Logging {
> . . .
>   val query = new FilterQuery
>   if (filters.size > 0) {
> query.track(filters.mkString(","))
> newTwitterStream.filter(query)
>   } else {
> newTwitterStream.sample()
>   }
> ...
> rather than construct the FilterQuery object in TwitterReceiver.onStart(). we 
> should be able to pass a FilterQueryObject
> looks like an easy fix. See source code links bellow
> kind regards
> Andy
> https://github.com/apache/spark/blob/master/external/twitter/src/main/scala/org/apache/spark/streaming/twitter/TwitterInputDStream.scala#L60
> https://github.com/apache/spark/blob/master/external/twitter/src/main/scala/org/apache/spark/streaming/twitter/TwitterInputDStream.scala#L89
> $ 2/2/16
> attached is my java implementation for this problem. Feel free to reuse it 
> how ever you like. In my streaming spark app main() I have the following code
>FilterQuery query = config.getFilterQuery().fetch();
> if (query != null) {
> // TODO https://issues.apache.org/jira/browse/SPARK-13065
> tweets = TwitterFilterQueryUtils.createStream(ssc, twitterAuth, 
> query);
> } /*else 
> spark native api
> String[] filters = {"tag1", tag2"}
> tweets = TwitterUtils.createStream(ssc, twitterAuth, filters);
> 
> see 
> https://github.com/apache/spark/blob/master/external/twitter/src/main/scala/org/apache/spark/streaming/twitter/TwitterInputDStream.scala#L89
> 
> causes
>  val query = new FilterQuery
>   if (filters.size > 0) {
> query.track(filters.mkString(","))
> newTwitterStream.filter(query)
> } */



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12711) ML StopWordsRemover does not protect itself from column name duplication

2016-02-02 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-12711:
--
Assignee: Grzegorz Chilkiewicz

> ML StopWordsRemover does not protect itself from column name duplication
> 
>
> Key: SPARK-12711
> URL: https://issues.apache.org/jira/browse/SPARK-12711
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 1.6.0
>Reporter: Grzegorz Chilkiewicz
>Assignee: Grzegorz Chilkiewicz
>Priority: Trivial
>  Labels: ml, mllib, newbie, suggestion
> Fix For: 1.6.1, 2.0.0
>
>
> At work we were 'taking a closer look' at ML transformers and I 
> spotted that anomally.
> On first look, resolution looks simple:
> Add to StopWordsRemover.transformSchema line (as is done in e.g. 
> PCA.transformSchema, StandardScaler.transformSchema, 
> OneHotEncoder.transformSchema):
> {code}
> require(!schema.fieldNames.contains($(outputCol)), s"Output column 
> ${$(outputCol)} already exists.")
> {code}
> Am I correct? Is that a bug?If yes - I am willing to prepare an 
> appropriate pull request.
> Maybe a better idea is to make use of super.transformSchema in 
> StopWordsRemover (and possibly in all other places)?
> Links to files at github, mentioned above:
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/StopWordsRemover.scala#L147
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/Transformer.scala#L109-L111
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/StandardScaler.scala#L101-L102
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/PCA.scala#L138-L139
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/OneHotEncoder.scala#L75-L76



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13121) java mapWithState mishandles scala Option

2016-02-02 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15128871#comment-15128871
 ] 

Apache Spark commented on SPARK-13121:
--

User 'gabrielenizzoli' has created a pull request for this issue:
https://github.com/apache/spark/pull/11028

> java mapWithState mishandles scala Option
> -
>
> Key: SPARK-13121
> URL: https://issues.apache.org/jira/browse/SPARK-13121
> Project: Spark
>  Issue Type: Bug
>  Components: Java API, Streaming
>Affects Versions: 1.6.0
>Reporter: Gabriele Nizzoli
>Priority: Critical
> Fix For: 1.6.1
>
>
> in Spark Streaming, java mapWithState that uses Function3 has a bug in the 
> convertion from a scala Option to a java Optional. In the conversion, the 
> code in `StateSpec.scala`, line 222 is
> `Optional.fromNullable(v.get)`. This fails if `v`, an `Option`, is `None`, 
> better to use `JavaUtils.optionToOptional(v)` instead.
> Workaround is to use the Function4 call to mapWithState. This call has the 
> right conversion.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-12988) Can't drop columns that contain dots

2016-02-02 Thread Dilip Biswal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15128902#comment-15128902
 ] 

Dilip Biswal edited comment on SPARK-12988 at 2/2/16 7:56 PM:
--

The subtle difference between column path and column name may not be very 
obvious to a common user of this API. 

val df = Seq((1, 1)).toDF("a_b", "a.b")
df.select("`a.b`")
df.drop("`a.b`") => the fact that one can not use back tick here , would it be 
that obvious to the user ?

I believe that was the motivation to allow it but then i am not sure of its 
implications.


was (Author: dkbiswal):
The shuttle difference between column path and column name may not be very 
obvious to a common user of this API. 

val df = Seq((1, 1)).toDF("a_b", "a.b")
df.select("`a.b`")
df.drop("`a.b`") => the fact that one can not use back tick here , would it be 
that obvious to the user ?

I believe that was the motivation to allow it but then i am not sure of its 
implications.

> Can't drop columns that contain dots
> 
>
> Key: SPARK-12988
> URL: https://issues.apache.org/jira/browse/SPARK-12988
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Michael Armbrust
>
> Neither of theses works:
> {code}
> val df = Seq((1, 1)).toDF("a_b", "a.c")
> df.drop("a.c").collect()
> df: org.apache.spark.sql.DataFrame = [a_b: int, a.c: int]
> {code}
> {code}
> val df = Seq((1, 1)).toDF("a_b", "a.c")
> df.drop("`a.c`").collect()
> df: org.apache.spark.sql.DataFrame = [a_b: int, a.c: int]
> {code}
> Given that you can't use drop to drop subfields, it seems to me that we 
> should treat the column name literally (i.e. as though it is wrapped in back 
> ticks).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13009) spark-streaming-twitter_2.10 does not make it possible to access the raw twitter json

2016-02-02 Thread Andrew Davidson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15128843#comment-15128843
 ] 

Andrew Davidson commented on SPARK-13009:
-

I Sean

I total agree with you. The Twitter4j people asked me to file a RFE with spark. 
I agree it is their problem.  I just looking for some sort of work around. My 
down stream systems will not be able to process the data I am capturing.

I guess in the short term I create the wrapper object and modify the spark 
twitter source code

kind regards

Andy

> spark-streaming-twitter_2.10 does not make it possible to access the raw 
> twitter json
> -
>
> Key: SPARK-13009
> URL: https://issues.apache.org/jira/browse/SPARK-13009
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.6.0
>Reporter: Andrew Davidson
>Priority: Minor
>
> The Streaming-twitter package makes it easy for Java programmers to work with 
> twitter. The implementation returns the raw twitter data in JSON formate as a 
> twitter4J StatusJSONImpl object
> JavaDStream tweets = TwitterUtils.createStream(ssc, twitterAuth);
> The status class is different then the raw JSON. I.E. serializing the status 
> object will be the same as the original json. I have down stream systems that 
> can only process raw tweets not twitter4J Status objects. 
> Here is my bug/RFE request made to Twitter4J . 
> They asked  I create a spark tracking issue.
> On Thursday, January 21, 2016 at 6:27:25 PM UTC, Andy Davidson wrote:
> Hi All
> Quick problem summary:
> My system uses the Status objects to do some analysis how ever I need to 
> store the raw JSON. There are other systems that process that data that are 
> not written in Java.
> Currently we are serializing the Status Object. The JSON is going to break 
> down stream systems.
> I am using the Apache Spark Streaming spark-streaming-twitter_2.10  
> http://spark.apache.org/docs/latest/streaming-programming-guide.html#advanced-sources
> Request For Enhancement:
> I imagine easy access to the raw JSON is a common requirement. Would it be 
> possible to add a member function to StatusJSONImpl getRawJson(). By default 
> the returned value would be null unless jsonStoreEnabled=True  is set in the 
> config.
> Alternative implementations:
>  
> It should be possible to modify the spark-streaming-twitter_2.10 to provide 
> this support. The solutions is not very clean
> It would required apache spark to define their own Status Pojo. The current 
> StatusJSONImpl class is marked final
> The Wrapper is not going to work nicely with existing code.
> spark-streaming-twitter_2.10  does not expose all of the twitter streaming 
> API so many developers are writing their implementations of 
> org.apache.park.streaming.twitter.TwitterInputDStream. This make maintenance 
> difficult. Its not easy to know when the spark implementation for twitter has 
> changed. 
> Code listing for 
> spark-1.6.0/external/twitter/src/main/scala/org/apache/spark/streaming/twitter/TwitterInputDStream.scala
> private[streaming]
> class TwitterReceiver(
> twitterAuth: Authorization,
> filters: Seq[String],
> storageLevel: StorageLevel
>   ) extends Receiver[Status](storageLevel) with Logging {
>   @volatile private var twitterStream: TwitterStream = _
>   @volatile private var stopped = false
>   def onStart() {
> try {
>   val newTwitterStream = new 
> TwitterStreamFactory().getInstance(twitterAuth)
>   newTwitterStream.addListener(new StatusListener {
> def onStatus(status: Status): Unit = {
>   store(status)
> }
> Ref: 
> https://forum.processing.org/one/topic/saving-json-data-from-twitter4j.html
> What do people think?
> Kind regards
> Andy
> From:  on behalf of Igor Brigadir 
> 
> Reply-To: 
> Date: Tuesday, January 19, 2016 at 5:55 AM
> To: Twitter4J 
> Subject: Re: [Twitter4J] trouble writing unit test
> Main issue is that the Json object is in the wrong json format.
> eg: "createdAt": 1449775664000 should be "created_at": "Thu Dec 10 19:27:44 
> + 2015", ...
> It looks like the json you have was serialized from a java Status object, 
> which makes json objects different to what you get from the API, 
> TwitterObjectFactory expects json from Twitter (I haven't had any problems 
> using TwitterObjectFactory instead of the Deprecated DataObjectFactory).
> You could "fix" it by matching the keys & values you have with the correct, 
> twitter API json - it should look like the example here: 
> https://dev.twitter.com/rest/reference/get/statuses/show/%3Aid
> But it might be easier to download the tweets again, but this 

[jira] [Commented] (SPARK-12988) Can't drop columns that contain dots

2016-02-02 Thread Dilip Biswal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15128902#comment-15128902
 ] 

Dilip Biswal commented on SPARK-12988:
--

The shuttle difference between column path and column name may not be very 
obvious to a common user of this API. 

val df = Seq((1, 1)).toDF("a_b", "a.b")
df.select("`a.b`")
df.drop("`a.b`") => the fact that one can not use back tick here , would it be 
that obvious to the user ?

I believe that was the motivation to allow it but then i am not sure of its 
implications.

> Can't drop columns that contain dots
> 
>
> Key: SPARK-12988
> URL: https://issues.apache.org/jira/browse/SPARK-12988
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Michael Armbrust
>
> Neither of theses works:
> {code}
> val df = Seq((1, 1)).toDF("a_b", "a.c")
> df.drop("a.c").collect()
> df: org.apache.spark.sql.DataFrame = [a_b: int, a.c: int]
> {code}
> {code}
> val df = Seq((1, 1)).toDF("a_b", "a.c")
> df.drop("`a.c`").collect()
> df: org.apache.spark.sql.DataFrame = [a_b: int, a.c: int]
> {code}
> Given that you can't use drop to drop subfields, it seems to me that we 
> should treat the column name literally (i.e. as though it is wrapped in back 
> ticks).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13126) History Server page always has horizontal scrollbar

2016-02-02 Thread Zhuo Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15128927#comment-15128927
 ] 

Zhuo Liu commented on SPARK-13126:
--

Hi Alex, thanks for testing that out.
I came up with the fix for that. Please feel free to test again.
https://github.com/apache/spark/pull/11029

> History Server page always has horizontal scrollbar
> ---
>
> Key: SPARK-13126
> URL: https://issues.apache.org/jira/browse/SPARK-13126
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.0.0
>Reporter: Alex Bozarth
>Priority: Minor
> Attachments: page_width.png
>
>
> The new History Server page table is always wider than the page no matter how 
> much larger you make the window. Most likely an odd CSS error, doesn't seem 
> to be to be a simple fix when manipulating the css using the Web Inspector



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13126) History Server page always has horizontal scrollbar

2016-02-02 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13126:


Assignee: (was: Apache Spark)

> History Server page always has horizontal scrollbar
> ---
>
> Key: SPARK-13126
> URL: https://issues.apache.org/jira/browse/SPARK-13126
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.0.0
>Reporter: Alex Bozarth
>Priority: Minor
> Attachments: page_width.png
>
>
> The new History Server page table is always wider than the page no matter how 
> much larger you make the window. Most likely an odd CSS error, doesn't seem 
> to be to be a simple fix when manipulating the css using the Web Inspector



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13126) History Server page always has horizontal scrollbar

2016-02-02 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13126:


Assignee: Apache Spark

> History Server page always has horizontal scrollbar
> ---
>
> Key: SPARK-13126
> URL: https://issues.apache.org/jira/browse/SPARK-13126
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.0.0
>Reporter: Alex Bozarth
>Assignee: Apache Spark
>Priority: Minor
> Attachments: page_width.png
>
>
> The new History Server page table is always wider than the page no matter how 
> much larger you make the window. Most likely an odd CSS error, doesn't seem 
> to be to be a simple fix when manipulating the css using the Web Inspector



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13146) API for managing streaming dataframes

2016-02-02 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13146:


Assignee: Tathagata Das  (was: Apache Spark)

> API for managing streaming dataframes
> -
>
> Key: SPARK-13146
> URL: https://issues.apache.org/jira/browse/SPARK-13146
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13121) java mapWithState mishandles scala Option

2016-02-02 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-13121.
--
   Resolution: Fixed
Fix Version/s: 2.0.0

> java mapWithState mishandles scala Option
> -
>
> Key: SPARK-13121
> URL: https://issues.apache.org/jira/browse/SPARK-13121
> Project: Spark
>  Issue Type: Bug
>  Components: Java API, Streaming
>Affects Versions: 1.6.0
>Reporter: Gabriele Nizzoli
>Priority: Critical
> Fix For: 1.6.1, 2.0.0
>
>
> in Spark Streaming, java mapWithState that uses Function3 has a bug in the 
> convertion from a scala Option to a java Optional. In the conversion, the 
> code in `StateSpec.scala`, line 222 is
> `Optional.fromNullable(v.get)`. This fails if `v`, an `Option`, is `None`, 
> better to use `JavaUtils.optionToOptional(v)` instead.
> Workaround is to use the Function4 call to mapWithState. This call has the 
> right conversion.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13147) improve readability of generated code

2016-02-02 Thread Davies Liu (JIRA)
Davies Liu created SPARK-13147:
--

 Summary: improve readability of generated code
 Key: SPARK-13147
 URL: https://issues.apache.org/jira/browse/SPARK-13147
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Davies Liu
Assignee: Davies Liu


1. try to avoid the suffix (unique id)
2. remove multiple empty lines in code formater
3. remove the comment if there is no code generated.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13146) API for managing streaming dataframes

2016-02-02 Thread Tathagata Das (JIRA)
Tathagata Das created SPARK-13146:
-

 Summary: API for managing streaming dataframes
 Key: SPARK-13146
 URL: https://issues.apache.org/jira/browse/SPARK-13146
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Tathagata Das
Assignee: Tathagata Das






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7997) Remove the developer api SparkEnv.actorSystem and AkkaUtils

2016-02-02 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15129019#comment-15129019
 ] 

Apache Spark commented on SPARK-7997:
-

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/11031

> Remove the developer api SparkEnv.actorSystem and AkkaUtils
> ---
>
> Key: SPARK-7997
> URL: https://issues.apache.org/jira/browse/SPARK-7997
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13121) java mapWithState mishandles scala Option

2016-02-02 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-13121:
-
Assignee: Gabriele Nizzoli

> java mapWithState mishandles scala Option
> -
>
> Key: SPARK-13121
> URL: https://issues.apache.org/jira/browse/SPARK-13121
> Project: Spark
>  Issue Type: Bug
>  Components: Java API, Streaming
>Affects Versions: 1.6.0
>Reporter: Gabriele Nizzoli
>Assignee: Gabriele Nizzoli
>Priority: Critical
> Fix For: 1.6.1, 2.0.0
>
>
> in Spark Streaming, java mapWithState that uses Function3 has a bug in the 
> convertion from a scala Option to a java Optional. In the conversion, the 
> code in `StateSpec.scala`, line 222 is
> `Optional.fromNullable(v.get)`. This fails if `v`, an `Option`, is `None`, 
> better to use `JavaUtils.optionToOptional(v)` instead.
> Workaround is to use the Function4 call to mapWithState. This call has the 
> right conversion.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13147) improve readability of generated code

2016-02-02 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13147:


Assignee: Davies Liu  (was: Apache Spark)

> improve readability of generated code
> -
>
> Key: SPARK-13147
> URL: https://issues.apache.org/jira/browse/SPARK-13147
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Davies Liu
>
> 1. try to avoid the suffix (unique id)
> 2. remove multiple empty lines in code formater
> 3. remove the comment if there is no code generated.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13147) improve readability of generated code

2016-02-02 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15129108#comment-15129108
 ] 

Apache Spark commented on SPARK-13147:
--

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/11032

> improve readability of generated code
> -
>
> Key: SPARK-13147
> URL: https://issues.apache.org/jira/browse/SPARK-13147
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Davies Liu
>
> 1. try to avoid the suffix (unique id)
> 2. remove multiple empty lines in code formater
> 3. remove the comment if there is no code generated.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13147) improve readability of generated code

2016-02-02 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13147:


Assignee: Apache Spark  (was: Davies Liu)

> improve readability of generated code
> -
>
> Key: SPARK-13147
> URL: https://issues.apache.org/jira/browse/SPARK-13147
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Apache Spark
>
> 1. try to avoid the suffix (unique id)
> 2. remove multiple empty lines in code formater
> 3. remove the comment if there is no code generated.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13148) support zero-keytab Oozie application launch on a secure cluster

2016-02-02 Thread Steve Loughran (JIRA)
Steve Loughran created SPARK-13148:
--

 Summary: support zero-keytab Oozie application launch on a secure 
cluster 
 Key: SPARK-13148
 URL: https://issues.apache.org/jira/browse/SPARK-13148
 Project: Spark
  Issue Type: New Feature
  Components: YARN
Affects Versions: 1.6.0
 Environment: YARN cluster with Kerberos enabled, launched from Oozie 
—where Oozie passes down the delegation tokens
Reporter: Steve Loughran


Oozie can launch Spark instances on insecure clusters, and on a secure cluster 
if Oozie is set up to provide a keytab.

What it cannot currently do is launch a Spark application on a YARN cluster 
without a keytab. In this situation, Oozie collects the delegation tokens it is 
setup to collect (as a superuser in the cluster), saves them to a file, then 
points to the file in the `HADOOP_TOKEN_FILE_LOCATION` environment variable.

These tokens need to be used to launch the application —rather than try to get 
some more



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13148) support zero-keytab Oozie application launch on a secure cluster

2016-02-02 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15129136#comment-15129136
 ] 

Steve Loughran commented on SPARK-13148:


Note that Hadoop's UGI class automatically loads the file referenced off 
{{$HADOOP_TOKEN_FILE_LOCATION}} when it inits; this is the mechanism used to 
get tokens in the YARN AM.

Client-side, they become the tokens of the current user. All that is needed is 
for the Yarn client to recognise that the situation has occurred (i.e. the env 
variable is set), add all those credentials to the AM's launch context —and 
skip trying to acquire tokens for filesystems, HBase and Hive.

> support zero-keytab Oozie application launch on a secure cluster 
> -
>
> Key: SPARK-13148
> URL: https://issues.apache.org/jira/browse/SPARK-13148
> Project: Spark
>  Issue Type: New Feature
>  Components: YARN
>Affects Versions: 1.6.0
> Environment: YARN cluster with Kerberos enabled, launched from Oozie 
> —where Oozie passes down the delegation tokens
>Reporter: Steve Loughran
>
> Oozie can launch Spark instances on insecure clusters, and on a secure 
> cluster if Oozie is set up to provide a keytab.
> What it cannot currently do is launch a Spark application on a YARN cluster 
> without a keytab. In this situation, Oozie collects the delegation tokens it 
> is setup to collect (as a superuser in the cluster), saves them to a file, 
> then points to the file in the `HADOOP_TOKEN_FILE_LOCATION` environment 
> variable.
> These tokens need to be used to launch the application —rather than try to 
> get some more



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13148) support zero-keytab Oozie application launch on a secure cluster

2016-02-02 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15129176#comment-15129176
 ] 

Apache Spark commented on SPARK-13148:
--

User 'steveloughran' has created a pull request for this issue:
https://github.com/apache/spark/pull/11033

> support zero-keytab Oozie application launch on a secure cluster 
> -
>
> Key: SPARK-13148
> URL: https://issues.apache.org/jira/browse/SPARK-13148
> Project: Spark
>  Issue Type: New Feature
>  Components: YARN
>Affects Versions: 1.6.0
> Environment: YARN cluster with Kerberos enabled, launched from Oozie 
> —where Oozie passes down the delegation tokens
>Reporter: Steve Loughran
>
> Oozie can launch Spark instances on insecure clusters, and on a secure 
> cluster if Oozie is set up to provide a keytab.
> What it cannot currently do is launch a Spark application on a YARN cluster 
> without a keytab. In this situation, Oozie collects the delegation tokens it 
> is setup to collect (as a superuser in the cluster), saves them to a file, 
> then points to the file in the `HADOOP_TOKEN_FILE_LOCATION` environment 
> variable.
> These tokens need to be used to launch the application —rather than try to 
> get some more



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13148) support zero-keytab Oozie application launch on a secure cluster

2016-02-02 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13148:


Assignee: Apache Spark

> support zero-keytab Oozie application launch on a secure cluster 
> -
>
> Key: SPARK-13148
> URL: https://issues.apache.org/jira/browse/SPARK-13148
> Project: Spark
>  Issue Type: New Feature
>  Components: YARN
>Affects Versions: 1.6.0
> Environment: YARN cluster with Kerberos enabled, launched from Oozie 
> —where Oozie passes down the delegation tokens
>Reporter: Steve Loughran
>Assignee: Apache Spark
>
> Oozie can launch Spark instances on insecure clusters, and on a secure 
> cluster if Oozie is set up to provide a keytab.
> What it cannot currently do is launch a Spark application on a YARN cluster 
> without a keytab. In this situation, Oozie collects the delegation tokens it 
> is setup to collect (as a superuser in the cluster), saves them to a file, 
> then points to the file in the `HADOOP_TOKEN_FILE_LOCATION` environment 
> variable.
> These tokens need to be used to launch the application —rather than try to 
> get some more



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13148) support zero-keytab Oozie application launch on a secure cluster

2016-02-02 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13148:


Assignee: (was: Apache Spark)

> support zero-keytab Oozie application launch on a secure cluster 
> -
>
> Key: SPARK-13148
> URL: https://issues.apache.org/jira/browse/SPARK-13148
> Project: Spark
>  Issue Type: New Feature
>  Components: YARN
>Affects Versions: 1.6.0
> Environment: YARN cluster with Kerberos enabled, launched from Oozie 
> —where Oozie passes down the delegation tokens
>Reporter: Steve Loughran
>
> Oozie can launch Spark instances on insecure clusters, and on a secure 
> cluster if Oozie is set up to provide a keytab.
> What it cannot currently do is launch a Spark application on a YARN cluster 
> without a keytab. In this situation, Oozie collects the delegation tokens it 
> is setup to collect (as a superuser in the cluster), saves them to a file, 
> then points to the file in the `HADOOP_TOKEN_FILE_LOCATION` environment 
> variable.
> These tokens need to be used to launch the application —rather than try to 
> get some more



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13146) API for managing streaming dataframes

2016-02-02 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13146:


Assignee: Apache Spark  (was: Tathagata Das)

> API for managing streaming dataframes
> -
>
> Key: SPARK-13146
> URL: https://issues.apache.org/jira/browse/SPARK-13146
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Tathagata Das
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13146) API for managing streaming dataframes

2016-02-02 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15129010#comment-15129010
 ] 

Apache Spark commented on SPARK-13146:
--

User 'tdas' has created a pull request for this issue:
https://github.com/apache/spark/pull/11030

> API for managing streaming dataframes
> -
>
> Key: SPARK-13146
> URL: https://issues.apache.org/jira/browse/SPARK-13146
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13149) Add FileStreamSource and a simple version of FileStreamSink

2016-02-02 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13149:


Assignee: Shixiong Zhu  (was: Apache Spark)

> Add FileStreamSource and a simple version of FileStreamSink
> ---
>
> Key: SPARK-13149
> URL: https://issues.apache.org/jira/browse/SPARK-13149
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13149) Add FileStreamSource and a simple version of FileStreamSink

2016-02-02 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13149:


Assignee: Apache Spark  (was: Shixiong Zhu)

> Add FileStreamSource and a simple version of FileStreamSink
> ---
>
> Key: SPARK-13149
> URL: https://issues.apache.org/jira/browse/SPARK-13149
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Shixiong Zhu
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13150) Flaky test: org.apache.spark.sql.hive.thriftserver.SingleSessionSuite.test single session

2016-02-02 Thread Davies Liu (JIRA)
Davies Liu created SPARK-13150:
--

 Summary: Flaky test: 
org.apache.spark.sql.hive.thriftserver.SingleSessionSuite.test single session
 Key: SPARK-13150
 URL: https://issues.apache.org/jira/browse/SPARK-13150
 Project: Spark
  Issue Type: Test
  Components: SQL
Reporter: Davies Liu
Assignee: Cheng Lian


https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50551/testReport/org.apache.spark.sql.hive.thriftserver/SingleSessionSuite/test_single_session/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13101) Dataset complex types mapping to DataFrame (element nullability) mismatch

2016-02-02 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13101:


Assignee: Apache Spark

> Dataset complex types mapping to DataFrame  (element nullability) mismatch
> --
>
> Key: SPARK-13101
> URL: https://issues.apache.org/jira/browse/SPARK-13101
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Deenar Toraskar
>Assignee: Apache Spark
>Priority: Blocker
>
> There seems to be a regression between 1.6.0 and 1.6.1 (snapshot build). By 
> default a scala {{Seq\[Double\]}} is mapped by Spark as an ArrayType with 
> nullable element
> {noformat}
>  |-- valuations: array (nullable = true)
>  ||-- element: double (containsNull = true)
> {noformat}
> This could be read back to as a Dataset in Spark 1.6.0
> {code}
> val df = sqlContext.table("valuations").as[Valuation]
> {code}
> But with Spark 1.6.1 the same fails with
> {code}
> val df = sqlContext.table("valuations").as[Valuation]
> org.apache.spark.sql.AnalysisException: cannot resolve 'cast(valuations as 
> array)' due to data type mismatch: cannot cast 
> ArrayType(DoubleType,true) to ArrayType(DoubleType,false);
> {code}
> Here's the classes I am using
> {code}
> case class Valuation(tradeId : String,
>  counterparty: String,
>  nettingAgreement: String,
>  wrongWay: Boolean,
>  valuations : Seq[Double], /* one per scenario */
>  timeInterval: Int,
>  jobId: String)  /* used for hdfs partitioning */
> val vals : Seq[Valuation] = Seq()
> val valsDF = sqlContext.sparkContext.parallelize(vals).toDF
> valsDF.write.partitionBy("jobId").mode(SaveMode.Overwrite).saveAsTable("valuations")
> {code}
> even the following gives the same result
> {code}
> val valsDF = vals.toDS.toDF
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13101) Dataset complex types mapping to DataFrame (element nullability) mismatch

2016-02-02 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15129347#comment-15129347
 ] 

Apache Spark commented on SPARK-13101:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/11035

> Dataset complex types mapping to DataFrame  (element nullability) mismatch
> --
>
> Key: SPARK-13101
> URL: https://issues.apache.org/jira/browse/SPARK-13101
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Deenar Toraskar
>Priority: Blocker
>
> There seems to be a regression between 1.6.0 and 1.6.1 (snapshot build). By 
> default a scala {{Seq\[Double\]}} is mapped by Spark as an ArrayType with 
> nullable element
> {noformat}
>  |-- valuations: array (nullable = true)
>  ||-- element: double (containsNull = true)
> {noformat}
> This could be read back to as a Dataset in Spark 1.6.0
> {code}
> val df = sqlContext.table("valuations").as[Valuation]
> {code}
> But with Spark 1.6.1 the same fails with
> {code}
> val df = sqlContext.table("valuations").as[Valuation]
> org.apache.spark.sql.AnalysisException: cannot resolve 'cast(valuations as 
> array)' due to data type mismatch: cannot cast 
> ArrayType(DoubleType,true) to ArrayType(DoubleType,false);
> {code}
> Here's the classes I am using
> {code}
> case class Valuation(tradeId : String,
>  counterparty: String,
>  nettingAgreement: String,
>  wrongWay: Boolean,
>  valuations : Seq[Double], /* one per scenario */
>  timeInterval: Int,
>  jobId: String)  /* used for hdfs partitioning */
> val vals : Seq[Valuation] = Seq()
> val valsDF = sqlContext.sparkContext.parallelize(vals).toDF
> valsDF.write.partitionBy("jobId").mode(SaveMode.Overwrite).saveAsTable("valuations")
> {code}
> even the following gives the same result
> {code}
> val valsDF = vals.toDS.toDF
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13101) Dataset complex types mapping to DataFrame (element nullability) mismatch

2016-02-02 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13101:


Assignee: (was: Apache Spark)

> Dataset complex types mapping to DataFrame  (element nullability) mismatch
> --
>
> Key: SPARK-13101
> URL: https://issues.apache.org/jira/browse/SPARK-13101
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Deenar Toraskar
>Priority: Blocker
>
> There seems to be a regression between 1.6.0 and 1.6.1 (snapshot build). By 
> default a scala {{Seq\[Double\]}} is mapped by Spark as an ArrayType with 
> nullable element
> {noformat}
>  |-- valuations: array (nullable = true)
>  ||-- element: double (containsNull = true)
> {noformat}
> This could be read back to as a Dataset in Spark 1.6.0
> {code}
> val df = sqlContext.table("valuations").as[Valuation]
> {code}
> But with Spark 1.6.1 the same fails with
> {code}
> val df = sqlContext.table("valuations").as[Valuation]
> org.apache.spark.sql.AnalysisException: cannot resolve 'cast(valuations as 
> array)' due to data type mismatch: cannot cast 
> ArrayType(DoubleType,true) to ArrayType(DoubleType,false);
> {code}
> Here's the classes I am using
> {code}
> case class Valuation(tradeId : String,
>  counterparty: String,
>  nettingAgreement: String,
>  wrongWay: Boolean,
>  valuations : Seq[Double], /* one per scenario */
>  timeInterval: Int,
>  jobId: String)  /* used for hdfs partitioning */
> val vals : Seq[Valuation] = Seq()
> val valsDF = sqlContext.sparkContext.parallelize(vals).toDF
> valsDF.write.partitionBy("jobId").mode(SaveMode.Overwrite).saveAsTable("valuations")
> {code}
> even the following gives the same result
> {code}
> val valsDF = vals.toDS.toDF
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13150) Flaky test: org.apache.spark.sql.hive.thriftserver.SingleSessionSuite.test single session

2016-02-02 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15129239#comment-15129239
 ] 

Davies Liu commented on SPARK-13150:


This one usually fail together : 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50551/testReport/org.apache.spark.sql.hive.thriftserver/HiveThriftBinaryServerSuite/SPARK_11595_ADD_JAR_with_input_path_having_URL_scheme/

> Flaky test: org.apache.spark.sql.hive.thriftserver.SingleSessionSuite.test 
> single session
> -
>
> Key: SPARK-13150
> URL: https://issues.apache.org/jira/browse/SPARK-13150
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Cheng Lian
>
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50551/testReport/org.apache.spark.sql.hive.thriftserver/SingleSessionSuite/test_single_session/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13149) Add FileStreamSource and a simple version of FileStreamSink

2016-02-02 Thread Shixiong Zhu (JIRA)
Shixiong Zhu created SPARK-13149:


 Summary: Add FileStreamSource and a simple version of 
FileStreamSink
 Key: SPARK-13149
 URL: https://issues.apache.org/jira/browse/SPARK-13149
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Shixiong Zhu
Assignee: Shixiong Zhu






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13122) Race condition in MemoryStore.unrollSafely() causes memory leak

2016-02-02 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-13122:
--
Assignee: Adam Budde

> Race condition in MemoryStore.unrollSafely() causes memory leak
> ---
>
> Key: SPARK-13122
> URL: https://issues.apache.org/jira/browse/SPARK-13122
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Streaming
>Affects Versions: 1.6.0
>Reporter: Adam Budde
>Assignee: Adam Budde
>
> The 
> [unrollSafely()|https://github.com/apache/spark/blob/v1.6.0/core/src/main/scala/org/apache/spark/storage/MemoryStore.scala#L249]
>  method in MemoryStore will progressively unroll the contents of a block 
> iterator into memory. It works by reserving an initial chunk of unroll memory 
> and periodically checking if more memory must be reserved as it unrolls the 
> iterator. The memory reserved for performing the unroll is considered 
> "pending" memory and is tracked on a per-task attempt ID bases in a map 
> object named pendingUnrollMemoryMap. When the unrolled block is committed to 
> storage memory in the 
> [tryToPut()|https://github.com/apache/spark/blob/v1.6.0/core/src/main/scala/org/apache/spark/storage/MemoryStore.scala#L362]
>  method, a method named 
> [releasePendingUnrollMemoryForThisTask()|https://github.com/apache/spark/blob/v1.6.0/core/src/main/scala/org/apache/spark/storage/MemoryStore.scala#L521]
>  is invoked and this pending memory is released. tryToPut() then proceeds to 
> allocate the storage memory required for the block.
> The unrollSafely() method computes the amount of pending memory used for the 
> unroll operation by saving the amount of unroll memory reserved for the 
> particular task attempt ID at the start of the method in a variable named 
> previousMemoryReserved and subtracting this value from the unroll memory 
> dedicated to the task at the end of the method. This value is stored as the 
> variable amountToTransferToPending. This amount is then subtracted from the 
> per-task unrollMemoryMap and added to pendingUnrollMemoryMap.
> The amount of unroll memory consumed for the task is obtained from 
> unrollMemoryMap via the currentUnrollMemoryForThisTask method. In order for 
> the semantics of unrollSafely() to work, the value of unrollMemoryMap for the 
> task returned by 
> [currentTaskAttemptId()|https://github.com/apache/spark/blob/v1.6.0/core/src/main/scala/org/apache/spark/storage/MemoryStore.scala#L475]
>  must not be mutated between the computation of previousMemoryReserved and 
> amountToTransferToPending. However, since there is no synchronization in 
> place to ensure that computing both variables and updating the memory maps 
> happens atomically, a race condition can occur when multiple threads for 
> which currentTaskAttemptId() returns the same value are both trying to store 
> blocks. This can lead to a negative value being computed for 
> amountToTransferToPending, corrupting the unrollMemoryMap and 
> pendingUnrollMemoryMap memory maps which in turn can lead to the memory 
> manager leaking unroll memory.
> For example, lets consider how the state of the unrollMemoryMap and 
> pendingUnrollMemoryMap variables might be affected if two threads returning 
> the same value for currentTaskAttemptId() both execute unrollSafely() 
> concurrently:
> ||Thread 1||Thread 2||unrollMemoryMap||pendingUnrollMemoryMap||
> |Enter unrollSafely()|-|0|0|
> |perviousMemoryReserved = 0|-|0|0|
> |(perform unroll)|-|2097152 (2 MiB)|0|
> |-|Enter unrollSafely()|2097152 (2 MiB)|0| 
> |-|perviousMemoryReserved = 2097152|2097152 (2 MiB)|0|
> |-|(performUnroll)|3145728 (3 MiB)|0|
> |Enter finally { }|-|3145728 (3 MiB)|0| 
> |amtToTransfer =  3145728|-|3145728 (3 MiB)|0|
> |Update memory maps|-|0|3145728 (3 MiB)|
> |Return|Enter finally { }|0|3145728 (3 MiB)|
> |-|amtToTrasnfer = -2097152|0|3145728 (3 MiB)|
> |-|Update memory maps|-2097152 (2 MiB)|1048576 (1 MiB)|
> In this example, we end up leaking 2 MiB of unroll memory since both Thread 1 
> and Thread 2 think that the task has only 1 MiB of unroll memory allocated to 
> it when it actually has 3 MiB. The negative value stored in unrollMemoryMap 
> will also propagate to future invocations of unrollSafely().
> In our particular case, this behavior manifests since the 
> currentTaskAttemptId() method is returning -1 for each Spark receiver task. 
> This in and of itself could be a bug and is something I'm going to look into. 
> We noticed that blocks would start to spill over to disk when more than 
> enough storage memory was available, so we inserted log statements into 
> MemoryManager's acquireUnrollMemory() and releaseUnrollMemory() in order to 
> collect the number of unroll bytes acquired and released. When we plot the 
> output, it is apparent that unroll memory is 

[jira] [Updated] (SPARK-13151) Investigate replacing SynchronizedBuffer as it is deprecated/unreliable

2016-02-02 Thread holdenk (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk updated SPARK-13151:

Component/s: Spark Core

> Investigate replacing SynchronizedBuffer as it is deprecated/unreliable
> ---
>
> Key: SPARK-13151
> URL: https://issues.apache.org/jira/browse/SPARK-13151
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, Streaming
>Reporter: holdenk
>Priority: Trivial
>
> Building with scala 2.11 results in the warning trait SynchronizedBuffer in 
> package mutable is deprecated: Synchronization via traits is deprecated as it 
> is inherently unreliable.  Consider 
> java.util.concurrent.ConcurrentLinkedQueue as an alternative - we should 
> investigate if this is a reasonable suggestion.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-13150) Flaky test: org.apache.spark.sql.hive.thriftserver.SingleSessionSuite.test single session

2016-02-02 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu reopened SPARK-13150:


not fixed yet


> Flaky test: org.apache.spark.sql.hive.thriftserver.SingleSessionSuite.test 
> single session
> -
>
> Key: SPARK-13150
> URL: https://issues.apache.org/jira/browse/SPARK-13150
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Cheng Lian
> Fix For: 2.0.0
>
>
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50551/testReport/org.apache.spark.sql.hive.thriftserver/SingleSessionSuite/test_single_session/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13150) Flaky test: org.apache.spark.sql.hive.thriftserver.SingleSessionSuite.test single session

2016-02-02 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15129556#comment-15129556
 ] 

Cheng Lian commented on SPARK-13150:


Seems that ADD JAR command in both flaky tests may fail silently and causing 
the following failure of CREATE TEMPORARY FUNCTION command. Still investigating.

> Flaky test: org.apache.spark.sql.hive.thriftserver.SingleSessionSuite.test 
> single session
> -
>
> Key: SPARK-13150
> URL: https://issues.apache.org/jira/browse/SPARK-13150
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Cheng Lian
> Fix For: 2.0.0
>
>
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50551/testReport/org.apache.spark.sql.hive.thriftserver/SingleSessionSuite/test_single_session/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12991) Establish correspondence between SparkPlan and LogicalPlan nodes

2016-02-02 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12991?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12991:


Assignee: Apache Spark

> Establish correspondence between SparkPlan and LogicalPlan nodes
> 
>
> Key: SPARK-12991
> URL: https://issues.apache.org/jira/browse/SPARK-12991
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Mikhail Bautin
>Assignee: Apache Spark
>
> In order to reuse RDDs across Spark SQL queries (SPARK-11838), we need to 
> know a {{LogicalPlan}} a {{SparkPlan}} node corresponds to. Unfortunately, 
> once a {{SparkPlan}} gets built, it is difficult to go back to 
> {{LogicalPlan}} nodes. Ideally, there would be an optional field of the type 
> {{LogicalPlan}} in {{SparkPlan}} that would get populated as {{SparkPlan}} 
> gets built.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12991) Establish correspondence between SparkPlan and LogicalPlan nodes

2016-02-02 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12991?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12991:


Assignee: (was: Apache Spark)

> Establish correspondence between SparkPlan and LogicalPlan nodes
> 
>
> Key: SPARK-12991
> URL: https://issues.apache.org/jira/browse/SPARK-12991
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Mikhail Bautin
>
> In order to reuse RDDs across Spark SQL queries (SPARK-11838), we need to 
> know a {{LogicalPlan}} a {{SparkPlan}} node corresponds to. Unfortunately, 
> once a {{SparkPlan}} gets built, it is difficult to go back to 
> {{LogicalPlan}} nodes. Ideally, there would be an optional field of the type 
> {{LogicalPlan}} in {{SparkPlan}} that would get populated as {{SparkPlan}} 
> gets built.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12991) Establish correspondence between SparkPlan and LogicalPlan nodes

2016-02-02 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15129429#comment-15129429
 ] 

Apache Spark commented on SPARK-12991:
--

User 'mbautin' has created a pull request for this issue:
https://github.com/apache/spark/pull/11036

> Establish correspondence between SparkPlan and LogicalPlan nodes
> 
>
> Key: SPARK-12991
> URL: https://issues.apache.org/jira/browse/SPARK-12991
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Mikhail Bautin
>
> In order to reuse RDDs across Spark SQL queries (SPARK-11838), we need to 
> know a {{LogicalPlan}} a {{SparkPlan}} node corresponds to. Unfortunately, 
> once a {{SparkPlan}} gets built, it is difficult to go back to 
> {{LogicalPlan}} nodes. Ideally, there would be an optional field of the type 
> {{LogicalPlan}} in {{SparkPlan}} that would get populated as {{SparkPlan}} 
> gets built.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13020) fix random generator for map type

2016-02-02 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-13020.
--
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 10930
[https://github.com/apache/spark/pull/10930]

> fix random generator for map type
> -
>
> Key: SPARK-13020
> URL: https://issues.apache.org/jira/browse/SPARK-13020
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11838) Spark SQL query fragment RDD reuse

2016-02-02 Thread Wenchen Fan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15129534#comment-15129534
 ] 

Wenchen Fan commented on SPARK-11838:
-

Can the new `StreamFrame` satisfy this requirement(avoid re-computing for 
slowly changing tables)? cc [~zsxwing]

> Spark SQL query fragment RDD reuse
> --
>
> Key: SPARK-11838
> URL: https://issues.apache.org/jira/browse/SPARK-11838
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Mikhail Bautin
>
> With many analytical Spark SQL workloads against slowly changing tables, 
> successive queries frequently share fragments that produce the same result. 
> Instead of re-computing those fragments for every query, it makes sense to 
> detect similar fragments and substitute RDDs previously created for matching 
> SparkPlan fragments into every new SparkPlan at the execution time whenever 
> possible. Even if no RDDs are persist()-ed to memory/disk/off-heap memory, 
> many stages can still be skipped due to map output files being present on 
> executor nodes.
> The implementation involves the following steps:
> (1) Logical plan "canonicalization". 
> Logical plans mapping to the same "canonical" logical plan should always 
> produce the same results (except for possible output column reordering), 
> although the inverse statement won't always be true. 
>   - Re-mapping expression ids to "canonical expression ids" (successively 
> increasing numbers always starting with 1).
>   - Eliminating alias names that are unimportant after analysis completion. 
> Only the names that are necessary to determine the Hive table columns to be 
> scanned are retained.
>   - Reordering columns in projections, grouping/aggregation expressions, etc. 
> This can be done e.g. by using the string representation as a sort key. Union 
> inputs always have to be reordered the same way.
>   - Tree traversal has to happen starting from leaves and progressing towards 
> the root, because we need to already have identified canonical expression ids 
> for children of a node before we can come up with sort keys that would allow 
> to reorder expressions in a node deterministically. This is a bit more 
> complicated for Union nodes.
>   - Special handling for MetastoreRelations. We replace MetastoreRelation 
> with a special class CanonicalMetastoreRelation that uses attributes and 
> partitionKeys as part of its equals() and hashCode() implementation, but the 
> visible attributes and aprtitionKeys are restricted to expression ids that 
> the rest of the query actually needs from that MetastoreRelation.
> An example of logical plans and corresponding canonical logical plans: 
> https://gist.githubusercontent.com/mbautin/ef1317b341211d9606cf/raw
> (2) Tracking LogicalPlan fragments corresponding to SparkPlan fragments. When 
> generating a SparkPlan, we keep an optional reference to a LogicalPlan 
> instance in every node. This allows us to populate the cache with mappings 
> from canonical logical plans of query fragments to the corresponding RDDs 
> generated as part of query execution. Note that there is no new work 
> necessary to generate the RDDs, we are merely utilizing the RDDs that would 
> have been produced as part of SparkPlan execution anyway.
> (3) SparkPlan fragment substitution. After generating a SparkPlan and before 
> calling prepare() or execute() on it, we check if any of its nodes have an 
> associated LogicalPlan that maps to a canonical logical plan matching a cache 
> entry. If so, we substitute a PhysicalRDD (or a new class UnsafePhysicalRDD 
> wrapping an RDD of UnsafeRow) scanning the previously created RDD instead of 
> the current query fragment. If the expected column order differs from what 
> the current SparkPlan fragment produces, we add a projection to reorder the 
> columns. We also add safe/unsafe row conversions as necessary to match the 
> row type that is expected by the parent of the current SparkPlan fragment.
> (4) The execute() method of SparkPlan also needs to perform the cache lookup 
> and substitution described above before producing a new RDD for the current 
> SparkPlan node. The "loading cache" pattern (e.g. as implemented in Guava) 
> allows to reuse query fragments between simultaneously submitted queries: 
> whichever query runs execute() for a particular fragment's canonical logical 
> plan starts producing an RDD first, and if another query has a fragment with 
> the same canonical logical plan, it waits for the RDD to be produced by the 
> first query and inserts it in its SparkPlan instead.
> This kind of query fragment caching will mostly be useful for slowly-changing 
> or static tables. Even with slowly-changing tables, the cache needs to be 
> invalidated when those data set changes take place. One of the following 
> 

[jira] [Assigned] (SPARK-13124) Adding JQuery DataTables messed up the Web UI css and js

2016-02-02 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13124:


Assignee: (was: Apache Spark)

> Adding JQuery DataTables messed up the Web UI css and js
> 
>
> Key: SPARK-13124
> URL: https://issues.apache.org/jira/browse/SPARK-13124
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.0.0
>Reporter: Alex Bozarth
> Attachments: css_issue.png, js_issue.png
>
>
> With the addition of JQuery DataTables in SPARK-10873 all the old tables are 
> using the new DataTables css instead of the old css. Though we most likely 
> want to switch over completely to DataTables eventually, we should still keep 
> the old tables UI.
> Also when you open up Web Inspector all pages in the WebUI throw an 
> jsonFormatter.min.js.map not found error. This file was not included in the 
> update and seems to be required to use Web Inspector on the new js file 
> (Error doesn't affect actual use)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13124) Adding JQuery DataTables messed up the Web UI css and js

2016-02-02 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13124:


Assignee: Apache Spark

> Adding JQuery DataTables messed up the Web UI css and js
> 
>
> Key: SPARK-13124
> URL: https://issues.apache.org/jira/browse/SPARK-13124
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.0.0
>Reporter: Alex Bozarth
>Assignee: Apache Spark
> Attachments: css_issue.png, js_issue.png
>
>
> With the addition of JQuery DataTables in SPARK-10873 all the old tables are 
> using the new DataTables css instead of the old css. Though we most likely 
> want to switch over completely to DataTables eventually, we should still keep 
> the old tables UI.
> Also when you open up Web Inspector all pages in the WebUI throw an 
> jsonFormatter.min.js.map not found error. This file was not included in the 
> update and seems to be required to use Web Inspector on the new js file 
> (Error doesn't affect actual use)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13124) Adding JQuery DataTables messed up the Web UI css and js

2016-02-02 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15129533#comment-15129533
 ] 

Apache Spark commented on SPARK-13124:
--

User 'ajbozarth' has created a pull request for this issue:
https://github.com/apache/spark/pull/11038

> Adding JQuery DataTables messed up the Web UI css and js
> 
>
> Key: SPARK-13124
> URL: https://issues.apache.org/jira/browse/SPARK-13124
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.0.0
>Reporter: Alex Bozarth
> Attachments: css_issue.png, js_issue.png
>
>
> With the addition of JQuery DataTables in SPARK-10873 all the old tables are 
> using the new DataTables css instead of the old css. Though we most likely 
> want to switch over completely to DataTables eventually, we should still keep 
> the old tables UI.
> Also when you open up Web Inspector all pages in the WebUI throw an 
> jsonFormatter.min.js.map not found error. This file was not included in the 
> update and seems to be required to use Web Inspector on the new js file 
> (Error doesn't affect actual use)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12986) Fix pydoc warnings in mllib/regression.py

2016-02-02 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-12986:
--
Assignee: Nam Pham  (was: Yu Ishikawa)

> Fix pydoc warnings in mllib/regression.py
> -
>
> Key: SPARK-12986
> URL: https://issues.apache.org/jira/browse/SPARK-12986
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib, PySpark
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>Assignee: Nam Pham
>Priority: Minor
>
> Got those warnings by running "make html" under "python/docs/":
> {code}
> /Users/meng/src/spark/python/pyspark/mllib/regression.py:docstring of 
> pyspark.mllib.regression.LinearRegressionWithSGD:3: ERROR: Unexpected 
> indentation.
> /Users/meng/src/spark/python/pyspark/mllib/regression.py:docstring of 
> pyspark.mllib.regression.LinearRegressionWithSGD:4: WARNING: Block quote ends 
> without a blank line; unexpected unindent.
> /Users/meng/src/spark/python/pyspark/mllib/regression.py:docstring of 
> pyspark.mllib.regression.RidgeRegressionWithSGD:3: ERROR: Unexpected 
> indentation.
> /Users/meng/src/spark/python/pyspark/mllib/regression.py:docstring of 
> pyspark.mllib.regression.RidgeRegressionWithSGD:4: WARNING: Block quote ends 
> without a blank line; unexpected unindent.
> /Users/meng/src/spark/python/pyspark/mllib/regression.py:docstring of 
> pyspark.mllib.regression.LassoWithSGD:3: ERROR: Unexpected indentation.
> /Users/meng/src/spark/python/pyspark/mllib/regression.py:docstring of 
> pyspark.mllib.regression.LassoWithSGD:4: WARNING: Block quote ends without a 
> blank line; unexpected unindent.
> /Users/meng/src/spark/python/pyspark/mllib/regression.py:docstring of 
> pyspark.mllib.regression.IsotonicRegression:7: ERROR: Unexpected indentation.
> /Users/meng/src/spark/python/pyspark/mllib/regression.py:docstring of 
> pyspark.mllib.regression.IsotonicRegression:12: ERROR: Unexpected indentation.
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12986) Fix pydoc warnings in mllib/regression.py

2016-02-02 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15129541#comment-15129541
 ] 

Xiangrui Meng commented on SPARK-12986:
---

[~holdenk] I think it is useful to check the pydoc warnings at builds. ScalaDoc 
is quite noise, but Python warnings are usually pointing out real problems. 
Could you make a JIRA and ping Josh there?

> Fix pydoc warnings in mllib/regression.py
> -
>
> Key: SPARK-12986
> URL: https://issues.apache.org/jira/browse/SPARK-12986
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib, PySpark
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>Assignee: Nam Pham
>Priority: Minor
>
> Got those warnings by running "make html" under "python/docs/":
> {code}
> /Users/meng/src/spark/python/pyspark/mllib/regression.py:docstring of 
> pyspark.mllib.regression.LinearRegressionWithSGD:3: ERROR: Unexpected 
> indentation.
> /Users/meng/src/spark/python/pyspark/mllib/regression.py:docstring of 
> pyspark.mllib.regression.LinearRegressionWithSGD:4: WARNING: Block quote ends 
> without a blank line; unexpected unindent.
> /Users/meng/src/spark/python/pyspark/mllib/regression.py:docstring of 
> pyspark.mllib.regression.RidgeRegressionWithSGD:3: ERROR: Unexpected 
> indentation.
> /Users/meng/src/spark/python/pyspark/mllib/regression.py:docstring of 
> pyspark.mllib.regression.RidgeRegressionWithSGD:4: WARNING: Block quote ends 
> without a blank line; unexpected unindent.
> /Users/meng/src/spark/python/pyspark/mllib/regression.py:docstring of 
> pyspark.mllib.regression.LassoWithSGD:3: ERROR: Unexpected indentation.
> /Users/meng/src/spark/python/pyspark/mllib/regression.py:docstring of 
> pyspark.mllib.regression.LassoWithSGD:4: WARNING: Block quote ends without a 
> blank line; unexpected unindent.
> /Users/meng/src/spark/python/pyspark/mllib/regression.py:docstring of 
> pyspark.mllib.regression.IsotonicRegression:7: ERROR: Unexpected indentation.
> /Users/meng/src/spark/python/pyspark/mllib/regression.py:docstring of 
> pyspark.mllib.regression.IsotonicRegression:12: ERROR: Unexpected indentation.
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-3611) Show number of cores for each executor in application web UI

2016-02-02 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-3611:
---

Assignee: Apache Spark

> Show number of cores for each executor in application web UI
> 
>
> Key: SPARK-3611
> URL: https://issues.apache.org/jira/browse/SPARK-3611
> Project: Spark
>  Issue Type: New Feature
>  Components: Web UI
>Affects Versions: 1.0.0
>Reporter: Matei Zaharia
>Assignee: Apache Spark
>Priority: Minor
>  Labels: starter
>
> This number is not always fully known, because e.g. in Mesos your executors 
> can scale up and down in # of CPUs, but it would be nice to show at least the 
> number of cores the machine has in that case, or the # of cores the executor 
> has been configured with if known.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3611) Show number of cores for each executor in application web UI

2016-02-02 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15129542#comment-15129542
 ] 

Apache Spark commented on SPARK-3611:
-

User 'ajbozarth' has created a pull request for this issue:
https://github.com/apache/spark/pull/11039

> Show number of cores for each executor in application web UI
> 
>
> Key: SPARK-3611
> URL: https://issues.apache.org/jira/browse/SPARK-3611
> Project: Spark
>  Issue Type: New Feature
>  Components: Web UI
>Affects Versions: 1.0.0
>Reporter: Matei Zaharia
>Priority: Minor
>  Labels: starter
>
> This number is not always fully known, because e.g. in Mesos your executors 
> can scale up and down in # of CPUs, but it would be nice to show at least the 
> number of cores the machine has in that case, or the # of cores the executor 
> has been configured with if known.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-3611) Show number of cores for each executor in application web UI

2016-02-02 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-3611:
---

Assignee: (was: Apache Spark)

> Show number of cores for each executor in application web UI
> 
>
> Key: SPARK-3611
> URL: https://issues.apache.org/jira/browse/SPARK-3611
> Project: Spark
>  Issue Type: New Feature
>  Components: Web UI
>Affects Versions: 1.0.0
>Reporter: Matei Zaharia
>Priority: Minor
>  Labels: starter
>
> This number is not always fully known, because e.g. in Mesos your executors 
> can scale up and down in # of CPUs, but it would be nice to show at least the 
> number of cores the machine has in that case, or the # of cores the executor 
> has been configured with if known.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12291) Support UnsafeRow in BroadcastLeftSemiJoinHash

2016-02-02 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15129369#comment-15129369
 ] 

Apache Spark commented on SPARK-12291:
--

User 'mbautin' has created a pull request for this issue:
https://github.com/apache/spark/pull/11036

> Support UnsafeRow in BroadcastLeftSemiJoinHash
> --
>
> Key: SPARK-12291
> URL: https://issues.apache.org/jira/browse/SPARK-12291
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Davies Liu
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13150) Flaky test: org.apache.spark.sql.hive.thriftserver.SingleSessionSuite.test single session

2016-02-02 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15129433#comment-15129433
 ] 

Apache Spark commented on SPARK-13150:
--

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/11037

> Flaky test: org.apache.spark.sql.hive.thriftserver.SingleSessionSuite.test 
> single session
> -
>
> Key: SPARK-13150
> URL: https://issues.apache.org/jira/browse/SPARK-13150
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Cheng Lian
>
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50551/testReport/org.apache.spark.sql.hive.thriftserver/SingleSessionSuite/test_single_session/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13150) Flaky test: org.apache.spark.sql.hive.thriftserver.SingleSessionSuite.test single session

2016-02-02 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13150:


Assignee: Apache Spark  (was: Cheng Lian)

> Flaky test: org.apache.spark.sql.hive.thriftserver.SingleSessionSuite.test 
> single session
> -
>
> Key: SPARK-13150
> URL: https://issues.apache.org/jira/browse/SPARK-13150
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Apache Spark
>
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50551/testReport/org.apache.spark.sql.hive.thriftserver/SingleSessionSuite/test_single_session/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13150) Flaky test: org.apache.spark.sql.hive.thriftserver.SingleSessionSuite.test single session

2016-02-02 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13150:


Assignee: Cheng Lian  (was: Apache Spark)

> Flaky test: org.apache.spark.sql.hive.thriftserver.SingleSessionSuite.test 
> single session
> -
>
> Key: SPARK-13150
> URL: https://issues.apache.org/jira/browse/SPARK-13150
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Cheng Lian
>
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50551/testReport/org.apache.spark.sql.hive.thriftserver/SingleSessionSuite/test_single_session/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13020) fix random generator for map type

2016-02-02 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-13020:
-
Assignee: Wenchen Fan

> fix random generator for map type
> -
>
> Key: SPARK-13020
> URL: https://issues.apache.org/jira/browse/SPARK-13020
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13151) Investigate replacing SynchronizedBuffer as it is deprecated/unreliable

2016-02-02 Thread holdenk (JIRA)
holdenk created SPARK-13151:
---

 Summary: Investigate replacing SynchronizedBuffer as it is 
deprecated/unreliable
 Key: SPARK-13151
 URL: https://issues.apache.org/jira/browse/SPARK-13151
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Reporter: holdenk
Priority: Trivial


Building with scala 2.11 results in the warning trait SynchronizedBuffer in 
package mutable is deprecated: Synchronization via traits is deprecated as it 
is inherently unreliable.  Consider java.util.concurrent.ConcurrentLinkedQueue 
as an alternative - we should investigate if this is a reasonable suggestion.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12992) Vectorize parquet decoding using ColumnarBatch

2016-02-02 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-12992.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 10908
[https://github.com/apache/spark/pull/10908]

> Vectorize parquet decoding using ColumnarBatch
> --
>
> Key: SPARK-12992
> URL: https://issues.apache.org/jira/browse/SPARK-12992
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Nong Li
> Fix For: 2.0.0
>
>
> Parquet files benefit from vectorized decoding. ColumnarBatches have been 
> designed to support this. This means that a single encoded parquet column is 
> decoded to a single ColumnVector. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13152) Fix task metrics deprecation warning

2016-02-02 Thread holdenk (JIRA)
holdenk created SPARK-13152:
---

 Summary: Fix task metrics deprecation warning
 Key: SPARK-13152
 URL: https://issues.apache.org/jira/browse/SPARK-13152
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: holdenk
Priority: Minor


Right now incBytesRead and incRecordsRead are marked as deprecated and for 
internal use only. We should make private[spark] versions which are not 
deprecated and switch to those internally so as to not clutter up the warning 
messages when building.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6476) Spark fileserver not started on same IP as using spark.driver.host

2016-02-02 Thread Hao Xia (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15129567#comment-15129567
 ] 

Hao Xia commented on SPARK-6476:


I found a workaround by setting env SPARK_LOCAL_IP on the driver

> Spark fileserver not started on same IP as using spark.driver.host
> --
>
> Key: SPARK-6476
> URL: https://issues.apache.org/jira/browse/SPARK-6476
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.2.1
>Reporter: Rares Vernica
>
> I initially inquired about this here: 
> http://mail-archives.apache.org/mod_mbox/spark-user/201503.mbox/%3ccalq9kxcn2mwfnd4r4k0q+qh1ypwn3p8rgud1v6yrx9_05lv...@mail.gmail.com%3E
> If the Spark driver host has multiple IPs and spark.driver.host is set to one 
> of them, I would expect the fileserver to start on the same IP. I checked 
> HttpServer and the jetty Server is started the default IP of the machine: 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/HttpServer.scala#L75
> Something like this might work instead:
> {code:title=HttpServer.scala#L75}
> val server = new Server(new InetSocketAddress(conf.get("spark.driver.host"), 
> 0))
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13150) Flaky test: org.apache.spark.sql.hive.thriftserver.SingleSessionSuite.test single session

2016-02-02 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13150:


Assignee: Cheng Lian  (was: Apache Spark)

> Flaky test: org.apache.spark.sql.hive.thriftserver.SingleSessionSuite.test 
> single session
> -
>
> Key: SPARK-13150
> URL: https://issues.apache.org/jira/browse/SPARK-13150
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Cheng Lian
> Fix For: 2.0.0
>
>
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50551/testReport/org.apache.spark.sql.hive.thriftserver/SingleSessionSuite/test_single_session/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13150) Flaky test: org.apache.spark.sql.hive.thriftserver.SingleSessionSuite.test single session

2016-02-02 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15129598#comment-15129598
 ] 

Apache Spark commented on SPARK-13150:
--

User 'liancheng' has created a pull request for this issue:
https://github.com/apache/spark/pull/11040

> Flaky test: org.apache.spark.sql.hive.thriftserver.SingleSessionSuite.test 
> single session
> -
>
> Key: SPARK-13150
> URL: https://issues.apache.org/jira/browse/SPARK-13150
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Cheng Lian
> Fix For: 2.0.0
>
>
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50551/testReport/org.apache.spark.sql.hive.thriftserver/SingleSessionSuite/test_single_session/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13150) Flaky test: org.apache.spark.sql.hive.thriftserver.SingleSessionSuite.test single session

2016-02-02 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13150:


Assignee: Apache Spark  (was: Cheng Lian)

> Flaky test: org.apache.spark.sql.hive.thriftserver.SingleSessionSuite.test 
> single session
> -
>
> Key: SPARK-13150
> URL: https://issues.apache.org/jira/browse/SPARK-13150
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Apache Spark
> Fix For: 2.0.0
>
>
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50551/testReport/org.apache.spark.sql.hive.thriftserver/SingleSessionSuite/test_single_session/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12629) SparkR: DataFrame's saveAsTable method has issues with the signature and HiveContext

2016-02-02 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman updated SPARK-12629:
--
Fix Version/s: 1.6.1

> SparkR: DataFrame's saveAsTable method has issues with the signature and 
> HiveContext 
> -
>
> Key: SPARK-12629
> URL: https://issues.apache.org/jira/browse/SPARK-12629
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Reporter: Narine Kokhlikyan
>Assignee: Narine Kokhlikyan
> Fix For: 1.6.1, 2.0.0
>
>
> There are several issues with the DataFrame's saveAsTable method in SparkR. 
> Here is a summary of some of them. Hope this will help to fix the issues.
> 1. According to SparkR's saveAsTable(...) documentation, we can call the 
> "saveAsTable(df, "myfile")" in order to store the dataframe.
> However, this signature isn't working. It seems that "source" and "mode" are 
> forced according to signature.
> 2. Within the method saveAsTable(...) it tries to retrieve the SQL context 
> and tries to create/initialize source as parquet, but this is also failing 
> because the context has to be Hive Context. Based on the error messages I see.
> 3. In general the method fails when I try to call it with sqlContext
> 4. Also, it seems that SQL DataFrame.saveAsTable is deprecated, we could use 
> df.write.saveAsTable(...) instead ...
> [~shivaram] [~sunrui] [~felixcheung]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13145) checkAnswer should tolerate small float number error

2016-02-02 Thread Davies Liu (JIRA)
Davies Liu created SPARK-13145:
--

 Summary: checkAnswer should tolerate small float number error
 Key: SPARK-13145
 URL: https://issues.apache.org/jira/browse/SPARK-13145
 Project: Spark
  Issue Type: Improvement
Reporter: Davies Liu


For example, we should the Float/Double as this:
{code}
 abs(actual -  expected) < expected * 1e-12
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13138) Add "logical" package prefix for ddl.scala

2016-02-02 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-13138.
-
   Resolution: Fixed
Fix Version/s: 2.0.0

> Add "logical" package prefix for ddl.scala
> --
>
> Key: SPARK-13138
> URL: https://issues.apache.org/jira/browse/SPARK-13138
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 2.0.0
>
>
> ddl.scala is defined in the execution package, and yet its reference of 
> "UnaryNode" and "Command" are logical. This was fairly confusing when I was 
> trying to understand the ddl code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13156) JDBC using multiple partitions creates additional tasks but only executes on one

2016-02-02 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15129815#comment-15129815
 ] 

Sean Owen commented on SPARK-13156:
---

It just sounds like your data is skewed. Are you sure that's not it?

> JDBC using multiple partitions creates additional tasks but only executes on 
> one
> 
>
> Key: SPARK-13156
> URL: https://issues.apache.org/jira/browse/SPARK-13156
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 1.5.0
> Environment: Hadoop 2.6.0-cdh5.4.0, Teradata, yarn-client
>Reporter: Charles Drotar
>
> I can successfully kick off a query through JDBC to Teradata, and when it 
> runs it creates a task on each executor for every partition. The problem is 
> that all of the tasks except for one complete within a couple seconds and the 
> final task handles the entire dataset.
> Example Code:
> private val properties = new java.util.Properties()
> properties.setProperty("driver","com.teradata.jdbc.TeraDriver")
> properties.setProperty("username","foo")
> properties.setProperty("password","bar")
> val url = "jdbc:teradata://oneview/, TMODE=TERA,TYPE=FASTEXPORT,SESSIONS=10"
> val numPartitions = 5
> val dbTableTemp = "( SELECT  id MOD $numPartitions%d AS modulo, id
>   FROM db.table
> ) AS TEMP_TABLE"
> val partitionColumn = "modulo"
> val lowerBound = 0.toLong
> val upperBound = (numPartitions-1).toLong
> val df = 
> sqlContext.read.jdbc(url,dbTableTemp,partitionColumn,lowerBound,upperBound,numPartitions,properties)
> df.write.parquet("/output/path/for/df/")
> When I look at the Spark UI I see that 5 tasks, but only 1 is actually 
> querying.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13151) Investigate replacing SynchronizedBuffer as it is deprecated/unreliable

2016-02-02 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15129816#comment-15129816
 ] 

Sean Owen commented on SPARK-13151:
---

+1 to this and your next JIRA

> Investigate replacing SynchronizedBuffer as it is deprecated/unreliable
> ---
>
> Key: SPARK-13151
> URL: https://issues.apache.org/jira/browse/SPARK-13151
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, Streaming
>Reporter: holdenk
>Priority: Trivial
>
> Building with scala 2.11 results in the warning trait SynchronizedBuffer in 
> package mutable is deprecated: Synchronization via traits is deprecated as it 
> is inherently unreliable.  Consider 
> java.util.concurrent.ConcurrentLinkedQueue as an alternative - we should 
> investigate if this is a reasonable suggestion.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13120) Shade protobuf-java

2016-02-02 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15129820#comment-15129820
 ] 

Sean Owen commented on SPARK-13120:
---

Yeah, aren't you also saying this wouldn't help? And it is still not something 
identified as a problem in the thread you linked

> Shade protobuf-java
> ---
>
> Key: SPARK-13120
> URL: https://issues.apache.org/jira/browse/SPARK-13120
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Reporter: Ted Yu
>
> See this thread for background information:
> http://search-hadoop.com/m/q3RTtdkUFK11xQhP1/Spark+not+able+to+fetch+events+from+Amazon+Kinesis
> This issue shades com.google.protobuf:protobuf-java as 
> org.spark-project.protobuf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13125) makes the ratio of KafkaRDD partition to kafka topic partition configurable.

2016-02-02 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-13125.
---
  Resolution: Not A Problem
Target Version/s:   (was: 2.0.0)

> makes the ratio of KafkaRDD partition to kafka topic partition  configurable.
> -
>
> Key: SPARK-13125
> URL: https://issues.apache.org/jira/browse/SPARK-13125
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output
>Affects Versions: 1.6.1
>Reporter: zhengcanbin
>  Labels: features
>   Original Estimate: 96h
>  Remaining Estimate: 96h
>
> Now each given Kafka topic/partition corresponds to an RDD partition, in some 
> case it's quite necessary to make this configurable,  namely a ratio 
> configuration of RDDPartition/kafkaTopicPartition is needed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13150) Flaky test: org.apache.spark.sql.hive.thriftserver.SingleSessionSuite.test single session

2016-02-02 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15129857#comment-15129857
 ] 

Cheng Lian commented on SPARK-13150:


Please refer to [this PR 
comment|https://github.com/apache/spark/pull/11040#issuecomment-179028394] for 
the reason of the test failure.

> Flaky test: org.apache.spark.sql.hive.thriftserver.SingleSessionSuite.test 
> single session
> -
>
> Key: SPARK-13150
> URL: https://issues.apache.org/jira/browse/SPARK-13150
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Cheng Lian
> Fix For: 2.0.0
>
>
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50551/testReport/org.apache.spark.sql.hive.thriftserver/SingleSessionSuite/test_single_session/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13125) makes the ratio of KafkaRDD partition to kafka topic partition configurable.

2016-02-02 Thread zhengcanbin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15129860#comment-15129860
 ] 

zhengcanbin commented on SPARK-13125:
-

Yes, you are right, but shuffle will increase net burden, and number of 
partitions is limited by total number of disk. In strictly real-time scenarios, 
one topic partition corresponds to multiple rdd partitions is important for 
increasing parallelism

> makes the ratio of KafkaRDD partition to kafka topic partition  configurable.
> -
>
> Key: SPARK-13125
> URL: https://issues.apache.org/jira/browse/SPARK-13125
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output
>Affects Versions: 1.6.1
>Reporter: zhengcanbin
>  Labels: features
>   Original Estimate: 96h
>  Remaining Estimate: 96h
>
> Now each given Kafka topic/partition corresponds to an RDD partition, in some 
> case it's quite necessary to make this configurable,  namely a ratio 
> configuration of RDDPartition/kafkaTopicPartition is needed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13125) makes the ratio of KafkaRDD partition to kafka topic partition configurable.

2016-02-02 Thread zhengcanbin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15129867#comment-15129867
 ] 

zhengcanbin commented on SPARK-13125:
-

Shuffle will increase net burden, and number of partitions is limited by total 
number of disk. In strictly real-time scenarios, one topic partition 
corresponds to multiple rdd partitions is important for increasing parallelism. 
A lot of clients who runs our application has raised this problem, so I still 
think it makes sense.

> makes the ratio of KafkaRDD partition to kafka topic partition  configurable.
> -
>
> Key: SPARK-13125
> URL: https://issues.apache.org/jira/browse/SPARK-13125
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output
>Affects Versions: 1.6.1
>Reporter: zhengcanbin
>  Labels: features
>   Original Estimate: 96h
>  Remaining Estimate: 96h
>
> Now each given Kafka topic/partition corresponds to an RDD partition, in some 
> case it's quite necessary to make this configurable,  namely a ratio 
> configuration of RDDPartition/kafkaTopicPartition is needed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-13125) makes the ratio of KafkaRDD partition to kafka topic partition configurable.

2016-02-02 Thread zhengcanbin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15129867#comment-15129867
 ] 

zhengcanbin edited comment on SPARK-13125 at 2/3/16 6:08 AM:
-

Shuffle will increase net burden, and number of partitions is limited by total 
number of disk. In strictly real-time scenarios, one topic partition 
corresponds to multiple rdd partitions is important for increasing parallelism. 
A lot of clients who run our application raise this problem, so I still think 
it makes sense.


was (Author: zhengcanbin):
Shuffle will increase net burden, and number of partitions is limited by total 
number of disk. In strictly real-time scenarios, one topic partition 
corresponds to multiple rdd partitions is important for increasing parallelism. 
A lot of clients who runs our application has raised this problem, so I still 
think it makes sense.

> makes the ratio of KafkaRDD partition to kafka topic partition  configurable.
> -
>
> Key: SPARK-13125
> URL: https://issues.apache.org/jira/browse/SPARK-13125
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output
>Affects Versions: 1.6.1
>Reporter: zhengcanbin
>  Labels: features
>   Original Estimate: 96h
>  Remaining Estimate: 96h
>
> Now each given Kafka topic/partition corresponds to an RDD partition, in some 
> case it's quite necessary to make this configurable,  namely a ratio 
> configuration of RDDPartition/kafkaTopicPartition is needed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-13125) makes the ratio of KafkaRDD partition to kafka topic partition configurable.

2016-02-02 Thread zhengcanbin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengcanbin reopened SPARK-13125:
-

Shuffle will increase net burden, and number of partitions is limited by total 
number of disk. In strictly real-time scenarios, one topic partition 
corresponds to multiple rdd partitions is important for increasing parallelism. 
A lot of clients who run our application raise this problem, so I still think 
it makes sense.

> makes the ratio of KafkaRDD partition to kafka topic partition  configurable.
> -
>
> Key: SPARK-13125
> URL: https://issues.apache.org/jira/browse/SPARK-13125
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output
>Affects Versions: 1.6.1
>Reporter: zhengcanbin
>  Labels: features
>   Original Estimate: 96h
>  Remaining Estimate: 96h
>
> Now each given Kafka topic/partition corresponds to an RDD partition, in some 
> case it's quite necessary to make this configurable,  namely a ratio 
> configuration of RDDPartition/kafkaTopicPartition is needed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13147) improve readability of generated code

2016-02-02 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-13147.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 11032
[https://github.com/apache/spark/pull/11032]

> improve readability of generated code
> -
>
> Key: SPARK-13147
> URL: https://issues.apache.org/jira/browse/SPARK-13147
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Davies Liu
> Fix For: 2.0.0
>
>
> 1. try to avoid the suffix (unique id)
> 2. remove multiple empty lines in code formater
> 3. remove the comment if there is no code generated.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10820) Initial infrastructure

2016-02-02 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-10820.
--
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 11006
[https://github.com/apache/spark/pull/11006]

> Initial infrastructure
> --
>
> Key: SPARK-10820
> URL: https://issues.apache.org/jira/browse/SPARK-10820
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Michael Armbrust
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13121) java mapWithState mishandles scala Option

2016-02-02 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-13121:
-
Fix Version/s: 1.6.1

> java mapWithState mishandles scala Option
> -
>
> Key: SPARK-13121
> URL: https://issues.apache.org/jira/browse/SPARK-13121
> Project: Spark
>  Issue Type: Bug
>  Components: Java API, Streaming
>Affects Versions: 1.6.0
>Reporter: Gabriele Nizzoli
>Priority: Critical
> Fix For: 1.6.1
>
>
> in Spark Streaming, java mapWithState that uses Function3 has a bug in the 
> convertion from a scala Option to a java Optional. In the conversion, the 
> code in `StateSpec.scala`, line 222 is
> `Optional.fromNullable(v.get)`. This fails if `v`, an `Option`, is `None`, 
> better to use `JavaUtils.optionToOptional(v)` instead.
> Workaround is to use the Function4 call to mapWithState. This call has the 
> right conversion.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13126) History Server page always has horizontal scrollbar

2016-02-02 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15128928#comment-15128928
 ] 

Apache Spark commented on SPARK-13126:
--

User 'zhuoliu' has created a pull request for this issue:
https://github.com/apache/spark/pull/11029

> History Server page always has horizontal scrollbar
> ---
>
> Key: SPARK-13126
> URL: https://issues.apache.org/jira/browse/SPARK-13126
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.0.0
>Reporter: Alex Bozarth
>Priority: Minor
> Attachments: page_width.png
>
>
> The new History Server page table is always wider than the page no matter how 
> much larger you make the window. Most likely an odd CSS error, doesn't seem 
> to be to be a simple fix when manipulating the css using the Web Inspector



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-12811) Estimator interface for generalized linear models (GLMs)

2016-02-02 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-12811:

Comment: was deleted

(was: Should we put it under a new folder named "ml/glm"?)

> Estimator interface for generalized linear models (GLMs)
> 
>
> Key: SPARK-12811
> URL: https://issues.apache.org/jira/browse/SPARK-12811
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>Assignee: Yanbo Liang
>Priority: Critical
>
> In Spark 1.6, MLlib provides logistic regression and linear regression with 
> L1/L2/elastic-net regularization. We want to expand the support of 
> generalized linear models (GLMs) in 2.0, e.g., Poisson/Gamma families and 
> more link functions. SPARK-9835 implements a GLM solver for the case when the 
> number of features is small. We also need to design an interface for GLMs.
> In SparkR, we can simply follow glm or glmnet. On the Python/Scala/Java side, 
> the interface should be consistent with LinearRegression and 
> LogisticRegression, e.g.,
> {code}
> val glm = new GeneralizedLinearModel()
>   .setFamily("poisson")
>   .setSolver("irls")
> {code}
> It would be great if LinearRegression and LogisticRegression can reuse code 
> from GeneralizedLinearModel.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12986) Fix pydoc warnings in mllib/regression.py

2016-02-02 Thread holdenk (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15129760#comment-15129760
 ] 

holdenk commented on SPARK-12986:
-

Sure :)

> Fix pydoc warnings in mllib/regression.py
> -
>
> Key: SPARK-12986
> URL: https://issues.apache.org/jira/browse/SPARK-12986
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib, PySpark
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>Assignee: Nam Pham
>Priority: Minor
>
> Got those warnings by running "make html" under "python/docs/":
> {code}
> /Users/meng/src/spark/python/pyspark/mllib/regression.py:docstring of 
> pyspark.mllib.regression.LinearRegressionWithSGD:3: ERROR: Unexpected 
> indentation.
> /Users/meng/src/spark/python/pyspark/mllib/regression.py:docstring of 
> pyspark.mllib.regression.LinearRegressionWithSGD:4: WARNING: Block quote ends 
> without a blank line; unexpected unindent.
> /Users/meng/src/spark/python/pyspark/mllib/regression.py:docstring of 
> pyspark.mllib.regression.RidgeRegressionWithSGD:3: ERROR: Unexpected 
> indentation.
> /Users/meng/src/spark/python/pyspark/mllib/regression.py:docstring of 
> pyspark.mllib.regression.RidgeRegressionWithSGD:4: WARNING: Block quote ends 
> without a blank line; unexpected unindent.
> /Users/meng/src/spark/python/pyspark/mllib/regression.py:docstring of 
> pyspark.mllib.regression.LassoWithSGD:3: ERROR: Unexpected indentation.
> /Users/meng/src/spark/python/pyspark/mllib/regression.py:docstring of 
> pyspark.mllib.regression.LassoWithSGD:4: WARNING: Block quote ends without a 
> blank line; unexpected unindent.
> /Users/meng/src/spark/python/pyspark/mllib/regression.py:docstring of 
> pyspark.mllib.regression.IsotonicRegression:7: ERROR: Unexpected indentation.
> /Users/meng/src/spark/python/pyspark/mllib/regression.py:docstring of 
> pyspark.mllib.regression.IsotonicRegression:12: ERROR: Unexpected indentation.
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13154) Add pydoc lint for docs

2016-02-02 Thread holdenk (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk updated SPARK-13154:

Component/s: PySpark

> Add pydoc lint for docs
> ---
>
> Key: SPARK-13154
> URL: https://issues.apache.org/jira/browse/SPARK-13154
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: holdenk
>Priority: Trivial
>
> As we fixed in SPARK-12986 it would be useful to have a lint rule to catch 
> this automatically.
> cc [~mengxr] & [~josephkb]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13154) Add pydoc lint for docs

2016-02-02 Thread holdenk (JIRA)
holdenk created SPARK-13154:
---

 Summary: Add pydoc lint for docs
 Key: SPARK-13154
 URL: https://issues.apache.org/jira/browse/SPARK-13154
 Project: Spark
  Issue Type: Improvement
Reporter: holdenk
Priority: Trivial


As we fixed in SPARK-12986 it would be useful to have a lint rule to catch this 
automatically.

cc [~mengxr] & [~josephkb]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12732) Fix LinearRegression.train for the case when label is constant and fitIntercept=false

2016-02-02 Thread DB Tsai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai resolved SPARK-12732.
-
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 10702
[https://github.com/apache/spark/pull/10702]

> Fix LinearRegression.train for the case when label is constant and 
> fitIntercept=false
> -
>
> Key: SPARK-12732
> URL: https://issues.apache.org/jira/browse/SPARK-12732
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Reporter: Imran Younus
>Assignee: Imran Younus
>Priority: Minor
> Fix For: 2.0.0
>
>
> If the target variable is constant, then the linear regression must check if 
> the fitIntercept is true or false, and handle these two cases separately.
> If the fitIntercept is true, then there is no training needed and we set the 
> intercept equal to the mean of y.
> But if the fit intercept is false, then the model should still train.
> Currently, LinearRegression handles both cases in the same way. It doesn't 
> train the model and sets the intercept equal to the mean of y. Which, means 
> that it returns a non-zero intercept even when the user forces the regression 
> through the origin.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13155) add runtime null check when convert catalyst array to external array

2016-02-02 Thread Wenchen Fan (JIRA)
Wenchen Fan created SPARK-13155:
---

 Summary: add runtime null check when convert catalyst array to 
external array
 Key: SPARK-13155
 URL: https://issues.apache.org/jira/browse/SPARK-13155
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Wenchen Fan


{code}
scala> Seq(("a", Seq(null, new Integer(1.toDS().as[(String, 
Array[Int])].collect()
res5: Array[(String, Array[Int])] = Array((a,Array(0, 1)))
{code}

This is wrong, we should throw exception for this case



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13069) ActorHelper is not throttled by rate limiter

2016-02-02 Thread sachin aggarwal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15129787#comment-15129787
 ] 

sachin aggarwal commented on SPARK-13069:
-

As of Spark 2.0 (not yet released), Spark does not use Akka any more. 

See https://issues.apache.org/jira/browse/SPARK-5293

can you check with latest 2.0 build, to see if similar problem exists.


> ActorHelper is not throttled by rate limiter
> 
>
> Key: SPARK-13069
> URL: https://issues.apache.org/jira/browse/SPARK-13069
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.6.0
>Reporter: Lin Zhao
>
> The rate an actor receiver sends data to spark is not limited by maxRate or 
> back pressure. Spark would control how fast it writes the data to block 
> manager, but the receiver actor sends events asynchronously and would fill 
> out akka mailbox with millions of events until memory runs out.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   3   >