date:20161014


[ 
https://issues.apache.org/jira/browse/SPARK-16002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15577198#comment-15577198
 ] 

Apache Spark commented on SPARK-16002:
--

User 'lw-lin' has created a pull request for this issue:
https://github.com/apache/spark/pull/15497

> Sleep when no new data arrives to avoid 100% CPU usage
> --
>
> Key: SPARK-16002
> URL: https://issues.apache.org/jira/browse/SPARK-16002
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
> Fix For: 2.0.0
>
>
> Right now if the trigger is ProcessTrigger(0), StreamExecution will keep 
> polling new data even if there is no data. Then the CPU usage will be 100%. 
> We should add a minimum polling delay when no new data arrives.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-16980) Load only catalog table partition metadata required to answer a query


 [ 
https://issues.apache.org/jira/browse/SPARK-16980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-16980.
-
   Resolution: Fixed
Fix Version/s: 2.1.0

> Load only catalog table partition metadata required to answer a query
> -
>
> Key: SPARK-16980
> URL: https://issues.apache.org/jira/browse/SPARK-16980
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Michael Allman
>Assignee: Michael Allman
> Fix For: 2.1.0
>
>
> Currently, when a user reads from a partitioned Hive table whose metadata are 
> not cached (and for which Hive table conversion is enabled and supported), 
> all partition metadata are fetched from the metastore:
> https://github.com/apache/spark/blob/5effc016c893ce917d535cc1b5026d8e4c846721/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala#L252-L260
> However, if the user's query includes partition pruning predicates then we 
> only need the subset of these metadata which satisfy those predicates.
> This issue tracks work to modify the current query planning scheme so that 
> unnecessary partition metadata are not loaded.
> I've prototyped two possible approaches. The first extends 
> {{o.a.s.s.c.catalog.ExternalCatalog}} and as such is more generally 
> applicable. It requires some new abstractions and refactoring of 
> {{HadoopFsRelation}} and {{FileCatalog}}, among others. It places a greater 
> burden on other implementations of {{ExternalCatalog}}. Currently the only 
> other implementation of {{ExternalCatalog}} is {{InMemoryCatalog}}, and my 
> prototype throws an {{UnsupportedOperationException}} on that implementation.
> The second prototype is simpler and only touches code in the {{hive}} 
> project. Basically, conversion of a partitioned {{MetastoreRelation}} to 
> {{HadoopFsRelation}} is deferred to physical planning. During physical 
> planning, the partition pruning filters in the query plan are used to 
> identify the required partition metadata and a {{HadoopFsRelation}} is built 
> from those. The new query plan is then re-injected into the physical planner 
> and proceeds as normal for a {{HadoopFsRelation}}.
> On the Spark dev mailing list, [~ekhliang] expressed a preference for the 
> approach I took in my first POC. (See 
> http://apache-spark-developers-list.1001551.n3.nabble.com/Scaling-partitioned-Hive-table-support-td18586.html)
>  Based on that, I'm going to open a PR with that patch as a starting point 
> for an architectural/design review. It will not be a complete patch ready for 
> integration into Spark master. Rather, I would like to get early feedback on 
> the implementation details so I can shape the PR before committing a large 
> amount of time on a finished product. I will open another PR for the second 
> approach for comparison if requested.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-17946) Python crossJoin API similar to Scala


 [ 
https://issues.apache.org/jira/browse/SPARK-17946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-17946.
-
   Resolution: Fixed
 Assignee: Srinath
Fix Version/s: 2.1.0

> Python crossJoin API similar to Scala
> -
>
> Key: SPARK-17946
> URL: https://issues.apache.org/jira/browse/SPARK-17946
> Project: Spark
>  Issue Type: Bug
>Reporter: Srinath
>Assignee: Srinath
> Fix For: 2.1.0
>
>
> https://github.com/apache/spark/pull/14866
> added an explicit cross join to the dataset api in scala, requiring crossJoin 
> to be used when there is no join condition. 
> (JIRA: https://issues.apache.org/jira/browse/SPARK-17298)
> The "join" API in python was implemented using cross join in that patch.
> Add an explicit crossJoin to python as well so the API behavior is similar to 
> Scala.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17951) BlockFetch with multiple threads slows down after spark 1.6

2016-10-14 Thread ding (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ding updated SPARK-17951:
-
Description: 
The following code demonstrates the issue:
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName(s"BMTest")
val size = 3344570
val sc = new SparkContext(conf)

val data = sc.parallelize(1 to 100, 8)
var accum = sc.accumulator(0.0, "get remote bytes")
var i = 0
while(i < 91) {
  accum = sc.accumulator(0.0, "get remote bytes")
  val test = data.mapPartitionsWithIndex { (pid, iter) =>
val N = size
val bm = SparkEnv.get.blockManager
val blockId = TaskResultBlockId(10*i + pid)
val test = new Array[Byte](N)
Random.nextBytes(test)
val buffer = ByteBuffer.allocate(N)
buffer.limit(N)
buffer.put(test)
bm.putBytes(blockId, buffer, StorageLevel.MEMORY_ONLY_SER)
Iterator(1)
  }.count()
  
  data.mapPartitionsWithIndex { (pid, iter) =>
val before = System.nanoTime()

val bm = SparkEnv.get.blockManager
(0 to 7).map(s => {
  Future {
val result = bm.getRemoteBytes(TaskResultBlockId(10*i + s))
  }
}).map(Await.result(_, Duration.Inf))

accum.add((System.nanoTime() - before) / 1e9)
Iterator(1)
  }.count()
  println("get remote bytes take: " + accum.value/8)
  i += 1
}
  }

In spark1.6.2, average of "getting remote bytes" time is: 0.19 s while
in spark 1.5.1 average of "getting remote bytes" time is: 0.09 s

However if fetch block in single thread, the gap is much smaller.
spark1.6.2  get remote bytes: 0.21 s
spark1.5.1  get remote bytes: 0.20 s



  was:
The following code demonstrates the issue:
 def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName(s"BMTest")
val size = 3344570
val sc = new SparkContext(conf)

val data = sc.parallelize(1 to 100, 8)
var accum = sc.accumulator(0.0, "get remote bytes")
var i = 0
while(i < 91) {
  accum = sc.accumulator(0.0, "get remote bytes")
  val test = data.mapPartitionsWithIndex { (pid, iter) =>
val N = size
val bm = SparkEnv.get.blockManager
val blockId = TaskResultBlockId(10*i + pid)
val test = new Array[Byte](N)
Random.nextBytes(test)
val buffer = ByteBuffer.allocate(N)
buffer.limit(N)
buffer.put(test)
bm.putBytes(blockId, buffer, StorageLevel.MEMORY_ONLY_SER)
Iterator(1)
  }.count()
  
  data.mapPartitionsWithIndex { (pid, iter) =>
val before = System.nanoTime()
val bm = SparkEnv.get.blockManager
(1 to 8).map(s => {
  Future {
val result = bm.getRemoteBytes(TaskResultBlockId(10*i + s))
  }
}).map(Await.result(_, Duration.Inf))

accum.add((System.nanoTime() - before) / 1e9)
Iterator(1)
  }.count()
  println("get remote bytes take: " + accum.value/8)
  i += 1
}
  }

In spark1.6.2, average of "getting remote bytes" time is: 0.16s while
in spark 1.5.1 average of "getting remote bytes" time is: 0.07s

However if fetch block in single thread, the gap is much smaller.
spark1.6.2  get remote bytes: 0.191421s
spark1.5.1  get remote bytes: 0.181312s




> BlockFetch with multiple threads slows down after spark 1.6
> ---
>
> Key: SPARK-17951
> URL: https://issues.apache.org/jira/browse/SPARK-17951
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager, Spark Core
>Affects Versions: 1.6.2
> Environment: cluster with 8 node, each node has 28 cores. 10Gb network
>Reporter: ding
>
> The following code demonstrates the issue:
> def main(args: Array[String]): Unit = {
> val conf = new SparkConf().setAppName(s"BMTest")
> val size = 3344570
> val sc = new SparkContext(conf)
> val data = sc.parallelize(1 to 100, 8)
> var accum = sc.accumulator(0.0, "get remote bytes")
> var i = 0
> while(i < 91) {
>   accum = sc.accumulator(0.0, "get remote bytes")
>   val test = data.mapPartitionsWithIndex { (pid, iter) =>
> val N = size
> val bm = SparkEnv.get.blockManager
> val blockId = TaskResultBlockId(10*i + pid)
> val test = new Array[Byte](N)
> Random.nextBytes(test)
> val buffer = ByteBuffer.allocate(N)
> buffer.limit(N)
> buffer.put(test)
> bm.putBytes(blockId, buffer, StorageLevel.MEMORY_ONLY_SER)
> Iterator(1)
>   }.count()
>   
>   data.mapPartitionsWithIndex { (pid, iter) =>
> val before = System.nanoTime()
> 
> val bm = SparkEnv.get.blockManager

[jira] [Commented] (SPARK-17813) Maximum data per trigger


[ 
https://issues.apache.org/jira/browse/SPARK-17813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15577037#comment-15577037
 ] 

Cody Koeninger commented on SPARK-17813:


To be clear, the current direct stream (and as a result structured stream) 
straight up will not work with compacted topics currently, because of the 
assumption that offset ranges are contiguous.  There's a ticket for it 
SPARK-17147 with a prototype solution, waiting for feedback from a user on it.

So for global maxOffsetsPerTrigger are you saying a spark configuration?  Is 
there a reason not to make that a maxRowsPerTrigger (or messages, or whatever 
name) so that it can potentially be reused by other sources?  I think for this 
a proportional distribution of offsets shouldn't be too hard.  I can pick this 
up once the assign stuff is stabilized.

> Maximum data per trigger
> 
>
> Key: SPARK-17813
> URL: https://issues.apache.org/jira/browse/SPARK-17813
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Michael Armbrust
>
> At any given point in a streaming query execution, we process all available 
> data.  This maximizes throughput at the cost of latency.  We should add 
> something similar to the {{maxFilesPerTrigger}} option available for files.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-17951) BlockFetch with multiple threads slows down after spark 1.6

2016-10-14 Thread ding (JIRA)

ding created SPARK-17951:


 Summary: BlockFetch with multiple threads slows down after spark 
1.6
 Key: SPARK-17951
 URL: https://issues.apache.org/jira/browse/SPARK-17951
 Project: Spark
  Issue Type: Bug
  Components: Block Manager, Spark Core
Affects Versions: 1.6.2
 Environment: cluster with 8 node, each node has 28 cores. 10Gb network
Reporter: ding


The following code demonstrates the issue:
 def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName(s"BMTest")
val size = 3344570
val sc = new SparkContext(conf)

val data = sc.parallelize(1 to 100, 8)
var accum = sc.accumulator(0.0, "get remote bytes")
var i = 0
while(i < 91) {
  accum = sc.accumulator(0.0, "get remote bytes")
  val test = data.mapPartitionsWithIndex { (pid, iter) =>
val N = size
val bm = SparkEnv.get.blockManager
val blockId = TaskResultBlockId(10*i + pid)
val test = new Array[Byte](N)
Random.nextBytes(test)
val buffer = ByteBuffer.allocate(N)
buffer.limit(N)
buffer.put(test)
bm.putBytes(blockId, buffer, StorageLevel.MEMORY_ONLY_SER)
Iterator(1)
  }.count()
  
  data.mapPartitionsWithIndex { (pid, iter) =>
val before = System.nanoTime()
val bm = SparkEnv.get.blockManager
(1 to 8).map(s => {
  Future {
val result = bm.getRemoteBytes(TaskResultBlockId(10*i + s))
  }
}).map(Await.result(_, Duration.Inf))

accum.add((System.nanoTime() - before) / 1e9)
Iterator(1)
  }.count()
  println("get remote bytes take: " + accum.value/8)
  i += 1
}
  }

In spark1.6.2, average of "getting remote bytes" time is: 0.16s while
in spark 1.5.1 average of "getting remote bytes" time is: 0.07s

However if fetch block in single thread, the gap is much smaller.
spark1.6.2  get remote bytes: 0.191421s
spark1.5.1  get remote bytes: 0.181312s





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17812) More granular control of starting offsets (assign)


 [ 
https://issues.apache.org/jira/browse/SPARK-17812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cody Koeninger updated SPARK-17812:
---
Description: 
Right now you can only run a Streaming Query starting from either the earliest 
or latests offsets available at the moment the query is started.  Sometimes 
this is a lot of data.  It would be nice to be able to do the following:
 - seek to user specified offsets for manually specified topicpartitions

currently agreed on plan:

Mutually exclusive subscription options (only assign is new to this ticket)
{noformat}
.option("subscribe","topicFoo,topicBar")
.option("subscribePattern","topic.*")
.option("assign","""{"topicfoo": [0, 1],"topicbar": [0, 1]}""")
{noformat}

where assign can only be specified that way, no inline offsets

Single starting position option with three mutually exclusive types of value
{noformat}
.option("startingOffsets", "earliest" | "latest" | """{"topicFoo": {"0": 1234, 
"1": -2}, "topicBar":{"0": -1}}""")
{noformat}

startingOffsets with json fails if any topicpartition in the assignments 
doesn't have an offset.


  was:
Right now you can only run a Streaming Query starting from either the earliest 
or latests offsets available at the moment the query is started.  Sometimes 
this is a lot of data.  It would be nice to be able to do the following:
 - seek to user specified offsets for manually specified topicpartitions


> More granular control of starting offsets (assign)
> --
>
> Key: SPARK-17812
> URL: https://issues.apache.org/jira/browse/SPARK-17812
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Michael Armbrust
>Assignee: Cody Koeninger
>
> Right now you can only run a Streaming Query starting from either the 
> earliest or latests offsets available at the moment the query is started.  
> Sometimes this is a lot of data.  It would be nice to be able to do the 
> following:
>  - seek to user specified offsets for manually specified topicpartitions
> currently agreed on plan:
> Mutually exclusive subscription options (only assign is new to this ticket)
> {noformat}
> .option("subscribe","topicFoo,topicBar")
> .option("subscribePattern","topic.*")
> .option("assign","""{"topicfoo": [0, 1],"topicbar": [0, 1]}""")
> {noformat}
> where assign can only be specified that way, no inline offsets
> Single starting position option with three mutually exclusive types of value
> {noformat}
> .option("startingOffsets", "earliest" | "latest" | """{"topicFoo": {"0": 
> 1234, "1": -2}, "topicBar":{"0": -1}}""")
> {noformat}
> startingOffsets with json fails if any topicpartition in the assignments 
> doesn't have an offset.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17812) More granular control of starting offsets (assign)


[ 
https://issues.apache.org/jira/browse/SPARK-17812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15577022#comment-15577022
 ] 

Cody Koeninger commented on SPARK-17812:


Assign is useful, otherwise you have no way of consuming only particular 
partitions of a topic.

Yeah, I just ended up using jackson tree model directly, as you said the 
catalyst stuff isn't really applicable.

Branch with initial implementation is is at 
https://github.com/koeninger/spark-1/tree/SPARK-17812 , will send a PR once I 
have some tests... trying to figure out if there's a reasonable way of unit 
testing offset out of range, but may just give up on that if it seems flaky.

> More granular control of starting offsets (assign)
> --
>
> Key: SPARK-17812
> URL: https://issues.apache.org/jira/browse/SPARK-17812
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Michael Armbrust
>Assignee: Cody Koeninger
>
> Right now you can only run a Streaming Query starting from either the 
> earliest or latests offsets available at the moment the query is started.  
> Sometimes this is a lot of data.  It would be nice to be able to do the 
> following:
>  - seek to user specified offsets for manually specified topicpartitions



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17950) Match SparseVector behavior with DenseVector


 [ 
https://issues.apache.org/jira/browse/SPARK-17950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17950:


Assignee: Apache Spark

> Match SparseVector behavior with DenseVector
> 
>
> Key: SPARK-17950
> URL: https://issues.apache.org/jira/browse/SPARK-17950
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib, PySpark
>Affects Versions: 2.0.1
>Reporter: AbderRahman Sobh
>Assignee: Apache Spark
>Priority: Minor
>   Original Estimate: 0h
>  Remaining Estimate: 0h
>
> Simply added the `__getattr__` to SparseVector that DenseVector has, but 
> calls self.toArray() instead of storing a vector all the time in self.array
> This allows for use of numpy functions on the values of a SparseVector in the 
> same direct way that users interact with DenseVectors.
>  i.e. you can simply call SparseVector.mean() to average the values in the 
> entire vector.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17950) Match SparseVector behavior with DenseVector


[ 
https://issues.apache.org/jira/browse/SPARK-17950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15576991#comment-15576991
 ] 

Apache Spark commented on SPARK-17950:
--

User 'itg-abby' has created a pull request for this issue:
https://github.com/apache/spark/pull/15496

> Match SparseVector behavior with DenseVector
> 
>
> Key: SPARK-17950
> URL: https://issues.apache.org/jira/browse/SPARK-17950
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib, PySpark
>Affects Versions: 2.0.1
>Reporter: AbderRahman Sobh
>Priority: Minor
>   Original Estimate: 0h
>  Remaining Estimate: 0h
>
> Simply added the `__getattr__` to SparseVector that DenseVector has, but 
> calls self.toArray() instead of storing a vector all the time in self.array
> This allows for use of numpy functions on the values of a SparseVector in the 
> same direct way that users interact with DenseVectors.
>  i.e. you can simply call SparseVector.mean() to average the values in the 
> entire vector.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17950) Match SparseVector behavior with DenseVector


 [ 
https://issues.apache.org/jira/browse/SPARK-17950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17950:


Assignee: (was: Apache Spark)

> Match SparseVector behavior with DenseVector
> 
>
> Key: SPARK-17950
> URL: https://issues.apache.org/jira/browse/SPARK-17950
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib, PySpark
>Affects Versions: 2.0.1
>Reporter: AbderRahman Sobh
>Priority: Minor
>   Original Estimate: 0h
>  Remaining Estimate: 0h
>
> Simply added the `__getattr__` to SparseVector that DenseVector has, but 
> calls self.toArray() instead of storing a vector all the time in self.array
> This allows for use of numpy functions on the values of a SparseVector in the 
> same direct way that users interact with DenseVectors.
>  i.e. you can simply call SparseVector.mean() to average the values in the 
> entire vector.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-17950) Match SparseVector behavior with DenseVector

2016-10-14 Thread AbderRahman Sobh (JIRA)

AbderRahman Sobh created SPARK-17950:


 Summary: Match SparseVector behavior with DenseVector
 Key: SPARK-17950
 URL: https://issues.apache.org/jira/browse/SPARK-17950
 Project: Spark
  Issue Type: Improvement
  Components: MLlib, PySpark
Affects Versions: 2.0.1
Reporter: AbderRahman Sobh
Priority: Minor


Simply added the `__getattr__` to SparseVector that DenseVector has, but calls 
self.toArray() instead of storing a vector all the time in self.array

This allows for use of numpy functions on the values of a SparseVector in the 
same direct way that users interact with DenseVectors.
 i.e. you can simply call SparseVector.mean() to average the values in the 
entire vector.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17949) Introduce a JVM object based aggregate operator


 [ 
https://issues.apache.org/jira/browse/SPARK-17949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian reassigned SPARK-17949:
--

Assignee: Cheng Lian

> Introduce a JVM object based aggregate operator
> ---
>
> Key: SPARK-17949
> URL: https://issues.apache.org/jira/browse/SPARK-17949
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Cheng Lian
>
> The new Tungsten execution engine has very robust memory management and speed 
> for simple data types. It does, however, suffer from the following:
> 1. For user defined aggregates (Hive UDAFs, Dataset typed operators), it is 
> fairly expensive to fit into the Tungsten internal format.
> 2. For aggregate functions that require complex intermediate data structures, 
> Unsafe (on raw bytes) is not a good programming abstraction due to the lack 
> of structs.
> The idea here is to introduce an JVM object based hash aggregate operator 
> that can support the aforementioned use cases. This operator, however, should 
> limit its memory usage to avoid putting too much pressure on GC, e.g. falling 
> back to sort-based aggregate as soon the number of objects exceed a very low 
> threshold.
> Internally at Databricks we prototyped a version of this for a customer POC 
> and have observed substantial speed-ups over existing Spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-17949) Introduce a JVM object based aggregate operator

Reynold Xin created SPARK-17949:
---

 Summary: Introduce a JVM object based aggregate operator
 Key: SPARK-17949
 URL: https://issues.apache.org/jira/browse/SPARK-17949
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Reynold Xin


The new Tungsten execution engine has very robust memory management and speed 
for simple data types. It does, however, suffer from the following:

1. For user defined aggregates (Hive UDAFs, Dataset typed operators), it is 
fairly expensive to fit into the Tungsten internal format.

2. For aggregate functions that require complex intermediate data structures, 
Unsafe (on raw bytes) is not a good programming abstraction due to the lack of 
structs.

The idea here is to introduce an JVM object based hash aggregate operator that 
can support the aforementioned use cases. This operator, however, should limit 
its memory usage to avoid putting too much pressure on GC, e.g. falling back to 
sort-based aggregate as soon the number of objects exceed a very low threshold.

Internally at Databricks we prototyped a version of this for a customer POC and 
have observed substantial speed-ups over existing Spark.





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-17948) WARN CodeGenerator: Error calculating stats of compiled class

2016-10-14 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-17948.

Resolution: Duplicate

>  WARN CodeGenerator: Error calculating stats of compiled class
> --
>
> Key: SPARK-17948
> URL: https://issues.apache.org/jira/browse/SPARK-17948
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.1
>Reporter: Harish
>
> I am getting below error in 2.0.2 snapshot. I use pyspark
> 16/10/14 22:33:25 WARN CodeGenerator: Error calculating stats of compiled 
> class.
> java.lang.IndexOutOfBoundsException: Index: 2659, Size: 1
>   at java.util.ArrayList.rangeCheck(ArrayList.java:653)
>   at java.util.ArrayList.get(ArrayList.java:429)
>   at 
> org.codehaus.janino.util.ClassFile.getConstantPoolInfo(ClassFile.java:457)
>   at 
> org.codehaus.janino.util.ClassFile.getConstantUtf8(ClassFile.java:469)
>   at org.codehaus.janino.util.ClassFile.loadAttribute(ClassFile.java:1387)
>   at org.codehaus.janino.util.ClassFile.loadAttributes(ClassFile.java:555)
>   at org.codehaus.janino.util.ClassFile.loadFields(ClassFile.java:518)
>   at org.codehaus.janino.util.ClassFile.(ClassFile.java:185)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$recordCompilationStats$1.apply(CodeGenerator.scala:919)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$recordCompilationStats$1.apply(CodeGenerator.scala:916)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:893)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
>   at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>   at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.recordCompilationStats(CodeGenerator.scala:916)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:888)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:950)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:947)
>   at 
> org.spark_project.guava.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599)
>   at 
> org.spark_project.guava.cache.LocalCache$Segment.loadSync(LocalCache.java:2379)
>   at 
> org.spark_project.guava.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342)
>   at 
> org.spark_project.guava.cache.LocalCache$Segment.get(LocalCache.java:2257)
>   at org.spark_project.guava.cache.LocalCache.get(LocalCache.java:4000)
>   at 
> org.spark_project.guava.cache.LocalCache.getOrLoad(LocalCache.java:4004)
>   at 
> org.spark_project.guava.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.compile(CodeGenerator.scala:841)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.GenerateMutableProjection$.create(GenerateMutableProjection.scala:140)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.GenerateMutableProjection$.generate(GenerateMutableProjection.scala:44)
>   at 
> org.apache.spark.sql.execution.SparkPlan.newMutableProjection(SparkPlan.scala:369)
>   at 
> org.apache.spark.sql.execution.aggregate.HashAggregateExec$$anonfun$doExecute$1$$anonfun$4$$anonfun$5.apply(HashAggregateExec.scala:110)
>   at 
> org.apache.spark.sql.execution.aggregate.HashAggregateExec$$anonfun$doExecute$1$$anonfun$4$$anonfun$5.apply(HashAggregateExec.scala:109)
>   at 
> org.apache.spark.sql.execution.aggregate.AggregationIterator.generateProcessRow(AggregationIterator.scala:179)
>   at 
> org.apache.spark.sql.execution.aggregate.AggregationIterator.(AggregationIterator.scala:198)
>   at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.(TungstenAggregationIterator.scala:92)
>   at 
> org.apache.spark.sql.execution.aggregate.HashAggregateExec$$anonfun$doExecute$1$$anonfun$4.apply(HashAggregateExec.scala:103)
>   at 
> org.apache.spark.sql.execution.aggregate.HashAggregateExec$$anonfun$doExecute$1$$anonfun$4.apply(HashAggregateExec.scala:94)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:785)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:785)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at

[jira] [Created] (SPARK-17948) WARN CodeGenerator: Error calculating stats of compiled class

Harish created SPARK-17948:
--

 Summary:  WARN CodeGenerator: Error calculating stats of compiled 
class
 Key: SPARK-17948
 URL: https://issues.apache.org/jira/browse/SPARK-17948
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.0.1
Reporter: Harish


I am getting below error in 2.0.2 snapshot. I use pyspark


16/10/14 22:33:25 WARN CodeGenerator: Error calculating stats of compiled class.
java.lang.IndexOutOfBoundsException: Index: 2659, Size: 1
at java.util.ArrayList.rangeCheck(ArrayList.java:653)
at java.util.ArrayList.get(ArrayList.java:429)
at 
org.codehaus.janino.util.ClassFile.getConstantPoolInfo(ClassFile.java:457)
at 
org.codehaus.janino.util.ClassFile.getConstantUtf8(ClassFile.java:469)
at org.codehaus.janino.util.ClassFile.loadAttribute(ClassFile.java:1387)
at org.codehaus.janino.util.ClassFile.loadAttributes(ClassFile.java:555)
at org.codehaus.janino.util.ClassFile.loadFields(ClassFile.java:518)
at org.codehaus.janino.util.ClassFile.(ClassFile.java:185)
at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$recordCompilationStats$1.apply(CodeGenerator.scala:919)
at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$recordCompilationStats$1.apply(CodeGenerator.scala:916)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.recordCompilationStats(CodeGenerator.scala:916)
at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:888)
at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:950)
at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:947)
at 
org.spark_project.guava.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599)
at 
org.spark_project.guava.cache.LocalCache$Segment.loadSync(LocalCache.java:2379)
at 
org.spark_project.guava.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342)
at 
org.spark_project.guava.cache.LocalCache$Segment.get(LocalCache.java:2257)
at org.spark_project.guava.cache.LocalCache.get(LocalCache.java:4000)
at 
org.spark_project.guava.cache.LocalCache.getOrLoad(LocalCache.java:4004)
at 
org.spark_project.guava.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874)
at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.compile(CodeGenerator.scala:841)
at 
org.apache.spark.sql.catalyst.expressions.codegen.GenerateMutableProjection$.create(GenerateMutableProjection.scala:140)
at 
org.apache.spark.sql.catalyst.expressions.codegen.GenerateMutableProjection$.generate(GenerateMutableProjection.scala:44)
at 
org.apache.spark.sql.execution.SparkPlan.newMutableProjection(SparkPlan.scala:369)
at 
org.apache.spark.sql.execution.aggregate.HashAggregateExec$$anonfun$doExecute$1$$anonfun$4$$anonfun$5.apply(HashAggregateExec.scala:110)
at 
org.apache.spark.sql.execution.aggregate.HashAggregateExec$$anonfun$doExecute$1$$anonfun$4$$anonfun$5.apply(HashAggregateExec.scala:109)
at 
org.apache.spark.sql.execution.aggregate.AggregationIterator.generateProcessRow(AggregationIterator.scala:179)
at 
org.apache.spark.sql.execution.aggregate.AggregationIterator.(AggregationIterator.scala:198)
at 
org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.(TungstenAggregationIterator.scala:92)
at 
org.apache.spark.sql.execution.aggregate.HashAggregateExec$$anonfun$doExecute$1$$anonfun$4.apply(HashAggregateExec.scala:103)
at 
org.apache.spark.sql.execution.aggregate.HashAggregateExec$$anonfun$doExecute$1$$anonfun$4.apply(HashAggregateExec.scala:94)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:785)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:785)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at

[jira] [Resolved] (SPARK-17900) Mark the following Spark SQL APIs as stable


 [ 
https://issues.apache.org/jira/browse/SPARK-17900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-17900.
--
   Resolution: Fixed
Fix Version/s: 2.1.0

Issue resolved by pull request 15469
[https://github.com/apache/spark/pull/15469]

> Mark the following Spark SQL APIs as stable
> ---
>
> Key: SPARK-17900
> URL: https://issues.apache.org/jira/browse/SPARK-17900
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 2.1.0
>
>
> Mark the following stable:
> Dataset/DataFrame
> - functions, since 1.3
> - ColumnName, since 1.3
> - DataFrameNaFunctions, since 1.3.1
> - DataFrameStatFunctions, since 1.4
> - UserDefinedFunction, since 1.3
> - UserDefinedAggregateFunction, since 1.5
> - Window and WindowSpec, since 1.4
> Data sources:
> - DataSourceRegister, since 1.5
> - RelationProvider, since 1.3
> - SchemaRelationProvider, since 1.3
> - CreatableRelationProvider, since 1.3
> - BaseRelation, since 1.3
> - TableScan, since 1.3
> - PrunedScan, since 1.3
> - PrunedFilteredScan, since 1.3
> - InsertableRelation, since 1.3
> Keep the following experimental / evolving:
> Data sources:
> - CatalystScan (tied to internal logical plans so it is not stable by 
> definition)
> Structured streaming:
> - all classes (introduced new in 2.0 and will likely change)
> Dataset typed operations (introduced in 1.6 and 2.0 and might change, 
> although probability is low)
> - all typed methods on Dataset
> - KeyValueGroupedDataset
> - o.a.s.sql.expressions.javalang.typed
> - o.a.s.sql.expressions.scalalang.typed
> - methods that return typed Dataset in SparkSession



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17812) More granular control of starting offsets (assign)


[ 
https://issues.apache.org/jira/browse/SPARK-17812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15576840#comment-15576840
 ] 

Michael Armbrust commented on SPARK-17812:
--

That sounds pretty good to me, with one question:  Is {{assign}} useful here?  
It seems you know the list of topicpartitions as they are all passed to 
{{startingOffsets}}.  If we get rid of {{assign}}, and keep the offset log 
format consistent with {{startingOffsets}}, then you could resume a query where 
another left off, simply by copying the last batch.  However, if we keep 
{{assign}}, you'll have to type that out manually and I'm not sure what you are 
gaining.

I would use jackson for the JSON stuff, but I would probably not use 
catalyst/encoders since those require code generation thats not going to buy us 
much.

Thanks for working on this!

> More granular control of starting offsets (assign)
> --
>
> Key: SPARK-17812
> URL: https://issues.apache.org/jira/browse/SPARK-17812
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Michael Armbrust
>Assignee: Cody Koeninger
>
> Right now you can only run a Streaming Query starting from either the 
> earliest or latests offsets available at the moment the query is started.  
> Sometimes this is a lot of data.  It would be nice to be able to do the 
> following:
>  - seek to user specified offsets for manually specified topicpartitions



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-17942) OpenJDK 64-Bit Server VM warning: Try increasing the code cache size using -XX:ReservedCodeCacheSize=


 [ 
https://issues.apache.org/jira/browse/SPARK-17942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Harish resolved SPARK-17942.

Resolution: Works for Me

> OpenJDK 64-Bit Server VM warning: Try increasing the code cache size using 
> -XX:ReservedCodeCacheSize=
> -
>
> Key: SPARK-17942
> URL: https://issues.apache.org/jira/browse/SPARK-17942
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.1
>Reporter: Harish
>Priority: Minor
>
> My code snipped is  in below location. In that  snippet i had put only few 
> columns, but in my test case i have data with 10M rows and 10,000 columns.
> http://stackoverflow.com/questions/39602596/convert-groupbykey-to-reducebykey-pyspark
> I see below message in spark 2.0.2 snapshot
> # Stderr of the node
> OpenJDK 64-Bit Server VM warning: CodeCache is full. Compiler has been 
> disabled.
> OpenJDK 64-Bit Server VM warning: Try increasing the code cache size using 
> -XX:ReservedCodeCacheSize=
> # stdout of the node
> CodeCache: size=245760Kb used=242680Kb max_used=242689Kb free=3079Kb
>  bounds [0x7f32c500, 0x7f32d400, 0x7f32d400]
>  total_blobs=41388 nmethods=40792 adapters=501
>  compilation: disabled (not enough contiguous free space left)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17813) Maximum data per trigger


[ 
https://issues.apache.org/jira/browse/SPARK-17813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15576826#comment-15576826
 ] 

Michael Armbrust commented on SPARK-17813:
--

I think its okay to ignore compacted topics, at least initially.  You would 
still respect the "maximum" nature of the configuration, though would waste 
some effort scheduling tasks smaller than the max.

I would probably start simple and just have a global {{maxOffsetsPerTrigger}} 
that bounds the total number of records in each batch and is distributed 
amongst the topic partitions.  topicpartitions that are skewed too small will 
not have enough offsets available and we can spill that over to the ones that 
are skewed large.  We can always add something more complicated in the future.

An alternative proposal would be to spread out the max to each partition 
proportional to the total number of offsets available when planning.

Regarding [SPARK-17510], I would make this configuration an option to the 
DataStreamReader, you'd be able to configure it perstream instead of globally.  
So, I think we are good.

> Maximum data per trigger
> 
>
> Key: SPARK-17813
> URL: https://issues.apache.org/jira/browse/SPARK-17813
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Michael Armbrust
>
> At any given point in a streaming query execution, we process all available 
> data.  This maximizes throughput at the cost of latency.  We should add 
> something similar to the {{maxFilesPerTrigger}} option available for files.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-11775) Allow PySpark to register Java UDF


 [ 
https://issues.apache.org/jira/browse/SPARK-11775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-11775.
--
   Resolution: Fixed
Fix Version/s: 2.1.0

Issue resolved by pull request 9766
[https://github.com/apache/spark/pull/9766]

> Allow PySpark to register Java UDF
> --
>
> Key: SPARK-11775
> URL: https://issues.apache.org/jira/browse/SPARK-11775
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark, SQL
>Reporter: Jeff Zhang
> Fix For: 2.1.0
>
>
> Currently pyspark can only call the builtin java UDF, but can not call custom 
> java UDF. It would be better to allow that. 2 benefits:
> * Leverage the power of rich third party java library
> * Improve the performance. Because if we use python UDF, python daemons will 
> be started on worker which will affect the performance. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17620) hive.default.fileformat=orc does not set OrcSerde


[ 
https://issues.apache.org/jira/browse/SPARK-17620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15576787#comment-15576787
 ] 

Apache Spark commented on SPARK-17620:
--

User 'dilipbiswal' has created a pull request for this issue:
https://github.com/apache/spark/pull/15495

> hive.default.fileformat=orc does not set OrcSerde
> -
>
> Key: SPARK-17620
> URL: https://issues.apache.org/jira/browse/SPARK-17620
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Brian Cho
>Assignee: Dilip Biswal
>Priority: Minor
>
> Setting {{hive.default.fileformat=orc}} does not set OrcSerde. This behavior 
> is inconsistent with {{STORED AS ORC}}. This means we cannot set a default 
> behavior for creating tables using orc.
> The behavior using stored as:
> {noformat}
> scala> spark.sql("CREATE TABLE tmp_stored_as(id INT) STORED AS ORC")
> res0: org.apache.spark.sql.DataFrame = []
> scala> spark.sql("DESC FORMATTED tmp_stored_as").collect.foreach(println)
> ...
> [# Storage Information,,]
> [SerDe Library:,org.apache.hadoop.hive.ql.io.orc.OrcSerde,]
> [InputFormat:,org.apache.hadoop.hive.ql.io.orc.OrcInputFormat,]
> [OutputFormat:,org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat,]
> ...
> {noformat}
> Behavior setting default conf (SerDe Library is not set properly):
> {noformat}
> scala> spark.sql("SET hive.default.fileformat=orc")
> res2: org.apache.spark.sql.DataFrame = [key: string, value: string]
> scala> spark.sql("CREATE TABLE tmp_default(id INT)")
> res3: org.apache.spark.sql.DataFrame = []
> scala> spark.sql("DESC FORMATTED tmp_default").collect.foreach(println)
> ...
> [# Storage Information,,]
> [SerDe Library:,org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe,]
> [InputFormat:,org.apache.hadoop.hive.ql.io.orc.OrcInputFormat,]
> [OutputFormat:,org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat,]
> ...
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17748) One-pass algorithm for linear regression with L1 and elastic-net penalties


 [ 
https://issues.apache.org/jira/browse/SPARK-17748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17748:


Assignee: Apache Spark  (was: Seth Hendrickson)

> One-pass algorithm for linear regression with L1 and elastic-net penalties
> --
>
> Key: SPARK-17748
> URL: https://issues.apache.org/jira/browse/SPARK-17748
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Seth Hendrickson
>Assignee: Apache Spark
>
> Currently linear regression uses weighted least squares to solve the normal 
> equations locally on the driver when the dimensionality is small (<4096). 
> Weighted least squares uses a Cholesky decomposition to solve the problem 
> with L2 regularization (which has a closed-form solution). We can support 
> L1/elasticnet penalties by solving the equations locally using OWL-QN solver.
> Also note that Cholesky does not handle singular covariance matrices, but 
> L-BFGS and OWL-QN are capable of providing reasonable solutions. This patch 
> can also add support for solving singular covariance matrices by also adding 
> L-BFGS.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17748) One-pass algorithm for linear regression with L1 and elastic-net penalties


[ 
https://issues.apache.org/jira/browse/SPARK-17748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15576770#comment-15576770
 ] 

Apache Spark commented on SPARK-17748:
--

User 'sethah' has created a pull request for this issue:
https://github.com/apache/spark/pull/15394

> One-pass algorithm for linear regression with L1 and elastic-net penalties
> --
>
> Key: SPARK-17748
> URL: https://issues.apache.org/jira/browse/SPARK-17748
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Seth Hendrickson
>Assignee: Seth Hendrickson
>
> Currently linear regression uses weighted least squares to solve the normal 
> equations locally on the driver when the dimensionality is small (<4096). 
> Weighted least squares uses a Cholesky decomposition to solve the problem 
> with L2 regularization (which has a closed-form solution). We can support 
> L1/elasticnet penalties by solving the equations locally using OWL-QN solver.
> Also note that Cholesky does not handle singular covariance matrices, but 
> L-BFGS and OWL-QN are capable of providing reasonable solutions. This patch 
> can also add support for solving singular covariance matrices by also adding 
> L-BFGS.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17748) One-pass algorithm for linear regression with L1 and elastic-net penalties


 [ 
https://issues.apache.org/jira/browse/SPARK-17748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17748:


Assignee: Seth Hendrickson  (was: Apache Spark)

> One-pass algorithm for linear regression with L1 and elastic-net penalties
> --
>
> Key: SPARK-17748
> URL: https://issues.apache.org/jira/browse/SPARK-17748
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Seth Hendrickson
>Assignee: Seth Hendrickson
>
> Currently linear regression uses weighted least squares to solve the normal 
> equations locally on the driver when the dimensionality is small (<4096). 
> Weighted least squares uses a Cholesky decomposition to solve the problem 
> with L2 regularization (which has a closed-form solution). We can support 
> L1/elasticnet penalties by solving the equations locally using OWL-QN solver.
> Also note that Cholesky does not handle singular covariance matrices, but 
> L-BFGS and OWL-QN are capable of providing reasonable solutions. This patch 
> can also add support for solving singular covariance matrices by also adding 
> L-BFGS.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17709) spark 2.0 join - column resolution error


[ 
https://issues.apache.org/jira/browse/SPARK-17709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15576763#comment-15576763
 ] 

Xiao Li commented on SPARK-17709:
-

That is what I said above. The deduplication is not triggered. It looks weird 
to me. Please try the 2.0.1. We fixed a lot of bugs in 2.0.1

Thanks!

> spark 2.0 join - column resolution error
> 
>
> Key: SPARK-17709
> URL: https://issues.apache.org/jira/browse/SPARK-17709
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Ashish Shrowty
>Priority: Critical
>
> If I try to inner-join two dataframes which originated from the same initial 
> dataframe that was loaded using spark.sql() call, it results in an error -
> // reading from Hive .. the data is stored in Parquet format in Amazon S3
> val d1 = spark.sql("select * from ")  
> val df1 = d1.groupBy("key1","key2")
>   .agg(avg("totalprice").as("avgtotalprice"))
> val df2 = d1.groupBy("key1","key2")
>   .agg(avg("itemcount").as("avgqty")) 
> df1.join(df2, Seq("key1","key2")) gives error -
> org.apache.spark.sql.AnalysisException: using columns ['key1,'key2] can 
> not be resolved given input columns: [key1, key2, avgtotalprice, avgqty];
> If the same Dataframe is initialized via spark.read.parquet(), the above code 
> works. This same code above worked with Spark 1.6.2



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12776) Implement Python API for Datasets


[ 
https://issues.apache.org/jira/browse/SPARK-12776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15576739#comment-15576739
 ] 

Michael Armbrust commented on SPARK-12776:
--

I would love to see better support here, but I don't think anyone has taken the 
time to flesh out the API.  Some suggestions I've heard are too support 
arbitrary objects in Datasets that are pickled and stored as 
{{ArrayType(BytesType)}}.  If you want schema, you could also tell us that, 
either with a schema string (where we use the schema to extract columns out of 
the object) or by using something like a named tuple.  This is all very rough 
though and I knew when we did a prototype we ran into issues caused by the 
batching we use when talking to the JVM (essentially things like {{df.count()}} 
broke).  If someone wants to flesh out these proposals that would be great.

> Implement Python API for Datasets
> -
>
> Key: SPARK-12776
> URL: https://issues.apache.org/jira/browse/SPARK-12776
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Kevin Cox
>Priority: Minor
>
> Now that the Dataset API is in Scala and Java it would be awesome to see it 
> show up in PySpark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-16063) Add storageLevel to Dataset


 [ 
https://issues.apache.org/jira/browse/SPARK-16063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-16063.
--
   Resolution: Fixed
Fix Version/s: 2.1.0

Issue resolved by pull request 13780
[https://github.com/apache/spark/pull/13780]

> Add storageLevel to Dataset
> ---
>
> Key: SPARK-16063
> URL: https://issues.apache.org/jira/browse/SPARK-16063
> Project: Spark
>  Issue Type: Improvement
>Reporter: Nick Pentreath
>Assignee: Nick Pentreath
>Priority: Minor
> Fix For: 2.1.0
>
>
> SPARK-11905 added {{cache}}/{{persist}} to {{Dataset}}. We should add 
> {{Dataset.storageLevel}}, analogous to {{RDD.getStorageLevel}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10954) Parquet version in the "created_by" metadata field of Parquet files written by Spark 1.5 and 1.6 is wrong


[ 
https://issues.apache.org/jira/browse/SPARK-10954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15576623#comment-15576623
 ] 

Cheng Lian commented on SPARK-10954:


[~hyukjin.kwon], yes, confirmed. Thanks!

> Parquet version in the "created_by" metadata field of Parquet files written 
> by Spark 1.5 and 1.6 is wrong
> -
>
> Key: SPARK-10954
> URL: https://issues.apache.org/jira/browse/SPARK-10954
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0, 1.5.1, 1.6.0
>Reporter: Cheng Lian
>Assignee: Gayathri Murali
>Priority: Minor
>
> We've upgraded to parquet-mr 1.7.0 in Spark 1.5, but the {{created_by}} field 
> still says 1.6.0. This issue can be reproduced by generating any Parquet file 
> with Spark 1.5, and then check the metadata with {{parquet-meta}} CLI tool:
> {noformat}
> $ parquet-meta /tmp/parquet/dec
> file:
> file:/tmp/parquet/dec/part-r-0-f210e968-1be5-40bc-bcbc-007f935e6dc7.gz.parquet
> creator: parquet-mr version 1.6.0
> extra:   org.apache.spark.sql.parquet.row.metadata = 
> {"type":"struct","fields":[{"name":"dec","type":"decimal(20,2)","nullable":true,"metadata":{}}]}
> file schema: spark_schema
> -
> dec: OPTIONAL FIXED_LEN_BYTE_ARRAY O:DECIMAL R:0 D:1
> row group 1: RC:10 TS:140 OFFSET:4
> -
> dec:  FIXED_LEN_BYTE_ARRAY GZIP DO:0 FPO:4 SZ:99/140/1.41 VC:10 
> ENC:PLAIN,BIT_PACKED,RLE
> {noformat}
> Note that this field is written by parquet-mr rather than Spark. However, 
> writing Parquet files using parquet-mr 1.7.0 directly without Spark 1.5 only 
> shows {{parquet-mr}} without any version number. Files written by parquet-mr 
> 1.8.1 without Spark look fine though.
> Currently this isn't a big issue. But parquet-mr 1.8 checks for this field to 
> workaround PARQUET-251.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17863) SELECT distinct does not work if there is a order by clause


 [ 
https://issues.apache.org/jira/browse/SPARK-17863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-17863:
-
Assignee: Davies Liu

> SELECT distinct does not work if there is a order by clause
> ---
>
> Key: SPARK-17863
> URL: https://issues.apache.org/jira/browse/SPARK-17863
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Davies Liu
>Priority: Blocker
>  Labels: correctness
> Fix For: 2.0.2, 2.1.0
>
>
> {code}
> select distinct struct.a, struct.b
> from (
>   select named_struct('a', 1, 'b', 2, 'c', 3) as struct
>   union all
>   select named_struct('a', 1, 'b', 2, 'c', 4) as struct) tmp
> order by struct.a, struct.b
> {code}
> This query generates
> {code}
> +---+---+
> |  a|  b|
> +---+---+
> |  1|  2|
> |  1|  2|
> +---+---+
> {code}
> The plan is wrong because the analyze somehow added {{struct#21805}} to the 
> project list, which changes the semantic of the distinct (basically, the 
> query is changed to {{select distinct struct.a, struct.b, struct}} from 
> {{select distinct struct.a, struct.b}}).
> {code}
> == Parsed Logical Plan ==
> 'Sort ['struct.a ASC, 'struct.b ASC], true
> +- 'Distinct
>+- 'Project ['struct.a, 'struct.b]
>   +- 'SubqueryAlias tmp
>  +- 'Union
> :- 'Project ['named_struct(a, 1, b, 2, c, 3) AS struct#21805]
> :  +- OneRowRelation$
> +- 'Project ['named_struct(a, 1, b, 2, c, 4) AS struct#21806]
>+- OneRowRelation$
> == Analyzed Logical Plan ==
> a: int, b: int
> Project [a#21819, b#21820]
> +- Sort [struct#21805.a ASC, struct#21805.b ASC], true
>+- Distinct
>   +- Project [struct#21805.a AS a#21819, struct#21805.b AS b#21820, 
> struct#21805]
>  +- SubqueryAlias tmp
> +- Union
>:- Project [named_struct(a, 1, b, 2, c, 3) AS struct#21805]
>:  +- OneRowRelation$
>+- Project [named_struct(a, 1, b, 2, c, 4) AS struct#21806]
>   +- OneRowRelation$
> == Optimized Logical Plan ==
> Project [a#21819, b#21820]
> +- Sort [struct#21805.a ASC, struct#21805.b ASC], true
>+- Aggregate [a#21819, b#21820, struct#21805], [a#21819, b#21820, 
> struct#21805]
>   +- Union
>  :- Project [1 AS a#21819, 2 AS b#21820, [1,2,3] AS struct#21805]
>  :  +- OneRowRelation$
>  +- Project [1 AS a#21819, 2 AS b#21820, [1,2,4] AS struct#21806]
> +- OneRowRelation$
> == Physical Plan ==
> *Project [a#21819, b#21820]
> +- *Sort [struct#21805.a ASC, struct#21805.b ASC], true, 0
>+- Exchange rangepartitioning(struct#21805.a ASC, struct#21805.b ASC, 200)
>   +- *HashAggregate(keys=[a#21819, b#21820, struct#21805], functions=[], 
> output=[a#21819, b#21820, struct#21805])
>  +- Exchange hashpartitioning(a#21819, b#21820, struct#21805, 200)
> +- *HashAggregate(keys=[a#21819, b#21820, struct#21805], 
> functions=[], output=[a#21819, b#21820, struct#21805])
>+- Union
>   :- *Project [1 AS a#21819, 2 AS b#21820, [1,2,3] AS 
> struct#21805]
>   :  +- Scan OneRowRelation[]
>   +- *Project [1 AS a#21819, 2 AS b#21820, [1,2,4] AS 
> struct#21806]
>  +- Scan OneRowRelation[]
> {code}
> If you use the following query, you will get the correct result
> {code}
> select distinct struct.a, struct.b
> from (
>   select named_struct('a', 1, 'b', 2, 'c', 3) as struct
>   union all
>   select named_struct('a', 1, 'b', 2, 'c', 4) as struct) tmp
> order by a, b
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-17863) SELECT distinct does not work if there is a order by clause


 [ 
https://issues.apache.org/jira/browse/SPARK-17863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-17863.
--
   Resolution: Fixed
Fix Version/s: 2.1.0
   2.0.2

Issue resolved by pull request 15489
[https://github.com/apache/spark/pull/15489]

> SELECT distinct does not work if there is a order by clause
> ---
>
> Key: SPARK-17863
> URL: https://issues.apache.org/jira/browse/SPARK-17863
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Priority: Blocker
>  Labels: correctness
> Fix For: 2.0.2, 2.1.0
>
>
> {code}
> select distinct struct.a, struct.b
> from (
>   select named_struct('a', 1, 'b', 2, 'c', 3) as struct
>   union all
>   select named_struct('a', 1, 'b', 2, 'c', 4) as struct) tmp
> order by struct.a, struct.b
> {code}
> This query generates
> {code}
> +---+---+
> |  a|  b|
> +---+---+
> |  1|  2|
> |  1|  2|
> +---+---+
> {code}
> The plan is wrong because the analyze somehow added {{struct#21805}} to the 
> project list, which changes the semantic of the distinct (basically, the 
> query is changed to {{select distinct struct.a, struct.b, struct}} from 
> {{select distinct struct.a, struct.b}}).
> {code}
> == Parsed Logical Plan ==
> 'Sort ['struct.a ASC, 'struct.b ASC], true
> +- 'Distinct
>+- 'Project ['struct.a, 'struct.b]
>   +- 'SubqueryAlias tmp
>  +- 'Union
> :- 'Project ['named_struct(a, 1, b, 2, c, 3) AS struct#21805]
> :  +- OneRowRelation$
> +- 'Project ['named_struct(a, 1, b, 2, c, 4) AS struct#21806]
>+- OneRowRelation$
> == Analyzed Logical Plan ==
> a: int, b: int
> Project [a#21819, b#21820]
> +- Sort [struct#21805.a ASC, struct#21805.b ASC], true
>+- Distinct
>   +- Project [struct#21805.a AS a#21819, struct#21805.b AS b#21820, 
> struct#21805]
>  +- SubqueryAlias tmp
> +- Union
>:- Project [named_struct(a, 1, b, 2, c, 3) AS struct#21805]
>:  +- OneRowRelation$
>+- Project [named_struct(a, 1, b, 2, c, 4) AS struct#21806]
>   +- OneRowRelation$
> == Optimized Logical Plan ==
> Project [a#21819, b#21820]
> +- Sort [struct#21805.a ASC, struct#21805.b ASC], true
>+- Aggregate [a#21819, b#21820, struct#21805], [a#21819, b#21820, 
> struct#21805]
>   +- Union
>  :- Project [1 AS a#21819, 2 AS b#21820, [1,2,3] AS struct#21805]
>  :  +- OneRowRelation$
>  +- Project [1 AS a#21819, 2 AS b#21820, [1,2,4] AS struct#21806]
> +- OneRowRelation$
> == Physical Plan ==
> *Project [a#21819, b#21820]
> +- *Sort [struct#21805.a ASC, struct#21805.b ASC], true, 0
>+- Exchange rangepartitioning(struct#21805.a ASC, struct#21805.b ASC, 200)
>   +- *HashAggregate(keys=[a#21819, b#21820, struct#21805], functions=[], 
> output=[a#21819, b#21820, struct#21805])
>  +- Exchange hashpartitioning(a#21819, b#21820, struct#21805, 200)
> +- *HashAggregate(keys=[a#21819, b#21820, struct#21805], 
> functions=[], output=[a#21819, b#21820, struct#21805])
>+- Union
>   :- *Project [1 AS a#21819, 2 AS b#21820, [1,2,3] AS 
> struct#21805]
>   :  +- Scan OneRowRelation[]
>   +- *Project [1 AS a#21819, 2 AS b#21820, [1,2,4] AS 
> struct#21806]
>  +- Scan OneRowRelation[]
> {code}
> If you use the following query, you will get the correct result
> {code}
> select distinct struct.a, struct.b
> from (
>   select named_struct('a', 1, 'b', 2, 'c', 3) as struct
>   union all
>   select named_struct('a', 1, 'b', 2, 'c', 4) as struct) tmp
> order by a, b
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17947) Document the impact of `spark.sql.debug`


 [ 
https://issues.apache.org/jira/browse/SPARK-17947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17947:


Assignee: (was: Apache Spark)

> Document the impact of `spark.sql.debug`
> 
>
> Key: SPARK-17947
> URL: https://issues.apache.org/jira/browse/SPARK-17947
> Project: Spark
>  Issue Type: Documentation
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Xiao Li
>
> Just document the impact of `spark.sql.debug`
>  When enabling the debug, Spark SQL internal table properties are not 
> filtered out; however, some related DDL commands (e.g., Analyze Table and 
> CREATE TABLE LIKE) might not work properly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17947) Document the impact of `spark.sql.debug`


 [ 
https://issues.apache.org/jira/browse/SPARK-17947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17947:


Assignee: Apache Spark

> Document the impact of `spark.sql.debug`
> 
>
> Key: SPARK-17947
> URL: https://issues.apache.org/jira/browse/SPARK-17947
> Project: Spark
>  Issue Type: Documentation
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Xiao Li
>Assignee: Apache Spark
>
> Just document the impact of `spark.sql.debug`
>  When enabling the debug, Spark SQL internal table properties are not 
> filtered out; however, some related DDL commands (e.g., Analyze Table and 
> CREATE TABLE LIKE) might not work properly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17947) Document the impact of `spark.sql.debug`


[ 
https://issues.apache.org/jira/browse/SPARK-17947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15576587#comment-15576587
 ] 

Apache Spark commented on SPARK-17947:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/15494

> Document the impact of `spark.sql.debug`
> 
>
> Key: SPARK-17947
> URL: https://issues.apache.org/jira/browse/SPARK-17947
> Project: Spark
>  Issue Type: Documentation
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Xiao Li
>
> Just document the impact of `spark.sql.debug`
>  When enabling the debug, Spark SQL internal table properties are not 
> filtered out; however, some related DDL commands (e.g., Analyze Table and 
> CREATE TABLE LIKE) might not work properly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-17947) Just document the impact of `spark.sql.debug`

Xiao Li created SPARK-17947:
---

 Summary: Just document the impact of `spark.sql.debug`
 Key: SPARK-17947
 URL: https://issues.apache.org/jira/browse/SPARK-17947
 Project: Spark
  Issue Type: Documentation
  Components: SQL
Affects Versions: 2.1.0
Reporter: Xiao Li


Just document the impact of `spark.sql.debug`

 When enabling the debug, Spark SQL internal table properties are not filtered 
out; however, some related DDL commands (e.g., Analyze Table and CREATE TABLE 
LIKE) might not work properly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17947) Document the impact of `spark.sql.debug`


 [ 
https://issues.apache.org/jira/browse/SPARK-17947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-17947:

Summary: Document the impact of `spark.sql.debug`  (was: Just document the 
impact of `spark.sql.debug`)

> Document the impact of `spark.sql.debug`
> 
>
> Key: SPARK-17947
> URL: https://issues.apache.org/jira/browse/SPARK-17947
> Project: Spark
>  Issue Type: Documentation
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Xiao Li
>
> Just document the impact of `spark.sql.debug`
>  When enabling the debug, Spark SQL internal table properties are not 
> filtered out; however, some related DDL commands (e.g., Analyze Table and 
> CREATE TABLE LIKE) might not work properly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17944) sbin/start-* scripts use of `hostname -f` fail with Solaris

2016-10-14 Thread Erik O'Shaughnessy (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erik O'Shaughnessy updated SPARK-17944:
---
Summary: sbin/start-* scripts use of `hostname -f` fail with Solaris   
(was: sbin/start-* scripts use of `hostname -f` fail for Solaris )

> sbin/start-* scripts use of `hostname -f` fail with Solaris 
> 
>
> Key: SPARK-17944
> URL: https://issues.apache.org/jira/browse/SPARK-17944
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.1
> Environment: Solaris 10, Solaris 11
>Reporter: Erik O'Shaughnessy
>Priority: Trivial
>
> {{$SPARK_HOME/sbin/start-master.sh}} fails:
> {noformat}
> $ ./start-master.sh 
> usage: hostname [[-t] system_name]
>hostname [-D]
> starting org.apache.spark.deploy.master.Master, logging to 
> /home/eoshaugh/local/spark/logs/spark-eoshaugh-org.apache.spark.deploy.master.Master-1-m7-16-002-ld1.out
> failed to launch org.apache.spark.deploy.master.Master:
> --properties-file FILE Path to a custom Spark properties file.
>Default is conf/spark-defaults.conf.
> full log in 
> /home/eoshaugh/local/spark/logs/spark-eoshaugh-org.apache.spark.deploy.master.Master-1-m7-16-002-ld1.out
> {noformat}
> I found SPARK-17546 which changed the invocation of hostname in 
> sbin/start-master.sh, sbin/start-slaves.sh and sbin/start-mesos-dispatcher.sh 
> to include the flag {{-f}}, which is not a valid command line option for the 
> Solaris hostname implementation. 
> As a workaround, Solaris users can substitute:
> {noformat}
> `/usr/sbin/check-hostname | awk '{print $NF}'`
> {noformat}
> Admittedly not an obvious fix, but it provides equivalent functionality. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-9783) Use SqlNewHadoopRDD in JSONRelation to eliminate extra refresh() call


 [ 
https://issues.apache.org/jira/browse/SPARK-9783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian closed SPARK-9783.
-
Resolution: Not A Problem

This issue is no longer a problem since we re-implemented the JSON data source 
using the new {{FileFormat}} facilities in Spark 2.0.0.

> Use SqlNewHadoopRDD in JSONRelation to eliminate extra refresh() call
> -
>
> Key: SPARK-9783
> URL: https://issues.apache.org/jira/browse/SPARK-9783
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>
> PR #8035 made a quick fix for SPARK-9743 by introducing an extra 
> {{refresh()}} call in {{JSONRelation.buildScan}}. Obviously it hurts 
> performance. To overcome this, we can use {{SqlNewHadoopRDD}} there and 
> override {{listStatus()}} to inject cached {{FileStatus}} instances, similar 
> as what we did in {{ParquetRelation}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9783) Use SqlNewHadoopRDD in JSONRelation to eliminate extra refresh() call


[ 
https://issues.apache.org/jira/browse/SPARK-9783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15576523#comment-15576523
 ] 

Cheng Lian commented on SPARK-9783:
---

Yes, I'm closing this. Thanks!

> Use SqlNewHadoopRDD in JSONRelation to eliminate extra refresh() call
> -
>
> Key: SPARK-9783
> URL: https://issues.apache.org/jira/browse/SPARK-9783
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>
> PR #8035 made a quick fix for SPARK-9743 by introducing an extra 
> {{refresh()}} call in {{JSONRelation.buildScan}}. Obviously it hurts 
> performance. To overcome this, we can use {{SqlNewHadoopRDD}} there and 
> override {{listStatus()}} to inject cached {{FileStatus}} instances, similar 
> as what we did in {{ParquetRelation}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17636) Parquet filter push down doesn't handle struct fields


[ 
https://issues.apache.org/jira/browse/SPARK-17636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15576513#comment-15576513
 ] 

Cheng Lian commented on SPARK-17636:


[~MasterDDT], yes, just as what [~hyukjin.kwon] explained previously, it's not 
implemented and is expected behavior.

> Parquet filter push down doesn't handle struct fields
> -
>
> Key: SPARK-17636
> URL: https://issues.apache.org/jira/browse/SPARK-17636
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 1.6.2, 1.6.3
>Reporter: Mitesh
>Priority: Minor
>
> There's a *PushedFilters* for a simple numeric field, but not for a numeric 
> field inside a struct. Not sure if this is a Spark limitation because of 
> Parquet, or only a Spark limitation.
> {noformat}
> scala> hc.read.parquet("s3a://some/parquet/file").select("day_timestamp", 
> "sale_id")
> res5: org.apache.spark.sql.DataFrame = [day_timestamp: 
> struct, sale_id: bigint]
> scala> res5.filter("sale_id > 4").queryExecution.executedPlan
> res9: org.apache.spark.sql.execution.SparkPlan =
> Filter[23814] [args=(sale_id#86324L > 
> 4)][outPart=UnknownPartitioning(0)][outOrder=List()]
> +- Scan ParquetRelation[day_timestamp#86302,sale_id#86324L] InputPaths: 
> s3a://some/parquet/file, PushedFilters: [GreaterThan(sale_id,4)]
> scala> res5.filter("day_timestamp.timestamp > 4").queryExecution.executedPlan
> res10: org.apache.spark.sql.execution.SparkPlan =
> Filter[23815] [args=(day_timestamp#86302.timestamp > 
> 4)][outPart=UnknownPartitioning(0)][outOrder=List()]
> +- Scan ParquetRelation[day_timestamp#86302,sale_id#86324L] InputPaths: 
> s3a://some/parquet/file
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17636) Parquet filter push down doesn't handle struct fields


 [ 
https://issues.apache.org/jira/browse/SPARK-17636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-17636:
---
Description: 
There's a *PushedFilters* for a simple numeric field, but not for a numeric 
field inside a struct. Not sure if this is a Spark limitation because of 
Parquet, or only a Spark limitation.

{noformat}
scala> hc.read.parquet("s3a://some/parquet/file").select("day_timestamp", 
"sale_id")

res5: org.apache.spark.sql.DataFrame = [day_timestamp: 
struct, sale_id: bigint]

scala> res5.filter("sale_id > 4").queryExecution.executedPlan

res9: org.apache.spark.sql.execution.SparkPlan =
Filter[23814] [args=(sale_id#86324L > 
4)][outPart=UnknownPartitioning(0)][outOrder=List()]
+- Scan ParquetRelation[day_timestamp#86302,sale_id#86324L] InputPaths: 
s3a://some/parquet/file, PushedFilters: [GreaterThan(sale_id,4)]

scala> res5.filter("day_timestamp.timestamp > 4").queryExecution.executedPlan

res10: org.apache.spark.sql.execution.SparkPlan =
Filter[23815] [args=(day_timestamp#86302.timestamp > 
4)][outPart=UnknownPartitioning(0)][outOrder=List()]
+- Scan ParquetRelation[day_timestamp#86302,sale_id#86324L] InputPaths: 
s3a://some/parquet/file
{noformat}

  was:
Theres a *PushedFilters* for a simple numeric field, but not for a numeric 
field inside a struct. Not sure if this is a Spark limitation because of 
Parquet, or only a Spark limitation.

{quote} 
scala> hc.read.parquet("s3a://some/parquet/file").select("day_timestamp", 
"sale_id")

res5: org.apache.spark.sql.DataFrame = [day_timestamp: 
struct, sale_id: bigint]

scala> res5.filter("sale_id > 4").queryExecution.executedPlan

res9: org.apache.spark.sql.execution.SparkPlan =
Filter[23814] [args=(sale_id#86324L > 
4)][outPart=UnknownPartitioning(0)][outOrder=List()]
+- Scan ParquetRelation[day_timestamp#86302,sale_id#86324L] InputPaths: 
s3a://some/parquet/file, PushedFilters: [GreaterThan(sale_id,4)]

scala> res5.filter("day_timestamp.timestamp > 4").queryExecution.executedPlan

res10: org.apache.spark.sql.execution.SparkPlan =
Filter[23815] [args=(day_timestamp#86302.timestamp > 
4)][outPart=UnknownPartitioning(0)][outOrder=List()]
+- Scan ParquetRelation[day_timestamp#86302,sale_id#86324L] InputPaths: 
s3a://some/parquet/file
{quote} 



> Parquet filter push down doesn't handle struct fields
> -
>
> Key: SPARK-17636
> URL: https://issues.apache.org/jira/browse/SPARK-17636
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 1.6.2, 1.6.3
>Reporter: Mitesh
>Priority: Minor
>
> There's a *PushedFilters* for a simple numeric field, but not for a numeric 
> field inside a struct. Not sure if this is a Spark limitation because of 
> Parquet, or only a Spark limitation.
> {noformat}
> scala> hc.read.parquet("s3a://some/parquet/file").select("day_timestamp", 
> "sale_id")
> res5: org.apache.spark.sql.DataFrame = [day_timestamp: 
> struct, sale_id: bigint]
> scala> res5.filter("sale_id > 4").queryExecution.executedPlan
> res9: org.apache.spark.sql.execution.SparkPlan =
> Filter[23814] [args=(sale_id#86324L > 
> 4)][outPart=UnknownPartitioning(0)][outOrder=List()]
> +- Scan ParquetRelation[day_timestamp#86302,sale_id#86324L] InputPaths: 
> s3a://some/parquet/file, PushedFilters: [GreaterThan(sale_id,4)]
> scala> res5.filter("day_timestamp.timestamp > 4").queryExecution.executedPlan
> res10: org.apache.spark.sql.execution.SparkPlan =
> Filter[23815] [args=(day_timestamp#86302.timestamp > 
> 4)][outPart=UnknownPartitioning(0)][outOrder=List()]
> +- Scan ParquetRelation[day_timestamp#86302,sale_id#86324L] InputPaths: 
> s3a://some/parquet/file
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17946) Python crossJoin API similar to Scala


[ 
https://issues.apache.org/jira/browse/SPARK-17946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15576495#comment-15576495
 ] 

Apache Spark commented on SPARK-17946:
--

User 'srinathshankar' has created a pull request for this issue:
https://github.com/apache/spark/pull/15493

> Python crossJoin API similar to Scala
> -
>
> Key: SPARK-17946
> URL: https://issues.apache.org/jira/browse/SPARK-17946
> Project: Spark
>  Issue Type: Bug
>Reporter: Srinath
>
> https://github.com/apache/spark/pull/14866
> added an explicit cross join to the dataset api in scala, requiring crossJoin 
> to be used when there is no join condition. 
> (JIRA: https://issues.apache.org/jira/browse/SPARK-17298)
> The "join" API in python was implemented using cross join in that patch.
> Add an explicit crossJoin to python as well so the API behavior is similar to 
> Scala.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17946) Python crossJoin API similar to Scala


 [ 
https://issues.apache.org/jira/browse/SPARK-17946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17946:


Assignee: Apache Spark

> Python crossJoin API similar to Scala
> -
>
> Key: SPARK-17946
> URL: https://issues.apache.org/jira/browse/SPARK-17946
> Project: Spark
>  Issue Type: Bug
>Reporter: Srinath
>Assignee: Apache Spark
>
> https://github.com/apache/spark/pull/14866
> added an explicit cross join to the dataset api in scala, requiring crossJoin 
> to be used when there is no join condition. 
> (JIRA: https://issues.apache.org/jira/browse/SPARK-17298)
> The "join" API in python was implemented using cross join in that patch.
> Add an explicit crossJoin to python as well so the API behavior is similar to 
> Scala.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17946) Python crossJoin API similar to Scala


 [ 
https://issues.apache.org/jira/browse/SPARK-17946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17946:


Assignee: (was: Apache Spark)

> Python crossJoin API similar to Scala
> -
>
> Key: SPARK-17946
> URL: https://issues.apache.org/jira/browse/SPARK-17946
> Project: Spark
>  Issue Type: Bug
>Reporter: Srinath
>
> https://github.com/apache/spark/pull/14866
> added an explicit cross join to the dataset api in scala, requiring crossJoin 
> to be used when there is no join condition. 
> (JIRA: https://issues.apache.org/jira/browse/SPARK-17298)
> The "join" API in python was implemented using cross join in that patch.
> Add an explicit crossJoin to python as well so the API behavior is similar to 
> Scala.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17620) hive.default.fileformat=orc does not set OrcSerde


 [ 
https://issues.apache.org/jira/browse/SPARK-17620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17620:


Assignee: Dilip Biswal  (was: Apache Spark)

> hive.default.fileformat=orc does not set OrcSerde
> -
>
> Key: SPARK-17620
> URL: https://issues.apache.org/jira/browse/SPARK-17620
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Brian Cho
>Assignee: Dilip Biswal
>Priority: Minor
>
> Setting {{hive.default.fileformat=orc}} does not set OrcSerde. This behavior 
> is inconsistent with {{STORED AS ORC}}. This means we cannot set a default 
> behavior for creating tables using orc.
> The behavior using stored as:
> {noformat}
> scala> spark.sql("CREATE TABLE tmp_stored_as(id INT) STORED AS ORC")
> res0: org.apache.spark.sql.DataFrame = []
> scala> spark.sql("DESC FORMATTED tmp_stored_as").collect.foreach(println)
> ...
> [# Storage Information,,]
> [SerDe Library:,org.apache.hadoop.hive.ql.io.orc.OrcSerde,]
> [InputFormat:,org.apache.hadoop.hive.ql.io.orc.OrcInputFormat,]
> [OutputFormat:,org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat,]
> ...
> {noformat}
> Behavior setting default conf (SerDe Library is not set properly):
> {noformat}
> scala> spark.sql("SET hive.default.fileformat=orc")
> res2: org.apache.spark.sql.DataFrame = [key: string, value: string]
> scala> spark.sql("CREATE TABLE tmp_default(id INT)")
> res3: org.apache.spark.sql.DataFrame = []
> scala> spark.sql("DESC FORMATTED tmp_default").collect.foreach(println)
> ...
> [# Storage Information,,]
> [SerDe Library:,org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe,]
> [InputFormat:,org.apache.hadoop.hive.ql.io.orc.OrcInputFormat,]
> [OutputFormat:,org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat,]
> ...
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17620) hive.default.fileformat=orc does not set OrcSerde


 [ 
https://issues.apache.org/jira/browse/SPARK-17620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-17620:
-
Fix Version/s: (was: 2.1.0)

> hive.default.fileformat=orc does not set OrcSerde
> -
>
> Key: SPARK-17620
> URL: https://issues.apache.org/jira/browse/SPARK-17620
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Brian Cho
>Assignee: Dilip Biswal
>Priority: Minor
>
> Setting {{hive.default.fileformat=orc}} does not set OrcSerde. This behavior 
> is inconsistent with {{STORED AS ORC}}. This means we cannot set a default 
> behavior for creating tables using orc.
> The behavior using stored as:
> {noformat}
> scala> spark.sql("CREATE TABLE tmp_stored_as(id INT) STORED AS ORC")
> res0: org.apache.spark.sql.DataFrame = []
> scala> spark.sql("DESC FORMATTED tmp_stored_as").collect.foreach(println)
> ...
> [# Storage Information,,]
> [SerDe Library:,org.apache.hadoop.hive.ql.io.orc.OrcSerde,]
> [InputFormat:,org.apache.hadoop.hive.ql.io.orc.OrcInputFormat,]
> [OutputFormat:,org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat,]
> ...
> {noformat}
> Behavior setting default conf (SerDe Library is not set properly):
> {noformat}
> scala> spark.sql("SET hive.default.fileformat=orc")
> res2: org.apache.spark.sql.DataFrame = [key: string, value: string]
> scala> spark.sql("CREATE TABLE tmp_default(id INT)")
> res3: org.apache.spark.sql.DataFrame = []
> scala> spark.sql("DESC FORMATTED tmp_default").collect.foreach(println)
> ...
> [# Storage Information,,]
> [SerDe Library:,org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe,]
> [InputFormat:,org.apache.hadoop.hive.ql.io.orc.OrcInputFormat,]
> [OutputFormat:,org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat,]
> ...
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-17620) hive.default.fileformat=orc does not set OrcSerde


 [ 
https://issues.apache.org/jira/browse/SPARK-17620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai reopened SPARK-17620:
--

> hive.default.fileformat=orc does not set OrcSerde
> -
>
> Key: SPARK-17620
> URL: https://issues.apache.org/jira/browse/SPARK-17620
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Brian Cho
>Assignee: Dilip Biswal
>Priority: Minor
>
> Setting {{hive.default.fileformat=orc}} does not set OrcSerde. This behavior 
> is inconsistent with {{STORED AS ORC}}. This means we cannot set a default 
> behavior for creating tables using orc.
> The behavior using stored as:
> {noformat}
> scala> spark.sql("CREATE TABLE tmp_stored_as(id INT) STORED AS ORC")
> res0: org.apache.spark.sql.DataFrame = []
> scala> spark.sql("DESC FORMATTED tmp_stored_as").collect.foreach(println)
> ...
> [# Storage Information,,]
> [SerDe Library:,org.apache.hadoop.hive.ql.io.orc.OrcSerde,]
> [InputFormat:,org.apache.hadoop.hive.ql.io.orc.OrcInputFormat,]
> [OutputFormat:,org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat,]
> ...
> {noformat}
> Behavior setting default conf (SerDe Library is not set properly):
> {noformat}
> scala> spark.sql("SET hive.default.fileformat=orc")
> res2: org.apache.spark.sql.DataFrame = [key: string, value: string]
> scala> spark.sql("CREATE TABLE tmp_default(id INT)")
> res3: org.apache.spark.sql.DataFrame = []
> scala> spark.sql("DESC FORMATTED tmp_default").collect.foreach(println)
> ...
> [# Storage Information,,]
> [SerDe Library:,org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe,]
> [InputFormat:,org.apache.hadoop.hive.ql.io.orc.OrcInputFormat,]
> [OutputFormat:,org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat,]
> ...
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17620) hive.default.fileformat=orc does not set OrcSerde


[ 
https://issues.apache.org/jira/browse/SPARK-17620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15576492#comment-15576492
 ] 

Yin Huai commented on SPARK-17620:
--

The PR somehow breaks the build and it has been reverted.

> hive.default.fileformat=orc does not set OrcSerde
> -
>
> Key: SPARK-17620
> URL: https://issues.apache.org/jira/browse/SPARK-17620
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Brian Cho
>Assignee: Dilip Biswal
>Priority: Minor
>
> Setting {{hive.default.fileformat=orc}} does not set OrcSerde. This behavior 
> is inconsistent with {{STORED AS ORC}}. This means we cannot set a default 
> behavior for creating tables using orc.
> The behavior using stored as:
> {noformat}
> scala> spark.sql("CREATE TABLE tmp_stored_as(id INT) STORED AS ORC")
> res0: org.apache.spark.sql.DataFrame = []
> scala> spark.sql("DESC FORMATTED tmp_stored_as").collect.foreach(println)
> ...
> [# Storage Information,,]
> [SerDe Library:,org.apache.hadoop.hive.ql.io.orc.OrcSerde,]
> [InputFormat:,org.apache.hadoop.hive.ql.io.orc.OrcInputFormat,]
> [OutputFormat:,org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat,]
> ...
> {noformat}
> Behavior setting default conf (SerDe Library is not set properly):
> {noformat}
> scala> spark.sql("SET hive.default.fileformat=orc")
> res2: org.apache.spark.sql.DataFrame = [key: string, value: string]
> scala> spark.sql("CREATE TABLE tmp_default(id INT)")
> res3: org.apache.spark.sql.DataFrame = []
> scala> spark.sql("DESC FORMATTED tmp_default").collect.foreach(println)
> ...
> [# Storage Information,,]
> [SerDe Library:,org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe,]
> [InputFormat:,org.apache.hadoop.hive.ql.io.orc.OrcInputFormat,]
> [OutputFormat:,org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat,]
> ...
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17620) hive.default.fileformat=orc does not set OrcSerde


 [ 
https://issues.apache.org/jira/browse/SPARK-17620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17620:


Assignee: Apache Spark  (was: Dilip Biswal)

> hive.default.fileformat=orc does not set OrcSerde
> -
>
> Key: SPARK-17620
> URL: https://issues.apache.org/jira/browse/SPARK-17620
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Brian Cho
>Assignee: Apache Spark
>Priority: Minor
>
> Setting {{hive.default.fileformat=orc}} does not set OrcSerde. This behavior 
> is inconsistent with {{STORED AS ORC}}. This means we cannot set a default 
> behavior for creating tables using orc.
> The behavior using stored as:
> {noformat}
> scala> spark.sql("CREATE TABLE tmp_stored_as(id INT) STORED AS ORC")
> res0: org.apache.spark.sql.DataFrame = []
> scala> spark.sql("DESC FORMATTED tmp_stored_as").collect.foreach(println)
> ...
> [# Storage Information,,]
> [SerDe Library:,org.apache.hadoop.hive.ql.io.orc.OrcSerde,]
> [InputFormat:,org.apache.hadoop.hive.ql.io.orc.OrcInputFormat,]
> [OutputFormat:,org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat,]
> ...
> {noformat}
> Behavior setting default conf (SerDe Library is not set properly):
> {noformat}
> scala> spark.sql("SET hive.default.fileformat=orc")
> res2: org.apache.spark.sql.DataFrame = [key: string, value: string]
> scala> spark.sql("CREATE TABLE tmp_default(id INT)")
> res3: org.apache.spark.sql.DataFrame = []
> scala> spark.sql("DESC FORMATTED tmp_default").collect.foreach(println)
> ...
> [# Storage Information,,]
> [SerDe Library:,org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe,]
> [InputFormat:,org.apache.hadoop.hive.ql.io.orc.OrcInputFormat,]
> [OutputFormat:,org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat,]
> ...
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-17946) Python crossJoin API similar to Scala

2016-10-14 Thread Srinath (JIRA)

Srinath created SPARK-17946:
---

 Summary: Python crossJoin API similar to Scala
 Key: SPARK-17946
 URL: https://issues.apache.org/jira/browse/SPARK-17946
 Project: Spark
  Issue Type: Bug
Reporter: Srinath


https://github.com/apache/spark/pull/14866
added an explicit cross join to the dataset api in scala, requiring crossJoin 
to be used when there is no join condition. 
(JIRA: https://issues.apache.org/jira/browse/SPARK-17298)
The "join" API in python was implemented using cross join in that patch.
Add an explicit crossJoin to python as well so the API behavior is similar to 
Scala.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-17945) Writing to S3 should allow setting object metadata

2016-10-14 Thread Jeff Schobelock (JIRA)

Jeff Schobelock created SPARK-17945:
---

 Summary: Writing to S3 should allow setting object metadata
 Key: SPARK-17945
 URL: https://issues.apache.org/jira/browse/SPARK-17945
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 2.0.1
Reporter: Jeff Schobelock
Priority: Minor


I can't find any possible way to use Spark to write to S3 and set user object 
metadata. This seems like such a simple thing that I feel I must be missing 
somewhere how to do itbut I have yet to find anything.

I don't know what all work adding this would entail. My idea would be that 
there is something like:

rdd.saveAsTextFile(s3://testbucket/file).withMetadata(Map data).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-17941) Logistic regression test suites should use weights when comparing to glmnet

2016-10-14 Thread DB Tsai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai closed SPARK-17941.
---
   Resolution: Fixed
Fix Version/s: 2.1.0

> Logistic regression test suites should use weights when comparing to glmnet
> ---
>
> Key: SPARK-17941
> URL: https://issues.apache.org/jira/browse/SPARK-17941
> Project: Spark
>  Issue Type: Test
>  Components: ML
>Reporter: Seth Hendrickson
>Assignee: Seth Hendrickson
>Priority: Minor
> Fix For: 2.1.0
>
>
> Logistic regression suite currently has many test cases comparing to R's 
> glmnet. Both libraries support weights, and to make the testing of weights in 
> Spark LOR more robust, we should add weights to all the test cases. The 
> current weight testing is quite minimal.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17941) Logistic regression test suites should use weights when comparing to glmnet

2016-10-14 Thread DB Tsai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai updated SPARK-17941:

Assignee: Seth Hendrickson

> Logistic regression test suites should use weights when comparing to glmnet
> ---
>
> Key: SPARK-17941
> URL: https://issues.apache.org/jira/browse/SPARK-17941
> Project: Spark
>  Issue Type: Test
>  Components: ML
>Reporter: Seth Hendrickson
>Assignee: Seth Hendrickson
>Priority: Minor
> Fix For: 2.1.0
>
>
> Logistic regression suite currently has many test cases comparing to R's 
> glmnet. Both libraries support weights, and to make the testing of weights in 
> Spark LOR more robust, we should add weights to all the test cases. The 
> current weight testing is quite minimal.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17620) hive.default.fileformat=orc does not set OrcSerde


 [ 
https://issues.apache.org/jira/browse/SPARK-17620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-17620:

Assignee: Dilip Biswal

> hive.default.fileformat=orc does not set OrcSerde
> -
>
> Key: SPARK-17620
> URL: https://issues.apache.org/jira/browse/SPARK-17620
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Brian Cho
>Assignee: Dilip Biswal
>Priority: Minor
> Fix For: 2.1.0
>
>
> Setting {{hive.default.fileformat=orc}} does not set OrcSerde. This behavior 
> is inconsistent with {{STORED AS ORC}}. This means we cannot set a default 
> behavior for creating tables using orc.
> The behavior using stored as:
> {noformat}
> scala> spark.sql("CREATE TABLE tmp_stored_as(id INT) STORED AS ORC")
> res0: org.apache.spark.sql.DataFrame = []
> scala> spark.sql("DESC FORMATTED tmp_stored_as").collect.foreach(println)
> ...
> [# Storage Information,,]
> [SerDe Library:,org.apache.hadoop.hive.ql.io.orc.OrcSerde,]
> [InputFormat:,org.apache.hadoop.hive.ql.io.orc.OrcInputFormat,]
> [OutputFormat:,org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat,]
> ...
> {noformat}
> Behavior setting default conf (SerDe Library is not set properly):
> {noformat}
> scala> spark.sql("SET hive.default.fileformat=orc")
> res2: org.apache.spark.sql.DataFrame = [key: string, value: string]
> scala> spark.sql("CREATE TABLE tmp_default(id INT)")
> res3: org.apache.spark.sql.DataFrame = []
> scala> spark.sql("DESC FORMATTED tmp_default").collect.foreach(println)
> ...
> [# Storage Information,,]
> [SerDe Library:,org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe,]
> [InputFormat:,org.apache.hadoop.hive.ql.io.orc.OrcInputFormat,]
> [OutputFormat:,org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat,]
> ...
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-17620) hive.default.fileformat=orc does not set OrcSerde


 [ 
https://issues.apache.org/jira/browse/SPARK-17620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-17620.
-
   Resolution: Fixed
Fix Version/s: 2.1.0

Issue resolved by pull request 15190
[https://github.com/apache/spark/pull/15190]

> hive.default.fileformat=orc does not set OrcSerde
> -
>
> Key: SPARK-17620
> URL: https://issues.apache.org/jira/browse/SPARK-17620
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Brian Cho
>Priority: Minor
> Fix For: 2.1.0
>
>
> Setting {{hive.default.fileformat=orc}} does not set OrcSerde. This behavior 
> is inconsistent with {{STORED AS ORC}}. This means we cannot set a default 
> behavior for creating tables using orc.
> The behavior using stored as:
> {noformat}
> scala> spark.sql("CREATE TABLE tmp_stored_as(id INT) STORED AS ORC")
> res0: org.apache.spark.sql.DataFrame = []
> scala> spark.sql("DESC FORMATTED tmp_stored_as").collect.foreach(println)
> ...
> [# Storage Information,,]
> [SerDe Library:,org.apache.hadoop.hive.ql.io.orc.OrcSerde,]
> [InputFormat:,org.apache.hadoop.hive.ql.io.orc.OrcInputFormat,]
> [OutputFormat:,org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat,]
> ...
> {noformat}
> Behavior setting default conf (SerDe Library is not set properly):
> {noformat}
> scala> spark.sql("SET hive.default.fileformat=orc")
> res2: org.apache.spark.sql.DataFrame = [key: string, value: string]
> scala> spark.sql("CREATE TABLE tmp_default(id INT)")
> res3: org.apache.spark.sql.DataFrame = []
> scala> spark.sql("DESC FORMATTED tmp_default").collect.foreach(println)
> ...
> [# Storage Information,,]
> [SerDe Library:,org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe,]
> [InputFormat:,org.apache.hadoop.hive.ql.io.orc.OrcInputFormat,]
> [OutputFormat:,org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat,]
> ...
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-17942) OpenJDK 64-Bit Server VM warning: Try increasing the code cache size using -XX:ReservedCodeCacheSize=


[ 
https://issues.apache.org/jira/browse/SPARK-17942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15576375#comment-15576375
 ] 

Harish edited comment on SPARK-17942 at 10/14/16 8:20 PM:
--

--conf "spark.executor.extraJavaOptions=-XX:ReservedCodeCacheSize=600m" -- will 
work. thanks

Please close this.


was (Author: harishk15):
--conf "spark.executor.extraJavaOptions=-XX:ReservedCodeCacheSize=600m" -- will 
work. thanks

> OpenJDK 64-Bit Server VM warning: Try increasing the code cache size using 
> -XX:ReservedCodeCacheSize=
> -
>
> Key: SPARK-17942
> URL: https://issues.apache.org/jira/browse/SPARK-17942
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.1
>Reporter: Harish
>Priority: Minor
>
> My code snipped is  in below location. In that  snippet i had put only few 
> columns, but in my test case i have data with 10M rows and 10,000 columns.
> http://stackoverflow.com/questions/39602596/convert-groupbykey-to-reducebykey-pyspark
> I see below message in spark 2.0.2 snapshot
> # Stderr of the node
> OpenJDK 64-Bit Server VM warning: CodeCache is full. Compiler has been 
> disabled.
> OpenJDK 64-Bit Server VM warning: Try increasing the code cache size using 
> -XX:ReservedCodeCacheSize=
> # stdout of the node
> CodeCache: size=245760Kb used=242680Kb max_used=242689Kb free=3079Kb
>  bounds [0x7f32c500, 0x7f32d400, 0x7f32d400]
>  total_blobs=41388 nmethods=40792 adapters=501
>  compilation: disabled (not enough contiguous free space left)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17942) OpenJDK 64-Bit Server VM warning: Try increasing the code cache size using -XX:ReservedCodeCacheSize=


[ 
https://issues.apache.org/jira/browse/SPARK-17942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15576375#comment-15576375
 ] 

Harish commented on SPARK-17942:


--conf "spark.executor.extraJavaOptions=-XX:ReservedCodeCacheSize=600m" -- will 
work. thanks

> OpenJDK 64-Bit Server VM warning: Try increasing the code cache size using 
> -XX:ReservedCodeCacheSize=
> -
>
> Key: SPARK-17942
> URL: https://issues.apache.org/jira/browse/SPARK-17942
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.1
>Reporter: Harish
>Priority: Minor
>
> My code snipped is  in below location. In that  snippet i had put only few 
> columns, but in my test case i have data with 10M rows and 10,000 columns.
> http://stackoverflow.com/questions/39602596/convert-groupbykey-to-reducebykey-pyspark
> I see below message in spark 2.0.2 snapshot
> # Stderr of the node
> OpenJDK 64-Bit Server VM warning: CodeCache is full. Compiler has been 
> disabled.
> OpenJDK 64-Bit Server VM warning: Try increasing the code cache size using 
> -XX:ReservedCodeCacheSize=
> # stdout of the node
> CodeCache: size=245760Kb used=242680Kb max_used=242689Kb free=3079Kb
>  bounds [0x7f32c500, 0x7f32d400, 0x7f32d400]
>  total_blobs=41388 nmethods=40792 adapters=501
>  compilation: disabled (not enough contiguous free space left)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17944) sbin/start-* scripts use of `hostname -f` fail for Solaris

2016-10-14 Thread Erik O'Shaughnessy (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15576367#comment-15576367
 ] 

Erik O'Shaughnessy commented on SPARK-17944:


I'm sure there are situations where Linux and OS X disagree with how to get 
things done, I was hoping for more inclusivity despite Solaris being a less 
favored development platform.

A simple change might be to add a file similar to {{sbin/spark-config.sh}}, 
call it {{sbin/spark-funcs.sh}}, and populate it with a set of bash functions 
which can attempt OS specific invocations until it finds one that is successful 
or returns an error after exhausting all options. This could provide a 
framework for resolving future conflicts between OS implementation details in 
addition to addressing this particular problem.

> sbin/start-* scripts use of `hostname -f` fail for Solaris 
> ---
>
> Key: SPARK-17944
> URL: https://issues.apache.org/jira/browse/SPARK-17944
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.1
> Environment: Solaris 10, Solaris 11
>Reporter: Erik O'Shaughnessy
>Priority: Trivial
>
> {{$SPARK_HOME/sbin/start-master.sh}} fails:
> {noformat}
> $ ./start-master.sh 
> usage: hostname [[-t] system_name]
>hostname [-D]
> starting org.apache.spark.deploy.master.Master, logging to 
> /home/eoshaugh/local/spark/logs/spark-eoshaugh-org.apache.spark.deploy.master.Master-1-m7-16-002-ld1.out
> failed to launch org.apache.spark.deploy.master.Master:
> --properties-file FILE Path to a custom Spark properties file.
>Default is conf/spark-defaults.conf.
> full log in 
> /home/eoshaugh/local/spark/logs/spark-eoshaugh-org.apache.spark.deploy.master.Master-1-m7-16-002-ld1.out
> {noformat}
> I found SPARK-17546 which changed the invocation of hostname in 
> sbin/start-master.sh, sbin/start-slaves.sh and sbin/start-mesos-dispatcher.sh 
> to include the flag {{-f}}, which is not a valid command line option for the 
> Solaris hostname implementation. 
> As a workaround, Solaris users can substitute:
> {noformat}
> `/usr/sbin/check-hostname | awk '{print $NF}'`
> {noformat}
> Admittedly not an obvious fix, but it provides equivalent functionality. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17944) sbin/start-* scripts use of `hostname -f` fail for Solaris

2016-10-14 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15576323#comment-15576323
 ] 

Sean Owen commented on SPARK-17944:
---

Yeah, I think Solaris is the odd man out here then. Linux and OS X support 
this. I'm not sure Solaris is something Spark intends to support in general, 
for reasons like this. If there's a simple change that makes it work, OK, but 
otherwise I'd close this.

> sbin/start-* scripts use of `hostname -f` fail for Solaris 
> ---
>
> Key: SPARK-17944
> URL: https://issues.apache.org/jira/browse/SPARK-17944
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.1
> Environment: Solaris 10, Solaris 11
>Reporter: Erik O'Shaughnessy
>Priority: Trivial
>
> {{$SPARK_HOME/sbin/start-master.sh}} fails:
> {noformat}
> $ ./start-master.sh 
> usage: hostname [[-t] system_name]
>hostname [-D]
> starting org.apache.spark.deploy.master.Master, logging to 
> /home/eoshaugh/local/spark/logs/spark-eoshaugh-org.apache.spark.deploy.master.Master-1-m7-16-002-ld1.out
> failed to launch org.apache.spark.deploy.master.Master:
> --properties-file FILE Path to a custom Spark properties file.
>Default is conf/spark-defaults.conf.
> full log in 
> /home/eoshaugh/local/spark/logs/spark-eoshaugh-org.apache.spark.deploy.master.Master-1-m7-16-002-ld1.out
> {noformat}
> I found SPARK-17546 which changed the invocation of hostname in 
> sbin/start-master.sh, sbin/start-slaves.sh and sbin/start-mesos-dispatcher.sh 
> to include the flag {{-f}}, which is not a valid command line option for the 
> Solaris hostname implementation. 
> As a workaround, Solaris users can substitute:
> {noformat}
> `/usr/sbin/check-hostname | awk '{print $NF}'`
> {noformat}
> Admittedly not an obvious fix, but it provides equivalent functionality. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17937) Clarify Kafka offset semantics for Structured Streaming

[
https://issues.apache.org/jira/browse/SPARK-17937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Cody Koeninger updated SPARK-17937:
---
Description:
Possible events for which offsets are needed:
# New partition is discovered
# Offset out of range (aka, data has been lost). It's possible to separate
this into offset too small and offset too large, but I'm not sure it matters
for us.

Possible sources of offsets:
# *Earliest* position in log
# *Latest* position in log
# *Fail* and kill the query
# *Checkpoint* position
# *User specified* per topicpartition
# *Kafka commit log*. Currently unsupported. This means users who want to
migrate from existing kafka jobs need to jump through hoops. Even if we never
want to support it, as soon as we take on SPARK-17815 we need to make sure
Kafka commit log state is clearly documented and handled.
# *Timestamp*. Currently unsupported. This could be supported with old,
inaccurate Kafka time api, or upcoming time index
# *X offsets* before or after latest / earliest position. Currently
unsupported. I think the semantics of this are super unclear by comparison
with timestamp, given that Kafka doesn't have a single range of offsets.

Currently allowed pre-query configuration, all "ORs" are exclusive:
# startingOffsets: *earliest* OR *latest* OR *User specified* json per
topicpartition (SPARK-17812)
# failOnDataLoss: true (which implies *Fail* above) OR false (which implies
*Earliest* above) In general, I see no reason this couldn't specify Latest as
an option.

Possible lifecycle times in which an offset-related event may happen:
# At initial query start
#* New partition: if startingOffsets is *Earliest* or *Latest*, use that. If
startingOffsets is *User specified* perTopicpartition, and the new partition
isn't in the map, *Fail*. Note that this is effectively undistinguishable from
new parititon during query, because partitions may have changed in between
pre-query configuration and query start, but we treat it differently, and users
in this case are SOL
#* Offset out of range on driver: We don't technically have behavior for this
case yet. Could use the value of failOnDataLoss, but it's possible people may
want to know at startup that something was wrong, even if they're ok with
earliest for a during-query out of range
#* Offset out of range on executor: seems like it should be *Fail* or
*Earliest*, based on failOnDataLoss. but it looks like this setting is
currently ignored, and the executor will just fail...
# During query
#* New partition: *Earliest*, only. This seems to be by fiat, I see no reason
this can't be configurable.
#* Offset out of range on driver: this _probably_ doesn't happen, because
we're doing explicit seeks to the latest position
#* Offset out of range on executor: ?
# At query restart
#* New partition: *Checkpoint*, fall back to *Earliest*. Again, no reason this
couldn't be configurable fall back to Latest
#* Offset out of range on driver: this _probably_ doesn't happen, because
we're doing explicit seeks to the specified position
#* Offset out of range on executor: ?

I've probably missed something, chime in.

was:
Possible events for which offsets are needed:
# New partition is discovered
# Offset out of range (aka, data has been lost). It's possible to separate
this into offset too small and offset too large, but I'm not sure it matters
for us.

[jira] [Updated] (SPARK-17937) Clarify Kafka offset semantics for Structured Streaming

[
https://issues.apache.org/jira/browse/SPARK-17937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Possible lifecycle times in which an offset-related event may happen:
# At initial query start
#* New partition: if startingOffsets is *Earliest* or *Latest*, use that. If
startingOffsets is *User specified* perTopicpartition, and the new partition
isn't in the map, *Fail*. Note that this is effectively undistinguishable from
new parititon during query, because partitions may have changed in between
pre-query configuration and query start, but we treat it differently, and users
in this case are SOL
#* Offset out of range on driver: We don't technically have behavior for this
case yet. Could use the value of failOnDataLoss, but it's possible people may
want to know at startup that something was wrong, even if they're ok with
earliest for a during-query out of range
#* Offset out of range on executor: seems like it should be*Fail* or
*Earliest*, based on failOnDataLoss. but it looks like this setting is
currently ignored, and the executor will just fail...
# During query
#* New partition: *Earliest*, only. This seems to be by fiat, I see no reason
this can't be configurable.
#* Offset out of range on driver: this _probably_ doesn't happen, because
we're doing explicit seeks to the latest position
#* Offset out of range on executor: ?
# At query restart
#* New partition: *Checkpoint*, fall back to *Earliest*. Again, no reason this
couldn't be configurable fall back to Latest
#* Offset out of range on driver: this _probably_ doesn't happen, because
we're doing explicit seeks to the specified position
#* Offset out of range on executor: ?

I've probably missed something, chime in.

[jira] [Created] (SPARK-17944) sbin/start-* scripts use of `hostname -f` fail for Solaris

2016-10-14 Thread Erik O'Shaughnessy (JIRA)

Erik O'Shaughnessy created SPARK-17944:
--

 Summary: sbin/start-* scripts use of `hostname -f` fail for 
Solaris 
 Key: SPARK-17944
 URL: https://issues.apache.org/jira/browse/SPARK-17944
 Project: Spark
  Issue Type: Bug
Affects Versions: 2.0.1
 Environment: Solaris 10, Solaris 11
Reporter: Erik O'Shaughnessy
Priority: Trivial


{{$SPARK_HOME/sbin/start-master.sh}} fails:

{noformat}
$ ./start-master.sh 
usage: hostname [[-t] system_name]
   hostname [-D]
starting org.apache.spark.deploy.master.Master, logging to 
/home/eoshaugh/local/spark/logs/spark-eoshaugh-org.apache.spark.deploy.master.Master-1-m7-16-002-ld1.out
failed to launch org.apache.spark.deploy.master.Master:
--properties-file FILE Path to a custom Spark properties file.
   Default is conf/spark-defaults.conf.
full log in 
/home/eoshaugh/local/spark/logs/spark-eoshaugh-org.apache.spark.deploy.master.Master-1-m7-16-002-ld1.out
{noformat}

I found SPARK-17546 which changed the invocation of hostname in 
sbin/start-master.sh, sbin/start-slaves.sh and sbin/start-mesos-dispatcher.sh 
to include the flag {{-f}}, which is not a valid command line option for the 
Solaris hostname implementation. 

As a workaround, Solaris users can substitute:
{noformat}
`/usr/sbin/check-hostname | awk '{print $NF}'`
{noformat}

Admittedly not an obvious fix, but it provides equivalent functionality. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-10541) Allow ApplicationHistoryProviders to provide their own text when there aren't any complete apps


 [ 
https://issues.apache.org/jira/browse/SPARK-10541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10541:


Assignee: Apache Spark

> Allow ApplicationHistoryProviders to provide their own text when there aren't 
> any complete apps
> ---
>
> Key: SPARK-10541
> URL: https://issues.apache.org/jira/browse/SPARK-10541
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 1.5.0
>Reporter: Steve Loughran
>Assignee: Apache Spark
>Priority: Minor
>
> The current History Server text when there are no complete apps is hard coded 
> for the FS provider and talks about log directories. As it is static text, it 
> doesn't get any recent information from that FS provider itself, such as the 
> directory it is looking at.
> For other ApplicationHistoryProviders, there will be different reasons
> for the list being empty, including, in the timeline implementation: failure 
> to talk to the server, authentication failure, or simply no events being 
> present.
> Even {{FsHistoryProvider}} could help users by actually printing what the log 
> directory was.
> If a method {{emptyListingText()}} `was supported, each implementation could 
> provide a live explanation of the current state



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10541) Allow ApplicationHistoryProviders to provide their own text when there aren't any complete apps


[ 
https://issues.apache.org/jira/browse/SPARK-10541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15576264#comment-15576264
 ] 

Apache Spark commented on SPARK-10541:
--

User 'ajbozarth' has created a pull request for this issue:
https://github.com/apache/spark/pull/15490

> Allow ApplicationHistoryProviders to provide their own text when there aren't 
> any complete apps
> ---
>
> Key: SPARK-10541
> URL: https://issues.apache.org/jira/browse/SPARK-10541
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 1.5.0
>Reporter: Steve Loughran
>Priority: Minor
>
> The current History Server text when there are no complete apps is hard coded 
> for the FS provider and talks about log directories. As it is static text, it 
> doesn't get any recent information from that FS provider itself, such as the 
> directory it is looking at.
> For other ApplicationHistoryProviders, there will be different reasons
> for the list being empty, including, in the timeline implementation: failure 
> to talk to the server, authentication failure, or simply no events being 
> present.
> Even {{FsHistoryProvider}} could help users by actually printing what the log 
> directory was.
> If a method {{emptyListingText()}} `was supported, each implementation could 
> provide a live explanation of the current state



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-10541) Allow ApplicationHistoryProviders to provide their own text when there aren't any complete apps


 [ 
https://issues.apache.org/jira/browse/SPARK-10541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10541:


Assignee: (was: Apache Spark)

> Allow ApplicationHistoryProviders to provide their own text when there aren't 
> any complete apps
> ---
>
> Key: SPARK-10541
> URL: https://issues.apache.org/jira/browse/SPARK-10541
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 1.5.0
>Reporter: Steve Loughran
>Priority: Minor
>
> The current History Server text when there are no complete apps is hard coded 
> for the FS provider and talks about log directories. As it is static text, it 
> doesn't get any recent information from that FS provider itself, such as the 
> directory it is looking at.
> For other ApplicationHistoryProviders, there will be different reasons
> for the list being empty, including, in the timeline implementation: failure 
> to talk to the server, authentication failure, or simply no events being 
> present.
> Even {{FsHistoryProvider}} could help users by actually printing what the log 
> directory was.
> If a method {{emptyListingText()}} `was supported, each implementation could 
> provide a live explanation of the current state



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17937) Clarify Kafka offset semantics for Structured Streaming

[
https://issues.apache.org/jira/browse/SPARK-17937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Possible lifecycle times in which an offset-related event may happen:
# At initial query start
#* New partition: if startingOffsets is *Earliest* or *Latest*, use that. If
startingOffsets is *User specified* perTopicpartition, and the new partition
isn't in the map, *Fail*. Note that this is effectively undistinguishable from
new parititon during query, because partitions may have changed in between
pre-query configuration and query start, but we treat it differently, and users
in this case are SOL
#* Offset out of range on driver: We don't technically have behavior for this
case yet. Could use the value of failOnDataLoss, but it's possible people may
want to know at startup that something was wrong, even if they're ok with
earliest for a during-query out of range
#* Offset out of range on executor: *Fail* or *Earliest*, based on
failOnDataLoss.
# During query
#* New partition: *Earliest*, only. This seems to be by fiat, I see no reason
this can't be configurable.
#* Offset out of range on driver: this _probably_ doesn't happen, because
we're doing explicit seeks to the latest position
#* Offset out of range on executor: *Fail* or *Earliest*, based on
failOnDataLoss
# At query restart
#* New partition: *Checkpoint*, fall back to *Earliest*. Again, no reason this
couldn't be configurable fall back to Latest
#* Offset out of range on driver: this _probably_ doesn't happen, because
we're doing explicit seeks to the specified position
#* Offset out of range on executor: *Fail* or *Earliest*, based on
failOnDataLoss

I've probably missed something, chime in.

was:
Possible events for which offsets are needed:
# New partition is discovered
# Offset out of range (aka, data has been lost)

[jira] [Assigned] (SPARK-17863) SELECT distinct does not work if there is a order by clause


 [ 
https://issues.apache.org/jira/browse/SPARK-17863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17863:


Assignee: Apache Spark

> SELECT distinct does not work if there is a order by clause
> ---
>
> Key: SPARK-17863
> URL: https://issues.apache.org/jira/browse/SPARK-17863
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Apache Spark
>Priority: Blocker
>  Labels: correctness
>
> {code}
> select distinct struct.a, struct.b
> from (
>   select named_struct('a', 1, 'b', 2, 'c', 3) as struct
>   union all
>   select named_struct('a', 1, 'b', 2, 'c', 4) as struct) tmp
> order by struct.a, struct.b
> {code}
> This query generates
> {code}
> +---+---+
> |  a|  b|
> +---+---+
> |  1|  2|
> |  1|  2|
> +---+---+
> {code}
> The plan is wrong because the analyze somehow added {{struct#21805}} to the 
> project list, which changes the semantic of the distinct (basically, the 
> query is changed to {{select distinct struct.a, struct.b, struct}} from 
> {{select distinct struct.a, struct.b}}).
> {code}
> == Parsed Logical Plan ==
> 'Sort ['struct.a ASC, 'struct.b ASC], true
> +- 'Distinct
>+- 'Project ['struct.a, 'struct.b]
>   +- 'SubqueryAlias tmp
>  +- 'Union
> :- 'Project ['named_struct(a, 1, b, 2, c, 3) AS struct#21805]
> :  +- OneRowRelation$
> +- 'Project ['named_struct(a, 1, b, 2, c, 4) AS struct#21806]
>+- OneRowRelation$
> == Analyzed Logical Plan ==
> a: int, b: int
> Project [a#21819, b#21820]
> +- Sort [struct#21805.a ASC, struct#21805.b ASC], true
>+- Distinct
>   +- Project [struct#21805.a AS a#21819, struct#21805.b AS b#21820, 
> struct#21805]
>  +- SubqueryAlias tmp
> +- Union
>:- Project [named_struct(a, 1, b, 2, c, 3) AS struct#21805]
>:  +- OneRowRelation$
>+- Project [named_struct(a, 1, b, 2, c, 4) AS struct#21806]
>   +- OneRowRelation$
> == Optimized Logical Plan ==
> Project [a#21819, b#21820]
> +- Sort [struct#21805.a ASC, struct#21805.b ASC], true
>+- Aggregate [a#21819, b#21820, struct#21805], [a#21819, b#21820, 
> struct#21805]
>   +- Union
>  :- Project [1 AS a#21819, 2 AS b#21820, [1,2,3] AS struct#21805]
>  :  +- OneRowRelation$
>  +- Project [1 AS a#21819, 2 AS b#21820, [1,2,4] AS struct#21806]
> +- OneRowRelation$
> == Physical Plan ==
> *Project [a#21819, b#21820]
> +- *Sort [struct#21805.a ASC, struct#21805.b ASC], true, 0
>+- Exchange rangepartitioning(struct#21805.a ASC, struct#21805.b ASC, 200)
>   +- *HashAggregate(keys=[a#21819, b#21820, struct#21805], functions=[], 
> output=[a#21819, b#21820, struct#21805])
>  +- Exchange hashpartitioning(a#21819, b#21820, struct#21805, 200)
> +- *HashAggregate(keys=[a#21819, b#21820, struct#21805], 
> functions=[], output=[a#21819, b#21820, struct#21805])
>+- Union
>   :- *Project [1 AS a#21819, 2 AS b#21820, [1,2,3] AS 
> struct#21805]
>   :  +- Scan OneRowRelation[]
>   +- *Project [1 AS a#21819, 2 AS b#21820, [1,2,4] AS 
> struct#21806]
>  +- Scan OneRowRelation[]
> {code}
> If you use the following query, you will get the correct result
> {code}
> select distinct struct.a, struct.b
> from (
>   select named_struct('a', 1, 'b', 2, 'c', 3) as struct
>   union all
>   select named_struct('a', 1, 'b', 2, 'c', 4) as struct) tmp
> order by a, b
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17863) SELECT distinct does not work if there is a order by clause


 [ 
https://issues.apache.org/jira/browse/SPARK-17863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17863:


Assignee: (was: Apache Spark)

> SELECT distinct does not work if there is a order by clause
> ---
>
> Key: SPARK-17863
> URL: https://issues.apache.org/jira/browse/SPARK-17863
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Priority: Blocker
>  Labels: correctness
>
> {code}
> select distinct struct.a, struct.b
> from (
>   select named_struct('a', 1, 'b', 2, 'c', 3) as struct
>   union all
>   select named_struct('a', 1, 'b', 2, 'c', 4) as struct) tmp
> order by struct.a, struct.b
> {code}
> This query generates
> {code}
> +---+---+
> |  a|  b|
> +---+---+
> |  1|  2|
> |  1|  2|
> +---+---+
> {code}
> The plan is wrong because the analyze somehow added {{struct#21805}} to the 
> project list, which changes the semantic of the distinct (basically, the 
> query is changed to {{select distinct struct.a, struct.b, struct}} from 
> {{select distinct struct.a, struct.b}}).
> {code}
> == Parsed Logical Plan ==
> 'Sort ['struct.a ASC, 'struct.b ASC], true
> +- 'Distinct
>+- 'Project ['struct.a, 'struct.b]
>   +- 'SubqueryAlias tmp
>  +- 'Union
> :- 'Project ['named_struct(a, 1, b, 2, c, 3) AS struct#21805]
> :  +- OneRowRelation$
> +- 'Project ['named_struct(a, 1, b, 2, c, 4) AS struct#21806]
>+- OneRowRelation$
> == Analyzed Logical Plan ==
> a: int, b: int
> Project [a#21819, b#21820]
> +- Sort [struct#21805.a ASC, struct#21805.b ASC], true
>+- Distinct
>   +- Project [struct#21805.a AS a#21819, struct#21805.b AS b#21820, 
> struct#21805]
>  +- SubqueryAlias tmp
> +- Union
>:- Project [named_struct(a, 1, b, 2, c, 3) AS struct#21805]
>:  +- OneRowRelation$
>+- Project [named_struct(a, 1, b, 2, c, 4) AS struct#21806]
>   +- OneRowRelation$
> == Optimized Logical Plan ==
> Project [a#21819, b#21820]
> +- Sort [struct#21805.a ASC, struct#21805.b ASC], true
>+- Aggregate [a#21819, b#21820, struct#21805], [a#21819, b#21820, 
> struct#21805]
>   +- Union
>  :- Project [1 AS a#21819, 2 AS b#21820, [1,2,3] AS struct#21805]
>  :  +- OneRowRelation$
>  +- Project [1 AS a#21819, 2 AS b#21820, [1,2,4] AS struct#21806]
> +- OneRowRelation$
> == Physical Plan ==
> *Project [a#21819, b#21820]
> +- *Sort [struct#21805.a ASC, struct#21805.b ASC], true, 0
>+- Exchange rangepartitioning(struct#21805.a ASC, struct#21805.b ASC, 200)
>   +- *HashAggregate(keys=[a#21819, b#21820, struct#21805], functions=[], 
> output=[a#21819, b#21820, struct#21805])
>  +- Exchange hashpartitioning(a#21819, b#21820, struct#21805, 200)
> +- *HashAggregate(keys=[a#21819, b#21820, struct#21805], 
> functions=[], output=[a#21819, b#21820, struct#21805])
>+- Union
>   :- *Project [1 AS a#21819, 2 AS b#21820, [1,2,3] AS 
> struct#21805]
>   :  +- Scan OneRowRelation[]
>   +- *Project [1 AS a#21819, 2 AS b#21820, [1,2,4] AS 
> struct#21806]
>  +- Scan OneRowRelation[]
> {code}
> If you use the following query, you will get the correct result
> {code}
> select distinct struct.a, struct.b
> from (
>   select named_struct('a', 1, 'b', 2, 'c', 3) as struct
>   union all
>   select named_struct('a', 1, 'b', 2, 'c', 4) as struct) tmp
> order by a, b
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17863) SELECT distinct does not work if there is a order by clause


[ 
https://issues.apache.org/jira/browse/SPARK-17863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15576236#comment-15576236
 ] 

Apache Spark commented on SPARK-17863:
--

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/15489

> SELECT distinct does not work if there is a order by clause
> ---
>
> Key: SPARK-17863
> URL: https://issues.apache.org/jira/browse/SPARK-17863
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Priority: Blocker
>  Labels: correctness
>
> {code}
> select distinct struct.a, struct.b
> from (
>   select named_struct('a', 1, 'b', 2, 'c', 3) as struct
>   union all
>   select named_struct('a', 1, 'b', 2, 'c', 4) as struct) tmp
> order by struct.a, struct.b
> {code}
> This query generates
> {code}
> +---+---+
> |  a|  b|
> +---+---+
> |  1|  2|
> |  1|  2|
> +---+---+
> {code}
> The plan is wrong because the analyze somehow added {{struct#21805}} to the 
> project list, which changes the semantic of the distinct (basically, the 
> query is changed to {{select distinct struct.a, struct.b, struct}} from 
> {{select distinct struct.a, struct.b}}).
> {code}
> == Parsed Logical Plan ==
> 'Sort ['struct.a ASC, 'struct.b ASC], true
> +- 'Distinct
>+- 'Project ['struct.a, 'struct.b]
>   +- 'SubqueryAlias tmp
>  +- 'Union
> :- 'Project ['named_struct(a, 1, b, 2, c, 3) AS struct#21805]
> :  +- OneRowRelation$
> +- 'Project ['named_struct(a, 1, b, 2, c, 4) AS struct#21806]
>+- OneRowRelation$
> == Analyzed Logical Plan ==
> a: int, b: int
> Project [a#21819, b#21820]
> +- Sort [struct#21805.a ASC, struct#21805.b ASC], true
>+- Distinct
>   +- Project [struct#21805.a AS a#21819, struct#21805.b AS b#21820, 
> struct#21805]
>  +- SubqueryAlias tmp
> +- Union
>:- Project [named_struct(a, 1, b, 2, c, 3) AS struct#21805]
>:  +- OneRowRelation$
>+- Project [named_struct(a, 1, b, 2, c, 4) AS struct#21806]
>   +- OneRowRelation$
> == Optimized Logical Plan ==
> Project [a#21819, b#21820]
> +- Sort [struct#21805.a ASC, struct#21805.b ASC], true
>+- Aggregate [a#21819, b#21820, struct#21805], [a#21819, b#21820, 
> struct#21805]
>   +- Union
>  :- Project [1 AS a#21819, 2 AS b#21820, [1,2,3] AS struct#21805]
>  :  +- OneRowRelation$
>  +- Project [1 AS a#21819, 2 AS b#21820, [1,2,4] AS struct#21806]
> +- OneRowRelation$
> == Physical Plan ==
> *Project [a#21819, b#21820]
> +- *Sort [struct#21805.a ASC, struct#21805.b ASC], true, 0
>+- Exchange rangepartitioning(struct#21805.a ASC, struct#21805.b ASC, 200)
>   +- *HashAggregate(keys=[a#21819, b#21820, struct#21805], functions=[], 
> output=[a#21819, b#21820, struct#21805])
>  +- Exchange hashpartitioning(a#21819, b#21820, struct#21805, 200)
> +- *HashAggregate(keys=[a#21819, b#21820, struct#21805], 
> functions=[], output=[a#21819, b#21820, struct#21805])
>+- Union
>   :- *Project [1 AS a#21819, 2 AS b#21820, [1,2,3] AS 
> struct#21805]
>   :  +- Scan OneRowRelation[]
>   +- *Project [1 AS a#21819, 2 AS b#21820, [1,2,4] AS 
> struct#21806]
>  +- Scan OneRowRelation[]
> {code}
> If you use the following query, you will get the correct result
> {code}
> select distinct struct.a, struct.b
> from (
>   select named_struct('a', 1, 'b', 2, 'c', 3) as struct
>   union all
>   select named_struct('a', 1, 'b', 2, 'c', 4) as struct) tmp
> order by a, b
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-17943) Change Memoized to Memorized

2016-10-14 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-17943.

Resolution: Not A Problem

https://en.wikipedia.org/wiki/Memoization

> Change Memoized to Memorized
> 
>
> Key: SPARK-17943
> URL: https://issues.apache.org/jira/browse/SPARK-17943
> Project: Spark
>  Issue Type: Documentation
>Affects Versions: 1.3.0
> Environment: Spark 1.3, 1.4 and more
>Reporter: Sunil Sabat
>Priority: Minor
>
> https://spark.apache.org/docs/1.4.0/api/java/org/apache/spark/sql/DataFrame.html
>  has 
> rdd
> public RDD rdd()
> Represents the content of the DataFrame as an RDD of Rows. Note that the RDD 
> is memoized. Once called, it won't change even if you change any query 
> planning related Spark SQL configurations (e.g. spark.sql.shuffle.partitions).
> Returns:
> (undocumented)
> Since:
> 1.3.0
> Here, memoized should be corrected  as "memorized". RDD is memorized.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17943) Change Memoized to Memorized

2016-10-14 Thread Sunil Sabat (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sunil Sabat updated SPARK-17943:

Priority: Minor  (was: Major)

> Change Memoized to Memorized
> 
>
> Key: SPARK-17943
> URL: https://issues.apache.org/jira/browse/SPARK-17943
> Project: Spark
>  Issue Type: Documentation
>Affects Versions: 1.3.0
> Environment: Spark 1.3, 1.4 and more
>Reporter: Sunil Sabat
>Priority: Minor
>
> https://spark.apache.org/docs/1.4.0/api/java/org/apache/spark/sql/DataFrame.html
>  has 
> rdd
> public RDD rdd()
> Represents the content of the DataFrame as an RDD of Rows. Note that the RDD 
> is memoized. Once called, it won't change even if you change any query 
> planning related Spark SQL configurations (e.g. spark.sql.shuffle.partitions).
> Returns:
> (undocumented)
> Since:
> 1.3.0
> Here, memoized should be corrected  as "memorized". RDD is memorized.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-17943) Change Memoized to Memorized

2016-10-14 Thread Sunil Sabat (JIRA)

Sunil Sabat created SPARK-17943:
---

 Summary: Change Memoized to Memorized
 Key: SPARK-17943
 URL: https://issues.apache.org/jira/browse/SPARK-17943
 Project: Spark
  Issue Type: Documentation
Affects Versions: 1.3.0
 Environment: Spark 1.3, 1.4 and more
Reporter: Sunil Sabat


https://spark.apache.org/docs/1.4.0/api/java/org/apache/spark/sql/DataFrame.html
 has 

rdd
public RDD rdd()
Represents the content of the DataFrame as an RDD of Rows. Note that the RDD is 
memoized. Once called, it won't change even if you change any query planning 
related Spark SQL configurations (e.g. spark.sql.shuffle.partitions).
Returns:
(undocumented)
Since:
1.3.0

Here, memoized should be corrected  as "memorized". RDD is memorized.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17937) Clarify Kafka offset semantics for Structured Streaming

[
https://issues.apache.org/jira/browse/SPARK-17937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Cody Koeninger updated SPARK-17937:
---
Description:
Possible events for which offsets are needed:
# New partition is discovered
# Offset out of range (aka, data has been lost)

Possible lifecycle times in which an offset-related event may happen:
# At initial query start
#* New partition: if startingOffsets is *Earliest* or *Latest*, use that. If
startingOffsets is *User specified* perTopicpartition, and the new partition
isn't in the map, *Fail*. Note that this is effectively undistinguishable from
new parititon during query, because partitions may have changed in between
pre-query configuration and query start, but we treat it differently, and users
in this case are SOL
#* Offset out of range on driver: We don't technically have behavior for this
case yet. Could use the value of failOnDataLoss, but it's possible people may
want to know at startup that something was wrong, even if they're ok with
earliest for a during-query out of range
#* Offset out of range on executor: *Fail* or *Earliest*, based on
failOnDataLoss.
# During query
#* New partition: *Earliest*, only. This seems to be by fiat, I see no reason
this can't be configurable.
#* Offset out of range on driver: this _probably_ doesn't happen, because
we're doing explicit seeks to the latest position
#* Offset out of range on executor: *Fail* or *Earliest*, based on
failOnDataLoss
# At query restart
#* New partition: *Checkpoint*, fall back to *Earliest*. Again, no reason this
couldn't be configurable fall back to Latest
#* Offset out of range on driver: this _probably_ doesn't happen, because
we're doing explicit seeks to the specified position
#* Offset out of range on executor: *Fail* or *Earliest*, based on
failOnDataLoss

I've probably missed something, chime in.

was:
Possible events for which offsets are needed:
# New partition is discovered
# Offset out of range (aka, data has been lost)

[jira] [Comment Edited] (SPARK-17709) spark 2.0 join - column resolution error


[ 
https://issues.apache.org/jira/browse/SPARK-17709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15576141#comment-15576141
 ] 

Ashish Shrowty edited comment on SPARK-17709 at 10/14/16 6:41 PM:
--

There is a slight difference, in my case the IDs generated are the same for 
e.g. companyid#121 in 
both aggregates, whereas in your plan the ids are difference companyid#5 and 
companyid#46. This is probably causing the resolution error?

I will try in the 2.0.1 branch later today.


was (Author: ashrowty):
There is a slight difference, in my case the IDs generated are the same for 
e.g. companyid#121 in both aggregates, whereas in your plan the ids are 
difference companyid#5 and companyid#46. This is probably causing the 
resolution error?

> spark 2.0 join - column resolution error
> 
>
> Key: SPARK-17709
> URL: https://issues.apache.org/jira/browse/SPARK-17709
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Ashish Shrowty
>Priority: Critical
>
> If I try to inner-join two dataframes which originated from the same initial 
> dataframe that was loaded using spark.sql() call, it results in an error -
> // reading from Hive .. the data is stored in Parquet format in Amazon S3
> val d1 = spark.sql("select * from ")  
> val df1 = d1.groupBy("key1","key2")
>   .agg(avg("totalprice").as("avgtotalprice"))
> val df2 = d1.groupBy("key1","key2")
>   .agg(avg("itemcount").as("avgqty")) 
> df1.join(df2, Seq("key1","key2")) gives error -
> org.apache.spark.sql.AnalysisException: using columns ['key1,'key2] can 
> not be resolved given input columns: [key1, key2, avgtotalprice, avgqty];
> If the same Dataframe is initialized via spark.read.parquet(), the above code 
> works. This same code above worked with Spark 1.6.2



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-17709) spark 2.0 join - column resolution error


 [ 
https://issues.apache.org/jira/browse/SPARK-17709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashish Shrowty updated SPARK-17709:
---
Comment: was deleted

(was: There is a slight difference .. in my case its companyid#121 in both 
relations whereas in yours its different. Perhaps that is causing the 
resolution error?)

> spark 2.0 join - column resolution error
> 
>
> Key: SPARK-17709
> URL: https://issues.apache.org/jira/browse/SPARK-17709
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Ashish Shrowty
>Priority: Critical
>
> If I try to inner-join two dataframes which originated from the same initial 
> dataframe that was loaded using spark.sql() call, it results in an error -
> // reading from Hive .. the data is stored in Parquet format in Amazon S3
> val d1 = spark.sql("select * from ")  
> val df1 = d1.groupBy("key1","key2")
>   .agg(avg("totalprice").as("avgtotalprice"))
> val df2 = d1.groupBy("key1","key2")
>   .agg(avg("itemcount").as("avgqty")) 
> df1.join(df2, Seq("key1","key2")) gives error -
> org.apache.spark.sql.AnalysisException: using columns ['key1,'key2] can 
> not be resolved given input columns: [key1, key2, avgtotalprice, avgqty];
> If the same Dataframe is initialized via spark.read.parquet(), the above code 
> works. This same code above worked with Spark 1.6.2



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17709) spark 2.0 join - column resolution error


[ 
https://issues.apache.org/jira/browse/SPARK-17709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15576143#comment-15576143
 ] 

Ashish Shrowty commented on SPARK-17709:


There is a slight difference .. in my case its companyid#121 in both relations 
whereas in yours its different. Perhaps that is causing the resolution error?

> spark 2.0 join - column resolution error
> 
>
> Key: SPARK-17709
> URL: https://issues.apache.org/jira/browse/SPARK-17709
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Ashish Shrowty
>Priority: Critical
>
> If I try to inner-join two dataframes which originated from the same initial 
> dataframe that was loaded using spark.sql() call, it results in an error -
> // reading from Hive .. the data is stored in Parquet format in Amazon S3
> val d1 = spark.sql("select * from ")  
> val df1 = d1.groupBy("key1","key2")
>   .agg(avg("totalprice").as("avgtotalprice"))
> val df2 = d1.groupBy("key1","key2")
>   .agg(avg("itemcount").as("avgqty")) 
> df1.join(df2, Seq("key1","key2")) gives error -
> org.apache.spark.sql.AnalysisException: using columns ['key1,'key2] can 
> not be resolved given input columns: [key1, key2, avgtotalprice, avgqty];
> If the same Dataframe is initialized via spark.read.parquet(), the above code 
> works. This same code above worked with Spark 1.6.2



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17709) spark 2.0 join - column resolution error


[ 
https://issues.apache.org/jira/browse/SPARK-17709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15576141#comment-15576141
 ] 

Ashish Shrowty commented on SPARK-17709:


There is a slight difference, in my case the IDs generated are the same for 
e.g. companyid#121 in both aggregates, whereas in your plan the ids are 
difference companyid#5 and companyid#46. This is probably causing the 
resolution error?

> spark 2.0 join - column resolution error
> 
>
> Key: SPARK-17709
> URL: https://issues.apache.org/jira/browse/SPARK-17709
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Ashish Shrowty
>Priority: Critical
>
> If I try to inner-join two dataframes which originated from the same initial 
> dataframe that was loaded using spark.sql() call, it results in an error -
> // reading from Hive .. the data is stored in Parquet format in Amazon S3
> val d1 = spark.sql("select * from ")  
> val df1 = d1.groupBy("key1","key2")
>   .agg(avg("totalprice").as("avgtotalprice"))
> val df2 = d1.groupBy("key1","key2")
>   .agg(avg("itemcount").as("avgqty")) 
> df1.join(df2, Seq("key1","key2")) gives error -
> org.apache.spark.sql.AnalysisException: using columns ['key1,'key2] can 
> not be resolved given input columns: [key1, key2, avgtotalprice, avgqty];
> If the same Dataframe is initialized via spark.read.parquet(), the above code 
> works. This same code above worked with Spark 1.6.2



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17863) SELECT distinct does not work if there is a order by clause


[ 
https://issues.apache.org/jira/browse/SPARK-17863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15576133#comment-15576133
 ] 

Yin Huai commented on SPARK-17863:
--

Seems it is introduced by https://github.com/apache/spark/pull/11153/files. 
Let's see if we can actually fix it. Another option is to make it throw an 
exception and the error message provides the instruction on how to rewrite the 
query.

> SELECT distinct does not work if there is a order by clause
> ---
>
> Key: SPARK-17863
> URL: https://issues.apache.org/jira/browse/SPARK-17863
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Priority: Blocker
>  Labels: correctness
>
> {code}
> select distinct struct.a, struct.b
> from (
>   select named_struct('a', 1, 'b', 2, 'c', 3) as struct
>   union all
>   select named_struct('a', 1, 'b', 2, 'c', 4) as struct) tmp
> order by struct.a, struct.b
> {code}
> This query generates
> {code}
> +---+---+
> |  a|  b|
> +---+---+
> |  1|  2|
> |  1|  2|
> +---+---+
> {code}
> The plan is wrong because the analyze somehow added {{struct#21805}} to the 
> project list, which changes the semantic of the distinct (basically, the 
> query is changed to {{select distinct struct.a, struct.b, struct}} from 
> {{select distinct struct.a, struct.b}}).
> {code}
> == Parsed Logical Plan ==
> 'Sort ['struct.a ASC, 'struct.b ASC], true
> +- 'Distinct
>+- 'Project ['struct.a, 'struct.b]
>   +- 'SubqueryAlias tmp
>  +- 'Union
> :- 'Project ['named_struct(a, 1, b, 2, c, 3) AS struct#21805]
> :  +- OneRowRelation$
> +- 'Project ['named_struct(a, 1, b, 2, c, 4) AS struct#21806]
>+- OneRowRelation$
> == Analyzed Logical Plan ==
> a: int, b: int
> Project [a#21819, b#21820]
> +- Sort [struct#21805.a ASC, struct#21805.b ASC], true
>+- Distinct
>   +- Project [struct#21805.a AS a#21819, struct#21805.b AS b#21820, 
> struct#21805]
>  +- SubqueryAlias tmp
> +- Union
>:- Project [named_struct(a, 1, b, 2, c, 3) AS struct#21805]
>:  +- OneRowRelation$
>+- Project [named_struct(a, 1, b, 2, c, 4) AS struct#21806]
>   +- OneRowRelation$
> == Optimized Logical Plan ==
> Project [a#21819, b#21820]
> +- Sort [struct#21805.a ASC, struct#21805.b ASC], true
>+- Aggregate [a#21819, b#21820, struct#21805], [a#21819, b#21820, 
> struct#21805]
>   +- Union
>  :- Project [1 AS a#21819, 2 AS b#21820, [1,2,3] AS struct#21805]
>  :  +- OneRowRelation$
>  +- Project [1 AS a#21819, 2 AS b#21820, [1,2,4] AS struct#21806]
> +- OneRowRelation$
> == Physical Plan ==
> *Project [a#21819, b#21820]
> +- *Sort [struct#21805.a ASC, struct#21805.b ASC], true, 0
>+- Exchange rangepartitioning(struct#21805.a ASC, struct#21805.b ASC, 200)
>   +- *HashAggregate(keys=[a#21819, b#21820, struct#21805], functions=[], 
> output=[a#21819, b#21820, struct#21805])
>  +- Exchange hashpartitioning(a#21819, b#21820, struct#21805, 200)
> +- *HashAggregate(keys=[a#21819, b#21820, struct#21805], 
> functions=[], output=[a#21819, b#21820, struct#21805])
>+- Union
>   :- *Project [1 AS a#21819, 2 AS b#21820, [1,2,3] AS 
> struct#21805]
>   :  +- Scan OneRowRelation[]
>   +- *Project [1 AS a#21819, 2 AS b#21820, [1,2,4] AS 
> struct#21806]
>  +- Scan OneRowRelation[]
> {code}
> If you use the following query, you will get the correct result
> {code}
> select distinct struct.a, struct.b
> from (
>   select named_struct('a', 1, 'b', 2, 'c', 3) as struct
>   union all
>   select named_struct('a', 1, 'b', 2, 'c', 4) as struct) tmp
> order by a, b
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17863) SELECT distinct does not work if there is a order by clause


[ 
https://issues.apache.org/jira/browse/SPARK-17863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15576135#comment-15576135
 ] 

Yin Huai commented on SPARK-17863:
--

cc [~davies]

> SELECT distinct does not work if there is a order by clause
> ---
>
> Key: SPARK-17863
> URL: https://issues.apache.org/jira/browse/SPARK-17863
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Priority: Blocker
>  Labels: correctness
>
> {code}
> select distinct struct.a, struct.b
> from (
>   select named_struct('a', 1, 'b', 2, 'c', 3) as struct
>   union all
>   select named_struct('a', 1, 'b', 2, 'c', 4) as struct) tmp
> order by struct.a, struct.b
> {code}
> This query generates
> {code}
> +---+---+
> |  a|  b|
> +---+---+
> |  1|  2|
> |  1|  2|
> +---+---+
> {code}
> The plan is wrong because the analyze somehow added {{struct#21805}} to the 
> project list, which changes the semantic of the distinct (basically, the 
> query is changed to {{select distinct struct.a, struct.b, struct}} from 
> {{select distinct struct.a, struct.b}}).
> {code}
> == Parsed Logical Plan ==
> 'Sort ['struct.a ASC, 'struct.b ASC], true
> +- 'Distinct
>+- 'Project ['struct.a, 'struct.b]
>   +- 'SubqueryAlias tmp
>  +- 'Union
> :- 'Project ['named_struct(a, 1, b, 2, c, 3) AS struct#21805]
> :  +- OneRowRelation$
> +- 'Project ['named_struct(a, 1, b, 2, c, 4) AS struct#21806]
>+- OneRowRelation$
> == Analyzed Logical Plan ==
> a: int, b: int
> Project [a#21819, b#21820]
> +- Sort [struct#21805.a ASC, struct#21805.b ASC], true
>+- Distinct
>   +- Project [struct#21805.a AS a#21819, struct#21805.b AS b#21820, 
> struct#21805]
>  +- SubqueryAlias tmp
> +- Union
>:- Project [named_struct(a, 1, b, 2, c, 3) AS struct#21805]
>:  +- OneRowRelation$
>+- Project [named_struct(a, 1, b, 2, c, 4) AS struct#21806]
>   +- OneRowRelation$
> == Optimized Logical Plan ==
> Project [a#21819, b#21820]
> +- Sort [struct#21805.a ASC, struct#21805.b ASC], true
>+- Aggregate [a#21819, b#21820, struct#21805], [a#21819, b#21820, 
> struct#21805]
>   +- Union
>  :- Project [1 AS a#21819, 2 AS b#21820, [1,2,3] AS struct#21805]
>  :  +- OneRowRelation$
>  +- Project [1 AS a#21819, 2 AS b#21820, [1,2,4] AS struct#21806]
> +- OneRowRelation$
> == Physical Plan ==
> *Project [a#21819, b#21820]
> +- *Sort [struct#21805.a ASC, struct#21805.b ASC], true, 0
>+- Exchange rangepartitioning(struct#21805.a ASC, struct#21805.b ASC, 200)
>   +- *HashAggregate(keys=[a#21819, b#21820, struct#21805], functions=[], 
> output=[a#21819, b#21820, struct#21805])
>  +- Exchange hashpartitioning(a#21819, b#21820, struct#21805, 200)
> +- *HashAggregate(keys=[a#21819, b#21820, struct#21805], 
> functions=[], output=[a#21819, b#21820, struct#21805])
>+- Union
>   :- *Project [1 AS a#21819, 2 AS b#21820, [1,2,3] AS 
> struct#21805]
>   :  +- Scan OneRowRelation[]
>   +- *Project [1 AS a#21819, 2 AS b#21820, [1,2,4] AS 
> struct#21806]
>  +- Scan OneRowRelation[]
> {code}
> If you use the following query, you will get the correct result
> {code}
> select distinct struct.a, struct.b
> from (
>   select named_struct('a', 1, 'b', 2, 'c', 3) as struct
>   union all
>   select named_struct('a', 1, 'b', 2, 'c', 4) as struct) tmp
> order by a, b
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17863) SELECT distinct does not work if there is a order by clause


 [ 
https://issues.apache.org/jira/browse/SPARK-17863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-17863:
-
Description: 
{code}
select distinct struct.a, struct.b
from (
  select named_struct('a', 1, 'b', 2, 'c', 3) as struct
  union all
  select named_struct('a', 1, 'b', 2, 'c', 4) as struct) tmp
order by struct.a, struct.b
{code}
This query generates
{code}
+---+---+
|  a|  b|
+---+---+
|  1|  2|
|  1|  2|
+---+---+
{code}
The plan is wrong because the analyze somehow added {{struct#21805}} to the 
project list, which changes the semantic of the distinct (basically, the query 
is changed to {{select distinct struct.a, struct.b, struct}} from {{select 
distinct struct.a, struct.b}}).
{code}
== Parsed Logical Plan ==
'Sort ['struct.a ASC, 'struct.b ASC], true
+- 'Distinct
   +- 'Project ['struct.a, 'struct.b]
  +- 'SubqueryAlias tmp
 +- 'Union
:- 'Project ['named_struct(a, 1, b, 2, c, 3) AS struct#21805]
:  +- OneRowRelation$
+- 'Project ['named_struct(a, 1, b, 2, c, 4) AS struct#21806]
   +- OneRowRelation$

== Analyzed Logical Plan ==
a: int, b: int
Project [a#21819, b#21820]
+- Sort [struct#21805.a ASC, struct#21805.b ASC], true
   +- Distinct
  +- Project [struct#21805.a AS a#21819, struct#21805.b AS b#21820, 
struct#21805]
 +- SubqueryAlias tmp
+- Union
   :- Project [named_struct(a, 1, b, 2, c, 3) AS struct#21805]
   :  +- OneRowRelation$
   +- Project [named_struct(a, 1, b, 2, c, 4) AS struct#21806]
  +- OneRowRelation$

== Optimized Logical Plan ==
Project [a#21819, b#21820]
+- Sort [struct#21805.a ASC, struct#21805.b ASC], true
   +- Aggregate [a#21819, b#21820, struct#21805], [a#21819, b#21820, 
struct#21805]
  +- Union
 :- Project [1 AS a#21819, 2 AS b#21820, [1,2,3] AS struct#21805]
 :  +- OneRowRelation$
 +- Project [1 AS a#21819, 2 AS b#21820, [1,2,4] AS struct#21806]
+- OneRowRelation$

== Physical Plan ==
*Project [a#21819, b#21820]
+- *Sort [struct#21805.a ASC, struct#21805.b ASC], true, 0
   +- Exchange rangepartitioning(struct#21805.a ASC, struct#21805.b ASC, 200)
  +- *HashAggregate(keys=[a#21819, b#21820, struct#21805], functions=[], 
output=[a#21819, b#21820, struct#21805])
 +- Exchange hashpartitioning(a#21819, b#21820, struct#21805, 200)
+- *HashAggregate(keys=[a#21819, b#21820, struct#21805], 
functions=[], output=[a#21819, b#21820, struct#21805])
   +- Union
  :- *Project [1 AS a#21819, 2 AS b#21820, [1,2,3] AS 
struct#21805]
  :  +- Scan OneRowRelation[]
  +- *Project [1 AS a#21819, 2 AS b#21820, [1,2,4] AS 
struct#21806]
 +- Scan OneRowRelation[]
{code}

If you use the following query, you will get the correct result
{code}
select distinct struct.a, struct.b
from (
  select named_struct('a', 1, 'b', 2, 'c', 3) as struct
  union all
  select named_struct('a', 1, 'b', 2, 'c', 4) as struct) tmp
order by a, b
{code}

  was:
{code}
select distinct struct.a, struct.b
from (
  select named_struct('a', 1, 'b', 2, 'c', 3) as struct
  union all
  select named_struct('a', 1, 'b', 2, 'c', 4) as struct) tmp
order by struct.a, struct.b
{code}
This query generates
{code}
+---+---+
|  a|  b|
+---+---+
|  1|  2|
|  1|  2|
+---+---+
{code}
The plan is wrong because the analyze somehow added {{struct#21805}} to the 
project list, which changes the semantic of the distinct (basically, the query 
is changed to {{select distinct struct.a, struct.b, struct}} from {{select 
distinct struct.a, struct.b}}).
{code}
== Parsed Logical Plan ==
'Sort ['struct.a ASC, 'struct.b ASC], true
+- 'Distinct
   +- 'Project ['struct.a, 'struct.b]
  +- 'SubqueryAlias tmp
 +- 'Union
:- 'Project ['named_struct(a, 1, b, 2, c, 3) AS struct#21805]
:  +- OneRowRelation$
+- 'Project ['named_struct(a, 1, b, 2, c, 4) AS struct#21806]
   +- OneRowRelation$

== Analyzed Logical Plan ==
a: int, b: int
Project [a#21819, b#21820]
+- Sort [struct#21805.a ASC, struct#21805.b ASC], true
   +- Distinct
  +- Project [struct#21805.a AS a#21819, struct#21805.b AS b#21820, 
struct#21805]
 +- SubqueryAlias tmp
+- Union
   :- Project [named_struct(a, 1, b, 2, c, 3) AS struct#21805]
   :  +- OneRowRelation$
   +- Project [named_struct(a, 1, b, 2, c, 4) AS struct#21806]
  +- OneRowRelation$

== Optimized Logical Plan ==
Project [a#21819, b#21820]
+- Sort [struct#21805.a ASC, struct#21805.b ASC], true
   +- Aggregate [a#21819, b#21820, struct#21805], [a#21819, b#21820, 
struct#21805]
  +- Union
 :- Project [1 AS a#21819, 2 AS b#21820, [1,2,3] AS struct#21805]
 :  +- OneRowRelation$
 +- Project [1 AS a#21819, 2 AS b#21820,

[jira] [Updated] (SPARK-17884) In the cast expression, casting from empty string to interval type throws NullPointerException


 [ 
https://issues.apache.org/jira/browse/SPARK-17884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-17884:

Fix Version/s: 1.6.3

> In the cast expression, casting from empty string to interval type throws 
> NullPointerException
> --
>
> Key: SPARK-17884
> URL: https://issues.apache.org/jira/browse/SPARK-17884
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Priyanka Garg
>Assignee: Priyanka Garg
> Fix For: 1.6.3, 2.0.2, 2.1.0
>
>
> When the cast expression is applied on empty string "" to cast it to interval 
> type it throws Null pointer exception..
> Getting the same exception when I tried reproducing the same through test case
> checkEvaluation(Cast(Literal(""), CalendarIntervalType), null)
> Exception i am getting is:
> java.lang.NullPointerException was thrown.
> java.lang.NullPointerException
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter.write(UnsafeRowWriter.java:254)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.ExpressionEvalHelper$class.checkEvalutionWithUnsafeProjection(ExpressionEvalHelper.scala:181)
>   at 
> org.apache.spark.sql.catalyst.expressions.CastSuite.checkEvalutionWithUnsafeProjection(CastSuite.scala:33)
>   at 
> org.apache.spark.sql.catalyst.expressions.ExpressionEvalHelper$class.checkEvaluation(ExpressionEvalHelper.scala:64)
>   at 
> org.apache.spark.sql.catalyst.expressions.CastSuite.checkEvaluation(CastSuite.scala:33)
>   at 
> org.apache.spark.sql.catalyst.expressions.CastSuite$$anonfun$22.apply$mcV$sp(CastSuite.scala:770)
>   at 
> org.apache.spark.sql.catalyst.expressions.CastSuite$$anonfun$22.apply(CastSuite.scala:767)
>   at 
> org.apache.spark.sql.catalyst.expressions.CastSuite$$anonfun$22.apply(CastSuite.scala:767)
>   at 
> org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
>   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>   at org.scalatest.Transformer.apply(Transformer.scala:22)
>   at org.scalatest.Transformer.apply(Transformer.scala:20)
>   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
>   at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:57)
>   at 
> org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
>   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
>   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
>   at org.scalatest.FunSuite.runTest(FunSuite.scala:1555)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
>   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413)
>   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
>   at 
> org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396)
>   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483)
>   at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208)
>   at org.scalatest.FunSuite.runTests(FunSuite.scala:1555)
>   at org.scalatest.Suite$class.run(Suite.scala:1424)
>   at 
> org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
>   at org.scalatest.SuperEngine.runImpl(Engine.scala:545)
>   at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:212)
>   at 
> org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(SparkFunSuite.scala:29)
>   at 
> org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:257)
>   at 
> org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:256)
>   at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:29)
>   at

[jira] [Updated] (SPARK-17709) spark 2.0 join - column resolution error


 [ 
https://issues.apache.org/jira/browse/SPARK-17709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-17709:

Priority: Critical  (was: Major)

> spark 2.0 join - column resolution error
> 
>
> Key: SPARK-17709
> URL: https://issues.apache.org/jira/browse/SPARK-17709
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Ashish Shrowty
>Priority: Critical
>
> If I try to inner-join two dataframes which originated from the same initial 
> dataframe that was loaded using spark.sql() call, it results in an error -
> // reading from Hive .. the data is stored in Parquet format in Amazon S3
> val d1 = spark.sql("select * from ")  
> val df1 = d1.groupBy("key1","key2")
>   .agg(avg("totalprice").as("avgtotalprice"))
> val df2 = d1.groupBy("key1","key2")
>   .agg(avg("itemcount").as("avgqty")) 
> df1.join(df2, Seq("key1","key2")) gives error -
> org.apache.spark.sql.AnalysisException: using columns ['key1,'key2] can 
> not be resolved given input columns: [key1, key2, avgtotalprice, avgqty];
> If the same Dataframe is initialized via spark.read.parquet(), the above code 
> works. This same code above worked with Spark 1.6.2



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17709) spark 2.0 join - column resolution error


 [ 
https://issues.apache.org/jira/browse/SPARK-17709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-17709:

Component/s: SQL

> spark 2.0 join - column resolution error
> 
>
> Key: SPARK-17709
> URL: https://issues.apache.org/jira/browse/SPARK-17709
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Ashish Shrowty
>
> If I try to inner-join two dataframes which originated from the same initial 
> dataframe that was loaded using spark.sql() call, it results in an error -
> // reading from Hive .. the data is stored in Parquet format in Amazon S3
> val d1 = spark.sql("select * from ")  
> val df1 = d1.groupBy("key1","key2")
>   .agg(avg("totalprice").as("avgtotalprice"))
> val df2 = d1.groupBy("key1","key2")
>   .agg(avg("itemcount").as("avgqty")) 
> df1.join(df2, Seq("key1","key2")) gives error -
> org.apache.spark.sql.AnalysisException: using columns ['key1,'key2] can 
> not be resolved given input columns: [key1, key2, avgtotalprice, avgqty];
> If the same Dataframe is initialized via spark.read.parquet(), the above code 
> works. This same code above worked with Spark 1.6.2



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17709) spark 2.0 join - column resolution error


 [ 
https://issues.apache.org/jira/browse/SPARK-17709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-17709:

Labels:   (was: easyfix)

> spark 2.0 join - column resolution error
> 
>
> Key: SPARK-17709
> URL: https://issues.apache.org/jira/browse/SPARK-17709
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Ashish Shrowty
>
> If I try to inner-join two dataframes which originated from the same initial 
> dataframe that was loaded using spark.sql() call, it results in an error -
> // reading from Hive .. the data is stored in Parquet format in Amazon S3
> val d1 = spark.sql("select * from ")  
> val df1 = d1.groupBy("key1","key2")
>   .agg(avg("totalprice").as("avgtotalprice"))
> val df2 = d1.groupBy("key1","key2")
>   .agg(avg("itemcount").as("avgqty")) 
> df1.join(df2, Seq("key1","key2")) gives error -
> org.apache.spark.sql.AnalysisException: using columns ['key1,'key2] can 
> not be resolved given input columns: [key1, key2, avgtotalprice, avgqty];
> If the same Dataframe is initialized via spark.read.parquet(), the above code 
> works. This same code above worked with Spark 1.6.2



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17709) spark 2.0 join - column resolution error


[ 
https://issues.apache.org/jira/browse/SPARK-17709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15575999#comment-15575999
 ] 

Xiao Li commented on SPARK-17709:
-

Below is the statements I used to recreate the problem

{noformat}
sql("CREATE TABLE testext2(companyid int, productid int, price int, count 
int) using parquet")
sql("insert into testext2 values (1, 1, 1, 1)")
val d1 = spark.sql("select * from testext2")
val df1 = d1.groupBy("companyid","productid").agg(sum("price").as("price"))
val df2 = d1.groupBy("companyid","productid").agg(sum("count").as("count"))
df1.join(df2, Seq("companyid", "productid")).show
{noformat}

Can you try it?

> spark 2.0 join - column resolution error
> 
>
> Key: SPARK-17709
> URL: https://issues.apache.org/jira/browse/SPARK-17709
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Ashish Shrowty
>  Labels: easyfix
>
> If I try to inner-join two dataframes which originated from the same initial 
> dataframe that was loaded using spark.sql() call, it results in an error -
> // reading from Hive .. the data is stored in Parquet format in Amazon S3
> val d1 = spark.sql("select * from ")  
> val df1 = d1.groupBy("key1","key2")
>   .agg(avg("totalprice").as("avgtotalprice"))
> val df2 = d1.groupBy("key1","key2")
>   .agg(avg("itemcount").as("avgqty")) 
> df1.join(df2, Seq("key1","key2")) gives error -
> org.apache.spark.sql.AnalysisException: using columns ['key1,'key2] can 
> not be resolved given input columns: [key1, key2, avgtotalprice, avgqty];
> If the same Dataframe is initialized via spark.read.parquet(), the above code 
> works. This same code above worked with Spark 1.6.2



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17709) spark 2.0 join - column resolution error


[ 
https://issues.apache.org/jira/browse/SPARK-17709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15575995#comment-15575995
 ] 

Xiao Li commented on SPARK-17709:
-

Still works well in 2.0.1

> spark 2.0 join - column resolution error
> 
>
> Key: SPARK-17709
> URL: https://issues.apache.org/jira/browse/SPARK-17709
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Ashish Shrowty
>  Labels: easyfix
>
> If I try to inner-join two dataframes which originated from the same initial 
> dataframe that was loaded using spark.sql() call, it results in an error -
> // reading from Hive .. the data is stored in Parquet format in Amazon S3
> val d1 = spark.sql("select * from ")  
> val df1 = d1.groupBy("key1","key2")
>   .agg(avg("totalprice").as("avgtotalprice"))
> val df2 = d1.groupBy("key1","key2")
>   .agg(avg("itemcount").as("avgqty")) 
> df1.join(df2, Seq("key1","key2")) gives error -
> org.apache.spark.sql.AnalysisException: using columns ['key1,'key2] can 
> not be resolved given input columns: [key1, key2, avgtotalprice, avgqty];
> If the same Dataframe is initialized via spark.read.parquet(), the above code 
> works. This same code above worked with Spark 1.6.2



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17936) "CodeGenerator - failed to compile: org.codehaus.janino.JaninoRuntimeException: Code of" method Error

2016-10-14 Thread Justin Miller (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15575945#comment-15575945
 ] 

Justin Miller commented on SPARK-17936:
---

Hey Sean,

I did a bit more digging this morning looking at SpecificUnsafeProjection and 
saw this commit: 
https://github.com/apache/spark/commit/b1b47274bfeba17a9e4e9acebd7385289f31f6c8

I thought I'd try running w/2.1.0-SNAPSHOT and see how things went and it 
appears to work great now!

[Stage 1:> (0 + 8) / 8]11:28:33.237 INFO  c.p.o.ObservationPersister - 
(ObservationPersister) - Thrift Parse Success: 0 / Thrift Parse Errors: 0
[Stage 3:> (0 + 8) / 8]11:29:03.236 INFO  c.p.o.ObservationPersister - 
(ObservationPersister) - Thrift Parse Success: 89 / Thrift Parse Errors: 0
[Stage 5:> (4 + 4) / 8]11:29:33.237 INFO  c.p.o.ObservationPersister - 
(ObservationPersister) - Thrift Parse Success: 205 / Thrift Parse Errors: 0

Since we're still testing this out that snapshot works great for now. Do you 
know when 2.1.0 might be available generally?

Best,
Justin


> "CodeGenerator - failed to compile: 
> org.codehaus.janino.JaninoRuntimeException: Code of" method Error
> -
>
> Key: SPARK-17936
> URL: https://issues.apache.org/jira/browse/SPARK-17936
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.1
>Reporter: Justin Miller
>
> Greetings. I'm currently in the process of migrating a project I'm working on 
> from Spark 1.6.2 to 2.0.1. The project uses Spark Streaming to convert Thrift 
> structs coming from Kafka into Parquet files stored in S3. This conversion 
> process works fine in 1.6.2 but I think there may be a bug in 2.0.1. I'll 
> paste the stack trace below.
> org.codehaus.janino.JaninoRuntimeException: Code of method 
> "(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass;[Ljava/lang/Object;)V"
>  of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection"
>  grows beyond 64 KB
>   at org.codehaus.janino.CodeContext.makeSpace(CodeContext.java:941)
>   at org.codehaus.janino.CodeContext.write(CodeContext.java:854)
>   at org.codehaus.janino.UnitCompiler.writeShort(UnitCompiler.java:10242)
>   at org.codehaus.janino.UnitCompiler.writeLdc(UnitCompiler.java:9058)
> Also, later on:
> 07:35:30.191 ERROR o.a.s.u.SparkUncaughtExceptionHandler - Uncaught exception 
> in thread Thread[Executor task launch worker-6,5,run-main-group-0]
> java.lang.OutOfMemoryError: Java heap space
> I've seen similar issues posted, but those were always on the query side. I 
> have a hunch that this is happening at write time as the error occurs after 
> batchDuration. Here's the write snippet.
> stream.
>   flatMap {
> case Success(row) =>
>   thriftParseSuccess += 1
>   Some(row)
> case Failure(ex) =>
>   thriftParseErrors += 1
>   logger.error("Error during deserialization: ", ex)
>   None
>   }.foreachRDD { rdd =>
> val sqlContext = SQLContext.getOrCreate(rdd.context)
> transformer(sqlContext.createDataFrame(rdd, converter.schema))
>   .coalesce(coalesceSize)
>   .write
>   .mode(Append)
>   .partitionBy(partitioning: _*)
>   .parquet(parquetPath)
>   }
> Please let me know if you can be of assistance and if there's anything I can 
> do to help.
> Best,
> Justin



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17709) spark 2.0 join - column resolution error


[ 
https://issues.apache.org/jira/browse/SPARK-17709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15575938#comment-15575938
 ] 

Xiao Li commented on SPARK-17709:
-

I can get an exactly same plan in the master branch, but my job can pass. 

{noformat}
'Join UsingJoin(Inner,List('companyid, 'productid)) 
   
:- Aggregate [companyid#5, productid#6], [companyid#5, productid#6, 
sum(cast(price#7 as bigint)) AS price#30L] 
:  +- Project [companyid#5, productid#6, price#7, count#8]  
   
: +- SubqueryAlias testext2 
   
:+- Relation[companyid#5,productid#6,price#7,count#8] parquet   
   
+- Aggregate [companyid#46, productid#47], [companyid#46, productid#47, 
sum(cast(count#49 as bigint)) AS count#41L]
   +- Project [companyid#46, productid#47, price#48, count#49]  
   
  +- SubqueryAlias testext2 
   
 +- Relation[companyid#46,productid#47,price#48,count#49] parquet  
{noformat}

The only difference is yours does not trigger deduplication of expression ids. 
Let me try it in the 2.0.1 branch. 

> spark 2.0 join - column resolution error
> 
>
> Key: SPARK-17709
> URL: https://issues.apache.org/jira/browse/SPARK-17709
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Ashish Shrowty
>  Labels: easyfix
>
> If I try to inner-join two dataframes which originated from the same initial 
> dataframe that was loaded using spark.sql() call, it results in an error -
> // reading from Hive .. the data is stored in Parquet format in Amazon S3
> val d1 = spark.sql("select * from ")  
> val df1 = d1.groupBy("key1","key2")
>   .agg(avg("totalprice").as("avgtotalprice"))
> val df2 = d1.groupBy("key1","key2")
>   .agg(avg("itemcount").as("avgqty")) 
> df1.join(df2, Seq("key1","key2")) gives error -
> org.apache.spark.sql.AnalysisException: using columns ['key1,'key2] can 
> not be resolved given input columns: [key1, key2, avgtotalprice, avgqty];
> If the same Dataframe is initialized via spark.read.parquet(), the above code 
> works. This same code above worked with Spark 1.6.2



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17606) New batches are not created when there are 1000 created after restarting streaming from checkpoint.

2016-10-14 Thread etienne (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15575927#comment-15575927
 ] 

etienne commented on SPARK-17606:
-

I'm not able to reproduce in local mode. either because the JobGenerator 
managed to restart either because the streaming resulted to a OOM before the 
restarting of the JobGenerator.

After heapdump it appears the memory is full of MapPartitionRDD

I give you here the result of my tries.
batch interval : 500ms , real batch duration > 2s (I had to reduce the batch 
interval to generate batches faster)

I have proceeded simply as : Start the streaming without checkpoint wait until 
checkpoint, stop it, wait a period and restart from the checkpoint.

||stoping time||restarting||batch during down time|| batch pending|| batch to 
reschedule|| starting of JobGenerator || last time* ||
|14:23:01|14:33:29|1320 - [14:22:30-14:33:29.500]|88 - 
[14:22:00-14:22:44]|1379-[14:22:00.500-14:33:29.500]|14:55:00 for time 
14:33:30|14:25:11|
|15:08:25|15:31:20|2777 - [15:08:18.500-15:31:26.500]|22 - 
[15:08:11-15:08:21.500]|2792-[15:08:11-15:31:26.500]|OOM| |
|09:30:01|09:47:01|2338 - [09:27:33-09:47:01.500]|298 - 
[09:26:03-09:28:31.500]|2518 - [09:26:03-09:47:01.500]|OOM| |
|12:52:49|12:46:01|1838- [12:49:11.500-13:04:30]|116 - 
[12:49:18.500-12:50:16]|1838- [12:45:11.500-13:04:30]|OOM| |



\* last time to reschedule found in log that was executed before the restarting 
of job generator (strangely there 8 minutes are not missing in UI)

All these OOM make me think there is something that is not cleaned correctly.

The JobGenerator is not started directly after the beginning (I have looked 
into the src and I didn't find what is blocking) and may induce a lag in batch 
generation. 


> New batches are not created when there are 1000 created after restarting 
> streaming from checkpoint.
> ---
>
> Key: SPARK-17606
> URL: https://issues.apache.org/jira/browse/SPARK-17606
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.6.1
>Reporter: etienne
>
> When spark restarts from a checkpoint after being down for a while.
> It recreates missing batch since the down time.
> When there are few missing batches, spark creates new incoming batch every 
> batchTime, but when there is enough missing time to create 1000 batches no 
> new batch is created.
> So when all these batch are completed the stream is idle ...
> I think there is a rigid limit set somewhere.
> I was expecting that spark continue to recreate missed batches, maybe not all 
> at once ( because it's look like it's cause driver memory problem ), and then 
> recreate batches each batchTime.
> Another solution would be to not create missing batches but still restart the 
> direct input.
> Right know for me the only solution to restart a stream after a long break it 
> to remove the checkpoint to allow the creation of a new stream. But losing 
> all my states.
> ps : I'm speaking about direct Kafka input because it's the source I'm 
> currently using, I don't know what happens with other sources.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12664) Expose raw prediction scores in MultilayerPerceptronClassificationModel

2016-10-14 Thread Guo-Xun Yuan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15575865#comment-15575865
 ] 

Guo-Xun Yuan commented on SPARK-12664:
--

Thank you, [~yanboliang]! So, just to confirm, will your PR just cover a method 
that exposes raw prediction scores in MultilayerPerceptronClassificationModel? 
Or it will also cover the fix where MultilayerPerceptronClassificationModel is 
derived from ClassificationModel.

Thanks!

> Expose raw prediction scores in MultilayerPerceptronClassificationModel
> ---
>
> Key: SPARK-12664
> URL: https://issues.apache.org/jira/browse/SPARK-12664
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Robert Dodier
>Assignee: Yanbo Liang
>
> In 
> org.apache.spark.ml.classification.MultilayerPerceptronClassificationModel, 
> there isn't any way to get raw prediction scores; only an integer output 
> (from 0 to #classes - 1) is available via the `predict` method. 
> `mplModel.predict` is called within the class to get the raw score, but 
> `mlpModel` is private so that isn't available to outside callers.
> The raw score is useful when the user wants to interpret the classifier 
> output as a probability. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-13802) Fields order in Row(**kwargs) is not consistent with Schema.toInternal method

2016-10-14 Thread Thomas Dunne (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15575825#comment-15575825
 ] 

Thomas Dunne edited comment on SPARK-13802 at 10/14/16 4:45 PM:


This is especially troublesome when combined with creating a DataFrame, while 
using your own schema.

The data I am working on can contain a lot of empty fields, which makes the 
schema inference potentially have to scan every row to determine their type. 
Providing our own schema should fix this, right?

Nope... Rather than matching up the keys of the Row, with the field names of 
the provided schema, lets just change the order of one (the Row), and naively 
use zip(row, schema.fields). This means that even keeping both schema field 
order, and Row key value is not enough, due to Rows sorting keys, we need to 
manually sort schema fields too.

Doesn't seem consistent or desirable behavior at all.


was (Author: thomas9):
This is especially troublesome when combined with creating a DataFrame, while 
using your own schema.

The data I am working on can contain a lot of empty fields, which makes the 
schema inference potentially have to scan every row to determine their type. 
Providing our own schema should fix this, right?

Nope... Rather than matching up the keys of the Row, with the field names of 
the provided schema, lets just change the order of one (the Row), and naively 
use zip(row, schema.fields). This means that even keeping both schema field 
order, and Row key value is not enough, due to Rows sorting keys, we need to 
manually sort schema fields too.

> Fields order in Row(**kwargs) is not consistent with Schema.toInternal method
> -
>
> Key: SPARK-13802
> URL: https://issues.apache.org/jira/browse/SPARK-13802
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.6.0
>Reporter: Szymon Matejczyk
>
> When using Row constructor from kwargs, fields in the tuple underneath are 
> sorted by name. When Schema is reading the row, it is not using the fields in 
> this order.
> {code}
> from pyspark.sql import Row
> from pyspark.sql.types import *
> schema = StructType([
> StructField("id", StringType()),
> StructField("first_name", StringType())])
> row = Row(id="39", first_name="Szymon")
> schema.toInternal(row)
> Out[5]: ('Szymon', '39')
> {code}
> {code}
> df = sqlContext.createDataFrame([row], schema)
> df.show(1)
> +--+--+
> |id|first_name|
> +--+--+
> |Szymon|39|
> +--+--+
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13802) Fields order in Row(**kwargs) is not consistent with Schema.toInternal method

2016-10-14 Thread Thomas Dunne (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15575825#comment-15575825
 ] 

Thomas Dunne commented on SPARK-13802:
--

This is especially troublesome when combined with creating a DataFrame, while 
using your own schema.

The data I am working on can contain a lot of empty fields, which makes the 
schema inference potentially have to scan every row to determine their type. 
Providing our own schema should fix this, right?

Nope... Rather than matching up the keys of the Row, with the field names of 
the provided schema, lets just change the order of one (the Row), and naively 
use zip(row, schema.fields). This means that even keeping both schema field 
order, and Row key value is not enough, due to Rows sorting keys, we need to 
manually sort schema fields too.

> Fields order in Row(**kwargs) is not consistent with Schema.toInternal method
> -
>
> Key: SPARK-13802
> URL: https://issues.apache.org/jira/browse/SPARK-13802
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.6.0
>Reporter: Szymon Matejczyk
>
> When using Row constructor from kwargs, fields in the tuple underneath are 
> sorted by name. When Schema is reading the row, it is not using the fields in 
> this order.
> {code}
> from pyspark.sql import Row
> from pyspark.sql.types import *
> schema = StructType([
> StructField("id", StringType()),
> StructField("first_name", StringType())])
> row = Row(id="39", first_name="Szymon")
> schema.toInternal(row)
> Out[5]: ('Szymon', '39')
> {code}
> {code}
> df = sqlContext.createDataFrame([row], schema)
> df.show(1)
> +--+--+
> |id|first_name|
> +--+--+
> |Szymon|39|
> +--+--+
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17940) Typo in LAST function error message

2016-10-14 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-17940:
--
Priority: Trivial  (was: Minor)

> Typo in LAST function error message
> ---
>
> Key: SPARK-17940
> URL: https://issues.apache.org/jira/browse/SPARK-17940
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Shuai Lin
>Priority: Trivial
>
> https://github.com/apache/spark/blob/v2.0.1/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Last.scala#L40
> {code}
>   throw new AnalysisException("The second argument of First should be a 
> boolean literal.")
> {code} 
> "First" should be "Last".
> Also the usage string can be improved to match the FIRST function.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17942) OpenJDK 64-Bit Server VM warning: Try increasing the code cache size using -XX:ReservedCodeCacheSize=

2016-10-14 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15575759#comment-15575759
 ] 

Sean Owen commented on SPARK-17942:
---

You probably need to increase this value -- is there more to it?

> OpenJDK 64-Bit Server VM warning: Try increasing the code cache size using 
> -XX:ReservedCodeCacheSize=
> -
>
> Key: SPARK-17942
> URL: https://issues.apache.org/jira/browse/SPARK-17942
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.1
>Reporter: Harish
>Priority: Minor
>
> My code snipped is  in below location. In that  snippet i had put only few 
> columns, but in my test case i have data with 10M rows and 10,000 columns.
> http://stackoverflow.com/questions/39602596/convert-groupbykey-to-reducebykey-pyspark
> I see below message in spark 2.0.2 snapshot
> # Stderr of the node
> OpenJDK 64-Bit Server VM warning: CodeCache is full. Compiler has been 
> disabled.
> OpenJDK 64-Bit Server VM warning: Try increasing the code cache size using 
> -XX:ReservedCodeCacheSize=
> # stdout of the node
> CodeCache: size=245760Kb used=242680Kb max_used=242689Kb free=3079Kb
>  bounds [0x7f32c500, 0x7f32d400, 0x7f32d400]
>  total_blobs=41388 nmethods=40792 adapters=501
>  compilation: disabled (not enough contiguous free space left)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17942) OpenJDK 64-Bit Server VM warning: Try increasing the code cache size using -XX:ReservedCodeCacheSize=


 [ 
https://issues.apache.org/jira/browse/SPARK-17942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Harish updated SPARK-17942:
---
Priority: Minor  (was: Major)

> OpenJDK 64-Bit Server VM warning: Try increasing the code cache size using 
> -XX:ReservedCodeCacheSize=
> -
>
> Key: SPARK-17942
> URL: https://issues.apache.org/jira/browse/SPARK-17942
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.1
>Reporter: Harish
>Priority: Minor
>
> My code snipped is  in below location. In that  snippet i had put only few 
> columns, but in my test case i have data with 10M rows and 10,000 columns.
> http://stackoverflow.com/questions/39602596/convert-groupbykey-to-reducebykey-pyspark
> I see below message in spark 2.0.2 snapshot
> # Stderr of the node
> OpenJDK 64-Bit Server VM warning: CodeCache is full. Compiler has been 
> disabled.
> OpenJDK 64-Bit Server VM warning: Try increasing the code cache size using 
> -XX:ReservedCodeCacheSize=
> # stdout of the node
> CodeCache: size=245760Kb used=242680Kb max_used=242689Kb free=3079Kb
>  bounds [0x7f32c500, 0x7f32d400, 0x7f32d400]
>  total_blobs=41388 nmethods=40792 adapters=501
>  compilation: disabled (not enough contiguous free space left)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-17942) OpenJDK 64-Bit Server VM warning: Try increasing the code cache size using -XX:ReservedCodeCacheSize=

Harish created SPARK-17942:
--

 Summary: OpenJDK 64-Bit Server VM warning: Try increasing the code 
cache size using -XX:ReservedCodeCacheSize=
 Key: SPARK-17942
 URL: https://issues.apache.org/jira/browse/SPARK-17942
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.0.1
Reporter: Harish


My code snipped is  in below location. In that  snippet i had put only few 
columns, but in my test case i have data with 10M rows and 10,000 columns.
http://stackoverflow.com/questions/39602596/convert-groupbykey-to-reducebykey-pyspark

I see below message in spark 2.0.2 snapshot
# Stderr of the node
OpenJDK 64-Bit Server VM warning: CodeCache is full. Compiler has been disabled.
OpenJDK 64-Bit Server VM warning: Try increasing the code cache size using 
-XX:ReservedCodeCacheSize=

# stdout of the node
CodeCache: size=245760Kb used=242680Kb max_used=242689Kb free=3079Kb
 bounds [0x7f32c500, 0x7f32d400, 0x7f32d400]
 total_blobs=41388 nmethods=40792 adapters=501
 compilation: disabled (not enough contiguous free space left)




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17941) Logistic regression test suites should use weights when comparing to glmnet


 [ 
https://issues.apache.org/jira/browse/SPARK-17941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17941:


Assignee: (was: Apache Spark)

> Logistic regression test suites should use weights when comparing to glmnet
> ---
>
> Key: SPARK-17941
> URL: https://issues.apache.org/jira/browse/SPARK-17941
> Project: Spark
>  Issue Type: Test
>  Components: ML
>Reporter: Seth Hendrickson
>Priority: Minor
>
> Logistic regression suite currently has many test cases comparing to R's 
> glmnet. Both libraries support weights, and to make the testing of weights in 
> Spark LOR more robust, we should add weights to all the test cases. The 
> current weight testing is quite minimal.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17941) Logistic regression test suites should use weights when comparing to glmnet


[ 
https://issues.apache.org/jira/browse/SPARK-17941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15575716#comment-15575716
 ] 

Apache Spark commented on SPARK-17941:
--

User 'sethah' has created a pull request for this issue:
https://github.com/apache/spark/pull/15488

> Logistic regression test suites should use weights when comparing to glmnet
> ---
>
> Key: SPARK-17941
> URL: https://issues.apache.org/jira/browse/SPARK-17941
> Project: Spark
>  Issue Type: Test
>  Components: ML
>Reporter: Seth Hendrickson
>Priority: Minor
>
> Logistic regression suite currently has many test cases comparing to R's 
> glmnet. Both libraries support weights, and to make the testing of weights in 
> Spark LOR more robust, we should add weights to all the test cases. The 
> current weight testing is quite minimal.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17941) Logistic regression test suites should use weights when comparing to glmnet