from:"Andrew Or \(JIRA\)"

[jira] [Resolved] (SPARK-17672) Spark 2.0 history server web Ui takes too long for a single application

2016-09-29 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or resolved SPARK-17672.
---
  Resolution: Fixed
   Fix Version/s: 2.0.1
Target Version/s: 2.0.1

> Spark 2.0 history server web Ui takes too long for a single application
> ---
>
> Key: SPARK-17672
> URL: https://issues.apache.org/jira/browse/SPARK-17672
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.0.0
>Reporter: Gang Wu
> Fix For: 2.0.1
>
>
> When there are 10K application history in the history server back end, it can 
> take a very long time to even get a single application history page. After 
> some investigation, I found the root cause was the following piece of code: 
> {code:title=OneApplicationResource.scala|borderStyle=solid}
> @Produces(Array(MediaType.APPLICATION_JSON))
> private[v1] class OneApplicationResource(uiRoot: UIRoot) {
>   @GET
>   def getApp(@PathParam("appId") appId: String): ApplicationInfo = {
> val apps = uiRoot.getApplicationInfoList.find { _.id == appId }
> apps.getOrElse(throw new NotFoundException("unknown app: " + appId))
>   }
> }
> {code}
> Although all application history infos are stored in a LinkedHashMap, here to 
> code transforms the map to an iterator and then uses the find() api which is 
> O( n) instead of O(1) from a map.get() operation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-17648) TaskSchedulerImpl.resourceOffers should take an IndexedSeq, not a Seq

2016-09-29 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or resolved SPARK-17648.
---
  Resolution: Fixed
   Fix Version/s: 2.1.0
Target Version/s: 2.1.0

> TaskSchedulerImpl.resourceOffers should take an IndexedSeq, not a Seq
> -
>
> Key: SPARK-17648
> URL: https://issues.apache.org/jira/browse/SPARK-17648
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler, Spark Core
>Affects Versions: 2.0.0
>Reporter: Imran Rashid
>Assignee: Imran Rashid
>Priority: Minor
> Fix For: 2.1.0
>
>
> {{TaskSchedulerImpl.resourceOffer}} takes in a {{Seq[WorkerOffer]}}.  
> however, later on it indexes into this by position.  If you don't pass in an 
> {{IndexedSeq}}, this turns an O(n) operation in an O(n^2) operation.
> In practice, this isn't an issue, since just by chance the important places 
> this is called, the datastructures happen to already be {{IndexedSeq}} s.  
> But we ought to tighten up the types to make this more clear.  I ran into 
> this while doing some performance tests on the scheduler, and performance was 
> terrible when I passed in a {{Seq}} and even a few hundred offers were 
> scheduled very slowly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-17623) Failed tasks end reason is always a TaskFailedReason, types should reflect this

2016-09-21 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or resolved SPARK-17623.
---
   Resolution: Fixed
Fix Version/s: 2.1.0

> Failed tasks end reason is always a TaskFailedReason, types should reflect 
> this
> ---
>
> Key: SPARK-17623
> URL: https://issues.apache.org/jira/browse/SPARK-17623
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler, Spark Core
>Affects Versions: 2.0.0
>Reporter: Imran Rashid
>Assignee: Imran Rashid
>Priority: Minor
> Fix For: 2.1.0
>
>
> Minor code cleanup.  In TaskResultGetter, enqueueFailedTask currently 
> deserializes the result as a TaskEndReason.  But the type is actually more 
> specific, its a TaskFailedReason.  This just leads to more blind casting 
> later on -- it would be more clear if the msg was cast to the right type 
> immediately, so method parameter types could be tightened.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-17438) Master UI should show the correct core limit when `ApplicationInfo.executorLimit` is set

2016-09-19 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or resolved SPARK-17438.
---
   Resolution: Fixed
Fix Version/s: 2.1.0
   2.0.1

> Master UI should show the correct core limit when 
> `ApplicationInfo.executorLimit` is set
> 
>
> Key: SPARK-17438
> URL: https://issues.apache.org/jira/browse/SPARK-17438
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
> Fix For: 2.0.1, 2.1.0
>
>
> The core info of an application in Master UI doesn't consider 
> `ApplicationInfo.executorLimit`. It's pretty confusing that UI says 
> "Unlimited" when `executorLimit` is set.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17458) Alias specified for aggregates in a pivot are not honored

2016-09-15 Thread Andrew Ray (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15494591#comment-15494591
 ] 

Andrew Ray commented on SPARK-17458:


[~hvanhovell]: My JIRA username is a1ray.

> Alias specified for aggregates in a pivot are not honored
> -
>
> Key: SPARK-17458
> URL: https://issues.apache.org/jira/browse/SPARK-17458
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Ravi Somepalli
>Assignee: Herman van Hovell
> Fix For: 2.1.0
>
>
> When using pivot and multiple aggregations we need to alias to avoid special 
> characters, but alias does not help because 
> df.groupBy("C").pivot("A").agg(avg("D").as("COLD"), max("B").as("COLB")).show
> ||C || bar_avg(`D`) AS `COLD` || bar_max(`B`) AS `COLB` || foo_avg(`D`) 
> AS `COLD` || foo_max(`B`) AS `COLB` ||
> |small|   5.5|   two|2.3335|  
>  two|
> |large|   5.5|   two|   2.0|  
>  one|
> Expected Output
> ||C || bar_COLD || bar_COLB || foo_COLD || foo_COLB ||
> |small|   5.5|   two|2.3335|  
>  two|
> |large|   5.5|   two|   2.0|  
>  one|
> One approach you can fix this issue is to change the class
> sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
>  and change the outputName method in 
> {code}
> object ResolvePivot extends Rule[LogicalPlan] {
> def apply(plan: LogicalPlan): LogicalPlan = plan transform {
> {code}
> {code}
> def outputName(value: Literal, aggregate: Expression): String = {
>   val suffix = aggregate match {
>  case n: NamedExpression => 
> aggregate.asInstanceOf[NamedExpression].name
>  case _ => aggregate.sql
>}
>   if (singleAgg) value.toString else value + "_" + suffix
> }
> {code}
> Version : 2.0.0
> {code}
> def outputName(value: Literal, aggregate: Expression): String = {
>   if (singleAgg) value.toString else value + "_" + aggregate.sql
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-17458) Alias specified for aggregates in a pivot are not honored

2016-09-15 Thread Andrew Ray (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Ray updated SPARK-17458:
---
Comment: was deleted

(was: [~hvanhovell] It's a1ray)

> Alias specified for aggregates in a pivot are not honored
> -
>
> Key: SPARK-17458
> URL: https://issues.apache.org/jira/browse/SPARK-17458
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Ravi Somepalli
>Assignee: Herman van Hovell
> Fix For: 2.1.0
>
>
> When using pivot and multiple aggregations we need to alias to avoid special 
> characters, but alias does not help because 
> df.groupBy("C").pivot("A").agg(avg("D").as("COLD"), max("B").as("COLB")).show
> ||C || bar_avg(`D`) AS `COLD` || bar_max(`B`) AS `COLB` || foo_avg(`D`) 
> AS `COLD` || foo_max(`B`) AS `COLB` ||
> |small|   5.5|   two|2.3335|  
>  two|
> |large|   5.5|   two|   2.0|  
>  one|
> Expected Output
> ||C || bar_COLD || bar_COLB || foo_COLD || foo_COLB ||
> |small|   5.5|   two|2.3335|  
>  two|
> |large|   5.5|   two|   2.0|  
>  one|
> One approach you can fix this issue is to change the class
> sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
>  and change the outputName method in 
> {code}
> object ResolvePivot extends Rule[LogicalPlan] {
> def apply(plan: LogicalPlan): LogicalPlan = plan transform {
> {code}
> {code}
> def outputName(value: Literal, aggregate: Expression): String = {
>   val suffix = aggregate match {
>  case n: NamedExpression => 
> aggregate.asInstanceOf[NamedExpression].name
>  case _ => aggregate.sql
>}
>   if (singleAgg) value.toString else value + "_" + suffix
> }
> {code}
> Version : 2.0.0
> {code}
> def outputName(value: Literal, aggregate: Expression): String = {
>   if (singleAgg) value.toString else value + "_" + aggregate.sql
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-17458) Alias specified for aggregates in a pivot are not honored

2016-09-15 Thread Andrew Ray (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15494361#comment-15494361
 ] 

Andrew Ray edited comment on SPARK-17458 at 9/15/16 8:09 PM:
-

[~hvanhovell] It's a1ray


was (Author: a1ray):
It's a1ray

> Alias specified for aggregates in a pivot are not honored
> -
>
> Key: SPARK-17458
> URL: https://issues.apache.org/jira/browse/SPARK-17458
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Ravi Somepalli
>Assignee: Herman van Hovell
> Fix For: 2.1.0
>
>
> When using pivot and multiple aggregations we need to alias to avoid special 
> characters, but alias does not help because 
> df.groupBy("C").pivot("A").agg(avg("D").as("COLD"), max("B").as("COLB")).show
> ||C || bar_avg(`D`) AS `COLD` || bar_max(`B`) AS `COLB` || foo_avg(`D`) 
> AS `COLD` || foo_max(`B`) AS `COLB` ||
> |small|   5.5|   two|2.3335|  
>  two|
> |large|   5.5|   two|   2.0|  
>  one|
> Expected Output
> ||C || bar_COLD || bar_COLB || foo_COLD || foo_COLB ||
> |small|   5.5|   two|2.3335|  
>  two|
> |large|   5.5|   two|   2.0|  
>  one|
> One approach you can fix this issue is to change the class
> sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
>  and change the outputName method in 
> {code}
> object ResolvePivot extends Rule[LogicalPlan] {
> def apply(plan: LogicalPlan): LogicalPlan = plan transform {
> {code}
> {code}
> def outputName(value: Literal, aggregate: Expression): String = {
>   val suffix = aggregate match {
>  case n: NamedExpression => 
> aggregate.asInstanceOf[NamedExpression].name
>  case _ => aggregate.sql
>}
>   if (singleAgg) value.toString else value + "_" + suffix
> }
> {code}
> Version : 2.0.0
> {code}
> def outputName(value: Literal, aggregate: Expression): String = {
>   if (singleAgg) value.toString else value + "_" + aggregate.sql
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17458) Alias specified for aggregates in a pivot are not honored

2016-09-15 Thread Andrew Ray (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15494361#comment-15494361
 ] 

Andrew Ray commented on SPARK-17458:


It's a1ray

> Alias specified for aggregates in a pivot are not honored
> -
>
> Key: SPARK-17458
> URL: https://issues.apache.org/jira/browse/SPARK-17458
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Ravi Somepalli
> Fix For: 2.1.0
>
>
> When using pivot and multiple aggregations we need to alias to avoid special 
> characters, but alias does not help because 
> df.groupBy("C").pivot("A").agg(avg("D").as("COLD"), max("B").as("COLB")).show
> ||C || bar_avg(`D`) AS `COLD` || bar_max(`B`) AS `COLB` || foo_avg(`D`) 
> AS `COLD` || foo_max(`B`) AS `COLB` ||
> |small|   5.5|   two|2.3335|  
>  two|
> |large|   5.5|   two|   2.0|  
>  one|
> Expected Output
> ||C || bar_COLD || bar_COLB || foo_COLD || foo_COLB ||
> |small|   5.5|   two|2.3335|  
>  two|
> |large|   5.5|   two|   2.0|  
>  one|
> One approach you can fix this issue is to change the class
> sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
>  and change the outputName method in 
> {code}
> object ResolvePivot extends Rule[LogicalPlan] {
> def apply(plan: LogicalPlan): LogicalPlan = plan transform {
> {code}
> {code}
> def outputName(value: Literal, aggregate: Expression): String = {
>   val suffix = aggregate match {
>  case n: NamedExpression => 
> aggregate.asInstanceOf[NamedExpression].name
>  case _ => aggregate.sql
>}
>   if (singleAgg) value.toString else value + "_" + suffix
> }
> {code}
> Version : 2.0.0
> {code}
> def outputName(value: Literal, aggregate: Expression): String = {
>   if (singleAgg) value.toString else value + "_" + aggregate.sql
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15917) Define the number of executors in standalone mode with an easy-to-use property

2016-09-14 Thread Andrew Or (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15490663#comment-15490663
 ] 

Andrew Or commented on SPARK-15917:
---

By the way is there a pull request?

> Define the number of executors in standalone mode with an easy-to-use property
> --
>
> Key: SPARK-15917
> URL: https://issues.apache.org/jira/browse/SPARK-15917
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, Spark Shell, Spark Submit
>Affects Versions: 1.6.1
>Reporter: Jonathan Taws
>Priority: Minor
>
> After stumbling across a few StackOverflow posts around the issue of using a 
> fixed number of executors in standalone mode (non-YARN), I was wondering if 
> we could not add an easier way to set this parameter than having to resort to 
> some calculations based on the number of cores and the memory you have 
> available on your worker. 
> For example, let's say I have 8 cores and 30GB of memory available :
>  - If no option is passed, one executor will be spawned with 8 cores and 1GB 
> of memory allocated.
>  - However, if I want to have only *2* executors, and to use 2 cores and 10GB 
> of memory per executor, I will end up with *3* executors (as the available 
> memory will limit the number of executors) instead of the 2 I was hoping for.
> Sure, I can set {{spark.cores.max}} as a workaround to get exactly what I 
> want, but would it not be easier to add a {{--num-executors}}-like option to 
> standalone mode to be able to really fine-tune the configuration ? This 
> option is already available in YARN mode.
> From my understanding, I don't see any other option lying around that can 
> help achieve this.  
> This seems to be slightly disturbing for newcomers, and standalone mode is 
> probably the first thing anyone will use to just try out Spark or test some 
> configuration.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17310) Disable Parquet's record-by-record filter in normal parquet reader and do it in Spark-side

2016-08-30 Thread Andrew Duffy (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15448672#comment-15448672
 ] 

Andrew Duffy commented on SPARK-17310:
--

+1 to this, see comments on https://github.com/apache/spark/pull/14671, 
particularly rdblue's comment. We need to wait for next release of Parquet to 
be able to be able to set {{parquet.filter.record-level.enabled}} config

> Disable Parquet's record-by-record filter in normal parquet reader and do it 
> in Spark-side
> --
>
> Key: SPARK-17310
> URL: https://issues.apache.org/jira/browse/SPARK-17310
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>
> Currently, we are pushing filters down for normal Parquet reader which also 
> filters record-by-record.
> It seems Spark-side codegen row-by-row filtering might be faster than 
> Parquet's one in general due to type-boxing and virtual function calls which 
> Spark's one tries to avoid.
> Maybe we should perform a benchmark and disable this. This ticket was from 
> https://github.com/apache/spark/pull/14671
> Please refer the discussion in the PR.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17227) Allow configuring record delimiter in csv

2016-08-24 Thread Andrew Ash (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15435539#comment-15435539
 ] 

Andrew Ash commented on SPARK-17227:


Rob and I work together, and we've seen datasets in mostly-CSV format that have 
non-standard record delimiters ('\0' character for instance).

For some broader context, we've created our own CSV text parser and use that in 
all our various internal products that use Spark, but would like to contribute 
this additional flexibility back to the Spark community at large and in the 
process eliminate the need for our internal CSV datasource.

Here are the tickets Rob just opened that we would require to eliminate our 
internal CSV datasource:

SPARK-17222
SPARK-17224
SPARK-17225
SPARK-17226
SPARK-17227

The basic question then, is would the Spark community accept patches that 
extend Spark's CSV parser to cover these features?  We're willing to write the 
code and get the patches through code review, but would rather know up front if 
these changes would never be accepted into mainline Spark due to philosophical 
disagreements around what Spark's CSV datasource should be.

> Allow configuring record delimiter in csv
> -
>
> Key: SPARK-17227
> URL: https://issues.apache.org/jira/browse/SPARK-17227
> Project: Spark
>  Issue Type: Improvement
>Reporter: Robert Kruszewski
>Priority: Minor
>
> Instead of hard coded "\n"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17213) Parquet String Pushdown for Non-Eq Comparisons Broken

2016-08-24 Thread Andrew Duffy (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Duffy updated SPARK-17213:
-
Description: 
Spark defines ordering over strings based on comparison of UTF8 byte arrays, 
which compare bytes as unsigned integers. Currently however Parquet does not 
respect this ordering. This is currently in the process of being fixed in 
Parquet, JIRA and PR link below, but currently all filters are broken over 
strings, with there actually being a correctness issue for {{>}} and {{<}}.

*Repro:*
Querying directly from in-memory DataFrame:
{code}
> Seq("a", "é").toDF("name").where("name > 'a'").count
1
{code}

Querying from a parquet dataset:
{code}
> Seq("a", "é").toDF("name").write.parquet("/tmp/bad")
> spark.read.parquet("/tmp/bad").where("name > 'a'").count
0
{code}
This happens because Spark sorts the rows to be {{[a, é]}}, but Parquet's 
implementation of comparison of strings is based on signed byte array 
comparison, so it will actually create 1 row group with statistics 
{{min=é,max=a}}, and so the row group will be dropped by the query.

Based on the way Parquet pushes down Eq, it will not be affecting correctness 
but it will force you to read row groups you should be able to skip.

Link to PARQUET issue: https://issues.apache.org/jira/browse/PARQUET-686
Link to PR: https://github.com/apache/parquet-mr/pull/362

  was:
Spark defines ordering over strings based on comparison of UTF8 byte arrays, 
which compare bytes as unsigned integers. Currently however Parquet does not 
respect this ordering. This is currently in the process of being fixed in 
Parquet, JIRA and PR link below, but currently all filters are broken over 
strings, with there actually being a correctness issue for {{>}} and {{<}}.

*Repro:*
Querying directly from in-memory DataFrame:
{code}
> Seq("a", "é").toDF("name").where("name > 'a'").count
1
{code}

Querying from a parquet dataset:
{code}
> Seq("a", "é").toDF("name").write.parquet("/tmp/bad")
> spark.read.parquet("/tmp/bad").where("name > 'a'").count
0
{code}
This happens because Spark sorts the rows to be {{[a, é]}}, but Parquet's 
implementation of comparison of strings is based on signed byte array 
comparison, so it will actually create 1 row group with statistics 
{{min=é,max=a}}, and so the row group will be dropped by the query.

Link to PARQUET issue: https://issues.apache.org/jira/browse/PARQUET-686
Link to PR: https://github.com/apache/parquet-mr/pull/362


> Parquet String Pushdown for Non-Eq Comparisons Broken
> -
>
> Key: SPARK-17213
> URL: https://issues.apache.org/jira/browse/SPARK-17213
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Andrew Duffy
>
> Spark defines ordering over strings based on comparison of UTF8 byte arrays, 
> which compare bytes as unsigned integers. Currently however Parquet does not 
> respect this ordering. This is currently in the process of being fixed in 
> Parquet, JIRA and PR link below, but currently all filters are broken over 
> strings, with there actually being a correctness issue for {{>}} and {{<}}.
> *Repro:*
> Querying directly from in-memory DataFrame:
> {code}
> > Seq("a", "é").toDF("name").where("name > 'a'").count
> 1
> {code}
> Querying from a parquet dataset:
> {code}
> > Seq("a", "é").toDF("name").write.parquet("/tmp/bad")
> > spark.read.parquet("/tmp/bad").where("name > 'a'").count
> 0
> {code}
> This happens because Spark sorts the rows to be {{[a, é]}}, but Parquet's 
> implementation of comparison of strings is based on signed byte array 
> comparison, so it will actually create 1 row group with statistics 
> {{min=é,max=a}}, and so the row group will be dropped by the query.
> Based on the way Parquet pushes down Eq, it will not be affecting correctness 
> but it will force you to read row groups you should be able to skip.
> Link to PARQUET issue: https://issues.apache.org/jira/browse/PARQUET-686
> Link to PR: https://github.com/apache/parquet-mr/pull/362



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17213) Parquet String Pushdown for Non-Eq Comparisons Broken

2016-08-24 Thread Andrew Duffy (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Duffy updated SPARK-17213:
-
Description: 
Spark defines ordering over strings based on comparison of UTF8 byte arrays, 
which compare bytes as unsigned integers. Currently however Parquet does not 
respect this ordering. This is currently in the process of being fixed in 
Parquet, JIRA and PR link below, but currently all filters are broken over 
strings, with there actually being a correctness issue for {{>}} and {{<}}.

*Repro:*
Querying directly from in-memory DataFrame:
{code}
> Seq("a", "é").toDF("name").where("name > 'a'").count
1
{code}

Querying from a parquet dataset:
{code}
> Seq("a", "é").toDF("name").write.parquet("/tmp/bad")
> spark.read.parquet("/tmp/bad").where("name > 'a'").count
0
{code}
This happens because Spark sorts the rows to be {{[a, é]}}, but Parquet's 
implementation of comparison of strings is based on signed byte array 
comparison, so it will actually create 1 row group with statistics 
{{min=é,max=a}}, and so the row group will be dropped by the query.

Link to PARQUET issue: https://issues.apache.org/jira/browse/PARQUET-686
Link to PR: https://github.com/apache/parquet-mr/pull/362

  was:
Spark defines ordering over strings based on comparison of UTF8 byte arrays, 
which compare bytes as unsigned integers. Currently however Parquet does not 
respect this ordering. This is currently in the process of being fixed in 
Parquet, JIRA and PR link below, but currently all filters are broken over 
strings, with there actually being a correctness issue for {{>}} and {{<}}.

*Repro:*
Querying directly from in-memory DataFrame:
{code}
> Seq("a", "é").toDF("name").where("name > 'a'").count
1
{code}

Querying from a parquet dataset:
{code}
> Seq("a", "é").toDF("name").write.parquet("/tmp/bad")
> spark.read.parquet("/tmp/bad").where("name > 'a'").count
0
{code}
This happens because Spark sorts the rows to be {{[a, é]}}, but Parquet's 
implementation of comparison of strings is based on signed byte array 
comparison, so it will actually create 1 row group with statistics 
{{min=é,max=a}}, and so the row group will be dropped by the query.


> Parquet String Pushdown for Non-Eq Comparisons Broken
> -
>
> Key: SPARK-17213
> URL: https://issues.apache.org/jira/browse/SPARK-17213
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Andrew Duffy
>
> Spark defines ordering over strings based on comparison of UTF8 byte arrays, 
> which compare bytes as unsigned integers. Currently however Parquet does not 
> respect this ordering. This is currently in the process of being fixed in 
> Parquet, JIRA and PR link below, but currently all filters are broken over 
> strings, with there actually being a correctness issue for {{>}} and {{<}}.
> *Repro:*
> Querying directly from in-memory DataFrame:
> {code}
> > Seq("a", "é").toDF("name").where("name > 'a'").count
> 1
> {code}
> Querying from a parquet dataset:
> {code}
> > Seq("a", "é").toDF("name").write.parquet("/tmp/bad")
> > spark.read.parquet("/tmp/bad").where("name > 'a'").count
> 0
> {code}
> This happens because Spark sorts the rows to be {{[a, é]}}, but Parquet's 
> implementation of comparison of strings is based on signed byte array 
> comparison, so it will actually create 1 row group with statistics 
> {{min=é,max=a}}, and so the row group will be dropped by the query.
> Link to PARQUET issue: https://issues.apache.org/jira/browse/PARQUET-686
> Link to PR: https://github.com/apache/parquet-mr/pull/362



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-17213) Parquet String Pushdown for Non-Eq Comparisons Broken

2016-08-24 Thread Andrew Duffy (JIRA)

Andrew Duffy created SPARK-17213:


 Summary: Parquet String Pushdown for Non-Eq Comparisons Broken
 Key: SPARK-17213
 URL: https://issues.apache.org/jira/browse/SPARK-17213
 Project: Spark
  Issue Type: Bug
Affects Versions: 2.0.0
Reporter: Andrew Duffy


Spark defines ordering over strings based on comparison of UTF8 byte arrays, 
which compare bytes as unsigned integers. Currently however Parquet does not 
respect this ordering. This is currently in the process of being fixed in 
Parquet, JIRA and PR link below, but currently all filters are broken over 
strings, with there actually being a correctness issue for {{>}} and {{<}}.

*Repro:*
Querying directly from in-memory DataFrame:
{code}
> Seq("a", "é").toDF("name").where("name > 'a'").count
1
{code}

Querying from a parquet dataset:
{code}
> Seq("a", "é").toDF("name").write.parquet("/tmp/bad")
> spark.read.parquet("/tmp/bad").where("name > 'a'").count
0
{code}
This happens because Spark sorts the rows to be {{[a, é]}}, but Parquet's 
implementation of comparison of strings is based on signed byte array 
comparison, so it will actually create 1 row group with statistics 
{{min=é,max=a}}, and so the row group will be dropped by the query.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17172) pyspak hiveContext can not create UDF: Py4JJavaError: An error occurred while calling None.org.apache.spark.sql.hive.HiveContext.

2016-08-22 Thread Andrew Davidson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15431409#comment-15431409
 ] 

Andrew Davidson commented on SPARK-17172:
-

Hi Sean

I forgot about that older jira issue. I never resolved it. I am using juypter. 
I believe each notebook gets it own spark context. I googled around and found 
some old issue that seem to suggest that a hive and sql context where being 
created . I have not figure out how to either use a different database for the 
hive context or prevent the original spark context from being created.



> pyspak hiveContext can not create UDF: Py4JJavaError: An error occurred while 
> calling None.org.apache.spark.sql.hive.HiveContext. 
> --
>
> Key: SPARK-17172
> URL: https://issues.apache.org/jira/browse/SPARK-17172
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.6.2
> Environment: spark version: 1.6.2
> python version: 3.4.2 (v3.4.2:ab2c023a9432, Oct  5 2014, 20:42:22) 
> [GCC 4.2.1 (Apple Inc. build 5666) (dot 3)]
>Reporter: Andrew Davidson
> Attachments: hiveUDFBug.html, hiveUDFBug.ipynb
>
>
> from pyspark.sql import HiveContext
> sqlContext = HiveContext(sc)
> # Define udf
> from pyspark.sql.functions import udf
> def scoreToCategory(score):
> if score >= 80: return 'A'
> elif score >= 60: return 'B'
> elif score >= 35: return 'C'
> else: return 'D'
>  
> udfScoreToCategory=udf(scoreToCategory, StringType())
> throws exception
> Py4JJavaError: An error occurred while calling 
> None.org.apache.spark.sql.hive.HiveContext.
> : java.lang.RuntimeException: java.lang.RuntimeException: Unable to 
> instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17172) pyspak hiveContext can not create UDF: Py4JJavaError: An error occurred while calling None.org.apache.spark.sql.hive.HiveContext.

2016-08-22 Thread Andrew Davidson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15431371#comment-15431371
 ] 

Andrew Davidson commented on SPARK-17172:
-

Hi Sean

the data center was created using spark-ec2 from spark-1.6.1-bin-hadoop2.6

ec2-user@ip-172-31-22-140 root]$ cat /root/spark/RELEASE 
Spark 1.6.1 built for Hadoop 2.0.0-mr1-cdh4.2.0
Build flags: -Psparkr -Phadoop-1 -Phive -Phive-thriftserver 
-Dhadoop.version=2.0.0-mr1-cdh4.2.0 -DzincPort=3032
[ec2-user@ip-172-31-22-140 root]$ 

> pyspak hiveContext can not create UDF: Py4JJavaError: An error occurred while 
> calling None.org.apache.spark.sql.hive.HiveContext. 
> --
>
> Key: SPARK-17172
> URL: https://issues.apache.org/jira/browse/SPARK-17172
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.6.2
> Environment: spark version: 1.6.2
> python version: 3.4.2 (v3.4.2:ab2c023a9432, Oct  5 2014, 20:42:22) 
> [GCC 4.2.1 (Apple Inc. build 5666) (dot 3)]
>Reporter: Andrew Davidson
> Attachments: hiveUDFBug.html, hiveUDFBug.ipynb
>
>
> from pyspark.sql import HiveContext
> sqlContext = HiveContext(sc)
> # Define udf
> from pyspark.sql.functions import udf
> def scoreToCategory(score):
> if score >= 80: return 'A'
> elif score >= 60: return 'B'
> elif score >= 35: return 'C'
> else: return 'D'
>  
> udfScoreToCategory=udf(scoreToCategory, StringType())
> throws exception
> Py4JJavaError: An error occurred while calling 
> None.org.apache.spark.sql.hive.HiveContext.
> : java.lang.RuntimeException: java.lang.RuntimeException: Unable to 
> instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17172) pyspak hiveContext can not create UDF: Py4JJavaError: An error occurred while calling None.org.apache.spark.sql.hive.HiveContext.

2016-08-22 Thread Andrew Davidson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15431018#comment-15431018
 ] 

Andrew Davidson commented on SPARK-17172:
-

Hi Sean

It should be very easy to use the attached notebook to reproduce the hive bug. 
I got the code example from a blog. The original code worked in spark 1.5.x

I also attached an html version of the notebook so you can see the entire stack 
trace with out having to start jupyter

thanks

Andy

> pyspak hiveContext can not create UDF: Py4JJavaError: An error occurred while 
> calling None.org.apache.spark.sql.hive.HiveContext. 
> --
>
> Key: SPARK-17172
> URL: https://issues.apache.org/jira/browse/SPARK-17172
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.6.2
> Environment: spark version: 1.6.2
> python version: 3.4.2 (v3.4.2:ab2c023a9432, Oct  5 2014, 20:42:22) 
> [GCC 4.2.1 (Apple Inc. build 5666) (dot 3)]
>Reporter: Andrew Davidson
> Attachments: hiveUDFBug.html, hiveUDFBug.ipynb
>
>
> from pyspark.sql import HiveContext
> sqlContext = HiveContext(sc)
> # Define udf
> from pyspark.sql.functions import udf
> def scoreToCategory(score):
> if score >= 80: return 'A'
> elif score >= 60: return 'B'
> elif score >= 35: return 'C'
> else: return 'D'
>  
> udfScoreToCategory=udf(scoreToCategory, StringType())
> throws exception
> Py4JJavaError: An error occurred while calling 
> None.org.apache.spark.sql.hive.HiveContext.
> : java.lang.RuntimeException: java.lang.RuntimeException: Unable to 
> instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17172) pyspak hiveContext can not create UDF: Py4JJavaError: An error occurred while calling None.org.apache.spark.sql.hive.HiveContext.

2016-08-21 Thread Andrew Davidson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15430004#comment-15430004
 ] 

Andrew Davidson commented on SPARK-17172:
-

Hi Sean

I do not think it is the same error. In the related to bug, I could not create 
a udf using sqlcontext. The work around solution was to change the permission 
on hdfs:///tmp  The error msg actually mentioned problem with /tmp. (I thought 
the msg referred to the file:///tmp ) not sure how permission got messed up? 
maybe some one deleted it by accident and spark does not recreated it if its 
missing?

so I am able to create udf using sqlcontext. hiveContext does not work. Given I 
fixed the hdfs:/// permission problem I think its probably something else. 
Hopefully the attached notebook makes it easy to reproduce

thanks

Andy

> pyspak hiveContext can not create UDF: Py4JJavaError: An error occurred while 
> calling None.org.apache.spark.sql.hive.HiveContext. 
> --
>
> Key: SPARK-17172
> URL: https://issues.apache.org/jira/browse/SPARK-17172
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.6.2
> Environment: spark version: 1.6.2
> python version: 3.4.2 (v3.4.2:ab2c023a9432, Oct  5 2014, 20:42:22) 
> [GCC 4.2.1 (Apple Inc. build 5666) (dot 3)]
>Reporter: Andrew Davidson
> Attachments: hiveUDFBug.html, hiveUDFBug.ipynb
>
>
> from pyspark.sql import HiveContext
> sqlContext = HiveContext(sc)
> # Define udf
> from pyspark.sql.functions import udf
> def scoreToCategory(score):
> if score >= 80: return 'A'
> elif score >= 60: return 'B'
> elif score >= 35: return 'C'
> else: return 'D'
>  
> udfScoreToCategory=udf(scoreToCategory, StringType())
> throws exception
> Py4JJavaError: An error occurred while calling 
> None.org.apache.spark.sql.hive.HiveContext.
> : java.lang.RuntimeException: java.lang.RuntimeException: Unable to 
> instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17172) pyspak hiveContext can not create UDF: Py4JJavaError: An error occurred while calling None.org.apache.spark.sql.hive.HiveContext.

2016-08-20 Thread Andrew Davidson (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Davidson updated SPARK-17172:

Attachment: hiveUDFBug.ipynb
hiveUDFBug.html

> pyspak hiveContext can not create UDF: Py4JJavaError: An error occurred while 
> calling None.org.apache.spark.sql.hive.HiveContext. 
> --
>
> Key: SPARK-17172
> URL: https://issues.apache.org/jira/browse/SPARK-17172
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.6.2
> Environment: spark version: 1.6.2
> python version: 3.4.2 (v3.4.2:ab2c023a9432, Oct  5 2014, 20:42:22) 
> [GCC 4.2.1 (Apple Inc. build 5666) (dot 3)]
>Reporter: Andrew Davidson
> Attachments: hiveUDFBug.html, hiveUDFBug.ipynb
>
>
> from pyspark.sql import HiveContext
> sqlContext = HiveContext(sc)
> # Define udf
> from pyspark.sql.functions import udf
> def scoreToCategory(score):
> if score >= 80: return 'A'
> elif score >= 60: return 'B'
> elif score >= 35: return 'C'
> else: return 'D'
>  
> udfScoreToCategory=udf(scoreToCategory, StringType())
> throws exception
> Py4JJavaError: An error occurred while calling 
> None.org.apache.spark.sql.hive.HiveContext.
> : java.lang.RuntimeException: java.lang.RuntimeException: Unable to 
> instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17172) pyspak hiveContext can not create UDF: Py4JJavaError: An error occurred while calling None.org.apache.spark.sql.hive.HiveContext.

2016-08-20 Thread Andrew Davidson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15429465#comment-15429465
 ] 

Andrew Davidson commented on SPARK-17172:
-

attached a notebook that demonstrates the bug. Also attaced an html version of 
notebook

> pyspak hiveContext can not create UDF: Py4JJavaError: An error occurred while 
> calling None.org.apache.spark.sql.hive.HiveContext. 
> --
>
> Key: SPARK-17172
> URL: https://issues.apache.org/jira/browse/SPARK-17172
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.6.2
> Environment: spark version: 1.6.2
> python version: 3.4.2 (v3.4.2:ab2c023a9432, Oct  5 2014, 20:42:22) 
> [GCC 4.2.1 (Apple Inc. build 5666) (dot 3)]
>Reporter: Andrew Davidson
> Attachments: hiveUDFBug.html, hiveUDFBug.ipynb
>
>
> from pyspark.sql import HiveContext
> sqlContext = HiveContext(sc)
> # Define udf
> from pyspark.sql.functions import udf
> def scoreToCategory(score):
> if score >= 80: return 'A'
> elif score >= 60: return 'B'
> elif score >= 35: return 'C'
> else: return 'D'
>  
> udfScoreToCategory=udf(scoreToCategory, StringType())
> throws exception
> Py4JJavaError: An error occurred while calling 
> None.org.apache.spark.sql.hive.HiveContext.
> : java.lang.RuntimeException: java.lang.RuntimeException: Unable to 
> instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17172) pyspak hiveContext can not create UDF: Py4JJavaError: An error occurred while calling None.org.apache.spark.sql.hive.HiveContext.

2016-08-20 Thread Andrew Davidson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15429463#comment-15429463
 ] 

Andrew Davidson commented on SPARK-17172:
-

related bug report : https://issues.apache.org/jira/browse/SPARK-17143

> pyspak hiveContext can not create UDF: Py4JJavaError: An error occurred while 
> calling None.org.apache.spark.sql.hive.HiveContext. 
> --
>
> Key: SPARK-17172
> URL: https://issues.apache.org/jira/browse/SPARK-17172
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.6.2
> Environment: spark version: 1.6.2
> python version: 3.4.2 (v3.4.2:ab2c023a9432, Oct  5 2014, 20:42:22) 
> [GCC 4.2.1 (Apple Inc. build 5666) (dot 3)]
>Reporter: Andrew Davidson
>
> from pyspark.sql import HiveContext
> sqlContext = HiveContext(sc)
> # Define udf
> from pyspark.sql.functions import udf
> def scoreToCategory(score):
> if score >= 80: return 'A'
> elif score >= 60: return 'B'
> elif score >= 35: return 'C'
> else: return 'D'
>  
> udfScoreToCategory=udf(scoreToCategory, StringType())
> throws exception
> Py4JJavaError: An error occurred while calling 
> None.org.apache.spark.sql.hive.HiveContext.
> : java.lang.RuntimeException: java.lang.RuntimeException: Unable to 
> instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-17172) pyspak hiveContext can not create UDF: Py4JJavaError: An error occurred while calling None.org.apache.spark.sql.hive.HiveContext.

2016-08-20 Thread Andrew Davidson (JIRA)

Andrew Davidson created SPARK-17172:
---

 Summary: pyspak hiveContext can not create UDF: Py4JJavaError: An 
error occurred while calling None.org.apache.spark.sql.hive.HiveContext. 
 Key: SPARK-17172
 URL: https://issues.apache.org/jira/browse/SPARK-17172
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.6.2
 Environment: spark version: 1.6.2
python version: 3.4.2 (v3.4.2:ab2c023a9432, Oct  5 2014, 20:42:22) 
[GCC 4.2.1 (Apple Inc. build 5666) (dot 3)]
Reporter: Andrew Davidson


from pyspark.sql import HiveContext
sqlContext = HiveContext(sc)

# Define udf
from pyspark.sql.functions import udf
def scoreToCategory(score):
if score >= 80: return 'A'
elif score >= 60: return 'B'
elif score >= 35: return 'C'
else: return 'D'
 
udfScoreToCategory=udf(scoreToCategory, StringType())

throws exception

Py4JJavaError: An error occurred while calling 
None.org.apache.spark.sql.hive.HiveContext.
: java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate 
org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17143) pyspark unable to create UDF: java.lang.RuntimeException: org.apache.hadoop.fs.FileAlreadyExistsException: Parent path is not a directory: /tmp tmp

2016-08-18 Thread Andrew Davidson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15427394#comment-15427394
 ] 

Andrew Davidson commented on SPARK-17143:
-

See email from user's group. I was able to find a work around. Not sure how 
hdfs:///tmp/ got created or how the permissions got messed up

##

NICE CATCH!!! Many thanks. 

I spent all day on this bug

The error msg report /tmp. I did not think to look on hdfs.

[ec2-user@ip-172-31-22-140 notebooks]$ hadoop fs -ls hdfs:///tmp/
Found 1 items
-rw-r--r--   3 ec2-user supergroup418 2016-04-13 22:49 hdfs:///tmp
[ec2-user@ip-172-31-22-140 notebooks]$ 


I have no idea how hdfs:///tmp got created. I deleted it. 

This causes a bunch of exceptions. These exceptions has useful message. I was 
able to fix the problem as follows

$ hadoop fs -rmr hdfs:///tmp

Now I run the notebook. It creates hdfs:///tmp/hive but the permission are wrong

$ hadoop fs -chmod 777 hdfs:///tmp/hive


From: Felix Cheung 
Date: Thursday, August 18, 2016 at 3:37 PM
To: Andrew Davidson , "user @spark" 

Subject: Re: pyspark unable to create UDF: java.lang.RuntimeException: 
org.apache.hadoop.fs.FileAlreadyExistsException: Parent path is not a 
directory: /tmp tmp

Do you have a file called tmp at / on HDFS?




> pyspark unable to create UDF: java.lang.RuntimeException: 
> org.apache.hadoop.fs.FileAlreadyExistsException: Parent path is not a 
> directory: /tmp tmp
> ---
>
> Key: SPARK-17143
> URL: https://issues.apache.org/jira/browse/SPARK-17143
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.6.1
> Environment: spark version: 1.6.1
> python version: 3.4.3 (default, Apr  1 2015, 18:10:40) 
> [GCC 4.8.2 20140120 (Red Hat 4.8.2-16)]
>Reporter: Andrew Davidson
> Attachments: udfBug.html, udfBug.ipynb
>
>
> For unknown reason I can not create UDF when I run the attached notebook on 
> my cluster. I get the following error
> Py4JJavaError: An error occurred while calling 
> None.org.apache.spark.sql.hive.HiveContext.
> : java.lang.RuntimeException: 
> org.apache.hadoop.fs.FileAlreadyExistsException: Parent path is not a 
> directory: /tmp tmp
> The notebook runs fine on my Mac
> In general I am able to run non UDF spark code with out any trouble
> I start the notebook server as the user “ec2-user" and uses master URL 
>   spark://ec2-51-215-120-63.us-west-1.compute.amazonaws.com:6066
> I found the following message in the notebook server log file. I have log 
> level set to warn
> 16/08/18 21:38:45 WARN ObjectStore: Version information not found in 
> metastore. hive.metastore.schema.verification is not enabled so recording the 
> schema version 1.2.0
> 16/08/18 21:38:45 WARN ObjectStore: Failed to get database default, returning 
> NoSuchObjectException
> The cluster was originally created using 
> spark-1.6.1-bin-hadoop2.6/ec2/spark-ec2
> #from pyspark.sql import SQLContext, HiveContext
> #sqlContext = SQLContext(sc)
> 
> #from pyspark.sql import DataFrame
> #from pyspark.sql import functions
> 
> from pyspark.sql.types import StringType
> from pyspark.sql.functions import udf
> 
> print("spark version: {}".format(sc.version))
> 
> import sys
> print("python version: {}".format(sys.version))
> spark version: 1.6.1
> python version: 3.4.3 (default, Apr  1 2015, 18:10:40)
> [GCC 4.8.2 20140120 (Red Hat 4.8.2-16)]
> # functions.lower() raises 
> # py4j.Py4JException: Method lower([class java.lang.String]) does not exist
> # work around define a UDF
> toLowerUDFRetType = StringType()
> #toLowerUDF = udf(lambda s : s.lower(), toLowerUDFRetType)
> toLowerUDF = udf(lambda s : s.lower(), StringType())
> You must build Spark with Hive. Export 'SPARK_HIVE=true' and run build/sbt 
> assembly
> Py4JJavaErrorTraceback (most recent call last)
>  in ()
>   4 toLowerUDFRetType = StringType()
>   5 #toLowerUDF = udf(lambda s : s.lower(), toLowerUDFRetType)
> > 6 toLowerUDF = udf(lambda s : s.lower(), StringType())
> /root/spark/python/pyspark/sql/functions.py in udf(f, returnType)
>1595 [Row(slen=5), Row(slen=3)]
>1596 """
> -> 1597 return UserDefinedFunction(f, returnType)
>1598
>1599 blacklist = ['map', 'since', 'ignore_unicode_prefix']
> /root/spark/python/pyspark/sql/functions.py in __init__(self, func, 
> returnType, name)
>1556 self.returnType = returnType
>1557 self._broadcast = None
> -> 1558 self._judf = self._create_judf(name)
>1559
>1560 def _create_judf(self, name):
> /root/spark/python/pyspark/sql/functions.py in _create_judf(self, name)
>

[jira] [Commented] (SPARK-17143) pyspark unable to create UDF: java.lang.RuntimeException: org.apache.hadoop.fs.FileAlreadyExistsException: Parent path is not a directory: /tmp tmp

2016-08-18 Thread Andrew Davidson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15427278#comment-15427278
 ] 

Andrew Davidson commented on SPARK-17143:
-

given the exception metioned an issue with /tmp I decide to track how /tmp 
changed when run my cell

# no spark jobs are running
[ec2-user@ip-172-31-22-140 notebooks]$ !ls
ls /tmp/
hsperfdata_ec2-user  hsperfdata_root  pip_build_ec2-user
[ec2-user@ip-172-31-22-140 notebooks]$ 

# start notebook server
$ nohup startIPythonNotebook.sh > startIPythonNotebook.sh.out &

[ec2-user@ip-172-31-22-140 notebooks]$ !ls
ls /tmp/
hsperfdata_ec2-user  hsperfdata_root  pip_build_ec2-user
[ec2-user@ip-172-31-22-140 notebooks]$ 

# start the udfBug notebook
[ec2-user@ip-172-31-22-140 notebooks]$ ls /tmp/
hsperfdata_ec2-user  hsperfdata_root  
libnetty-transport-native-epoll818283657820702.so  pip_build_ec2-user
[ec2-user@ip-172-31-22-140 notebooks]$ 

# execute cell that define UDF
[ec2-user@ip-172-31-22-140 notebooks]$ ls /tmp/
hsperfdata_ec2-user  hsperfdata_root  
libnetty-transport-native-epoll818283657820702.so  pip_build_ec2-user  
spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9
[ec2-user@ip-172-31-22-140 notebooks]$ 

[ec2-user@ip-172-31-22-140 notebooks]$ find 
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/db.lck
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/log
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/log/log.ctrl
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/log/log1.dat
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/log/README_DO_NOT_TOUCH_FILES.txt
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/log/logmirror.ctrl
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/service.properties
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/README_DO_NOT_TOUCH_FILES.txt
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c230.dat
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c4b0.dat
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c241.dat
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c3a1.dat
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c180.dat
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c2b1.dat
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c7b1.dat
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c311.dat
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c880.dat
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c541.dat
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c9f1.dat
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c20.dat
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c590.dat
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c721.dat
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c470.dat
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c441.dat
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c8e1.dat
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c361.dat
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/ca1.dat
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c421.dat
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c331.dat
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c461.dat
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c5d0.dat
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c851.dat
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c621.dat
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c101.dat
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c3d1.dat
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c891.dat
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c1b1.dat
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c641.dat
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c871.dat
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c6a1.dat
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/cb1.dat
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/ca01.dat
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c391.dat
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c7f1.dat
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c1a1.dat
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c41.dat
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c990.dat

[jira] [Updated] (SPARK-17143) pyspark unable to create UDF: java.lang.RuntimeException: org.apache.hadoop.fs.FileAlreadyExistsException: Parent path is not a directory: /tmp tmp

2016-08-18 Thread Andrew Davidson (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Davidson updated SPARK-17143:

Attachment: udfBug.html

This html version of the notebook shows the output when run in my data center

> pyspark unable to create UDF: java.lang.RuntimeException: 
> org.apache.hadoop.fs.FileAlreadyExistsException: Parent path is not a 
> directory: /tmp tmp
> ---
>
> Key: SPARK-17143
> URL: https://issues.apache.org/jira/browse/SPARK-17143
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.6.1
> Environment: spark version: 1.6.1
> python version: 3.4.3 (default, Apr  1 2015, 18:10:40) 
> [GCC 4.8.2 20140120 (Red Hat 4.8.2-16)]
>Reporter: Andrew Davidson
> Attachments: udfBug.html, udfBug.ipynb
>
>
> For unknown reason I can not create UDF when I run the attached notebook on 
> my cluster. I get the following error
> Py4JJavaError: An error occurred while calling 
> None.org.apache.spark.sql.hive.HiveContext.
> : java.lang.RuntimeException: 
> org.apache.hadoop.fs.FileAlreadyExistsException: Parent path is not a 
> directory: /tmp tmp
> The notebook runs fine on my Mac
> In general I am able to run non UDF spark code with out any trouble
> I start the notebook server as the user “ec2-user" and uses master URL 
>   spark://ec2-51-215-120-63.us-west-1.compute.amazonaws.com:6066
> I found the following message in the notebook server log file. I have log 
> level set to warn
> 16/08/18 21:38:45 WARN ObjectStore: Version information not found in 
> metastore. hive.metastore.schema.verification is not enabled so recording the 
> schema version 1.2.0
> 16/08/18 21:38:45 WARN ObjectStore: Failed to get database default, returning 
> NoSuchObjectException
> The cluster was originally created using 
> spark-1.6.1-bin-hadoop2.6/ec2/spark-ec2
> #from pyspark.sql import SQLContext, HiveContext
> #sqlContext = SQLContext(sc)
> 
> #from pyspark.sql import DataFrame
> #from pyspark.sql import functions
> 
> from pyspark.sql.types import StringType
> from pyspark.sql.functions import udf
> 
> print("spark version: {}".format(sc.version))
> 
> import sys
> print("python version: {}".format(sys.version))
> spark version: 1.6.1
> python version: 3.4.3 (default, Apr  1 2015, 18:10:40)
> [GCC 4.8.2 20140120 (Red Hat 4.8.2-16)]
> # functions.lower() raises 
> # py4j.Py4JException: Method lower([class java.lang.String]) does not exist
> # work around define a UDF
> toLowerUDFRetType = StringType()
> #toLowerUDF = udf(lambda s : s.lower(), toLowerUDFRetType)
> toLowerUDF = udf(lambda s : s.lower(), StringType())
> You must build Spark with Hive. Export 'SPARK_HIVE=true' and run build/sbt 
> assembly
> Py4JJavaErrorTraceback (most recent call last)
>  in ()
>   4 toLowerUDFRetType = StringType()
>   5 #toLowerUDF = udf(lambda s : s.lower(), toLowerUDFRetType)
> > 6 toLowerUDF = udf(lambda s : s.lower(), StringType())
> /root/spark/python/pyspark/sql/functions.py in udf(f, returnType)
>1595 [Row(slen=5), Row(slen=3)]
>1596 """
> -> 1597 return UserDefinedFunction(f, returnType)
>1598
>1599 blacklist = ['map', 'since', 'ignore_unicode_prefix']
> /root/spark/python/pyspark/sql/functions.py in __init__(self, func, 
> returnType, name)
>1556 self.returnType = returnType
>1557 self._broadcast = None
> -> 1558 self._judf = self._create_judf(name)
>1559
>1560 def _create_judf(self, name):
> /root/spark/python/pyspark/sql/functions.py in _create_judf(self, name)
>1567 pickled_command, broadcast_vars, env, includes = 
> _prepare_for_python_RDD(sc, command, self)
>1568 ctx = SQLContext.getOrCreate(sc)
> -> 1569 jdt = ctx._ssql_ctx.parseDataType(self.returnType.json())
>1570 if name is None:
>1571 name = f.__name__ if hasattr(f, '__name__') else 
> f.__class__.__name__
> /root/spark/python/pyspark/sql/context.py in _ssql_ctx(self)
> 681 try:
> 682 if not hasattr(self, '_scala_HiveContext'):
> --> 683 self._scala_HiveContext = self._get_hive_ctx()
> 684 return self._scala_HiveContext
> 685 except Py4JError as e:
> /root/spark/python/pyspark/sql/context.py in _get_hive_ctx(self)
> 690
> 691 def _get_hive_ctx(self):
> --> 692 return self._jvm.HiveContext(self._jsc.sc())
> 693
> 694 def refreshTable(self, tableName):
> /root/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py in 
> __call__(self, *args)
>1062 answer = self._gateway_client.send_command(command)
>1063 return_value = get_return_value(
> ->

[jira] [Updated] (SPARK-17143) pyspark unable to create UDF: java.lang.RuntimeException: org.apache.hadoop.fs.FileAlreadyExistsException: Parent path is not a directory: /tmp tmp

2016-08-18 Thread Andrew Davidson (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Davidson updated SPARK-17143:

Attachment: udfBug.ipynb

The attached notebook demonstrated the reported bug. Note it includes the 
output when run on my mac book pro. The bug report contains the stack trace 
when the same code is run in my data center

> pyspark unable to create UDF: java.lang.RuntimeException: 
> org.apache.hadoop.fs.FileAlreadyExistsException: Parent path is not a 
> directory: /tmp tmp
> ---
>
> Key: SPARK-17143
> URL: https://issues.apache.org/jira/browse/SPARK-17143
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.6.1
> Environment: spark version: 1.6.1
> python version: 3.4.3 (default, Apr  1 2015, 18:10:40) 
> [GCC 4.8.2 20140120 (Red Hat 4.8.2-16)]
>Reporter: Andrew Davidson
> Attachments: udfBug.ipynb
>
>
> For unknown reason I can not create UDF when I run the attached notebook on 
> my cluster. I get the following error
> Py4JJavaError: An error occurred while calling 
> None.org.apache.spark.sql.hive.HiveContext.
> : java.lang.RuntimeException: 
> org.apache.hadoop.fs.FileAlreadyExistsException: Parent path is not a 
> directory: /tmp tmp
> The notebook runs fine on my Mac
> In general I am able to run non UDF spark code with out any trouble
> I start the notebook server as the user “ec2-user" and uses master URL 
>   spark://ec2-51-215-120-63.us-west-1.compute.amazonaws.com:6066
> I found the following message in the notebook server log file. I have log 
> level set to warn
> 16/08/18 21:38:45 WARN ObjectStore: Version information not found in 
> metastore. hive.metastore.schema.verification is not enabled so recording the 
> schema version 1.2.0
> 16/08/18 21:38:45 WARN ObjectStore: Failed to get database default, returning 
> NoSuchObjectException
> The cluster was originally created using 
> spark-1.6.1-bin-hadoop2.6/ec2/spark-ec2
> #from pyspark.sql import SQLContext, HiveContext
> #sqlContext = SQLContext(sc)
> 
> #from pyspark.sql import DataFrame
> #from pyspark.sql import functions
> 
> from pyspark.sql.types import StringType
> from pyspark.sql.functions import udf
> 
> print("spark version: {}".format(sc.version))
> 
> import sys
> print("python version: {}".format(sys.version))
> spark version: 1.6.1
> python version: 3.4.3 (default, Apr  1 2015, 18:10:40)
> [GCC 4.8.2 20140120 (Red Hat 4.8.2-16)]
> # functions.lower() raises 
> # py4j.Py4JException: Method lower([class java.lang.String]) does not exist
> # work around define a UDF
> toLowerUDFRetType = StringType()
> #toLowerUDF = udf(lambda s : s.lower(), toLowerUDFRetType)
> toLowerUDF = udf(lambda s : s.lower(), StringType())
> You must build Spark with Hive. Export 'SPARK_HIVE=true' and run build/sbt 
> assembly
> Py4JJavaErrorTraceback (most recent call last)
>  in ()
>   4 toLowerUDFRetType = StringType()
>   5 #toLowerUDF = udf(lambda s : s.lower(), toLowerUDFRetType)
> > 6 toLowerUDF = udf(lambda s : s.lower(), StringType())
> /root/spark/python/pyspark/sql/functions.py in udf(f, returnType)
>1595 [Row(slen=5), Row(slen=3)]
>1596 """
> -> 1597 return UserDefinedFunction(f, returnType)
>1598
>1599 blacklist = ['map', 'since', 'ignore_unicode_prefix']
> /root/spark/python/pyspark/sql/functions.py in __init__(self, func, 
> returnType, name)
>1556 self.returnType = returnType
>1557 self._broadcast = None
> -> 1558 self._judf = self._create_judf(name)
>1559
>1560 def _create_judf(self, name):
> /root/spark/python/pyspark/sql/functions.py in _create_judf(self, name)
>1567 pickled_command, broadcast_vars, env, includes = 
> _prepare_for_python_RDD(sc, command, self)
>1568 ctx = SQLContext.getOrCreate(sc)
> -> 1569 jdt = ctx._ssql_ctx.parseDataType(self.returnType.json())
>1570 if name is None:
>1571 name = f.__name__ if hasattr(f, '__name__') else 
> f.__class__.__name__
> /root/spark/python/pyspark/sql/context.py in _ssql_ctx(self)
> 681 try:
> 682 if not hasattr(self, '_scala_HiveContext'):
> --> 683 self._scala_HiveContext = self._get_hive_ctx()
> 684 return self._scala_HiveContext
> 685 except Py4JError as e:
> /root/spark/python/pyspark/sql/context.py in _get_hive_ctx(self)
> 690
> 691 def _get_hive_ctx(self):
> --> 692 return self._jvm.HiveContext(self._jsc.sc())
> 693
> 694 def refreshTable(self, tableName):
> /root/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py in 
> __call__(self, *args)
>1062

[jira] [Created] (SPARK-17143) pyspark unable to create UDF: java.lang.RuntimeException: org.apache.hadoop.fs.FileAlreadyExistsException: Parent path is not a directory: /tmp tmp

2016-08-18 Thread Andrew Davidson (JIRA)

Andrew Davidson created SPARK-17143:
---

 Summary: pyspark unable to create UDF: java.lang.RuntimeException: 
org.apache.hadoop.fs.FileAlreadyExistsException: Parent path is not a 
directory: /tmp tmp
 Key: SPARK-17143
 URL: https://issues.apache.org/jira/browse/SPARK-17143
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.6.1
 Environment: spark version: 1.6.1
python version: 3.4.3 (default, Apr  1 2015, 18:10:40) 
[GCC 4.8.2 20140120 (Red Hat 4.8.2-16)]
Reporter: Andrew Davidson


For unknown reason I can not create UDF when I run the attached notebook on my 
cluster. I get the following error

Py4JJavaError: An error occurred while calling 
None.org.apache.spark.sql.hive.HiveContext.
: java.lang.RuntimeException: org.apache.hadoop.fs.FileAlreadyExistsException: 
Parent path is not a directory: /tmp tmp

The notebook runs fine on my Mac

In general I am able to run non UDF spark code with out any trouble

I start the notebook server as the user “ec2-user" and uses master URL 
spark://ec2-51-215-120-63.us-west-1.compute.amazonaws.com:6066


I found the following message in the notebook server log file. I have log level 
set to warn

16/08/18 21:38:45 WARN ObjectStore: Version information not found in metastore. 
hive.metastore.schema.verification is not enabled so recording the schema 
version 1.2.0
16/08/18 21:38:45 WARN ObjectStore: Failed to get database default, returning 
NoSuchObjectException


The cluster was originally created using spark-1.6.1-bin-hadoop2.6/ec2/spark-ec2



#from pyspark.sql import SQLContext, HiveContext
#sqlContext = SQLContext(sc)

#from pyspark.sql import DataFrame
#from pyspark.sql import functions

from pyspark.sql.types import StringType
from pyspark.sql.functions import udf

print("spark version: {}".format(sc.version))

import sys
print("python version: {}".format(sys.version))
spark version: 1.6.1
python version: 3.4.3 (default, Apr  1 2015, 18:10:40)
[GCC 4.8.2 20140120 (Red Hat 4.8.2-16)]



# functions.lower() raises 
# py4j.Py4JException: Method lower([class java.lang.String]) does not exist
# work around define a UDF
toLowerUDFRetType = StringType()
#toLowerUDF = udf(lambda s : s.lower(), toLowerUDFRetType)
toLowerUDF = udf(lambda s : s.lower(), StringType())
You must build Spark with Hive. Export 'SPARK_HIVE=true' and run build/sbt 
assembly
Py4JJavaErrorTraceback (most recent call last)
 in ()
  4 toLowerUDFRetType = StringType()
  5 #toLowerUDF = udf(lambda s : s.lower(), toLowerUDFRetType)
> 6 toLowerUDF = udf(lambda s : s.lower(), StringType())

/root/spark/python/pyspark/sql/functions.py in udf(f, returnType)
   1595 [Row(slen=5), Row(slen=3)]
   1596 """
-> 1597 return UserDefinedFunction(f, returnType)
   1598
   1599 blacklist = ['map', 'since', 'ignore_unicode_prefix']

/root/spark/python/pyspark/sql/functions.py in __init__(self, func, returnType, 
name)
   1556 self.returnType = returnType
   1557 self._broadcast = None
-> 1558 self._judf = self._create_judf(name)
   1559
   1560 def _create_judf(self, name):

/root/spark/python/pyspark/sql/functions.py in _create_judf(self, name)
   1567 pickled_command, broadcast_vars, env, includes = 
_prepare_for_python_RDD(sc, command, self)
   1568 ctx = SQLContext.getOrCreate(sc)
-> 1569 jdt = ctx._ssql_ctx.parseDataType(self.returnType.json())
   1570 if name is None:
   1571 name = f.__name__ if hasattr(f, '__name__') else 
f.__class__.__name__

/root/spark/python/pyspark/sql/context.py in _ssql_ctx(self)
681 try:
682 if not hasattr(self, '_scala_HiveContext'):
--> 683 self._scala_HiveContext = self._get_hive_ctx()
684 return self._scala_HiveContext
685 except Py4JError as e:

/root/spark/python/pyspark/sql/context.py in _get_hive_ctx(self)
690
691 def _get_hive_ctx(self):
--> 692 return self._jvm.HiveContext(self._jsc.sc())
693
694 def refreshTable(self, tableName):

/root/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py in __call__(self, 
*args)
   1062 answer = self._gateway_client.send_command(command)
   1063 return_value = get_return_value(
-> 1064 answer, self._gateway_client, None, self._fqn)
   1065
   1066 for temp_arg in temp_args:

/root/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
 43 def deco(*a, **kw):
 44 try:
---> 45 return f(*a, **kw)
 46 except py4j.protocol.Py4JJavaError as e:
 47 s = e.java_exception.toString()

/root/spark/python/lib/py4j-0.9-src.zip/py4j/protocol.py in 
get_return_value(answer, gateway_client, target_id, name)
306 raise Py4JJavaError(
307 "An error occurred

[jira] [Created] (SPARK-17091) ParquetFilters rewrite IN to OR of Eq

2016-08-16 Thread Andrew Duffy (JIRA)

Andrew Duffy created SPARK-17091:


 Summary: ParquetFilters rewrite IN to OR of Eq
 Key: SPARK-17091
 URL: https://issues.apache.org/jira/browse/SPARK-17091
 Project: Spark
  Issue Type: Bug
Reporter: Andrew Duffy


Past attempts at pushing down the InSet operation for Parquet relied on 
user-defined predicates. It would be simpler to rewrite an IN clause into the 
corresponding OR union of a set of equality conditions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-17059) Allow FileFormat to specify partition pruning strategy

2016-08-15 Thread Andrew Duffy (JIRA)

Andrew Duffy created SPARK-17059:


 Summary: Allow FileFormat to specify partition pruning strategy
 Key: SPARK-17059
 URL: https://issues.apache.org/jira/browse/SPARK-17059
 Project: Spark
  Issue Type: Bug
Reporter: Andrew Duffy


Allow Spark to have pluggable pruning of input files for FileSourceScanExec by 
allowing FileFormat's to specify format-specific filterPartitions method.

This is especially useful for Parquet as Spark does not currently make use of 
the summary metadata, instead reading the footer of all part files for a 
Parquet data source. This can lead to massive speedups when reading a filtered 
chunk of a dataset, especially when using remote storage (S3).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17029) Dataset toJSON goes through RDD form instead of transforming dataset itself

2016-08-11 Thread Andrew Ash (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15418149#comment-15418149
 ] 

Andrew Ash commented on SPARK-17029:


Note RDD form usage from https://issues.apache.org/jira/browse/SPARK-10705

> Dataset toJSON goes through RDD form instead of transforming dataset itself
> ---
>
> Key: SPARK-17029
> URL: https://issues.apache.org/jira/browse/SPARK-17029
> Project: Spark
>  Issue Type: Bug
>Reporter: Robert Kruszewski
>
> No longer necessary and can be optimized with datasets



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-16780) spark-streaming-kafka_2.10 version 2.0.0 not on maven central

2016-07-29 Thread Andrew B (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew B closed SPARK-16780.


> spark-streaming-kafka_2.10 version 2.0.0 not on maven central
> -
>
> Key: SPARK-16780
> URL: https://issues.apache.org/jira/browse/SPARK-16780
> Project: Spark
>  Issue Type: Bug
>Reporter: Andrew B
>
> I cannot seem to find spark-streaming-kafka_2.10 version 2.0.0 on maven 
> central. Has this been released?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16780) spark-streaming-kafka_2.10 version 2.0.0 not on maven central

2016-07-28 Thread Andrew B (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15398105#comment-15398105
 ] 

Andrew B commented on SPARK-16780:
--

How are the new artifacts used with the example below? 

https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/streaming/JavaKafkaWordCount.java#L3

The example contains a reference to KafkaUtil class which contains a 
createStream() method. However, org.apache.spark.streaming.kafka010.KafkaUtil, 
which is in spark-streaming-kafka-0-10_2.10,  has switched over to DStream API, 
so it does not have a createStream method.

> spark-streaming-kafka_2.10 version 2.0.0 not on maven central
> -
>
> Key: SPARK-16780
> URL: https://issues.apache.org/jira/browse/SPARK-16780
> Project: Spark
>  Issue Type: Bug
>Reporter: Andrew B
>
> I cannot seem to find spark-streaming-kafka_2.10 version 2.0.0 on maven 
> central. Has this been released?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-16780) spark-streaming-kafka_2.10 version 2.0.0 not on maven central

2016-07-28 Thread Andrew B (JIRA)

Andrew B created SPARK-16780:


 Summary: spark-streaming-kafka_2.10 version 2.0.0 not on maven 
central
 Key: SPARK-16780
 URL: https://issues.apache.org/jira/browse/SPARK-16780
 Project: Spark
  Issue Type: Bug
Reporter: Andrew B


I cannot seem to find spark-streaming-kafka_2.10 version 2.0.0 on maven 
central. Has this been released?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16665) python import pyspark fails in context.py

2016-07-22 Thread Andrew Jefferson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15389135#comment-15389135
 ] 

Andrew Jefferson commented on SPARK-16665:
--

This was the result of a previous failed import in python

> python import pyspark fails in context.py 
> --
>
> Key: SPARK-16665
> URL: https://issues.apache.org/jira/browse/SPARK-16665
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.0
>Reporter: Andrew Jefferson
>Priority: Critical
>
> Using 2.0.0 Release Candidate 5
> python
> import pyspark
> Traceback (most recent call last):
> File "", line 1, in 
> File "pyspark/init.py", line 44, in 
> from pyspark.context import SparkContext
> File "pyspark/context.py", line 28, in 
> from pyspark import accumulators
> ImportError: cannot import name accumulators



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-16665) python import pyspark fails in context.py

2016-07-22 Thread Andrew Jefferson (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Jefferson resolved SPARK-16665.
--
Resolution: Cannot Reproduce

> python import pyspark fails in context.py 
> --
>
> Key: SPARK-16665
> URL: https://issues.apache.org/jira/browse/SPARK-16665
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.0
>Reporter: Andrew Jefferson
>Priority: Critical
>
> Using 2.0.0 Release Candidate 5
> python
> import pyspark
> Traceback (most recent call last):
> File "", line 1, in 
> File "pyspark/init.py", line 44, in 
> from pyspark.context import SparkContext
> File "pyspark/context.py", line 28, in 
> from pyspark import accumulators
> ImportError: cannot import name accumulators



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16665) python import pyspark fails in context.py

2016-07-21 Thread Andrew Jefferson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15387908#comment-15387908
 ] 

Andrew Jefferson commented on SPARK-16665:
--

Pull Request here:
https://github.com/apache/spark/pull/14303

> python import pyspark fails in context.py 
> --
>
> Key: SPARK-16665
> URL: https://issues.apache.org/jira/browse/SPARK-16665
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.0
>Reporter: Andrew Jefferson
>Priority: Critical
>
> Using 2.0.0 Release Candidate 5
> python
> import pyspark
> Traceback (most recent call last):
> File "", line 1, in 
> File "pyspark/init.py", line 44, in 
> from pyspark.context import SparkContext
> File "pyspark/context.py", line 28, in 
> from pyspark import accumulators
> ImportError: cannot import name accumulators



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-16665) python import pyspark fails in context.py

2016-07-21 Thread Andrew Jefferson (JIRA)

Andrew Jefferson created SPARK-16665:


 Summary: python import pyspark fails in context.py 
 Key: SPARK-16665
 URL: https://issues.apache.org/jira/browse/SPARK-16665
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.0.0
Reporter: Andrew Jefferson
Priority: Critical


Using 2.0.0 Release Candidate 5

python

import pyspark
Traceback (most recent call last):
File "", line 1, in 
File "pyspark/init.py", line 44, in 
from pyspark.context import SparkContext
File "pyspark/context.py", line 28, in 
from pyspark import accumulators
ImportError: cannot import name accumulators



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16265) Add option to SparkSubmit to ship driver JRE to YARN

2016-07-07 Thread Andrew Duffy (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15365998#comment-15365998
 ] 

Andrew Duffy commented on SPARK-16265:
--

Hi Sean, yeah I can see where you're coming from, but I feel like this change 
is simple and targeted enough (meant to be used with the {{SparkLauncher}} API) 
that it can actually be useful without adding much (if any) maintenance load. 
If anything I would argue it at least deserves consideration as an experimental 
feature, as users who write programs that use SparkLauncher are going to have 
to split Java versions for the code that launches and interacts with the Spark 
app and the Spark app itself if the application is eg. written for one 
environment and then deployed in another uncontrolled customer environment 
where the cluster does not have Java 8 installed.

> Add option to SparkSubmit to ship driver JRE to YARN
> 
>
> Key: SPARK-16265
> URL: https://issues.apache.org/jira/browse/SPARK-16265
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 1.6.2
>Reporter: Andrew Duffy
>
> Add an option to {{SparkSubmit}} to allow the driver to package up it's 
> version of the JRE to be shipped to a YARN cluster. This allows deploying 
> Spark applications to a YARN cluster in which its required Java version need 
> not match one of the versions already installed on the YARN cluster, useful 
> in situations in which the Spark Application developer does not have 
> administrative access over the YARN cluster (ex. school or corporate 
> environment) but still wants to use certain language features in their code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15829) spark master webpage links to application UI broke when running in cluster mode

2016-07-01 Thread Andrew Davidson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15359277#comment-15359277
 ] 

Andrew Davidson commented on SPARK-15829:
-

Hi Sean

you mention the ec2 script is not supported anymore? What was the last release 
it was supported in? Its still part of the 1.6.x documentation

Is there a replacement or alternative?

thanks

Andy

> spark master webpage links to application UI broke when running in cluster 
> mode
> ---
>
> Key: SPARK-15829
> URL: https://issues.apache.org/jira/browse/SPARK-15829
> Project: Spark
>  Issue Type: Bug
>  Components: EC2
>Affects Versions: 1.6.1
> Environment: AWS ec2 cluster
>Reporter: Andrew Davidson
>Priority: Critical
>
> Hi 
> I created a cluster using the spark-1.6.1-bin-hadoop2.6/ec2/spark-ec2
> I use the stand alone cluster manager. I have a streaming app running in 
> cluster mode. I notice the master webpage links to the application UI page 
> are incorrect
> It does not look like jira will let my upload images. I'll try and describe 
> the web pages and the bug
> My master is running on
> http://ec2-54-215-230-73.us-west-1.compute.amazonaws.com:8080/
> It has a section marked "applications". If I click on one of the running 
> application ids I am taken to a page showing "Executor Summary".  This page 
> has a link to teh 'application detail UI'  the url is 
> http://ec2-54-215-230-73.us-west-1.compute.amazonaws.com:4041/
> Notice it things the application UI is running on the cluster master.
> It is actually running on the same machine as the driver on port 4041. I was 
> able to reverse engine the url by noticing the private ip address is part of 
> the worker id . For example worker-20160322041632-172.31.23.201-34909
> next I went on the aws ec2 console to find the public DNS name for this 
> machine 
> http://ec2-54-193-104-169.us-west-1.compute.amazonaws.com:4041/streaming/
> Kind regards
> Andy



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15829) spark master webpage links to application UI broke when running in cluster mode

2016-07-01 Thread Andrew Davidson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15359215#comment-15359215
 ] 

Andrew Davidson commented on SPARK-15829:
-

Hi Sean

I am not sure how to check the value of 'SPARK_MASTER_HOST'. I looked at the 
documentation pags http://spark.apache.org/docs/latest/configuration.html and  
http://spark.apache.org/docs/latest/ec2-scripts.html. They do not mention 
SPARK_MASTER_HOST

when I submit my jobs I use 
MASTER_URL=spark://ec2-54-215-230-73.us-west-1.compute.amazonaws.com:6066

I use the stand alone cluster manager

I think the problem may be that the web UI assume the driver is always running 
on the master machine. I assume the cluster manager decides which worker the 
driver will run on. Is there a way for the web UI to discover where the driver 
is running?

On my master 
[ec2-user@ip-172-31-22-140 conf]$ pwd
/root/spark/conf
[ec2-user@ip-172-31-22-140 conf]$ cat slaves
ec2-54-193-94-207.us-west-1.compute.amazonaws.com
ec2-54-67-13-246.us-west-1.compute.amazonaws.com
ec2-54-67-48-49.us-west-1.compute.amazonaws.com
ec2-54-193-104-169.us-west-1.compute.amazonaws.com
[ec2-user@ip-172-31-22-140 conf]$ 


[ec2-user@ip-172-31-22-140 conf]$ grep SPARK_MASTER_HOST *
[ec2-user@ip-172-31-22-140 conf]$ pwd
/root/spark/conf
[ec2-user@ip-172-31-22-140 conf]$ grep SPARK_MASTER_HOST *
[ec2-user@ip-172-31-22-140 conf]$ 

ec2-user@ip-172-31-22-140 sbin]$ pwd
/root/spark/sbin
[ec2-user@ip-172-31-22-140 sbin]$ grep SPARK_MASTER_HOST *

[ec2-user@ip-172-31-22-140 bin]$ !grep
grep SPARK_MASTER_HOST *
[ec2-user@ip-172-31-22-140 bin]$ 

Thanks for looking into this

Andy


> spark master webpage links to application UI broke when running in cluster 
> mode
> ---
>
> Key: SPARK-15829
> URL: https://issues.apache.org/jira/browse/SPARK-15829
> Project: Spark
>  Issue Type: Bug
>  Components: EC2
>Affects Versions: 1.6.1
> Environment: AWS ec2 cluster
>Reporter: Andrew Davidson
>Priority: Critical
>
> Hi 
> I created a cluster using the spark-1.6.1-bin-hadoop2.6/ec2/spark-ec2
> I use the stand alone cluster manager. I have a streaming app running in 
> cluster mode. I notice the master webpage links to the application UI page 
> are incorrect
> It does not look like jira will let my upload images. I'll try and describe 
> the web pages and the bug
> My master is running on
> http://ec2-54-215-230-73.us-west-1.compute.amazonaws.com:8080/
> It has a section marked "applications". If I click on one of the running 
> application ids I am taken to a page showing "Executor Summary".  This page 
> has a link to teh 'application detail UI'  the url is 
> http://ec2-54-215-230-73.us-west-1.compute.amazonaws.com:4041/
> Notice it things the application UI is running on the cluster master.
> It is actually running on the same machine as the driver on port 4041. I was 
> able to reverse engine the url by noticing the private ip address is part of 
> the worker id . For example worker-20160322041632-172.31.23.201-34909
> next I went on the aws ec2 console to find the public DNS name for this 
> machine 
> http://ec2-54-193-104-169.us-west-1.compute.amazonaws.com:4041/streaming/
> Kind regards
> Andy



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-16265) Add option to SparkSubmit to ship driver JRE to YARN

2016-06-28 Thread Andrew Duffy (JIRA)

Andrew Duffy created SPARK-16265:


 Summary: Add option to SparkSubmit to ship driver JRE to YARN
 Key: SPARK-16265
 URL: https://issues.apache.org/jira/browse/SPARK-16265
 Project: Spark
  Issue Type: Improvement
Affects Versions: 1.6.2
Reporter: Andrew Duffy
 Fix For: 2.1.0


Add an option to {{SparkSubmit}} to allow the driver to package up it's version 
of the JRE to be shipped to a YARN cluster. This allows deploying Spark 
applications to a YARN cluster in which its required Java version need not 
match one of the versions already installed on the YARN cluster, useful in 
situations in which the Spark Application developer does not have 
administrative access over the YARN cluster (ex. school or corporate 
environment) but still wants to use certain language features in their code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-16196) Optimize in-memory scan performance using ColumnarBatches

2016-06-24 Thread Andrew Or (JIRA)

Andrew Or created SPARK-16196:
-

 Summary: Optimize in-memory scan performance using ColumnarBatches
 Key: SPARK-16196
 URL: https://issues.apache.org/jira/browse/SPARK-16196
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.0
Reporter: Andrew Or
Assignee: Andrew Or


A simple benchmark such as the following reveals inefficiencies in the existing 
in-memory scan implementation:
{code}
spark.range(N)
  .selectExpr("id", "floor(rand() * 1) as k")
  .createOrReplaceTempView("test")
val ds = spark.sql("select count(k), count(id) from test").cache()
ds.collect()
ds.collect()
{code}

There are many reasons why caching is slow. The biggest is that compression 
takes a long time. The second is that there are a lot of virtual function calls 
in this hot code path since the rows are processed using iterators. Further, 
the rows are converted to and from ByteBuffers, which are slow to read in 
general.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15917) Define the number of executors in standalone mode with an easy-to-use property

2016-06-21 Thread Andrew Or (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15342376#comment-15342376
 ] 

Andrew Or commented on SPARK-15917:
---

(1) Yes, right now `spark.executor.instances` doesn't do anything in standalone 
mode even if it's set so we should support it.

(2) There are several options here. Cores and number of executors are 
inherently conflicting things so ideally we should disallow the setting of both 
of them. We could throw an exception with a good error message, but that would 
fail existing apps that do have both of them set. We could just log a warning, 
but there's a high chance that people just won't see the warning.

Both options are fine but I'm slightly in favor of throwing an exception when 
conflicting configs are set. You might have to dig into the internal scheduling 
code in Master.scala to support num instances.

> Define the number of executors in standalone mode with an easy-to-use property
> --
>
> Key: SPARK-15917
> URL: https://issues.apache.org/jira/browse/SPARK-15917
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, Spark Shell, Spark Submit
>Affects Versions: 1.6.1
>Reporter: Jonathan Taws
>Priority: Minor
>
> After stumbling across a few StackOverflow posts around the issue of using a 
> fixed number of executors in standalone mode (non-YARN), I was wondering if 
> we could not add an easier way to set this parameter than having to resort to 
> some calculations based on the number of cores and the memory you have 
> available on your worker. 
> For example, let's say I have 8 cores and 30GB of memory available :
>  - If no option is passed, one executor will be spawned with 8 cores and 1GB 
> of memory allocated.
>  - However, if I want to have only *2* executors, and to use 2 cores and 10GB 
> of memory per executor, I will end up with *3* executors (as the available 
> memory will limit the number of executors) instead of the 2 I was hoping for.
> Sure, I can set {{spark.cores.max}} as a workaround to get exactly what I 
> want, but would it not be easier to add a {{--num-executors}}-like option to 
> standalone mode to be able to really fine-tune the configuration ? This 
> option is already available in YARN mode.
> From my understanding, I don't see any other option lying around that can 
> help achieve this.  
> This seems to be slightly disturbing for newcomers, and standalone mode is 
> probably the first thing anyone will use to just try out Spark or test some 
> configuration.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15917) Define the number of executors in standalone mode with an easy-to-use property

2016-06-21 Thread Andrew Or (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15342310#comment-15342310
 ] 

Andrew Or commented on SPARK-15917:
---

Yeah, I agree. We need to deal with conflicting options, however, e.g. 
spark.executor.cores=4, spark.executor.instances=4, spark.cores.max=8. 
[~JonathanTaws] would you like to work on this?

> Define the number of executors in standalone mode with an easy-to-use property
> --
>
> Key: SPARK-15917
> URL: https://issues.apache.org/jira/browse/SPARK-15917
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, Spark Shell, Spark Submit
>Affects Versions: 1.6.1
>Reporter: Jonathan Taws
>Priority: Minor
>
> After stumbling across a few StackOverflow posts around the issue of using a 
> fixed number of executors in standalone mode (non-YARN), I was wondering if 
> we could not add an easier way to set this parameter than having to resort to 
> some calculations based on the number of cores and the memory you have 
> available on your worker. 
> For example, let's say I have 8 cores and 30GB of memory available :
>  - If no option is passed, one executor will be spawned with 8 cores and 1GB 
> of memory allocated.
>  - However, if I want to have only *2* executors, and to use 2 cores and 10GB 
> of memory per executor, I will end up with *3* executors (as the available 
> memory will limit the number of executors) instead of the 2 I was hoping for.
> Sure, I can set {{spark.cores.max}} as a workaround to get exactly what I 
> want, but would it not be easier to add a {{--num-executors}}-like option to 
> standalone mode to be able to really fine-tune the configuration ? This 
> option is already available in YARN mode.
> From my understanding, I don't see any other option lying around that can 
> help achieve this.  
> This seems to be slightly disturbing for newcomers, and standalone mode is 
> probably the first thing anyone will use to just try out Spark or test some 
> configuration.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-16023) Move InMemoryRelation to its own file

2016-06-17 Thread Andrew Or (JIRA)

Andrew Or created SPARK-16023:
-

 Summary: Move InMemoryRelation to its own file
 Key: SPARK-16023
 URL: https://issues.apache.org/jira/browse/SPARK-16023
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.0
Reporter: Andrew Or
Assignee: Andrew Or
Priority: Minor


Just to make InMemoryTableScanExec a little smaller and more readable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-15749) Make the error message more meaningful

2016-06-16 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or resolved SPARK-15749.
---
  Resolution: Fixed
Assignee: Huaxin Gao
   Fix Version/s: 2.0.0
Target Version/s: 2.0.0

> Make the error message more meaningful
> --
>
> Key: SPARK-15749
> URL: https://issues.apache.org/jira/browse/SPARK-15749
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Huaxin Gao
>Assignee: Huaxin Gao
>Priority: Trivial
> Fix For: 2.0.0
>
>
> For table test1 (C1 varchar (10), C2 varchar (10)), when I insert a row using 
> sqlContext.sql("insert into test1 values ('abc', 'def', 1)")
> I got error message
> Exception in thread "main" java.lang.RuntimeException: Relation[C1#0,C2#1] 
> JDBCRelation(test1)
>  requires that the query in the SELECT clause of the INSERT INTO/OVERWRITE 
> statement generates the same number of columns as its schema.
> The error message is a little confusing. In my simple insert statement, it 
> doesn't have a SELECT clause. 
> I will change the error message to a more general one 
> Exception in thread "main" java.lang.RuntimeException: Relation[C1#0,C2#1] 
> JDBCRelation(test1)
>  requires that the data to be inserted have the same number of columns as the 
> target table.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-15868) Executors table in Executors tab should sort Executor IDs in numerical order (not alphabetical order)

2016-06-16 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or resolved SPARK-15868.
---
  Resolution: Fixed
Assignee: Alex Bozarth
   Fix Version/s: 2.0.0
Target Version/s: 2.0.0

> Executors table in Executors tab should sort Executor IDs in numerical order 
> (not alphabetical order)
> -
>
> Key: SPARK-15868
> URL: https://issues.apache.org/jira/browse/SPARK-15868
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.0.0
>Reporter: Jacek Laskowski
>Assignee: Alex Bozarth
>Priority: Minor
> Fix For: 2.0.0
>
> Attachments: spark-webui-executors-sorting-2.png, 
> spark-webui-executors-sorting.png
>
>
> It _appears_ that Executors table in Executors tab sorts Executor IDs in 
> alphabetical order while it should in numerical. It does sorting in a more 
> "friendly" way yet driver executor appears between 0 and 1?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-15998) Verification of SQLConf HIVE_METASTORE_PARTITION_PRUNING

2016-06-16 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or resolved SPARK-15998.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

> Verification of SQLConf HIVE_METASTORE_PARTITION_PRUNING
> 
>
> Key: SPARK-15998
> URL: https://issues.apache.org/jira/browse/SPARK-15998
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
> Fix For: 2.0.0
>
>
> HIVE_METASTORE_PARTITION_PRUNING is a public SQLConf. When true, some 
> predicates will be pushed down into the Hive metastore so that unmatching 
> partitions can be eliminated earlier. The current default value is false.
> So far, the code base does not have such a test case to verify whether this 
> SQLConf properly works.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15998) Verification of SQLConf HIVE_METASTORE_PARTITION_PRUNING

2016-06-16 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-15998:
--
Assignee: Xiao Li

> Verification of SQLConf HIVE_METASTORE_PARTITION_PRUNING
> 
>
> Key: SPARK-15998
> URL: https://issues.apache.org/jira/browse/SPARK-15998
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>Assignee: Xiao Li
> Fix For: 2.0.0
>
>
> HIVE_METASTORE_PARTITION_PRUNING is a public SQLConf. When true, some 
> predicates will be pushed down into the Hive metastore so that unmatching 
> partitions can be eliminated earlier. The current default value is false.
> So far, the code base does not have such a test case to verify whether this 
> SQLConf properly works.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-15975) Improper Popen.wait() return code handling in dev/run-tests

2016-06-16 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or resolved SPARK-15975.
---
   Resolution: Fixed
Fix Version/s: 2.0.0
   1.6.2
   1.5.3

> Improper Popen.wait() return code handling in dev/run-tests
> ---
>
> Key: SPARK-15975
> URL: https://issues.apache.org/jira/browse/SPARK-15975
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Affects Versions: 1.6.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
> Fix For: 1.5.3, 1.6.2, 2.0.0
>
>
> In dev/run-tests.py there's a line where we effectively do
> {code}
> retcode = some_popen_instance.wait()
> if retcode > 0:
>   err
> # else do nothing
> {code}
> but this code is subtlety wrong because Popen's return code will be negative 
> if the child process was terminated by a signal: 
> https://docs.python.org/2/library/subprocess.html#subprocess.Popen.returncode
> We should change this to {{retcode != 0}} so that we properly error out and 
> exit due to termination by signal.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-15978) Some improvement of "Show Tables"

2016-06-16 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or resolved SPARK-15978.
---
  Resolution: Fixed
Assignee: Bo Meng  (was: Apache Spark)
   Fix Version/s: 2.0.0
Target Version/s: 2.0.0

> Some improvement of "Show Tables"
> -
>
> Key: SPARK-15978
> URL: https://issues.apache.org/jira/browse/SPARK-15978
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Bo Meng
>Assignee: Bo Meng
>Priority: Minor
> Fix For: 2.0.0
>
>
> I've found some minor issues in "show tables" command:
> 1. In the SessionCatalog.scala, listTables(db: String) method will call 
> listTables(formatDatabaseName(db), "*") to list all the tables for certain 
> db, but in the method listTables(db: String, pattern: String), this db name 
> is formatted once more. So I think we should remove formatDatabaseName() in 
> the caller.
> 2. I suggest to add sort to listTables(db: String) in InMemoryCatalog.scala, 
> just like listDatabases().
> I will make a PR shortly. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15576) Add back hive tests blacklisted by SPARK-15539

2016-06-14 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-15576:
--
Target Version/s: 2.1.0  (was: 2.0.0)

> Add back hive tests blacklisted by SPARK-15539
> --
>
> Key: SPARK-15576
> URL: https://issues.apache.org/jira/browse/SPARK-15576
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 2.0.0
>Reporter: Andrew Or
>
> These were removed from HiveCompatibilitySuite. They should be added back to 
> HiveQuerySuite.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15829) spark master webpage links to application UI broke when running in cluster mode

2016-06-10 Thread Andrew Davidson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15325001#comment-15325001
 ] 

Andrew Davidson commented on SPARK-15829:
-

Hi Xin

I ran netstat on my master. I do not think the port are in use. 

To submit in cluster mode I use port 6066. If you are using port 7077 you are 
in client mode. In client mode the application UI will run on the spark master. 
In cluster mode the application UI runs on which ever slave the driver is 
running on. If you notice in my original description the url is incorrect. the 
ip is wrong, the port is correct.

Kind regards

Andy

# bash-4.2# netstat -tulpn 
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address   Foreign Address 
State   PID/Program name   
tcp0  0 0.0.0.0:86520.0.0.0:*   
LISTEN  3832/gmetad 
tcp0  0 0.0.0.0:87870.0.0.0:*   
LISTEN  2584/rserver
tcp0  0 0.0.0.0:36757   0.0.0.0:*   
LISTEN  2905/java   
tcp0  0 0.0.0.0:50070   0.0.0.0:*   
LISTEN  2905/java   
tcp0  0 0.0.0.0:22  0.0.0.0:*   
LISTEN  2144/sshd   
tcp0  0 127.0.0.1:631   0.0.0.0:*   
LISTEN  2095/cupsd  
tcp0  0 127.0.0.1:7000  0.0.0.0:*   
LISTEN  6512/python3.4  
tcp0  0 127.0.0.1:250.0.0.0:*   
LISTEN  2183/sendmail   
tcp0  0 0.0.0.0:43813   0.0.0.0:*   
LISTEN  3093/java   
tcp0  0 172.31.22.140:9000  0.0.0.0:*   
LISTEN  2905/java   
tcp0  0 0.0.0.0:86490.0.0.0:*   
LISTEN  3810/gmond  
tcp0  0 0.0.0.0:50090   0.0.0.0:*   
LISTEN  3093/java   
tcp0  0 0.0.0.0:86510.0.0.0:*   
LISTEN  3832/gmetad 
tcp0  0 :::8080 :::*
LISTEN  23719/java  
tcp0  0 :::8081 :::*
LISTEN  5588/java   
tcp0  0 :::172.31.22.140:6066   :::*
LISTEN  23719/java  
tcp0  0 :::172.31.22.140:6067   :::*
LISTEN  5588/java   
tcp0  0 :::22   :::*
LISTEN  2144/sshd   
tcp0  0 ::1:631 :::*
LISTEN  2095/cupsd  
tcp0  0 :::19998:::*
LISTEN  3709/java   
tcp0  0 :::1:::*
LISTEN  3709/java   
tcp0  0 :::172.31.22.140:7077   :::*
LISTEN  23719/java  
tcp0  0 :::172.31.22.140:7078   :::*
LISTEN  5588/java   
udp0  0 0.0.0.0:86490.0.0.0:*   
3810/gmond  
udp0  0 0.0.0.0:631 0.0.0.0:*   
2095/cupsd  
udp0  0 0.0.0.0:38546   0.0.0.0:*   
2905/java   
udp0  0 0.0.0.0:68  0.0.0.0:*   
1142/dhclient   
udp0  0 172.31.22.140:123   0.0.0.0:*   
2168/ntpd   
udp0  0 127.0.0.1:123   0.0.0.0:*   
2168/ntpd   
udp0  0 0.0.0.0:123 0.0.0.0:*   
2168/ntpd   
bash-4.2# 


> spark master webpage links to application UI broke when running in cluster 
> mode
> ---
>
> Key: SPARK-15829
> URL: https://issues.apache.org/jira/browse/SPARK-15829
> Project: Spark
>  Issue Type: Bug
>  Components: EC2
>Affects Versions: 1.6.1
> Environment: AWS ec2 cluster
>Reporter: Andrew Davidson
>Priority: Critical
>
> Hi 
> I created a cluster using the spark-1.6.1-bin-hadoop2.6/ec2/spark-ec2
> I use the stand alone cluster manager. I have a streaming app running in 
> cluster mode. I notice the master webpage links to the application UI page 
> are incorrect
> It does not look like jira will let my upload images. I'll try and

[jira] [Commented] (SPARK-15867) TABLESAMPLE BUCKET semantics don't match Hive's

2016-06-10 Thread Andrew Or (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15324960#comment-15324960
 ] 

Andrew Or commented on SPARK-15867:
---

I think we should fix it, though looks like it's been an issue for a while. I 
don't think it was ever documented so it's OK to change the behavior.

> TABLESAMPLE BUCKET semantics don't match Hive's
> ---
>
> Key: SPARK-15867
> URL: https://issues.apache.org/jira/browse/SPARK-15867
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0, 2.0.0
>Reporter: Andrew Or
>
> {code}
> SELECT * FROM boxes TABLESAMPLE (BUCKET 3 OUT OF 16)
> {code}
> In Hive, this would select the 3rd bucket out of every 16 buckets there are 
> in the table. E.g. if the table was clustered by 32 buckets then this would 
> sample the 3rd and the 19th bucket. (See 
> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Sampling)
> In Spark, however, we simply sample 3/16 of the number of input rows.
> Either we don't support it in Spark or do it in a way that's consistent with 
> Hive.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15867) TABLESAMPLE BUCKET semantics don't match Hive's

2016-06-10 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-15867:
--
Affects Version/s: 1.6.0

> TABLESAMPLE BUCKET semantics don't match Hive's
> ---
>
> Key: SPARK-15867
> URL: https://issues.apache.org/jira/browse/SPARK-15867
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0, 2.0.0
>Reporter: Andrew Or
>
> {code}
> SELECT * FROM boxes TABLESAMPLE (BUCKET 3 OUT OF 16)
> {code}
> In Hive, this would select the 3rd bucket out of every 16 buckets there are 
> in the table. E.g. if the table was clustered by 32 buckets then this would 
> sample the 3rd and the 19th bucket. (See 
> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Sampling)
> In Spark, however, we simply sample 3/16 of the number of input rows.
> Either we don't support it in Spark or do it in a way that's consistent with 
> Hive.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-15867) TABLESAMPLE BUCKET semantics don't match Hive's

2016-06-10 Thread Andrew Or (JIRA)

Andrew Or created SPARK-15867:
-

 Summary: TABLESAMPLE BUCKET semantics don't match Hive's
 Key: SPARK-15867
 URL: https://issues.apache.org/jira/browse/SPARK-15867
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.0
Reporter: Andrew Or


{code}
SELECT * FROM boxes TABLESAMPLE (BUCKET 3 OUT OF 16)
{code}

In Hive, this would select the 3rd bucket out of every 16 buckets there are in 
the table. E.g. if the table was clustered by 32 buckets then this would sample 
the 3rd and the 19th bucket. (See 
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Sampling)

In Spark, however, we simply sample 3/16 of the number of input rows.

Either we don't support it in Spark or do it in a way that's consistent with 
Hive.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-15829) spark master webpage links to application UI broke when running in cluster mode

2016-06-08 Thread Andrew Davidson (JIRA)

Andrew Davidson created SPARK-15829:
---

 Summary: spark master webpage links to application UI broke when 
running in cluster mode
 Key: SPARK-15829
 URL: https://issues.apache.org/jira/browse/SPARK-15829
 Project: Spark
  Issue Type: Bug
  Components: EC2
Affects Versions: 1.6.1
 Environment: AWS ec2 cluster
Reporter: Andrew Davidson
Priority: Critical


Hi 
I created a cluster using the spark-1.6.1-bin-hadoop2.6/ec2/spark-ec2

I use the stand alone cluster manager. I have a streaming app running in 
cluster mode. I notice the master webpage links to the application UI page are 
incorrect

It does not look like jira will let my upload images. I'll try and describe the 
web pages and the bug

My master is running on
http://ec2-54-215-230-73.us-west-1.compute.amazonaws.com:8080/

It has a section marked "applications". If I click on one of the running 
application ids I am taken to a page showing "Executor Summary".  This page has 
a link to teh 'application detail UI'  the url is 
http://ec2-54-215-230-73.us-west-1.compute.amazonaws.com:4041/

Notice it things the application UI is running on the cluster master.

It is actually running on the same machine as the driver on port 4041. I was 
able to reverse engine the url by noticing the private ip address is part of 
the worker id . For example   worker-20160322041632-172.31.23.201-34909

next I went on the aws ec2 console to find the public DNS name for this machine 
http://ec2-54-193-104-169.us-west-1.compute.amazonaws.com:4041/streaming/

Kind regards

Andy




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-15722) Wrong data when CTAS specifies schema

2016-06-03 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-15722:
--
Comment: was deleted

(was: User 'andrewor14' has created a pull request for this issue:
https://github.com/apache/spark/pull/13457)

> Wrong data when CTAS specifies schema
> -
>
> Key: SPARK-15722
> URL: https://issues.apache.org/jira/browse/SPARK-15722
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>
> {code}
> scala> sql("CREATE TABLE boxes (width INT, length INT, height INT) USING CSV")
> scala> (1 to 3).map { i => (i, i * 2, i * 3) }.toDF("height", "length", 
> "width").write.insertInto("boxes")
> scala> spark.table("boxes").show()
> +-+--+--+
> |width|length|height|
> +-+--+--+
> |1| 2| 3|
> |2| 4| 6|
> |3| 6| 9|
> +-+--+--+
> scala> sql("CREATE TABLE blocks (name STRING, age INT) AS SELECT * FROM 
> boxes")
> scala> spark.table("students").show()
> ++---+
> |name|age|
> ++---+
> |   1|  2|
> |   2|  4|
> |   3|  6|
> ++---+
> {code}
> The columns don't even match in types.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-15736) Gracefully handle loss of DiskStore files

2016-06-02 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or resolved SPARK-15736.
---
   Resolution: Fixed
Fix Version/s: 2.0.0
   1.6.2

> Gracefully handle loss of DiskStore files
> -
>
> Key: SPARK-15736
> URL: https://issues.apache.org/jira/browse/SPARK-15736
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager
>Affects Versions: 1.6.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
> Fix For: 1.6.2, 2.0.0
>
>
> If an RDD partition is cached on disk and the DiskStore file is lost, then 
> reads of that cached partition will fail and the missing partition is 
> supposed to be recomputed by a new task attempt. In the current BlockManager 
> implementation, however, the missing file does not trigger any metadata 
> updates / does not invalidate the cache, so subsequent task attempts will be 
> scheduled on the same executor and the doomed read will be repeatedly 
> retried, leading to repeated task failures and eventually a total job failure.
> In order to fix this problem, the executor with the missing file needs to 
> properly mark the corresponding block as missing so that it stops advertising 
> itself as a cache location for that block.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-15718) better error message for writing bucketing data

2016-06-02 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or resolved SPARK-15718.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

> better error message for writing bucketing data
> ---
>
> Key: SPARK-15718
> URL: https://issues.apache.org/jira/browse/SPARK-15718
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-15722) Wrong data when CTAS specifies schema

2016-06-02 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-15722:
--
Comment: was deleted

(was: User 'andrewor14' has created a pull request for this issue:
https://github.com/apache/spark/pull/13457)

> Wrong data when CTAS specifies schema
> -
>
> Key: SPARK-15722
> URL: https://issues.apache.org/jira/browse/SPARK-15722
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>
> {code}
> scala> sql("CREATE TABLE boxes (width INT, length INT, height INT) USING CSV")
> scala> (1 to 3).map { i => (i, i * 2, i * 3) }.toDF("height", "length", 
> "width").write.insertInto("boxes")
> scala> spark.table("boxes").show()
> +-+--+--+
> |width|length|height|
> +-+--+--+
> |1| 2| 3|
> |2| 4| 6|
> |3| 6| 9|
> +-+--+--+
> scala> sql("CREATE TABLE blocks (name STRING, age INT) AS SELECT * FROM 
> boxes")
> scala> spark.table("students").show()
> ++---+
> |name|age|
> ++---+
> |   1|  2|
> |   2|  4|
> |   3|  6|
> ++---+
> {code}
> The columns don't even match in types.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-15711) Ban CREATE TEMP TABLE USING AS SELECT for now

2016-06-02 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or resolved SPARK-15711.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

> Ban CREATE TEMP TABLE USING AS SELECT for now
> -
>
> Key: SPARK-15711
> URL: https://issues.apache.org/jira/browse/SPARK-15711
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Sean Zhong
>Priority: Critical
> Fix For: 2.0.0
>
>
> CREATE TEMP TABLE USING AS SELECT  is ill-defined. It requires that user to 
> specify the location and the temp data is not cleaned up when the session 
> exits. Before we fix it, I'd propose that we ban this command. I will create 
> a jira with description on proper temp table support.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15646) When spark.sql.hive.convertCTAS is true, we may still convert the table to a parquet table when TEXTFILE or SEQUENCEFILE is specified.

2016-06-01 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-15646:
--
Assignee: Yin Huai

> When spark.sql.hive.convertCTAS is true, we may still convert the table to a 
> parquet table when TEXTFILE or SEQUENCEFILE is specified.
> --
>
> Key: SPARK-15646
> URL: https://issues.apache.org/jira/browse/SPARK-15646
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Yin Huai
> Fix For: 2.0.0
>
>
> When {{spark.sql.hive.convertCTAS}} is true, we try to convert the table to a 
> parquet table if the user does not specify any storage format. However, we 
> only check serde, which causes us to still convert the table when 
> TEXTFILE/SEQUENCEFILE is specified and a serde is not provided.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-15646) When spark.sql.hive.convertCTAS is true, we may still convert the table to a parquet table when TEXTFILE or SEQUENCEFILE is specified.

2016-06-01 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or resolved SPARK-15646.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

> When spark.sql.hive.convertCTAS is true, we may still convert the table to a 
> parquet table when TEXTFILE or SEQUENCEFILE is specified.
> --
>
> Key: SPARK-15646
> URL: https://issues.apache.org/jira/browse/SPARK-15646
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
> Fix For: 2.0.0
>
>
> When {{spark.sql.hive.convertCTAS}} is true, we try to convert the table to a 
> parquet table if the user does not specify any storage format. However, we 
> only check serde, which causes us to still convert the table when 
> TEXTFILE/SEQUENCEFILE is specified and a serde is not provided.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-15722) Wrong data when CTAS specifies schema

2016-06-01 Thread Andrew Or (JIRA)

Andrew Or created SPARK-15722:
-

 Summary: Wrong data when CTAS specifies schema
 Key: SPARK-15722
 URL: https://issues.apache.org/jira/browse/SPARK-15722
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.0
Reporter: Andrew Or
Assignee: Andrew Or


{code}
scala> sql("CREATE TABLE boxes (width INT, length INT, height INT) USING CSV")
scala> (1 to 3).map { i => (i, i * 2, i * 3) }.toDF("height", "length", 
"width").write.insertInto("boxes")
scala> spark.table("boxes").show()
+-+--+--+
|width|length|height|
+-+--+--+
|1| 2| 3|
|2| 4| 6|
|3| 6| 9|
+-+--+--+
scala> sql("CREATE TABLE blocks (name STRING, age INT) AS SELECT * FROM boxes")
scala> spark.table("students").show()
++---+
|name|age|
++---+
|   1|  2|
|   2|  4|
|   3|  6|
++---+
{code}
The columns don't even match in types.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-15715) Altering partition storage information doesn't work in Hive

2016-06-01 Thread Andrew Or (JIRA)

Andrew Or created SPARK-15715:
-

 Summary: Altering partition storage information doesn't work in 
Hive
 Key: SPARK-15715
 URL: https://issues.apache.org/jira/browse/SPARK-15715
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.0
Reporter: Andrew Or
Assignee: Andrew Or


In HiveClientImpl
{code}
  private def toHivePartition(
  p: CatalogTablePartition,
  ht: HiveTable): HivePartition = {
new HivePartition(ht, p.spec.asJava, p.storage.locationUri.map { l => new 
Path(l) }.orNull)
  }
{code}
Other than the location, we don't even store any of the storage information in 
the metastore: output format, input format, serde, serde props. The result is 
that doing something like the following doesn't actually do anything:
{code}
ALTER TABLE boxes PARTITION (width=3)
SET SERDE 'com.sparkbricks.serde.ColumnarSerDe'
WITH SERDEPROPERTIES ('compress'='true')
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15711) Ban CREATE TEMP TABLE USING AS SELECT for now

2016-06-01 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-15711:
--
Assignee: Sean Zhong

> Ban CREATE TEMP TABLE USING AS SELECT for now
> -
>
> Key: SPARK-15711
> URL: https://issues.apache.org/jira/browse/SPARK-15711
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Sean Zhong
>Priority: Critical
>
> CREATE TEMP TABLE USING AS SELECT  is ill-defined. It requires that user to 
> specify the location and the temp data is not cleaned up when the session 
> exits. Before we fix it, I'd propose that we ban this command. I will create 
> a jira with description on proper temp table support.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-15236) No way to disable Hive support in REPL

2016-05-31 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or resolved SPARK-15236.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

> No way to disable Hive support in REPL
> --
>
> Key: SPARK-15236
> URL: https://issues.apache.org/jira/browse/SPARK-15236
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, SQL
>Affects Versions: 2.0.0
>Reporter: Andrew Or
>Assignee: Xin Wu
> Fix For: 2.0.0
>
>
> If you built Spark with Hive classes, there's no switch to flip to start a 
> new `spark-shell` using the InMemoryCatalog. The only thing you can do now is 
> to rebuild Spark again. That is quite inconvenient.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-15618) Use SparkSession.builder.sparkContext(...) in tests where possible

2016-05-31 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or resolved SPARK-15618.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

> Use SparkSession.builder.sparkContext(...) in tests where possible
> --
>
> Key: SPARK-15618
> URL: https://issues.apache.org/jira/browse/SPARK-15618
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Andrew Or
>Assignee: Dongjoon Hyun
>Priority: Minor
> Fix For: 2.0.0
>
>
> There are many places where we could be more explicit about the particular 
> underlying SparkContext we want, but we just do 
> `SparkSession.builder.getOrCreate()` anyway. It's better to be clearer in the 
> code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15236) No way to disable Hive support in REPL

2016-05-31 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-15236:
--
Assignee: Xin Wu

> No way to disable Hive support in REPL
> --
>
> Key: SPARK-15236
> URL: https://issues.apache.org/jira/browse/SPARK-15236
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, SQL
>Affects Versions: 2.0.0
>Reporter: Andrew Or
>Assignee: Xin Wu
> Fix For: 2.0.0
>
>
> If you built Spark with Hive classes, there's no switch to flip to start a 
> new `spark-shell` using the InMemoryCatalog. The only thing you can do now is 
> to rebuild Spark again. That is quite inconvenient.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15670) Add deprecate annotation for acumulator V1 interface in JavaSparkContext class

2016-05-31 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-15670:
--
Assignee: Weichen Xu

> Add deprecate annotation for acumulator V1 interface in JavaSparkContext class
> --
>
> Key: SPARK-15670
> URL: https://issues.apache.org/jira/browse/SPARK-15670
> Project: Spark
>  Issue Type: Improvement
>  Components: Java API, Spark Core
>Reporter: Weichen Xu
>Assignee: Weichen Xu
>Priority: Minor
> Fix For: 2.0.0
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Add deprecate annotation for acumulator V1 interface in JavaSparkContext class



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-15670) Add deprecate annotation for acumulator V1 interface in JavaSparkContext class

2016-05-31 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or resolved SPARK-15670.
---
  Resolution: Fixed
   Fix Version/s: 2.0.0
Target Version/s: 2.0.0

> Add deprecate annotation for acumulator V1 interface in JavaSparkContext class
> --
>
> Key: SPARK-15670
> URL: https://issues.apache.org/jira/browse/SPARK-15670
> Project: Spark
>  Issue Type: Improvement
>  Components: Java API, Spark Core
>Reporter: Weichen Xu
>Priority: Minor
> Fix For: 2.0.0
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Add deprecate annotation for acumulator V1 interface in JavaSparkContext class



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-15662) Add since annotation for classes in sql.catalog

2016-05-31 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or resolved SPARK-15662.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

> Add since annotation for classes in sql.catalog
> ---
>
> Key: SPARK-15662
> URL: https://issues.apache.org/jira/browse/SPARK-15662
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-15635) ALTER TABLE RENAME doesn't work for datasource tables

2016-05-27 Thread Andrew Or (JIRA)

Andrew Or created SPARK-15635:
-

 Summary: ALTER TABLE RENAME doesn't work for datasource tables
 Key: SPARK-15635
 URL: https://issues.apache.org/jira/browse/SPARK-15635
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.0
Reporter: Andrew Or
Assignee: Andrew Or


{code}
scala> sql("CREATE TABLE students (age INT, name STRING) USING parquet")
scala> sql("ALTER TABLE students RENAME TO teachers")
scala> spark.table("teachers").show()
com.google.common.util.concurrent.UncheckedExecutionException: 
org.apache.spark.sql.AnalysisException: Path does not exist: 
file:/Users/andrew/Documents/dev/spark/andrew-spark/spark-warehouse/students;
  at 
com.google.common.cache.LocalCache$LocalLoadingCache.getUnchecked(LocalCache.java:4882)
  at 
com.google.common.cache.LocalCache$LocalLoadingCache.apply(LocalCache.java:4898)
  at 
org.apache.spark.sql.hive.HiveMetastoreCatalog.lookupRelation(HiveMetastoreCatalog.scala:170)
  at 
org.apache.spark.sql.hive.HiveSessionCatalog.lookupRelation(HiveSessionCatalog.scala:67)
  at org.apache.spark.sql.SparkSession.table(SparkSession.scala:583)
  at org.apache.spark.sql.SparkSession.table(SparkSession.scala:579)
  ... 48 elided
Caused by: org.apache.spark.sql.AnalysisException: Path does not exist: 
file:/Users/andrew/Documents/dev/spark/andrew-spark/spark-warehouse/students;
  at 
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$12.apply(DataSource.scala:351)
  at 
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$12.apply(DataSource.scala:340)
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-15450) Clean up SparkSession builder for python

2016-05-27 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or resolved SPARK-15450.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

> Clean up SparkSession builder for python
> 
>
> Key: SPARK-15450
> URL: https://issues.apache.org/jira/browse/SPARK-15450
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.0.0
>Reporter: Andrew Or
>Assignee: Eric Liang
> Fix For: 2.0.0
>
>
> This is the sister JIRA for SPARK-15075. Today we use 
> `SQLContext.getOrCreate` in our builder. Instead we should just have a real 
> `SparkSession.getOrCreate` and use that in our builder.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-15534) TRUNCATE TABLE should throw exceptions, not logError

2016-05-27 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or resolved SPARK-15534.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

> TRUNCATE TABLE should throw exceptions, not logError
> 
>
> Key: SPARK-15534
> URL: https://issues.apache.org/jira/browse/SPARK-15534
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>Priority: Minor
> Fix For: 2.0.0
>
>
> If the table to truncate doesn't exist, throw an exception!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-15535) Remove code for TRUNCATE TABLE ... COLUMN

2016-05-27 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or resolved SPARK-15535.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

> Remove code for TRUNCATE TABLE ... COLUMN
> -
>
> Key: SPARK-15535
> URL: https://issues.apache.org/jira/browse/SPARK-15535
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>Priority: Minor
> Fix For: 2.0.0
>
>
> This was never supported in the first place. Also Hive doesn't support it: 
> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15450) Clean up SparkSession builder for python

2016-05-27 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-15450:
--
Assignee: Eric Liang  (was: Andrew Or)

> Clean up SparkSession builder for python
> 
>
> Key: SPARK-15450
> URL: https://issues.apache.org/jira/browse/SPARK-15450
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.0.0
>Reporter: Andrew Or
>Assignee: Eric Liang
> Fix For: 2.0.0
>
>
> This is the sister JIRA for SPARK-15075. Today we use 
> `SQLContext.getOrCreate` in our builder. Instead we should just have a real 
> `SparkSession.getOrCreate` and use that in our builder.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15618) Use SparkSession.builder.sparkContext(...) in tests where possible

2016-05-27 Thread Andrew Or (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15304506#comment-15304506
 ] 

Andrew Or commented on SPARK-15618:
---

it needs to be internal. At least it should be private[spark]

> Use SparkSession.builder.sparkContext(...) in tests where possible
> --
>
> Key: SPARK-15618
> URL: https://issues.apache.org/jira/browse/SPARK-15618
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Andrew Or
>Assignee: Dongjoon Hyun
>Priority: Minor
>
> There are many places where we could be more explicit about the particular 
> underlying SparkContext we want, but we just do 
> `SparkSession.builder.getOrCreate()` anyway. It's better to be clearer in the 
> code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-15569) Executors spending significant time in DiskObjectWriter.updateBytesWritten function

2016-05-27 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or resolved SPARK-15569.
---
  Resolution: Fixed
Assignee: Sital Kedia
   Fix Version/s: 2.0.0
Target Version/s: 2.0.0

> Executors spending significant time in DiskObjectWriter.updateBytesWritten 
> function
> ---
>
> Key: SPARK-15569
> URL: https://issues.apache.org/jira/browse/SPARK-15569
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Reporter: Sital Kedia
>Assignee: Sital Kedia
> Fix For: 2.0.0
>
>
> Profiling a Spark job spilling large amount of intermediate data we found 
> that significant portion of time is being spent in 
> DiskObjectWriter.updateBytesWritten function. Looking at the code 
> (https://github.com/sitalkedia/spark/blob/master/core/src/main/scala/org/apache/spark/storage/DiskBlockObjectWriter.scala#L206),
>  we see that the function is being called too frequently to update the number 
> of bytes written to disk. We should reduce the frequency to avoid this. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15599) Document createDataset functions in SparkSession

2016-05-27 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-15599:
--
Affects Version/s: 2.0.0
 Target Version/s: 2.0.0
  Component/s: Documentation

> Document createDataset functions in SparkSession
> 
>
> Key: SPARK-15599
> URL: https://issues.apache.org/jira/browse/SPARK-15599
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SQL
>Affects Versions: 2.0.0
>Reporter: Sameer Agarwal
>Assignee: Sameer Agarwal
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-15599) Document createDataset functions in SparkSession

2016-05-27 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or resolved SPARK-15599.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

> Document createDataset functions in SparkSession
> 
>
> Key: SPARK-15599
> URL: https://issues.apache.org/jira/browse/SPARK-15599
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SQL
>Affects Versions: 2.0.0
>Reporter: Sameer Agarwal
>Assignee: Sameer Agarwal
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15599) Document createDataset functions in SparkSession

2016-05-27 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-15599:
--
Assignee: Sameer Agarwal

> Document createDataset functions in SparkSession
> 
>
> Key: SPARK-15599
> URL: https://issues.apache.org/jira/browse/SPARK-15599
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SQL
>Affects Versions: 2.0.0
>Reporter: Sameer Agarwal
>Assignee: Sameer Agarwal
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-15584) Abstract duplicate code: "spark.sql.sources." properties

2016-05-27 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or resolved SPARK-15584.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

> Abstract duplicate code: "spark.sql.sources." properties
> 
>
> Key: SPARK-15584
> URL: https://issues.apache.org/jira/browse/SPARK-15584
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Andrew Or
>Assignee: Dongjoon Hyun
>Priority: Minor
> Fix For: 2.0.0
>
>
> Right now we have "spark.sql.sources.provider", "spark.sql.sources.numParts" 
> etc. everywhere. If we mistype something then things will silently fail. This 
> is pretty brittle. It would better if we have static variables that we can 
> reuse.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15603) Replace SQLContext with SparkSession in ML/MLLib

2016-05-27 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-15603:
--
Fix Version/s: 2.0.0

> Replace SQLContext with SparkSession in ML/MLLib
> 
>
> Key: SPARK-15603
> URL: https://issues.apache.org/jira/browse/SPARK-15603
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
> Fix For: 2.0.0
>
>
> This issue replaces all deprecated `SQLContext` occurrences with 
> `SparkSession` in `ML/MLLib` module except the following two classes. These 
> two classes use `SQLContext` as their function arguments.
> - ReadWrite.scala
> - TreeModels.scala



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15603) Replace SQLContext with SparkSession in ML/MLLib

2016-05-27 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-15603:
--
Assignee: Dongjoon Hyun

> Replace SQLContext with SparkSession in ML/MLLib
> 
>
> Key: SPARK-15603
> URL: https://issues.apache.org/jira/browse/SPARK-15603
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
> Fix For: 2.0.0
>
>
> This issue replaces all deprecated `SQLContext` occurrences with 
> `SparkSession` in `ML/MLLib` module except the following two classes. These 
> two classes use `SQLContext` as their function arguments.
> - ReadWrite.scala
> - TreeModels.scala



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15618) Use SparkSession.builder.sparkContext(...) in tests where possible

2016-05-27 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-15618:
--
Priority: Minor  (was: Major)

> Use SparkSession.builder.sparkContext(...) in tests where possible
> --
>
> Key: SPARK-15618
> URL: https://issues.apache.org/jira/browse/SPARK-15618
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Andrew Or
>Assignee: Dongjoon Hyun
>Priority: Minor
>
> There are many places where we could be more explicit about the particular 
> underlying SparkContext we want, but we just do 
> `SparkSession.builder.getOrCreate()` anyway. It's better to be clearer in the 
> code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15603) Replace SQLContext with SparkSession in ML/MLLib

2016-05-27 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-15603:
--
Affects Version/s: 2.0.0

> Replace SQLContext with SparkSession in ML/MLLib
> 
>
> Key: SPARK-15603
> URL: https://issues.apache.org/jira/browse/SPARK-15603
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
> Fix For: 2.0.0
>
>
> This issue replaces all deprecated `SQLContext` occurrences with 
> `SparkSession` in `ML/MLLib` module except the following two classes. These 
> two classes use `SQLContext` as their function arguments.
> - ReadWrite.scala
> - TreeModels.scala



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-15618) Use SparkSession.builder.sparkContext(...) in tests where possible

2016-05-27 Thread Andrew Or (JIRA)

Andrew Or created SPARK-15618:
-

 Summary: Use SparkSession.builder.sparkContext(...) in tests where 
possible
 Key: SPARK-15618
 URL: https://issues.apache.org/jira/browse/SPARK-15618
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.0
Reporter: Andrew Or
Assignee: Dongjoon Hyun


There are many places where we could be more explicit about the particular 
underlying SparkContext we want, but we just do 
`SparkSession.builder.getOrCreate()` anyway. It's better to be clearer in the 
code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-15536) Disallow TRUNCATE TABLE with external tables and views

2016-05-26 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or resolved SPARK-15536.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

> Disallow TRUNCATE TABLE with external tables and views
> --
>
> Key: SPARK-15536
> URL: https://issues.apache.org/jira/browse/SPARK-15536
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Andrew Or
>Assignee: Andrew Or
> Fix For: 2.0.0
>
>
> Otherwise we might accidentally delete existing data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-15538) Truncate table does not work on data source table

2016-05-26 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or resolved SPARK-15538.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

> Truncate table does not work on data source table
> -
>
> Key: SPARK-15538
> URL: https://issues.apache.org/jira/browse/SPARK-15538
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Suresh Thalamati
>Assignee: Andrew Or
>Priority: Minor
> Fix For: 2.0.0
>
>
> Truncate table does not seem to work on data source tables.
> Repro:
> {code}
> val df = Seq((1 , "john", "CA") ,(2,"Mike", "NY"), (3, "Robert", 
> "CA")).toDF("id", "name", "state")
> df.write.format("parquet").partitionBy("state").saveAsTable("emp")
> scala> sql("truncate table emp") 
> res8: org.apache.spark.sql.DataFrame = []
> scala> sql("select * from emp").show() // FileNotFoundException
> {code} 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-15596) ALTER TABLE RENAME needs to uncache query

2016-05-26 Thread Andrew Or (JIRA)

Andrew Or created SPARK-15596:
-

 Summary: ALTER TABLE RENAME needs to uncache query
 Key: SPARK-15596
 URL: https://issues.apache.org/jira/browse/SPARK-15596
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.0
Reporter: Andrew Or
Assignee: Andrew Or






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15584) Abstract duplicate code: "spark.sql.sources." properties

2016-05-26 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-15584:
--
Assignee: Dongjoon Hyun

> Abstract duplicate code: "spark.sql.sources." properties
> 
>
> Key: SPARK-15584
> URL: https://issues.apache.org/jira/browse/SPARK-15584
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Andrew Or
>Assignee: Dongjoon Hyun
>Priority: Minor
>
> Right now we have "spark.sql.sources.provider", "spark.sql.sources.numParts" 
> etc. everywhere. If we mistype something then things will silently fail. This 
> is pretty brittle. It would better if we have static variables that we can 
> reuse.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-15594) ALTER TABLE ... SERDEPROPERTIES does not respect partition spec

2016-05-26 Thread Andrew Or (JIRA)

Andrew Or created SPARK-15594:
-

 Summary: ALTER TABLE ... SERDEPROPERTIES does not respect 
partition spec
 Key: SPARK-15594
 URL: https://issues.apache.org/jira/browse/SPARK-15594
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.0
Reporter: Andrew Or
Assignee: Andrew Or


{code}
case class AlterTableSerDePropertiesCommand(
tableName: TableIdentifier,
serdeClassName: Option[String],
serdeProperties: Option[Map[String, String]],
partition: Option[Map[String, String]])
  extends RunnableCommand {
{code}
The `partition` flag is not read anywhere!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-15584) Abstract duplicate code: "spark.sql.sources." properties

2016-05-26 Thread Andrew Or (JIRA)

Andrew Or created SPARK-15584:
-

 Summary: Abstract duplicate code: "spark.sql.sources." properties
 Key: SPARK-15584
 URL: https://issues.apache.org/jira/browse/SPARK-15584
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.0
Reporter: Andrew Or
Assignee: Andrew Or


Right now we have "spark.sql.sources.provider", "spark.sql.sources.numParts" 
etc. everywhere. If we mistype something then things will silently fail. This 
is pretty brittle. It would better if we have static variables that we can 
reuse.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15584) Abstract duplicate code: "spark.sql.sources." properties

2016-05-26 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-15584:
--
Issue Type: Improvement  (was: Bug)

> Abstract duplicate code: "spark.sql.sources." properties
> 
>
> Key: SPARK-15584
> URL: https://issues.apache.org/jira/browse/SPARK-15584
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Andrew Or
>
> Right now we have "spark.sql.sources.provider", "spark.sql.sources.numParts" 
> etc. everywhere. If we mistype something then things will silently fail. This 
> is pretty brittle. It would better if we have static variables that we can 
> reuse.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15584) Abstract duplicate code: "spark.sql.sources." properties

2016-05-26 Thread Andrew Or (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15303054#comment-15303054
 ] 

Andrew Or commented on SPARK-15584:
---

[~dongjoon] would you like to work on this?

> Abstract duplicate code: "spark.sql.sources." properties
> 
>
> Key: SPARK-15584
> URL: https://issues.apache.org/jira/browse/SPARK-15584
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Andrew Or
>
> Right now we have "spark.sql.sources.provider", "spark.sql.sources.numParts" 
> etc. everywhere. If we mistype something then things will silently fail. This 
> is pretty brittle. It would better if we have static variables that we can 
> reuse.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15584) Abstract duplicate code: "spark.sql.sources." properties

2016-05-26 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-15584:
--
Assignee: (was: Andrew Or)

> Abstract duplicate code: "spark.sql.sources." properties
> 
>
> Key: SPARK-15584
> URL: https://issues.apache.org/jira/browse/SPARK-15584
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Andrew Or
>
> Right now we have "spark.sql.sources.provider", "spark.sql.sources.numParts" 
> etc. everywhere. If we mistype something then things will silently fail. This 
> is pretty brittle. It would better if we have static variables that we can 
> reuse.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15584) Abstract duplicate code: "spark.sql.sources." properties

2016-05-26 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-15584:
--
Priority: Minor  (was: Major)

> Abstract duplicate code: "spark.sql.sources." properties
> 
>
> Key: SPARK-15584
> URL: https://issues.apache.org/jira/browse/SPARK-15584
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Andrew Or
>Priority: Minor
>
> Right now we have "spark.sql.sources.provider", "spark.sql.sources.numParts" 
> etc. everywhere. If we mistype something then things will silently fail. This 
> is pretty brittle. It would better if we have static variables that we can 
> reuse.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-15583) Relax ALTER TABLE properties restriction for data source tables

2016-05-26 Thread Andrew Or (JIRA)

Andrew Or created SPARK-15583:
-

 Summary: Relax ALTER TABLE properties restriction for data source 
tables
 Key: SPARK-15583
 URL: https://issues.apache.org/jira/browse/SPARK-15583
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.0
Reporter: Andrew Or
Assignee: Andrew Or


Looks like right now we just don't support ALTER TABLE SET TBLPROPERTIES for 
all properties. This is overly restrictive; as long as the user doesn't touch 
anything in the special namespace (spark.sql.sources.*) then we're OK.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

< 1 2 3 4 5 6 7 8 9 10 >

301 - 400 of 3551 matches

Mail list logo