[jira] [Resolved] (SPARK-17672) Spark 2.0 history server web Ui takes too long for a single application
[ https://issues.apache.org/jira/browse/SPARK-17672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or resolved SPARK-17672. --- Resolution: Fixed Fix Version/s: 2.0.1 Target Version/s: 2.0.1 > Spark 2.0 history server web Ui takes too long for a single application > --- > > Key: SPARK-17672 > URL: https://issues.apache.org/jira/browse/SPARK-17672 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.0.0 >Reporter: Gang Wu > Fix For: 2.0.1 > > > When there are 10K application history in the history server back end, it can > take a very long time to even get a single application history page. After > some investigation, I found the root cause was the following piece of code: > {code:title=OneApplicationResource.scala|borderStyle=solid} > @Produces(Array(MediaType.APPLICATION_JSON)) > private[v1] class OneApplicationResource(uiRoot: UIRoot) { > @GET > def getApp(@PathParam("appId") appId: String): ApplicationInfo = { > val apps = uiRoot.getApplicationInfoList.find { _.id == appId } > apps.getOrElse(throw new NotFoundException("unknown app: " + appId)) > } > } > {code} > Although all application history infos are stored in a LinkedHashMap, here to > code transforms the map to an iterator and then uses the find() api which is > O( n) instead of O(1) from a map.get() operation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17648) TaskSchedulerImpl.resourceOffers should take an IndexedSeq, not a Seq
[ https://issues.apache.org/jira/browse/SPARK-17648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or resolved SPARK-17648. --- Resolution: Fixed Fix Version/s: 2.1.0 Target Version/s: 2.1.0 > TaskSchedulerImpl.resourceOffers should take an IndexedSeq, not a Seq > - > > Key: SPARK-17648 > URL: https://issues.apache.org/jira/browse/SPARK-17648 > Project: Spark > Issue Type: Improvement > Components: Scheduler, Spark Core >Affects Versions: 2.0.0 >Reporter: Imran Rashid >Assignee: Imran Rashid >Priority: Minor > Fix For: 2.1.0 > > > {{TaskSchedulerImpl.resourceOffer}} takes in a {{Seq[WorkerOffer]}}. > however, later on it indexes into this by position. If you don't pass in an > {{IndexedSeq}}, this turns an O(n) operation in an O(n^2) operation. > In practice, this isn't an issue, since just by chance the important places > this is called, the datastructures happen to already be {{IndexedSeq}} s. > But we ought to tighten up the types to make this more clear. I ran into > this while doing some performance tests on the scheduler, and performance was > terrible when I passed in a {{Seq}} and even a few hundred offers were > scheduled very slowly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17623) Failed tasks end reason is always a TaskFailedReason, types should reflect this
[ https://issues.apache.org/jira/browse/SPARK-17623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or resolved SPARK-17623. --- Resolution: Fixed Fix Version/s: 2.1.0 > Failed tasks end reason is always a TaskFailedReason, types should reflect > this > --- > > Key: SPARK-17623 > URL: https://issues.apache.org/jira/browse/SPARK-17623 > Project: Spark > Issue Type: Improvement > Components: Scheduler, Spark Core >Affects Versions: 2.0.0 >Reporter: Imran Rashid >Assignee: Imran Rashid >Priority: Minor > Fix For: 2.1.0 > > > Minor code cleanup. In TaskResultGetter, enqueueFailedTask currently > deserializes the result as a TaskEndReason. But the type is actually more > specific, its a TaskFailedReason. This just leads to more blind casting > later on -- it would be more clear if the msg was cast to the right type > immediately, so method parameter types could be tightened. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17438) Master UI should show the correct core limit when `ApplicationInfo.executorLimit` is set
[ https://issues.apache.org/jira/browse/SPARK-17438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or resolved SPARK-17438. --- Resolution: Fixed Fix Version/s: 2.1.0 2.0.1 > Master UI should show the correct core limit when > `ApplicationInfo.executorLimit` is set > > > Key: SPARK-17438 > URL: https://issues.apache.org/jira/browse/SPARK-17438 > Project: Spark > Issue Type: Bug > Components: Web UI >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > Fix For: 2.0.1, 2.1.0 > > > The core info of an application in Master UI doesn't consider > `ApplicationInfo.executorLimit`. It's pretty confusing that UI says > "Unlimited" when `executorLimit` is set. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17458) Alias specified for aggregates in a pivot are not honored
[ https://issues.apache.org/jira/browse/SPARK-17458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15494591#comment-15494591 ] Andrew Ray commented on SPARK-17458: [~hvanhovell]: My JIRA username is a1ray. > Alias specified for aggregates in a pivot are not honored > - > > Key: SPARK-17458 > URL: https://issues.apache.org/jira/browse/SPARK-17458 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Ravi Somepalli >Assignee: Herman van Hovell > Fix For: 2.1.0 > > > When using pivot and multiple aggregations we need to alias to avoid special > characters, but alias does not help because > df.groupBy("C").pivot("A").agg(avg("D").as("COLD"), max("B").as("COLB")).show > ||C || bar_avg(`D`) AS `COLD` || bar_max(`B`) AS `COLB` || foo_avg(`D`) > AS `COLD` || foo_max(`B`) AS `COLB` || > |small| 5.5| two|2.3335| > two| > |large| 5.5| two| 2.0| > one| > Expected Output > ||C || bar_COLD || bar_COLB || foo_COLD || foo_COLB || > |small| 5.5| two|2.3335| > two| > |large| 5.5| two| 2.0| > one| > One approach you can fix this issue is to change the class > sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala > and change the outputName method in > {code} > object ResolvePivot extends Rule[LogicalPlan] { > def apply(plan: LogicalPlan): LogicalPlan = plan transform { > {code} > {code} > def outputName(value: Literal, aggregate: Expression): String = { > val suffix = aggregate match { > case n: NamedExpression => > aggregate.asInstanceOf[NamedExpression].name > case _ => aggregate.sql >} > if (singleAgg) value.toString else value + "_" + suffix > } > {code} > Version : 2.0.0 > {code} > def outputName(value: Literal, aggregate: Expression): String = { > if (singleAgg) value.toString else value + "_" + aggregate.sql > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-17458) Alias specified for aggregates in a pivot are not honored
[ https://issues.apache.org/jira/browse/SPARK-17458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Ray updated SPARK-17458: --- Comment: was deleted (was: [~hvanhovell] It's a1ray) > Alias specified for aggregates in a pivot are not honored > - > > Key: SPARK-17458 > URL: https://issues.apache.org/jira/browse/SPARK-17458 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Ravi Somepalli >Assignee: Herman van Hovell > Fix For: 2.1.0 > > > When using pivot and multiple aggregations we need to alias to avoid special > characters, but alias does not help because > df.groupBy("C").pivot("A").agg(avg("D").as("COLD"), max("B").as("COLB")).show > ||C || bar_avg(`D`) AS `COLD` || bar_max(`B`) AS `COLB` || foo_avg(`D`) > AS `COLD` || foo_max(`B`) AS `COLB` || > |small| 5.5| two|2.3335| > two| > |large| 5.5| two| 2.0| > one| > Expected Output > ||C || bar_COLD || bar_COLB || foo_COLD || foo_COLB || > |small| 5.5| two|2.3335| > two| > |large| 5.5| two| 2.0| > one| > One approach you can fix this issue is to change the class > sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala > and change the outputName method in > {code} > object ResolvePivot extends Rule[LogicalPlan] { > def apply(plan: LogicalPlan): LogicalPlan = plan transform { > {code} > {code} > def outputName(value: Literal, aggregate: Expression): String = { > val suffix = aggregate match { > case n: NamedExpression => > aggregate.asInstanceOf[NamedExpression].name > case _ => aggregate.sql >} > if (singleAgg) value.toString else value + "_" + suffix > } > {code} > Version : 2.0.0 > {code} > def outputName(value: Literal, aggregate: Expression): String = { > if (singleAgg) value.toString else value + "_" + aggregate.sql > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17458) Alias specified for aggregates in a pivot are not honored
[ https://issues.apache.org/jira/browse/SPARK-17458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15494361#comment-15494361 ] Andrew Ray edited comment on SPARK-17458 at 9/15/16 8:09 PM: - [~hvanhovell] It's a1ray was (Author: a1ray): It's a1ray > Alias specified for aggregates in a pivot are not honored > - > > Key: SPARK-17458 > URL: https://issues.apache.org/jira/browse/SPARK-17458 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Ravi Somepalli >Assignee: Herman van Hovell > Fix For: 2.1.0 > > > When using pivot and multiple aggregations we need to alias to avoid special > characters, but alias does not help because > df.groupBy("C").pivot("A").agg(avg("D").as("COLD"), max("B").as("COLB")).show > ||C || bar_avg(`D`) AS `COLD` || bar_max(`B`) AS `COLB` || foo_avg(`D`) > AS `COLD` || foo_max(`B`) AS `COLB` || > |small| 5.5| two|2.3335| > two| > |large| 5.5| two| 2.0| > one| > Expected Output > ||C || bar_COLD || bar_COLB || foo_COLD || foo_COLB || > |small| 5.5| two|2.3335| > two| > |large| 5.5| two| 2.0| > one| > One approach you can fix this issue is to change the class > sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala > and change the outputName method in > {code} > object ResolvePivot extends Rule[LogicalPlan] { > def apply(plan: LogicalPlan): LogicalPlan = plan transform { > {code} > {code} > def outputName(value: Literal, aggregate: Expression): String = { > val suffix = aggregate match { > case n: NamedExpression => > aggregate.asInstanceOf[NamedExpression].name > case _ => aggregate.sql >} > if (singleAgg) value.toString else value + "_" + suffix > } > {code} > Version : 2.0.0 > {code} > def outputName(value: Literal, aggregate: Expression): String = { > if (singleAgg) value.toString else value + "_" + aggregate.sql > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17458) Alias specified for aggregates in a pivot are not honored
[ https://issues.apache.org/jira/browse/SPARK-17458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15494361#comment-15494361 ] Andrew Ray commented on SPARK-17458: It's a1ray > Alias specified for aggregates in a pivot are not honored > - > > Key: SPARK-17458 > URL: https://issues.apache.org/jira/browse/SPARK-17458 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Ravi Somepalli > Fix For: 2.1.0 > > > When using pivot and multiple aggregations we need to alias to avoid special > characters, but alias does not help because > df.groupBy("C").pivot("A").agg(avg("D").as("COLD"), max("B").as("COLB")).show > ||C || bar_avg(`D`) AS `COLD` || bar_max(`B`) AS `COLB` || foo_avg(`D`) > AS `COLD` || foo_max(`B`) AS `COLB` || > |small| 5.5| two|2.3335| > two| > |large| 5.5| two| 2.0| > one| > Expected Output > ||C || bar_COLD || bar_COLB || foo_COLD || foo_COLB || > |small| 5.5| two|2.3335| > two| > |large| 5.5| two| 2.0| > one| > One approach you can fix this issue is to change the class > sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala > and change the outputName method in > {code} > object ResolvePivot extends Rule[LogicalPlan] { > def apply(plan: LogicalPlan): LogicalPlan = plan transform { > {code} > {code} > def outputName(value: Literal, aggregate: Expression): String = { > val suffix = aggregate match { > case n: NamedExpression => > aggregate.asInstanceOf[NamedExpression].name > case _ => aggregate.sql >} > if (singleAgg) value.toString else value + "_" + suffix > } > {code} > Version : 2.0.0 > {code} > def outputName(value: Literal, aggregate: Expression): String = { > if (singleAgg) value.toString else value + "_" + aggregate.sql > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15917) Define the number of executors in standalone mode with an easy-to-use property
[ https://issues.apache.org/jira/browse/SPARK-15917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15490663#comment-15490663 ] Andrew Or commented on SPARK-15917: --- By the way is there a pull request? > Define the number of executors in standalone mode with an easy-to-use property > -- > > Key: SPARK-15917 > URL: https://issues.apache.org/jira/browse/SPARK-15917 > Project: Spark > Issue Type: Improvement > Components: Spark Core, Spark Shell, Spark Submit >Affects Versions: 1.6.1 >Reporter: Jonathan Taws >Priority: Minor > > After stumbling across a few StackOverflow posts around the issue of using a > fixed number of executors in standalone mode (non-YARN), I was wondering if > we could not add an easier way to set this parameter than having to resort to > some calculations based on the number of cores and the memory you have > available on your worker. > For example, let's say I have 8 cores and 30GB of memory available : > - If no option is passed, one executor will be spawned with 8 cores and 1GB > of memory allocated. > - However, if I want to have only *2* executors, and to use 2 cores and 10GB > of memory per executor, I will end up with *3* executors (as the available > memory will limit the number of executors) instead of the 2 I was hoping for. > Sure, I can set {{spark.cores.max}} as a workaround to get exactly what I > want, but would it not be easier to add a {{--num-executors}}-like option to > standalone mode to be able to really fine-tune the configuration ? This > option is already available in YARN mode. > From my understanding, I don't see any other option lying around that can > help achieve this. > This seems to be slightly disturbing for newcomers, and standalone mode is > probably the first thing anyone will use to just try out Spark or test some > configuration. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17310) Disable Parquet's record-by-record filter in normal parquet reader and do it in Spark-side
[ https://issues.apache.org/jira/browse/SPARK-17310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15448672#comment-15448672 ] Andrew Duffy commented on SPARK-17310: -- +1 to this, see comments on https://github.com/apache/spark/pull/14671, particularly rdblue's comment. We need to wait for next release of Parquet to be able to be able to set {{parquet.filter.record-level.enabled}} config > Disable Parquet's record-by-record filter in normal parquet reader and do it > in Spark-side > -- > > Key: SPARK-17310 > URL: https://issues.apache.org/jira/browse/SPARK-17310 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Hyukjin Kwon > > Currently, we are pushing filters down for normal Parquet reader which also > filters record-by-record. > It seems Spark-side codegen row-by-row filtering might be faster than > Parquet's one in general due to type-boxing and virtual function calls which > Spark's one tries to avoid. > Maybe we should perform a benchmark and disable this. This ticket was from > https://github.com/apache/spark/pull/14671 > Please refer the discussion in the PR. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17227) Allow configuring record delimiter in csv
[ https://issues.apache.org/jira/browse/SPARK-17227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15435539#comment-15435539 ] Andrew Ash commented on SPARK-17227: Rob and I work together, and we've seen datasets in mostly-CSV format that have non-standard record delimiters ('\0' character for instance). For some broader context, we've created our own CSV text parser and use that in all our various internal products that use Spark, but would like to contribute this additional flexibility back to the Spark community at large and in the process eliminate the need for our internal CSV datasource. Here are the tickets Rob just opened that we would require to eliminate our internal CSV datasource: SPARK-17222 SPARK-17224 SPARK-17225 SPARK-17226 SPARK-17227 The basic question then, is would the Spark community accept patches that extend Spark's CSV parser to cover these features? We're willing to write the code and get the patches through code review, but would rather know up front if these changes would never be accepted into mainline Spark due to philosophical disagreements around what Spark's CSV datasource should be. > Allow configuring record delimiter in csv > - > > Key: SPARK-17227 > URL: https://issues.apache.org/jira/browse/SPARK-17227 > Project: Spark > Issue Type: Improvement >Reporter: Robert Kruszewski >Priority: Minor > > Instead of hard coded "\n" -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17213) Parquet String Pushdown for Non-Eq Comparisons Broken
[ https://issues.apache.org/jira/browse/SPARK-17213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Duffy updated SPARK-17213: - Description: Spark defines ordering over strings based on comparison of UTF8 byte arrays, which compare bytes as unsigned integers. Currently however Parquet does not respect this ordering. This is currently in the process of being fixed in Parquet, JIRA and PR link below, but currently all filters are broken over strings, with there actually being a correctness issue for {{>}} and {{<}}. *Repro:* Querying directly from in-memory DataFrame: {code} > Seq("a", "é").toDF("name").where("name > 'a'").count 1 {code} Querying from a parquet dataset: {code} > Seq("a", "é").toDF("name").write.parquet("/tmp/bad") > spark.read.parquet("/tmp/bad").where("name > 'a'").count 0 {code} This happens because Spark sorts the rows to be {{[a, é]}}, but Parquet's implementation of comparison of strings is based on signed byte array comparison, so it will actually create 1 row group with statistics {{min=é,max=a}}, and so the row group will be dropped by the query. Based on the way Parquet pushes down Eq, it will not be affecting correctness but it will force you to read row groups you should be able to skip. Link to PARQUET issue: https://issues.apache.org/jira/browse/PARQUET-686 Link to PR: https://github.com/apache/parquet-mr/pull/362 was: Spark defines ordering over strings based on comparison of UTF8 byte arrays, which compare bytes as unsigned integers. Currently however Parquet does not respect this ordering. This is currently in the process of being fixed in Parquet, JIRA and PR link below, but currently all filters are broken over strings, with there actually being a correctness issue for {{>}} and {{<}}. *Repro:* Querying directly from in-memory DataFrame: {code} > Seq("a", "é").toDF("name").where("name > 'a'").count 1 {code} Querying from a parquet dataset: {code} > Seq("a", "é").toDF("name").write.parquet("/tmp/bad") > spark.read.parquet("/tmp/bad").where("name > 'a'").count 0 {code} This happens because Spark sorts the rows to be {{[a, é]}}, but Parquet's implementation of comparison of strings is based on signed byte array comparison, so it will actually create 1 row group with statistics {{min=é,max=a}}, and so the row group will be dropped by the query. Link to PARQUET issue: https://issues.apache.org/jira/browse/PARQUET-686 Link to PR: https://github.com/apache/parquet-mr/pull/362 > Parquet String Pushdown for Non-Eq Comparisons Broken > - > > Key: SPARK-17213 > URL: https://issues.apache.org/jira/browse/SPARK-17213 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: Andrew Duffy > > Spark defines ordering over strings based on comparison of UTF8 byte arrays, > which compare bytes as unsigned integers. Currently however Parquet does not > respect this ordering. This is currently in the process of being fixed in > Parquet, JIRA and PR link below, but currently all filters are broken over > strings, with there actually being a correctness issue for {{>}} and {{<}}. > *Repro:* > Querying directly from in-memory DataFrame: > {code} > > Seq("a", "é").toDF("name").where("name > 'a'").count > 1 > {code} > Querying from a parquet dataset: > {code} > > Seq("a", "é").toDF("name").write.parquet("/tmp/bad") > > spark.read.parquet("/tmp/bad").where("name > 'a'").count > 0 > {code} > This happens because Spark sorts the rows to be {{[a, é]}}, but Parquet's > implementation of comparison of strings is based on signed byte array > comparison, so it will actually create 1 row group with statistics > {{min=é,max=a}}, and so the row group will be dropped by the query. > Based on the way Parquet pushes down Eq, it will not be affecting correctness > but it will force you to read row groups you should be able to skip. > Link to PARQUET issue: https://issues.apache.org/jira/browse/PARQUET-686 > Link to PR: https://github.com/apache/parquet-mr/pull/362 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17213) Parquet String Pushdown for Non-Eq Comparisons Broken
[ https://issues.apache.org/jira/browse/SPARK-17213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Duffy updated SPARK-17213: - Description: Spark defines ordering over strings based on comparison of UTF8 byte arrays, which compare bytes as unsigned integers. Currently however Parquet does not respect this ordering. This is currently in the process of being fixed in Parquet, JIRA and PR link below, but currently all filters are broken over strings, with there actually being a correctness issue for {{>}} and {{<}}. *Repro:* Querying directly from in-memory DataFrame: {code} > Seq("a", "é").toDF("name").where("name > 'a'").count 1 {code} Querying from a parquet dataset: {code} > Seq("a", "é").toDF("name").write.parquet("/tmp/bad") > spark.read.parquet("/tmp/bad").where("name > 'a'").count 0 {code} This happens because Spark sorts the rows to be {{[a, é]}}, but Parquet's implementation of comparison of strings is based on signed byte array comparison, so it will actually create 1 row group with statistics {{min=é,max=a}}, and so the row group will be dropped by the query. Link to PARQUET issue: https://issues.apache.org/jira/browse/PARQUET-686 Link to PR: https://github.com/apache/parquet-mr/pull/362 was: Spark defines ordering over strings based on comparison of UTF8 byte arrays, which compare bytes as unsigned integers. Currently however Parquet does not respect this ordering. This is currently in the process of being fixed in Parquet, JIRA and PR link below, but currently all filters are broken over strings, with there actually being a correctness issue for {{>}} and {{<}}. *Repro:* Querying directly from in-memory DataFrame: {code} > Seq("a", "é").toDF("name").where("name > 'a'").count 1 {code} Querying from a parquet dataset: {code} > Seq("a", "é").toDF("name").write.parquet("/tmp/bad") > spark.read.parquet("/tmp/bad").where("name > 'a'").count 0 {code} This happens because Spark sorts the rows to be {{[a, é]}}, but Parquet's implementation of comparison of strings is based on signed byte array comparison, so it will actually create 1 row group with statistics {{min=é,max=a}}, and so the row group will be dropped by the query. > Parquet String Pushdown for Non-Eq Comparisons Broken > - > > Key: SPARK-17213 > URL: https://issues.apache.org/jira/browse/SPARK-17213 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: Andrew Duffy > > Spark defines ordering over strings based on comparison of UTF8 byte arrays, > which compare bytes as unsigned integers. Currently however Parquet does not > respect this ordering. This is currently in the process of being fixed in > Parquet, JIRA and PR link below, but currently all filters are broken over > strings, with there actually being a correctness issue for {{>}} and {{<}}. > *Repro:* > Querying directly from in-memory DataFrame: > {code} > > Seq("a", "é").toDF("name").where("name > 'a'").count > 1 > {code} > Querying from a parquet dataset: > {code} > > Seq("a", "é").toDF("name").write.parquet("/tmp/bad") > > spark.read.parquet("/tmp/bad").where("name > 'a'").count > 0 > {code} > This happens because Spark sorts the rows to be {{[a, é]}}, but Parquet's > implementation of comparison of strings is based on signed byte array > comparison, so it will actually create 1 row group with statistics > {{min=é,max=a}}, and so the row group will be dropped by the query. > Link to PARQUET issue: https://issues.apache.org/jira/browse/PARQUET-686 > Link to PR: https://github.com/apache/parquet-mr/pull/362 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17213) Parquet String Pushdown for Non-Eq Comparisons Broken
Andrew Duffy created SPARK-17213: Summary: Parquet String Pushdown for Non-Eq Comparisons Broken Key: SPARK-17213 URL: https://issues.apache.org/jira/browse/SPARK-17213 Project: Spark Issue Type: Bug Affects Versions: 2.0.0 Reporter: Andrew Duffy Spark defines ordering over strings based on comparison of UTF8 byte arrays, which compare bytes as unsigned integers. Currently however Parquet does not respect this ordering. This is currently in the process of being fixed in Parquet, JIRA and PR link below, but currently all filters are broken over strings, with there actually being a correctness issue for {{>}} and {{<}}. *Repro:* Querying directly from in-memory DataFrame: {code} > Seq("a", "é").toDF("name").where("name > 'a'").count 1 {code} Querying from a parquet dataset: {code} > Seq("a", "é").toDF("name").write.parquet("/tmp/bad") > spark.read.parquet("/tmp/bad").where("name > 'a'").count 0 {code} This happens because Spark sorts the rows to be {{[a, é]}}, but Parquet's implementation of comparison of strings is based on signed byte array comparison, so it will actually create 1 row group with statistics {{min=é,max=a}}, and so the row group will be dropped by the query. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17172) pyspak hiveContext can not create UDF: Py4JJavaError: An error occurred while calling None.org.apache.spark.sql.hive.HiveContext.
[ https://issues.apache.org/jira/browse/SPARK-17172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15431409#comment-15431409 ] Andrew Davidson commented on SPARK-17172: - Hi Sean I forgot about that older jira issue. I never resolved it. I am using juypter. I believe each notebook gets it own spark context. I googled around and found some old issue that seem to suggest that a hive and sql context where being created . I have not figure out how to either use a different database for the hive context or prevent the original spark context from being created. > pyspak hiveContext can not create UDF: Py4JJavaError: An error occurred while > calling None.org.apache.spark.sql.hive.HiveContext. > -- > > Key: SPARK-17172 > URL: https://issues.apache.org/jira/browse/SPARK-17172 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.6.2 > Environment: spark version: 1.6.2 > python version: 3.4.2 (v3.4.2:ab2c023a9432, Oct 5 2014, 20:42:22) > [GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] >Reporter: Andrew Davidson > Attachments: hiveUDFBug.html, hiveUDFBug.ipynb > > > from pyspark.sql import HiveContext > sqlContext = HiveContext(sc) > # Define udf > from pyspark.sql.functions import udf > def scoreToCategory(score): > if score >= 80: return 'A' > elif score >= 60: return 'B' > elif score >= 35: return 'C' > else: return 'D' > > udfScoreToCategory=udf(scoreToCategory, StringType()) > throws exception > Py4JJavaError: An error occurred while calling > None.org.apache.spark.sql.hive.HiveContext. > : java.lang.RuntimeException: java.lang.RuntimeException: Unable to > instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17172) pyspak hiveContext can not create UDF: Py4JJavaError: An error occurred while calling None.org.apache.spark.sql.hive.HiveContext.
[ https://issues.apache.org/jira/browse/SPARK-17172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15431371#comment-15431371 ] Andrew Davidson commented on SPARK-17172: - Hi Sean the data center was created using spark-ec2 from spark-1.6.1-bin-hadoop2.6 ec2-user@ip-172-31-22-140 root]$ cat /root/spark/RELEASE Spark 1.6.1 built for Hadoop 2.0.0-mr1-cdh4.2.0 Build flags: -Psparkr -Phadoop-1 -Phive -Phive-thriftserver -Dhadoop.version=2.0.0-mr1-cdh4.2.0 -DzincPort=3032 [ec2-user@ip-172-31-22-140 root]$ > pyspak hiveContext can not create UDF: Py4JJavaError: An error occurred while > calling None.org.apache.spark.sql.hive.HiveContext. > -- > > Key: SPARK-17172 > URL: https://issues.apache.org/jira/browse/SPARK-17172 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.6.2 > Environment: spark version: 1.6.2 > python version: 3.4.2 (v3.4.2:ab2c023a9432, Oct 5 2014, 20:42:22) > [GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] >Reporter: Andrew Davidson > Attachments: hiveUDFBug.html, hiveUDFBug.ipynb > > > from pyspark.sql import HiveContext > sqlContext = HiveContext(sc) > # Define udf > from pyspark.sql.functions import udf > def scoreToCategory(score): > if score >= 80: return 'A' > elif score >= 60: return 'B' > elif score >= 35: return 'C' > else: return 'D' > > udfScoreToCategory=udf(scoreToCategory, StringType()) > throws exception > Py4JJavaError: An error occurred while calling > None.org.apache.spark.sql.hive.HiveContext. > : java.lang.RuntimeException: java.lang.RuntimeException: Unable to > instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17172) pyspak hiveContext can not create UDF: Py4JJavaError: An error occurred while calling None.org.apache.spark.sql.hive.HiveContext.
[ https://issues.apache.org/jira/browse/SPARK-17172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15431018#comment-15431018 ] Andrew Davidson commented on SPARK-17172: - Hi Sean It should be very easy to use the attached notebook to reproduce the hive bug. I got the code example from a blog. The original code worked in spark 1.5.x I also attached an html version of the notebook so you can see the entire stack trace with out having to start jupyter thanks Andy > pyspak hiveContext can not create UDF: Py4JJavaError: An error occurred while > calling None.org.apache.spark.sql.hive.HiveContext. > -- > > Key: SPARK-17172 > URL: https://issues.apache.org/jira/browse/SPARK-17172 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.6.2 > Environment: spark version: 1.6.2 > python version: 3.4.2 (v3.4.2:ab2c023a9432, Oct 5 2014, 20:42:22) > [GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] >Reporter: Andrew Davidson > Attachments: hiveUDFBug.html, hiveUDFBug.ipynb > > > from pyspark.sql import HiveContext > sqlContext = HiveContext(sc) > # Define udf > from pyspark.sql.functions import udf > def scoreToCategory(score): > if score >= 80: return 'A' > elif score >= 60: return 'B' > elif score >= 35: return 'C' > else: return 'D' > > udfScoreToCategory=udf(scoreToCategory, StringType()) > throws exception > Py4JJavaError: An error occurred while calling > None.org.apache.spark.sql.hive.HiveContext. > : java.lang.RuntimeException: java.lang.RuntimeException: Unable to > instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17172) pyspak hiveContext can not create UDF: Py4JJavaError: An error occurred while calling None.org.apache.spark.sql.hive.HiveContext.
[ https://issues.apache.org/jira/browse/SPARK-17172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15430004#comment-15430004 ] Andrew Davidson commented on SPARK-17172: - Hi Sean I do not think it is the same error. In the related to bug, I could not create a udf using sqlcontext. The work around solution was to change the permission on hdfs:///tmp The error msg actually mentioned problem with /tmp. (I thought the msg referred to the file:///tmp ) not sure how permission got messed up? maybe some one deleted it by accident and spark does not recreated it if its missing? so I am able to create udf using sqlcontext. hiveContext does not work. Given I fixed the hdfs:/// permission problem I think its probably something else. Hopefully the attached notebook makes it easy to reproduce thanks Andy > pyspak hiveContext can not create UDF: Py4JJavaError: An error occurred while > calling None.org.apache.spark.sql.hive.HiveContext. > -- > > Key: SPARK-17172 > URL: https://issues.apache.org/jira/browse/SPARK-17172 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.6.2 > Environment: spark version: 1.6.2 > python version: 3.4.2 (v3.4.2:ab2c023a9432, Oct 5 2014, 20:42:22) > [GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] >Reporter: Andrew Davidson > Attachments: hiveUDFBug.html, hiveUDFBug.ipynb > > > from pyspark.sql import HiveContext > sqlContext = HiveContext(sc) > # Define udf > from pyspark.sql.functions import udf > def scoreToCategory(score): > if score >= 80: return 'A' > elif score >= 60: return 'B' > elif score >= 35: return 'C' > else: return 'D' > > udfScoreToCategory=udf(scoreToCategory, StringType()) > throws exception > Py4JJavaError: An error occurred while calling > None.org.apache.spark.sql.hive.HiveContext. > : java.lang.RuntimeException: java.lang.RuntimeException: Unable to > instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17172) pyspak hiveContext can not create UDF: Py4JJavaError: An error occurred while calling None.org.apache.spark.sql.hive.HiveContext.
[ https://issues.apache.org/jira/browse/SPARK-17172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Davidson updated SPARK-17172: Attachment: hiveUDFBug.ipynb hiveUDFBug.html > pyspak hiveContext can not create UDF: Py4JJavaError: An error occurred while > calling None.org.apache.spark.sql.hive.HiveContext. > -- > > Key: SPARK-17172 > URL: https://issues.apache.org/jira/browse/SPARK-17172 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.6.2 > Environment: spark version: 1.6.2 > python version: 3.4.2 (v3.4.2:ab2c023a9432, Oct 5 2014, 20:42:22) > [GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] >Reporter: Andrew Davidson > Attachments: hiveUDFBug.html, hiveUDFBug.ipynb > > > from pyspark.sql import HiveContext > sqlContext = HiveContext(sc) > # Define udf > from pyspark.sql.functions import udf > def scoreToCategory(score): > if score >= 80: return 'A' > elif score >= 60: return 'B' > elif score >= 35: return 'C' > else: return 'D' > > udfScoreToCategory=udf(scoreToCategory, StringType()) > throws exception > Py4JJavaError: An error occurred while calling > None.org.apache.spark.sql.hive.HiveContext. > : java.lang.RuntimeException: java.lang.RuntimeException: Unable to > instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17172) pyspak hiveContext can not create UDF: Py4JJavaError: An error occurred while calling None.org.apache.spark.sql.hive.HiveContext.
[ https://issues.apache.org/jira/browse/SPARK-17172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15429465#comment-15429465 ] Andrew Davidson commented on SPARK-17172: - attached a notebook that demonstrates the bug. Also attaced an html version of notebook > pyspak hiveContext can not create UDF: Py4JJavaError: An error occurred while > calling None.org.apache.spark.sql.hive.HiveContext. > -- > > Key: SPARK-17172 > URL: https://issues.apache.org/jira/browse/SPARK-17172 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.6.2 > Environment: spark version: 1.6.2 > python version: 3.4.2 (v3.4.2:ab2c023a9432, Oct 5 2014, 20:42:22) > [GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] >Reporter: Andrew Davidson > Attachments: hiveUDFBug.html, hiveUDFBug.ipynb > > > from pyspark.sql import HiveContext > sqlContext = HiveContext(sc) > # Define udf > from pyspark.sql.functions import udf > def scoreToCategory(score): > if score >= 80: return 'A' > elif score >= 60: return 'B' > elif score >= 35: return 'C' > else: return 'D' > > udfScoreToCategory=udf(scoreToCategory, StringType()) > throws exception > Py4JJavaError: An error occurred while calling > None.org.apache.spark.sql.hive.HiveContext. > : java.lang.RuntimeException: java.lang.RuntimeException: Unable to > instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17172) pyspak hiveContext can not create UDF: Py4JJavaError: An error occurred while calling None.org.apache.spark.sql.hive.HiveContext.
[ https://issues.apache.org/jira/browse/SPARK-17172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15429463#comment-15429463 ] Andrew Davidson commented on SPARK-17172: - related bug report : https://issues.apache.org/jira/browse/SPARK-17143 > pyspak hiveContext can not create UDF: Py4JJavaError: An error occurred while > calling None.org.apache.spark.sql.hive.HiveContext. > -- > > Key: SPARK-17172 > URL: https://issues.apache.org/jira/browse/SPARK-17172 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.6.2 > Environment: spark version: 1.6.2 > python version: 3.4.2 (v3.4.2:ab2c023a9432, Oct 5 2014, 20:42:22) > [GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] >Reporter: Andrew Davidson > > from pyspark.sql import HiveContext > sqlContext = HiveContext(sc) > # Define udf > from pyspark.sql.functions import udf > def scoreToCategory(score): > if score >= 80: return 'A' > elif score >= 60: return 'B' > elif score >= 35: return 'C' > else: return 'D' > > udfScoreToCategory=udf(scoreToCategory, StringType()) > throws exception > Py4JJavaError: An error occurred while calling > None.org.apache.spark.sql.hive.HiveContext. > : java.lang.RuntimeException: java.lang.RuntimeException: Unable to > instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17172) pyspak hiveContext can not create UDF: Py4JJavaError: An error occurred while calling None.org.apache.spark.sql.hive.HiveContext.
Andrew Davidson created SPARK-17172: --- Summary: pyspak hiveContext can not create UDF: Py4JJavaError: An error occurred while calling None.org.apache.spark.sql.hive.HiveContext. Key: SPARK-17172 URL: https://issues.apache.org/jira/browse/SPARK-17172 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.6.2 Environment: spark version: 1.6.2 python version: 3.4.2 (v3.4.2:ab2c023a9432, Oct 5 2014, 20:42:22) [GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] Reporter: Andrew Davidson from pyspark.sql import HiveContext sqlContext = HiveContext(sc) # Define udf from pyspark.sql.functions import udf def scoreToCategory(score): if score >= 80: return 'A' elif score >= 60: return 'B' elif score >= 35: return 'C' else: return 'D' udfScoreToCategory=udf(scoreToCategory, StringType()) throws exception Py4JJavaError: An error occurred while calling None.org.apache.spark.sql.hive.HiveContext. : java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17143) pyspark unable to create UDF: java.lang.RuntimeException: org.apache.hadoop.fs.FileAlreadyExistsException: Parent path is not a directory: /tmp tmp
[ https://issues.apache.org/jira/browse/SPARK-17143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15427394#comment-15427394 ] Andrew Davidson commented on SPARK-17143: - See email from user's group. I was able to find a work around. Not sure how hdfs:///tmp/ got created or how the permissions got messed up ## NICE CATCH!!! Many thanks. I spent all day on this bug The error msg report /tmp. I did not think to look on hdfs. [ec2-user@ip-172-31-22-140 notebooks]$ hadoop fs -ls hdfs:///tmp/ Found 1 items -rw-r--r-- 3 ec2-user supergroup418 2016-04-13 22:49 hdfs:///tmp [ec2-user@ip-172-31-22-140 notebooks]$ I have no idea how hdfs:///tmp got created. I deleted it. This causes a bunch of exceptions. These exceptions has useful message. I was able to fix the problem as follows $ hadoop fs -rmr hdfs:///tmp Now I run the notebook. It creates hdfs:///tmp/hive but the permission are wrong $ hadoop fs -chmod 777 hdfs:///tmp/hive From: Felix CheungDate: Thursday, August 18, 2016 at 3:37 PM To: Andrew Davidson , "user @spark" Subject: Re: pyspark unable to create UDF: java.lang.RuntimeException: org.apache.hadoop.fs.FileAlreadyExistsException: Parent path is not a directory: /tmp tmp Do you have a file called tmp at / on HDFS? > pyspark unable to create UDF: java.lang.RuntimeException: > org.apache.hadoop.fs.FileAlreadyExistsException: Parent path is not a > directory: /tmp tmp > --- > > Key: SPARK-17143 > URL: https://issues.apache.org/jira/browse/SPARK-17143 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.6.1 > Environment: spark version: 1.6.1 > python version: 3.4.3 (default, Apr 1 2015, 18:10:40) > [GCC 4.8.2 20140120 (Red Hat 4.8.2-16)] >Reporter: Andrew Davidson > Attachments: udfBug.html, udfBug.ipynb > > > For unknown reason I can not create UDF when I run the attached notebook on > my cluster. I get the following error > Py4JJavaError: An error occurred while calling > None.org.apache.spark.sql.hive.HiveContext. > : java.lang.RuntimeException: > org.apache.hadoop.fs.FileAlreadyExistsException: Parent path is not a > directory: /tmp tmp > The notebook runs fine on my Mac > In general I am able to run non UDF spark code with out any trouble > I start the notebook server as the user “ec2-user" and uses master URL > spark://ec2-51-215-120-63.us-west-1.compute.amazonaws.com:6066 > I found the following message in the notebook server log file. I have log > level set to warn > 16/08/18 21:38:45 WARN ObjectStore: Version information not found in > metastore. hive.metastore.schema.verification is not enabled so recording the > schema version 1.2.0 > 16/08/18 21:38:45 WARN ObjectStore: Failed to get database default, returning > NoSuchObjectException > The cluster was originally created using > spark-1.6.1-bin-hadoop2.6/ec2/spark-ec2 > #from pyspark.sql import SQLContext, HiveContext > #sqlContext = SQLContext(sc) > > #from pyspark.sql import DataFrame > #from pyspark.sql import functions > > from pyspark.sql.types import StringType > from pyspark.sql.functions import udf > > print("spark version: {}".format(sc.version)) > > import sys > print("python version: {}".format(sys.version)) > spark version: 1.6.1 > python version: 3.4.3 (default, Apr 1 2015, 18:10:40) > [GCC 4.8.2 20140120 (Red Hat 4.8.2-16)] > # functions.lower() raises > # py4j.Py4JException: Method lower([class java.lang.String]) does not exist > # work around define a UDF > toLowerUDFRetType = StringType() > #toLowerUDF = udf(lambda s : s.lower(), toLowerUDFRetType) > toLowerUDF = udf(lambda s : s.lower(), StringType()) > You must build Spark with Hive. Export 'SPARK_HIVE=true' and run build/sbt > assembly > Py4JJavaErrorTraceback (most recent call last) > in () > 4 toLowerUDFRetType = StringType() > 5 #toLowerUDF = udf(lambda s : s.lower(), toLowerUDFRetType) > > 6 toLowerUDF = udf(lambda s : s.lower(), StringType()) > /root/spark/python/pyspark/sql/functions.py in udf(f, returnType) >1595 [Row(slen=5), Row(slen=3)] >1596 """ > -> 1597 return UserDefinedFunction(f, returnType) >1598 >1599 blacklist = ['map', 'since', 'ignore_unicode_prefix'] > /root/spark/python/pyspark/sql/functions.py in __init__(self, func, > returnType, name) >1556 self.returnType = returnType >1557 self._broadcast = None > -> 1558 self._judf = self._create_judf(name) >1559 >1560 def _create_judf(self, name): > /root/spark/python/pyspark/sql/functions.py in _create_judf(self, name) >
[jira] [Commented] (SPARK-17143) pyspark unable to create UDF: java.lang.RuntimeException: org.apache.hadoop.fs.FileAlreadyExistsException: Parent path is not a directory: /tmp tmp
[ https://issues.apache.org/jira/browse/SPARK-17143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15427278#comment-15427278 ] Andrew Davidson commented on SPARK-17143: - given the exception metioned an issue with /tmp I decide to track how /tmp changed when run my cell # no spark jobs are running [ec2-user@ip-172-31-22-140 notebooks]$ !ls ls /tmp/ hsperfdata_ec2-user hsperfdata_root pip_build_ec2-user [ec2-user@ip-172-31-22-140 notebooks]$ # start notebook server $ nohup startIPythonNotebook.sh > startIPythonNotebook.sh.out & [ec2-user@ip-172-31-22-140 notebooks]$ !ls ls /tmp/ hsperfdata_ec2-user hsperfdata_root pip_build_ec2-user [ec2-user@ip-172-31-22-140 notebooks]$ # start the udfBug notebook [ec2-user@ip-172-31-22-140 notebooks]$ ls /tmp/ hsperfdata_ec2-user hsperfdata_root libnetty-transport-native-epoll818283657820702.so pip_build_ec2-user [ec2-user@ip-172-31-22-140 notebooks]$ # execute cell that define UDF [ec2-user@ip-172-31-22-140 notebooks]$ ls /tmp/ hsperfdata_ec2-user hsperfdata_root libnetty-transport-native-epoll818283657820702.so pip_build_ec2-user spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9 [ec2-user@ip-172-31-22-140 notebooks]$ [ec2-user@ip-172-31-22-140 notebooks]$ find /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/ /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/ /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/db.lck /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/log /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/log/log.ctrl /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/log/log1.dat /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/log/README_DO_NOT_TOUCH_FILES.txt /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/log/logmirror.ctrl /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/service.properties /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/README_DO_NOT_TOUCH_FILES.txt /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0 /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c230.dat /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c4b0.dat /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c241.dat /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c3a1.dat /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c180.dat /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c2b1.dat /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c7b1.dat /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c311.dat /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c880.dat /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c541.dat /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c9f1.dat /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c20.dat /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c590.dat /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c721.dat /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c470.dat /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c441.dat /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c8e1.dat /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c361.dat /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/ca1.dat /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c421.dat /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c331.dat /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c461.dat /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c5d0.dat /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c851.dat /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c621.dat /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c101.dat /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c3d1.dat /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c891.dat /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c1b1.dat /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c641.dat /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c871.dat /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c6a1.dat /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/cb1.dat /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/ca01.dat /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c391.dat /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c7f1.dat /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c1a1.dat /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c41.dat /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c990.dat
[jira] [Updated] (SPARK-17143) pyspark unable to create UDF: java.lang.RuntimeException: org.apache.hadoop.fs.FileAlreadyExistsException: Parent path is not a directory: /tmp tmp
[ https://issues.apache.org/jira/browse/SPARK-17143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Davidson updated SPARK-17143: Attachment: udfBug.html This html version of the notebook shows the output when run in my data center > pyspark unable to create UDF: java.lang.RuntimeException: > org.apache.hadoop.fs.FileAlreadyExistsException: Parent path is not a > directory: /tmp tmp > --- > > Key: SPARK-17143 > URL: https://issues.apache.org/jira/browse/SPARK-17143 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.6.1 > Environment: spark version: 1.6.1 > python version: 3.4.3 (default, Apr 1 2015, 18:10:40) > [GCC 4.8.2 20140120 (Red Hat 4.8.2-16)] >Reporter: Andrew Davidson > Attachments: udfBug.html, udfBug.ipynb > > > For unknown reason I can not create UDF when I run the attached notebook on > my cluster. I get the following error > Py4JJavaError: An error occurred while calling > None.org.apache.spark.sql.hive.HiveContext. > : java.lang.RuntimeException: > org.apache.hadoop.fs.FileAlreadyExistsException: Parent path is not a > directory: /tmp tmp > The notebook runs fine on my Mac > In general I am able to run non UDF spark code with out any trouble > I start the notebook server as the user “ec2-user" and uses master URL > spark://ec2-51-215-120-63.us-west-1.compute.amazonaws.com:6066 > I found the following message in the notebook server log file. I have log > level set to warn > 16/08/18 21:38:45 WARN ObjectStore: Version information not found in > metastore. hive.metastore.schema.verification is not enabled so recording the > schema version 1.2.0 > 16/08/18 21:38:45 WARN ObjectStore: Failed to get database default, returning > NoSuchObjectException > The cluster was originally created using > spark-1.6.1-bin-hadoop2.6/ec2/spark-ec2 > #from pyspark.sql import SQLContext, HiveContext > #sqlContext = SQLContext(sc) > > #from pyspark.sql import DataFrame > #from pyspark.sql import functions > > from pyspark.sql.types import StringType > from pyspark.sql.functions import udf > > print("spark version: {}".format(sc.version)) > > import sys > print("python version: {}".format(sys.version)) > spark version: 1.6.1 > python version: 3.4.3 (default, Apr 1 2015, 18:10:40) > [GCC 4.8.2 20140120 (Red Hat 4.8.2-16)] > # functions.lower() raises > # py4j.Py4JException: Method lower([class java.lang.String]) does not exist > # work around define a UDF > toLowerUDFRetType = StringType() > #toLowerUDF = udf(lambda s : s.lower(), toLowerUDFRetType) > toLowerUDF = udf(lambda s : s.lower(), StringType()) > You must build Spark with Hive. Export 'SPARK_HIVE=true' and run build/sbt > assembly > Py4JJavaErrorTraceback (most recent call last) > in () > 4 toLowerUDFRetType = StringType() > 5 #toLowerUDF = udf(lambda s : s.lower(), toLowerUDFRetType) > > 6 toLowerUDF = udf(lambda s : s.lower(), StringType()) > /root/spark/python/pyspark/sql/functions.py in udf(f, returnType) >1595 [Row(slen=5), Row(slen=3)] >1596 """ > -> 1597 return UserDefinedFunction(f, returnType) >1598 >1599 blacklist = ['map', 'since', 'ignore_unicode_prefix'] > /root/spark/python/pyspark/sql/functions.py in __init__(self, func, > returnType, name) >1556 self.returnType = returnType >1557 self._broadcast = None > -> 1558 self._judf = self._create_judf(name) >1559 >1560 def _create_judf(self, name): > /root/spark/python/pyspark/sql/functions.py in _create_judf(self, name) >1567 pickled_command, broadcast_vars, env, includes = > _prepare_for_python_RDD(sc, command, self) >1568 ctx = SQLContext.getOrCreate(sc) > -> 1569 jdt = ctx._ssql_ctx.parseDataType(self.returnType.json()) >1570 if name is None: >1571 name = f.__name__ if hasattr(f, '__name__') else > f.__class__.__name__ > /root/spark/python/pyspark/sql/context.py in _ssql_ctx(self) > 681 try: > 682 if not hasattr(self, '_scala_HiveContext'): > --> 683 self._scala_HiveContext = self._get_hive_ctx() > 684 return self._scala_HiveContext > 685 except Py4JError as e: > /root/spark/python/pyspark/sql/context.py in _get_hive_ctx(self) > 690 > 691 def _get_hive_ctx(self): > --> 692 return self._jvm.HiveContext(self._jsc.sc()) > 693 > 694 def refreshTable(self, tableName): > /root/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py in > __call__(self, *args) >1062 answer = self._gateway_client.send_command(command) >1063 return_value = get_return_value( > ->
[jira] [Updated] (SPARK-17143) pyspark unable to create UDF: java.lang.RuntimeException: org.apache.hadoop.fs.FileAlreadyExistsException: Parent path is not a directory: /tmp tmp
[ https://issues.apache.org/jira/browse/SPARK-17143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Davidson updated SPARK-17143: Attachment: udfBug.ipynb The attached notebook demonstrated the reported bug. Note it includes the output when run on my mac book pro. The bug report contains the stack trace when the same code is run in my data center > pyspark unable to create UDF: java.lang.RuntimeException: > org.apache.hadoop.fs.FileAlreadyExistsException: Parent path is not a > directory: /tmp tmp > --- > > Key: SPARK-17143 > URL: https://issues.apache.org/jira/browse/SPARK-17143 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.6.1 > Environment: spark version: 1.6.1 > python version: 3.4.3 (default, Apr 1 2015, 18:10:40) > [GCC 4.8.2 20140120 (Red Hat 4.8.2-16)] >Reporter: Andrew Davidson > Attachments: udfBug.ipynb > > > For unknown reason I can not create UDF when I run the attached notebook on > my cluster. I get the following error > Py4JJavaError: An error occurred while calling > None.org.apache.spark.sql.hive.HiveContext. > : java.lang.RuntimeException: > org.apache.hadoop.fs.FileAlreadyExistsException: Parent path is not a > directory: /tmp tmp > The notebook runs fine on my Mac > In general I am able to run non UDF spark code with out any trouble > I start the notebook server as the user “ec2-user" and uses master URL > spark://ec2-51-215-120-63.us-west-1.compute.amazonaws.com:6066 > I found the following message in the notebook server log file. I have log > level set to warn > 16/08/18 21:38:45 WARN ObjectStore: Version information not found in > metastore. hive.metastore.schema.verification is not enabled so recording the > schema version 1.2.0 > 16/08/18 21:38:45 WARN ObjectStore: Failed to get database default, returning > NoSuchObjectException > The cluster was originally created using > spark-1.6.1-bin-hadoop2.6/ec2/spark-ec2 > #from pyspark.sql import SQLContext, HiveContext > #sqlContext = SQLContext(sc) > > #from pyspark.sql import DataFrame > #from pyspark.sql import functions > > from pyspark.sql.types import StringType > from pyspark.sql.functions import udf > > print("spark version: {}".format(sc.version)) > > import sys > print("python version: {}".format(sys.version)) > spark version: 1.6.1 > python version: 3.4.3 (default, Apr 1 2015, 18:10:40) > [GCC 4.8.2 20140120 (Red Hat 4.8.2-16)] > # functions.lower() raises > # py4j.Py4JException: Method lower([class java.lang.String]) does not exist > # work around define a UDF > toLowerUDFRetType = StringType() > #toLowerUDF = udf(lambda s : s.lower(), toLowerUDFRetType) > toLowerUDF = udf(lambda s : s.lower(), StringType()) > You must build Spark with Hive. Export 'SPARK_HIVE=true' and run build/sbt > assembly > Py4JJavaErrorTraceback (most recent call last) > in () > 4 toLowerUDFRetType = StringType() > 5 #toLowerUDF = udf(lambda s : s.lower(), toLowerUDFRetType) > > 6 toLowerUDF = udf(lambda s : s.lower(), StringType()) > /root/spark/python/pyspark/sql/functions.py in udf(f, returnType) >1595 [Row(slen=5), Row(slen=3)] >1596 """ > -> 1597 return UserDefinedFunction(f, returnType) >1598 >1599 blacklist = ['map', 'since', 'ignore_unicode_prefix'] > /root/spark/python/pyspark/sql/functions.py in __init__(self, func, > returnType, name) >1556 self.returnType = returnType >1557 self._broadcast = None > -> 1558 self._judf = self._create_judf(name) >1559 >1560 def _create_judf(self, name): > /root/spark/python/pyspark/sql/functions.py in _create_judf(self, name) >1567 pickled_command, broadcast_vars, env, includes = > _prepare_for_python_RDD(sc, command, self) >1568 ctx = SQLContext.getOrCreate(sc) > -> 1569 jdt = ctx._ssql_ctx.parseDataType(self.returnType.json()) >1570 if name is None: >1571 name = f.__name__ if hasattr(f, '__name__') else > f.__class__.__name__ > /root/spark/python/pyspark/sql/context.py in _ssql_ctx(self) > 681 try: > 682 if not hasattr(self, '_scala_HiveContext'): > --> 683 self._scala_HiveContext = self._get_hive_ctx() > 684 return self._scala_HiveContext > 685 except Py4JError as e: > /root/spark/python/pyspark/sql/context.py in _get_hive_ctx(self) > 690 > 691 def _get_hive_ctx(self): > --> 692 return self._jvm.HiveContext(self._jsc.sc()) > 693 > 694 def refreshTable(self, tableName): > /root/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py in > __call__(self, *args) >1062
[jira] [Created] (SPARK-17143) pyspark unable to create UDF: java.lang.RuntimeException: org.apache.hadoop.fs.FileAlreadyExistsException: Parent path is not a directory: /tmp tmp
Andrew Davidson created SPARK-17143: --- Summary: pyspark unable to create UDF: java.lang.RuntimeException: org.apache.hadoop.fs.FileAlreadyExistsException: Parent path is not a directory: /tmp tmp Key: SPARK-17143 URL: https://issues.apache.org/jira/browse/SPARK-17143 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.6.1 Environment: spark version: 1.6.1 python version: 3.4.3 (default, Apr 1 2015, 18:10:40) [GCC 4.8.2 20140120 (Red Hat 4.8.2-16)] Reporter: Andrew Davidson For unknown reason I can not create UDF when I run the attached notebook on my cluster. I get the following error Py4JJavaError: An error occurred while calling None.org.apache.spark.sql.hive.HiveContext. : java.lang.RuntimeException: org.apache.hadoop.fs.FileAlreadyExistsException: Parent path is not a directory: /tmp tmp The notebook runs fine on my Mac In general I am able to run non UDF spark code with out any trouble I start the notebook server as the user “ec2-user" and uses master URL spark://ec2-51-215-120-63.us-west-1.compute.amazonaws.com:6066 I found the following message in the notebook server log file. I have log level set to warn 16/08/18 21:38:45 WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 1.2.0 16/08/18 21:38:45 WARN ObjectStore: Failed to get database default, returning NoSuchObjectException The cluster was originally created using spark-1.6.1-bin-hadoop2.6/ec2/spark-ec2 #from pyspark.sql import SQLContext, HiveContext #sqlContext = SQLContext(sc) #from pyspark.sql import DataFrame #from pyspark.sql import functions from pyspark.sql.types import StringType from pyspark.sql.functions import udf print("spark version: {}".format(sc.version)) import sys print("python version: {}".format(sys.version)) spark version: 1.6.1 python version: 3.4.3 (default, Apr 1 2015, 18:10:40) [GCC 4.8.2 20140120 (Red Hat 4.8.2-16)] # functions.lower() raises # py4j.Py4JException: Method lower([class java.lang.String]) does not exist # work around define a UDF toLowerUDFRetType = StringType() #toLowerUDF = udf(lambda s : s.lower(), toLowerUDFRetType) toLowerUDF = udf(lambda s : s.lower(), StringType()) You must build Spark with Hive. Export 'SPARK_HIVE=true' and run build/sbt assembly Py4JJavaErrorTraceback (most recent call last) in () 4 toLowerUDFRetType = StringType() 5 #toLowerUDF = udf(lambda s : s.lower(), toLowerUDFRetType) > 6 toLowerUDF = udf(lambda s : s.lower(), StringType()) /root/spark/python/pyspark/sql/functions.py in udf(f, returnType) 1595 [Row(slen=5), Row(slen=3)] 1596 """ -> 1597 return UserDefinedFunction(f, returnType) 1598 1599 blacklist = ['map', 'since', 'ignore_unicode_prefix'] /root/spark/python/pyspark/sql/functions.py in __init__(self, func, returnType, name) 1556 self.returnType = returnType 1557 self._broadcast = None -> 1558 self._judf = self._create_judf(name) 1559 1560 def _create_judf(self, name): /root/spark/python/pyspark/sql/functions.py in _create_judf(self, name) 1567 pickled_command, broadcast_vars, env, includes = _prepare_for_python_RDD(sc, command, self) 1568 ctx = SQLContext.getOrCreate(sc) -> 1569 jdt = ctx._ssql_ctx.parseDataType(self.returnType.json()) 1570 if name is None: 1571 name = f.__name__ if hasattr(f, '__name__') else f.__class__.__name__ /root/spark/python/pyspark/sql/context.py in _ssql_ctx(self) 681 try: 682 if not hasattr(self, '_scala_HiveContext'): --> 683 self._scala_HiveContext = self._get_hive_ctx() 684 return self._scala_HiveContext 685 except Py4JError as e: /root/spark/python/pyspark/sql/context.py in _get_hive_ctx(self) 690 691 def _get_hive_ctx(self): --> 692 return self._jvm.HiveContext(self._jsc.sc()) 693 694 def refreshTable(self, tableName): /root/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py in __call__(self, *args) 1062 answer = self._gateway_client.send_command(command) 1063 return_value = get_return_value( -> 1064 answer, self._gateway_client, None, self._fqn) 1065 1066 for temp_arg in temp_args: /root/spark/python/pyspark/sql/utils.py in deco(*a, **kw) 43 def deco(*a, **kw): 44 try: ---> 45 return f(*a, **kw) 46 except py4j.protocol.Py4JJavaError as e: 47 s = e.java_exception.toString() /root/spark/python/lib/py4j-0.9-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name) 306 raise Py4JJavaError( 307 "An error occurred
[jira] [Created] (SPARK-17091) ParquetFilters rewrite IN to OR of Eq
Andrew Duffy created SPARK-17091: Summary: ParquetFilters rewrite IN to OR of Eq Key: SPARK-17091 URL: https://issues.apache.org/jira/browse/SPARK-17091 Project: Spark Issue Type: Bug Reporter: Andrew Duffy Past attempts at pushing down the InSet operation for Parquet relied on user-defined predicates. It would be simpler to rewrite an IN clause into the corresponding OR union of a set of equality conditions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17059) Allow FileFormat to specify partition pruning strategy
Andrew Duffy created SPARK-17059: Summary: Allow FileFormat to specify partition pruning strategy Key: SPARK-17059 URL: https://issues.apache.org/jira/browse/SPARK-17059 Project: Spark Issue Type: Bug Reporter: Andrew Duffy Allow Spark to have pluggable pruning of input files for FileSourceScanExec by allowing FileFormat's to specify format-specific filterPartitions method. This is especially useful for Parquet as Spark does not currently make use of the summary metadata, instead reading the footer of all part files for a Parquet data source. This can lead to massive speedups when reading a filtered chunk of a dataset, especially when using remote storage (S3). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17029) Dataset toJSON goes through RDD form instead of transforming dataset itself
[ https://issues.apache.org/jira/browse/SPARK-17029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15418149#comment-15418149 ] Andrew Ash commented on SPARK-17029: Note RDD form usage from https://issues.apache.org/jira/browse/SPARK-10705 > Dataset toJSON goes through RDD form instead of transforming dataset itself > --- > > Key: SPARK-17029 > URL: https://issues.apache.org/jira/browse/SPARK-17029 > Project: Spark > Issue Type: Bug >Reporter: Robert Kruszewski > > No longer necessary and can be optimized with datasets -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-16780) spark-streaming-kafka_2.10 version 2.0.0 not on maven central
[ https://issues.apache.org/jira/browse/SPARK-16780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew B closed SPARK-16780. > spark-streaming-kafka_2.10 version 2.0.0 not on maven central > - > > Key: SPARK-16780 > URL: https://issues.apache.org/jira/browse/SPARK-16780 > Project: Spark > Issue Type: Bug >Reporter: Andrew B > > I cannot seem to find spark-streaming-kafka_2.10 version 2.0.0 on maven > central. Has this been released? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16780) spark-streaming-kafka_2.10 version 2.0.0 not on maven central
[ https://issues.apache.org/jira/browse/SPARK-16780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15398105#comment-15398105 ] Andrew B commented on SPARK-16780: -- How are the new artifacts used with the example below? https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/streaming/JavaKafkaWordCount.java#L3 The example contains a reference to KafkaUtil class which contains a createStream() method. However, org.apache.spark.streaming.kafka010.KafkaUtil, which is in spark-streaming-kafka-0-10_2.10, has switched over to DStream API, so it does not have a createStream method. > spark-streaming-kafka_2.10 version 2.0.0 not on maven central > - > > Key: SPARK-16780 > URL: https://issues.apache.org/jira/browse/SPARK-16780 > Project: Spark > Issue Type: Bug >Reporter: Andrew B > > I cannot seem to find spark-streaming-kafka_2.10 version 2.0.0 on maven > central. Has this been released? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16780) spark-streaming-kafka_2.10 version 2.0.0 not on maven central
Andrew B created SPARK-16780: Summary: spark-streaming-kafka_2.10 version 2.0.0 not on maven central Key: SPARK-16780 URL: https://issues.apache.org/jira/browse/SPARK-16780 Project: Spark Issue Type: Bug Reporter: Andrew B I cannot seem to find spark-streaming-kafka_2.10 version 2.0.0 on maven central. Has this been released? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16665) python import pyspark fails in context.py
[ https://issues.apache.org/jira/browse/SPARK-16665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15389135#comment-15389135 ] Andrew Jefferson commented on SPARK-16665: -- This was the result of a previous failed import in python > python import pyspark fails in context.py > -- > > Key: SPARK-16665 > URL: https://issues.apache.org/jira/browse/SPARK-16665 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.0.0 >Reporter: Andrew Jefferson >Priority: Critical > > Using 2.0.0 Release Candidate 5 > python > import pyspark > Traceback (most recent call last): > File "", line 1, in > File "pyspark/init.py", line 44, in > from pyspark.context import SparkContext > File "pyspark/context.py", line 28, in > from pyspark import accumulators > ImportError: cannot import name accumulators -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16665) python import pyspark fails in context.py
[ https://issues.apache.org/jira/browse/SPARK-16665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Jefferson resolved SPARK-16665. -- Resolution: Cannot Reproduce > python import pyspark fails in context.py > -- > > Key: SPARK-16665 > URL: https://issues.apache.org/jira/browse/SPARK-16665 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.0.0 >Reporter: Andrew Jefferson >Priority: Critical > > Using 2.0.0 Release Candidate 5 > python > import pyspark > Traceback (most recent call last): > File "", line 1, in > File "pyspark/init.py", line 44, in > from pyspark.context import SparkContext > File "pyspark/context.py", line 28, in > from pyspark import accumulators > ImportError: cannot import name accumulators -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16665) python import pyspark fails in context.py
[ https://issues.apache.org/jira/browse/SPARK-16665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15387908#comment-15387908 ] Andrew Jefferson commented on SPARK-16665: -- Pull Request here: https://github.com/apache/spark/pull/14303 > python import pyspark fails in context.py > -- > > Key: SPARK-16665 > URL: https://issues.apache.org/jira/browse/SPARK-16665 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.0.0 >Reporter: Andrew Jefferson >Priority: Critical > > Using 2.0.0 Release Candidate 5 > python > import pyspark > Traceback (most recent call last): > File "", line 1, in > File "pyspark/init.py", line 44, in > from pyspark.context import SparkContext > File "pyspark/context.py", line 28, in > from pyspark import accumulators > ImportError: cannot import name accumulators -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16665) python import pyspark fails in context.py
Andrew Jefferson created SPARK-16665: Summary: python import pyspark fails in context.py Key: SPARK-16665 URL: https://issues.apache.org/jira/browse/SPARK-16665 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 2.0.0 Reporter: Andrew Jefferson Priority: Critical Using 2.0.0 Release Candidate 5 python import pyspark Traceback (most recent call last): File "", line 1, in File "pyspark/init.py", line 44, in from pyspark.context import SparkContext File "pyspark/context.py", line 28, in from pyspark import accumulators ImportError: cannot import name accumulators -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16265) Add option to SparkSubmit to ship driver JRE to YARN
[ https://issues.apache.org/jira/browse/SPARK-16265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15365998#comment-15365998 ] Andrew Duffy commented on SPARK-16265: -- Hi Sean, yeah I can see where you're coming from, but I feel like this change is simple and targeted enough (meant to be used with the {{SparkLauncher}} API) that it can actually be useful without adding much (if any) maintenance load. If anything I would argue it at least deserves consideration as an experimental feature, as users who write programs that use SparkLauncher are going to have to split Java versions for the code that launches and interacts with the Spark app and the Spark app itself if the application is eg. written for one environment and then deployed in another uncontrolled customer environment where the cluster does not have Java 8 installed. > Add option to SparkSubmit to ship driver JRE to YARN > > > Key: SPARK-16265 > URL: https://issues.apache.org/jira/browse/SPARK-16265 > Project: Spark > Issue Type: Improvement >Affects Versions: 1.6.2 >Reporter: Andrew Duffy > > Add an option to {{SparkSubmit}} to allow the driver to package up it's > version of the JRE to be shipped to a YARN cluster. This allows deploying > Spark applications to a YARN cluster in which its required Java version need > not match one of the versions already installed on the YARN cluster, useful > in situations in which the Spark Application developer does not have > administrative access over the YARN cluster (ex. school or corporate > environment) but still wants to use certain language features in their code. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15829) spark master webpage links to application UI broke when running in cluster mode
[ https://issues.apache.org/jira/browse/SPARK-15829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15359277#comment-15359277 ] Andrew Davidson commented on SPARK-15829: - Hi Sean you mention the ec2 script is not supported anymore? What was the last release it was supported in? Its still part of the 1.6.x documentation Is there a replacement or alternative? thanks Andy > spark master webpage links to application UI broke when running in cluster > mode > --- > > Key: SPARK-15829 > URL: https://issues.apache.org/jira/browse/SPARK-15829 > Project: Spark > Issue Type: Bug > Components: EC2 >Affects Versions: 1.6.1 > Environment: AWS ec2 cluster >Reporter: Andrew Davidson >Priority: Critical > > Hi > I created a cluster using the spark-1.6.1-bin-hadoop2.6/ec2/spark-ec2 > I use the stand alone cluster manager. I have a streaming app running in > cluster mode. I notice the master webpage links to the application UI page > are incorrect > It does not look like jira will let my upload images. I'll try and describe > the web pages and the bug > My master is running on > http://ec2-54-215-230-73.us-west-1.compute.amazonaws.com:8080/ > It has a section marked "applications". If I click on one of the running > application ids I am taken to a page showing "Executor Summary". This page > has a link to teh 'application detail UI' the url is > http://ec2-54-215-230-73.us-west-1.compute.amazonaws.com:4041/ > Notice it things the application UI is running on the cluster master. > It is actually running on the same machine as the driver on port 4041. I was > able to reverse engine the url by noticing the private ip address is part of > the worker id . For example worker-20160322041632-172.31.23.201-34909 > next I went on the aws ec2 console to find the public DNS name for this > machine > http://ec2-54-193-104-169.us-west-1.compute.amazonaws.com:4041/streaming/ > Kind regards > Andy -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15829) spark master webpage links to application UI broke when running in cluster mode
[ https://issues.apache.org/jira/browse/SPARK-15829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15359215#comment-15359215 ] Andrew Davidson commented on SPARK-15829: - Hi Sean I am not sure how to check the value of 'SPARK_MASTER_HOST'. I looked at the documentation pags http://spark.apache.org/docs/latest/configuration.html and http://spark.apache.org/docs/latest/ec2-scripts.html. They do not mention SPARK_MASTER_HOST when I submit my jobs I use MASTER_URL=spark://ec2-54-215-230-73.us-west-1.compute.amazonaws.com:6066 I use the stand alone cluster manager I think the problem may be that the web UI assume the driver is always running on the master machine. I assume the cluster manager decides which worker the driver will run on. Is there a way for the web UI to discover where the driver is running? On my master [ec2-user@ip-172-31-22-140 conf]$ pwd /root/spark/conf [ec2-user@ip-172-31-22-140 conf]$ cat slaves ec2-54-193-94-207.us-west-1.compute.amazonaws.com ec2-54-67-13-246.us-west-1.compute.amazonaws.com ec2-54-67-48-49.us-west-1.compute.amazonaws.com ec2-54-193-104-169.us-west-1.compute.amazonaws.com [ec2-user@ip-172-31-22-140 conf]$ [ec2-user@ip-172-31-22-140 conf]$ grep SPARK_MASTER_HOST * [ec2-user@ip-172-31-22-140 conf]$ pwd /root/spark/conf [ec2-user@ip-172-31-22-140 conf]$ grep SPARK_MASTER_HOST * [ec2-user@ip-172-31-22-140 conf]$ ec2-user@ip-172-31-22-140 sbin]$ pwd /root/spark/sbin [ec2-user@ip-172-31-22-140 sbin]$ grep SPARK_MASTER_HOST * [ec2-user@ip-172-31-22-140 bin]$ !grep grep SPARK_MASTER_HOST * [ec2-user@ip-172-31-22-140 bin]$ Thanks for looking into this Andy > spark master webpage links to application UI broke when running in cluster > mode > --- > > Key: SPARK-15829 > URL: https://issues.apache.org/jira/browse/SPARK-15829 > Project: Spark > Issue Type: Bug > Components: EC2 >Affects Versions: 1.6.1 > Environment: AWS ec2 cluster >Reporter: Andrew Davidson >Priority: Critical > > Hi > I created a cluster using the spark-1.6.1-bin-hadoop2.6/ec2/spark-ec2 > I use the stand alone cluster manager. I have a streaming app running in > cluster mode. I notice the master webpage links to the application UI page > are incorrect > It does not look like jira will let my upload images. I'll try and describe > the web pages and the bug > My master is running on > http://ec2-54-215-230-73.us-west-1.compute.amazonaws.com:8080/ > It has a section marked "applications". If I click on one of the running > application ids I am taken to a page showing "Executor Summary". This page > has a link to teh 'application detail UI' the url is > http://ec2-54-215-230-73.us-west-1.compute.amazonaws.com:4041/ > Notice it things the application UI is running on the cluster master. > It is actually running on the same machine as the driver on port 4041. I was > able to reverse engine the url by noticing the private ip address is part of > the worker id . For example worker-20160322041632-172.31.23.201-34909 > next I went on the aws ec2 console to find the public DNS name for this > machine > http://ec2-54-193-104-169.us-west-1.compute.amazonaws.com:4041/streaming/ > Kind regards > Andy -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16265) Add option to SparkSubmit to ship driver JRE to YARN
Andrew Duffy created SPARK-16265: Summary: Add option to SparkSubmit to ship driver JRE to YARN Key: SPARK-16265 URL: https://issues.apache.org/jira/browse/SPARK-16265 Project: Spark Issue Type: Improvement Affects Versions: 1.6.2 Reporter: Andrew Duffy Fix For: 2.1.0 Add an option to {{SparkSubmit}} to allow the driver to package up it's version of the JRE to be shipped to a YARN cluster. This allows deploying Spark applications to a YARN cluster in which its required Java version need not match one of the versions already installed on the YARN cluster, useful in situations in which the Spark Application developer does not have administrative access over the YARN cluster (ex. school or corporate environment) but still wants to use certain language features in their code. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16196) Optimize in-memory scan performance using ColumnarBatches
Andrew Or created SPARK-16196: - Summary: Optimize in-memory scan performance using ColumnarBatches Key: SPARK-16196 URL: https://issues.apache.org/jira/browse/SPARK-16196 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.0 Reporter: Andrew Or Assignee: Andrew Or A simple benchmark such as the following reveals inefficiencies in the existing in-memory scan implementation: {code} spark.range(N) .selectExpr("id", "floor(rand() * 1) as k") .createOrReplaceTempView("test") val ds = spark.sql("select count(k), count(id) from test").cache() ds.collect() ds.collect() {code} There are many reasons why caching is slow. The biggest is that compression takes a long time. The second is that there are a lot of virtual function calls in this hot code path since the rows are processed using iterators. Further, the rows are converted to and from ByteBuffers, which are slow to read in general. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15917) Define the number of executors in standalone mode with an easy-to-use property
[ https://issues.apache.org/jira/browse/SPARK-15917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15342376#comment-15342376 ] Andrew Or commented on SPARK-15917: --- (1) Yes, right now `spark.executor.instances` doesn't do anything in standalone mode even if it's set so we should support it. (2) There are several options here. Cores and number of executors are inherently conflicting things so ideally we should disallow the setting of both of them. We could throw an exception with a good error message, but that would fail existing apps that do have both of them set. We could just log a warning, but there's a high chance that people just won't see the warning. Both options are fine but I'm slightly in favor of throwing an exception when conflicting configs are set. You might have to dig into the internal scheduling code in Master.scala to support num instances. > Define the number of executors in standalone mode with an easy-to-use property > -- > > Key: SPARK-15917 > URL: https://issues.apache.org/jira/browse/SPARK-15917 > Project: Spark > Issue Type: Improvement > Components: Spark Core, Spark Shell, Spark Submit >Affects Versions: 1.6.1 >Reporter: Jonathan Taws >Priority: Minor > > After stumbling across a few StackOverflow posts around the issue of using a > fixed number of executors in standalone mode (non-YARN), I was wondering if > we could not add an easier way to set this parameter than having to resort to > some calculations based on the number of cores and the memory you have > available on your worker. > For example, let's say I have 8 cores and 30GB of memory available : > - If no option is passed, one executor will be spawned with 8 cores and 1GB > of memory allocated. > - However, if I want to have only *2* executors, and to use 2 cores and 10GB > of memory per executor, I will end up with *3* executors (as the available > memory will limit the number of executors) instead of the 2 I was hoping for. > Sure, I can set {{spark.cores.max}} as a workaround to get exactly what I > want, but would it not be easier to add a {{--num-executors}}-like option to > standalone mode to be able to really fine-tune the configuration ? This > option is already available in YARN mode. > From my understanding, I don't see any other option lying around that can > help achieve this. > This seems to be slightly disturbing for newcomers, and standalone mode is > probably the first thing anyone will use to just try out Spark or test some > configuration. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15917) Define the number of executors in standalone mode with an easy-to-use property
[ https://issues.apache.org/jira/browse/SPARK-15917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15342310#comment-15342310 ] Andrew Or commented on SPARK-15917: --- Yeah, I agree. We need to deal with conflicting options, however, e.g. spark.executor.cores=4, spark.executor.instances=4, spark.cores.max=8. [~JonathanTaws] would you like to work on this? > Define the number of executors in standalone mode with an easy-to-use property > -- > > Key: SPARK-15917 > URL: https://issues.apache.org/jira/browse/SPARK-15917 > Project: Spark > Issue Type: Improvement > Components: Spark Core, Spark Shell, Spark Submit >Affects Versions: 1.6.1 >Reporter: Jonathan Taws >Priority: Minor > > After stumbling across a few StackOverflow posts around the issue of using a > fixed number of executors in standalone mode (non-YARN), I was wondering if > we could not add an easier way to set this parameter than having to resort to > some calculations based on the number of cores and the memory you have > available on your worker. > For example, let's say I have 8 cores and 30GB of memory available : > - If no option is passed, one executor will be spawned with 8 cores and 1GB > of memory allocated. > - However, if I want to have only *2* executors, and to use 2 cores and 10GB > of memory per executor, I will end up with *3* executors (as the available > memory will limit the number of executors) instead of the 2 I was hoping for. > Sure, I can set {{spark.cores.max}} as a workaround to get exactly what I > want, but would it not be easier to add a {{--num-executors}}-like option to > standalone mode to be able to really fine-tune the configuration ? This > option is already available in YARN mode. > From my understanding, I don't see any other option lying around that can > help achieve this. > This seems to be slightly disturbing for newcomers, and standalone mode is > probably the first thing anyone will use to just try out Spark or test some > configuration. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16023) Move InMemoryRelation to its own file
Andrew Or created SPARK-16023: - Summary: Move InMemoryRelation to its own file Key: SPARK-16023 URL: https://issues.apache.org/jira/browse/SPARK-16023 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.0 Reporter: Andrew Or Assignee: Andrew Or Priority: Minor Just to make InMemoryTableScanExec a little smaller and more readable. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15749) Make the error message more meaningful
[ https://issues.apache.org/jira/browse/SPARK-15749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or resolved SPARK-15749. --- Resolution: Fixed Assignee: Huaxin Gao Fix Version/s: 2.0.0 Target Version/s: 2.0.0 > Make the error message more meaningful > -- > > Key: SPARK-15749 > URL: https://issues.apache.org/jira/browse/SPARK-15749 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Huaxin Gao >Assignee: Huaxin Gao >Priority: Trivial > Fix For: 2.0.0 > > > For table test1 (C1 varchar (10), C2 varchar (10)), when I insert a row using > sqlContext.sql("insert into test1 values ('abc', 'def', 1)") > I got error message > Exception in thread "main" java.lang.RuntimeException: Relation[C1#0,C2#1] > JDBCRelation(test1) > requires that the query in the SELECT clause of the INSERT INTO/OVERWRITE > statement generates the same number of columns as its schema. > The error message is a little confusing. In my simple insert statement, it > doesn't have a SELECT clause. > I will change the error message to a more general one > Exception in thread "main" java.lang.RuntimeException: Relation[C1#0,C2#1] > JDBCRelation(test1) > requires that the data to be inserted have the same number of columns as the > target table. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15868) Executors table in Executors tab should sort Executor IDs in numerical order (not alphabetical order)
[ https://issues.apache.org/jira/browse/SPARK-15868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or resolved SPARK-15868. --- Resolution: Fixed Assignee: Alex Bozarth Fix Version/s: 2.0.0 Target Version/s: 2.0.0 > Executors table in Executors tab should sort Executor IDs in numerical order > (not alphabetical order) > - > > Key: SPARK-15868 > URL: https://issues.apache.org/jira/browse/SPARK-15868 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 2.0.0 >Reporter: Jacek Laskowski >Assignee: Alex Bozarth >Priority: Minor > Fix For: 2.0.0 > > Attachments: spark-webui-executors-sorting-2.png, > spark-webui-executors-sorting.png > > > It _appears_ that Executors table in Executors tab sorts Executor IDs in > alphabetical order while it should in numerical. It does sorting in a more > "friendly" way yet driver executor appears between 0 and 1? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15998) Verification of SQLConf HIVE_METASTORE_PARTITION_PRUNING
[ https://issues.apache.org/jira/browse/SPARK-15998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or resolved SPARK-15998. --- Resolution: Fixed Fix Version/s: 2.0.0 > Verification of SQLConf HIVE_METASTORE_PARTITION_PRUNING > > > Key: SPARK-15998 > URL: https://issues.apache.org/jira/browse/SPARK-15998 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xiao Li > Fix For: 2.0.0 > > > HIVE_METASTORE_PARTITION_PRUNING is a public SQLConf. When true, some > predicates will be pushed down into the Hive metastore so that unmatching > partitions can be eliminated earlier. The current default value is false. > So far, the code base does not have such a test case to verify whether this > SQLConf properly works. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15998) Verification of SQLConf HIVE_METASTORE_PARTITION_PRUNING
[ https://issues.apache.org/jira/browse/SPARK-15998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-15998: -- Assignee: Xiao Li > Verification of SQLConf HIVE_METASTORE_PARTITION_PRUNING > > > Key: SPARK-15998 > URL: https://issues.apache.org/jira/browse/SPARK-15998 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xiao Li >Assignee: Xiao Li > Fix For: 2.0.0 > > > HIVE_METASTORE_PARTITION_PRUNING is a public SQLConf. When true, some > predicates will be pushed down into the Hive metastore so that unmatching > partitions can be eliminated earlier. The current default value is false. > So far, the code base does not have such a test case to verify whether this > SQLConf properly works. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15975) Improper Popen.wait() return code handling in dev/run-tests
[ https://issues.apache.org/jira/browse/SPARK-15975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or resolved SPARK-15975. --- Resolution: Fixed Fix Version/s: 2.0.0 1.6.2 1.5.3 > Improper Popen.wait() return code handling in dev/run-tests > --- > > Key: SPARK-15975 > URL: https://issues.apache.org/jira/browse/SPARK-15975 > Project: Spark > Issue Type: Bug > Components: Project Infra >Affects Versions: 1.6.0 >Reporter: Josh Rosen >Assignee: Josh Rosen > Fix For: 1.5.3, 1.6.2, 2.0.0 > > > In dev/run-tests.py there's a line where we effectively do > {code} > retcode = some_popen_instance.wait() > if retcode > 0: > err > # else do nothing > {code} > but this code is subtlety wrong because Popen's return code will be negative > if the child process was terminated by a signal: > https://docs.python.org/2/library/subprocess.html#subprocess.Popen.returncode > We should change this to {{retcode != 0}} so that we properly error out and > exit due to termination by signal. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15978) Some improvement of "Show Tables"
[ https://issues.apache.org/jira/browse/SPARK-15978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or resolved SPARK-15978. --- Resolution: Fixed Assignee: Bo Meng (was: Apache Spark) Fix Version/s: 2.0.0 Target Version/s: 2.0.0 > Some improvement of "Show Tables" > - > > Key: SPARK-15978 > URL: https://issues.apache.org/jira/browse/SPARK-15978 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Bo Meng >Assignee: Bo Meng >Priority: Minor > Fix For: 2.0.0 > > > I've found some minor issues in "show tables" command: > 1. In the SessionCatalog.scala, listTables(db: String) method will call > listTables(formatDatabaseName(db), "*") to list all the tables for certain > db, but in the method listTables(db: String, pattern: String), this db name > is formatted once more. So I think we should remove formatDatabaseName() in > the caller. > 2. I suggest to add sort to listTables(db: String) in InMemoryCatalog.scala, > just like listDatabases(). > I will make a PR shortly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15576) Add back hive tests blacklisted by SPARK-15539
[ https://issues.apache.org/jira/browse/SPARK-15576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-15576: -- Target Version/s: 2.1.0 (was: 2.0.0) > Add back hive tests blacklisted by SPARK-15539 > -- > > Key: SPARK-15576 > URL: https://issues.apache.org/jira/browse/SPARK-15576 > Project: Spark > Issue Type: Bug > Components: SQL, Tests >Affects Versions: 2.0.0 >Reporter: Andrew Or > > These were removed from HiveCompatibilitySuite. They should be added back to > HiveQuerySuite. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15829) spark master webpage links to application UI broke when running in cluster mode
[ https://issues.apache.org/jira/browse/SPARK-15829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15325001#comment-15325001 ] Andrew Davidson commented on SPARK-15829: - Hi Xin I ran netstat on my master. I do not think the port are in use. To submit in cluster mode I use port 6066. If you are using port 7077 you are in client mode. In client mode the application UI will run on the spark master. In cluster mode the application UI runs on which ever slave the driver is running on. If you notice in my original description the url is incorrect. the ip is wrong, the port is correct. Kind regards Andy # bash-4.2# netstat -tulpn Active Internet connections (only servers) Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name tcp0 0 0.0.0.0:86520.0.0.0:* LISTEN 3832/gmetad tcp0 0 0.0.0.0:87870.0.0.0:* LISTEN 2584/rserver tcp0 0 0.0.0.0:36757 0.0.0.0:* LISTEN 2905/java tcp0 0 0.0.0.0:50070 0.0.0.0:* LISTEN 2905/java tcp0 0 0.0.0.0:22 0.0.0.0:* LISTEN 2144/sshd tcp0 0 127.0.0.1:631 0.0.0.0:* LISTEN 2095/cupsd tcp0 0 127.0.0.1:7000 0.0.0.0:* LISTEN 6512/python3.4 tcp0 0 127.0.0.1:250.0.0.0:* LISTEN 2183/sendmail tcp0 0 0.0.0.0:43813 0.0.0.0:* LISTEN 3093/java tcp0 0 172.31.22.140:9000 0.0.0.0:* LISTEN 2905/java tcp0 0 0.0.0.0:86490.0.0.0:* LISTEN 3810/gmond tcp0 0 0.0.0.0:50090 0.0.0.0:* LISTEN 3093/java tcp0 0 0.0.0.0:86510.0.0.0:* LISTEN 3832/gmetad tcp0 0 :::8080 :::* LISTEN 23719/java tcp0 0 :::8081 :::* LISTEN 5588/java tcp0 0 :::172.31.22.140:6066 :::* LISTEN 23719/java tcp0 0 :::172.31.22.140:6067 :::* LISTEN 5588/java tcp0 0 :::22 :::* LISTEN 2144/sshd tcp0 0 ::1:631 :::* LISTEN 2095/cupsd tcp0 0 :::19998:::* LISTEN 3709/java tcp0 0 :::1:::* LISTEN 3709/java tcp0 0 :::172.31.22.140:7077 :::* LISTEN 23719/java tcp0 0 :::172.31.22.140:7078 :::* LISTEN 5588/java udp0 0 0.0.0.0:86490.0.0.0:* 3810/gmond udp0 0 0.0.0.0:631 0.0.0.0:* 2095/cupsd udp0 0 0.0.0.0:38546 0.0.0.0:* 2905/java udp0 0 0.0.0.0:68 0.0.0.0:* 1142/dhclient udp0 0 172.31.22.140:123 0.0.0.0:* 2168/ntpd udp0 0 127.0.0.1:123 0.0.0.0:* 2168/ntpd udp0 0 0.0.0.0:123 0.0.0.0:* 2168/ntpd bash-4.2# > spark master webpage links to application UI broke when running in cluster > mode > --- > > Key: SPARK-15829 > URL: https://issues.apache.org/jira/browse/SPARK-15829 > Project: Spark > Issue Type: Bug > Components: EC2 >Affects Versions: 1.6.1 > Environment: AWS ec2 cluster >Reporter: Andrew Davidson >Priority: Critical > > Hi > I created a cluster using the spark-1.6.1-bin-hadoop2.6/ec2/spark-ec2 > I use the stand alone cluster manager. I have a streaming app running in > cluster mode. I notice the master webpage links to the application UI page > are incorrect > It does not look like jira will let my upload images. I'll try and
[jira] [Commented] (SPARK-15867) TABLESAMPLE BUCKET semantics don't match Hive's
[ https://issues.apache.org/jira/browse/SPARK-15867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15324960#comment-15324960 ] Andrew Or commented on SPARK-15867: --- I think we should fix it, though looks like it's been an issue for a while. I don't think it was ever documented so it's OK to change the behavior. > TABLESAMPLE BUCKET semantics don't match Hive's > --- > > Key: SPARK-15867 > URL: https://issues.apache.org/jira/browse/SPARK-15867 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0, 2.0.0 >Reporter: Andrew Or > > {code} > SELECT * FROM boxes TABLESAMPLE (BUCKET 3 OUT OF 16) > {code} > In Hive, this would select the 3rd bucket out of every 16 buckets there are > in the table. E.g. if the table was clustered by 32 buckets then this would > sample the 3rd and the 19th bucket. (See > https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Sampling) > In Spark, however, we simply sample 3/16 of the number of input rows. > Either we don't support it in Spark or do it in a way that's consistent with > Hive. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15867) TABLESAMPLE BUCKET semantics don't match Hive's
[ https://issues.apache.org/jira/browse/SPARK-15867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-15867: -- Affects Version/s: 1.6.0 > TABLESAMPLE BUCKET semantics don't match Hive's > --- > > Key: SPARK-15867 > URL: https://issues.apache.org/jira/browse/SPARK-15867 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0, 2.0.0 >Reporter: Andrew Or > > {code} > SELECT * FROM boxes TABLESAMPLE (BUCKET 3 OUT OF 16) > {code} > In Hive, this would select the 3rd bucket out of every 16 buckets there are > in the table. E.g. if the table was clustered by 32 buckets then this would > sample the 3rd and the 19th bucket. (See > https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Sampling) > In Spark, however, we simply sample 3/16 of the number of input rows. > Either we don't support it in Spark or do it in a way that's consistent with > Hive. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15867) TABLESAMPLE BUCKET semantics don't match Hive's
Andrew Or created SPARK-15867: - Summary: TABLESAMPLE BUCKET semantics don't match Hive's Key: SPARK-15867 URL: https://issues.apache.org/jira/browse/SPARK-15867 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.0 Reporter: Andrew Or {code} SELECT * FROM boxes TABLESAMPLE (BUCKET 3 OUT OF 16) {code} In Hive, this would select the 3rd bucket out of every 16 buckets there are in the table. E.g. if the table was clustered by 32 buckets then this would sample the 3rd and the 19th bucket. (See https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Sampling) In Spark, however, we simply sample 3/16 of the number of input rows. Either we don't support it in Spark or do it in a way that's consistent with Hive. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15829) spark master webpage links to application UI broke when running in cluster mode
Andrew Davidson created SPARK-15829: --- Summary: spark master webpage links to application UI broke when running in cluster mode Key: SPARK-15829 URL: https://issues.apache.org/jira/browse/SPARK-15829 Project: Spark Issue Type: Bug Components: EC2 Affects Versions: 1.6.1 Environment: AWS ec2 cluster Reporter: Andrew Davidson Priority: Critical Hi I created a cluster using the spark-1.6.1-bin-hadoop2.6/ec2/spark-ec2 I use the stand alone cluster manager. I have a streaming app running in cluster mode. I notice the master webpage links to the application UI page are incorrect It does not look like jira will let my upload images. I'll try and describe the web pages and the bug My master is running on http://ec2-54-215-230-73.us-west-1.compute.amazonaws.com:8080/ It has a section marked "applications". If I click on one of the running application ids I am taken to a page showing "Executor Summary". This page has a link to teh 'application detail UI' the url is http://ec2-54-215-230-73.us-west-1.compute.amazonaws.com:4041/ Notice it things the application UI is running on the cluster master. It is actually running on the same machine as the driver on port 4041. I was able to reverse engine the url by noticing the private ip address is part of the worker id . For example worker-20160322041632-172.31.23.201-34909 next I went on the aws ec2 console to find the public DNS name for this machine http://ec2-54-193-104-169.us-west-1.compute.amazonaws.com:4041/streaming/ Kind regards Andy -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-15722) Wrong data when CTAS specifies schema
[ https://issues.apache.org/jira/browse/SPARK-15722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-15722: -- Comment: was deleted (was: User 'andrewor14' has created a pull request for this issue: https://github.com/apache/spark/pull/13457) > Wrong data when CTAS specifies schema > - > > Key: SPARK-15722 > URL: https://issues.apache.org/jira/browse/SPARK-15722 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Andrew Or >Assignee: Andrew Or > > {code} > scala> sql("CREATE TABLE boxes (width INT, length INT, height INT) USING CSV") > scala> (1 to 3).map { i => (i, i * 2, i * 3) }.toDF("height", "length", > "width").write.insertInto("boxes") > scala> spark.table("boxes").show() > +-+--+--+ > |width|length|height| > +-+--+--+ > |1| 2| 3| > |2| 4| 6| > |3| 6| 9| > +-+--+--+ > scala> sql("CREATE TABLE blocks (name STRING, age INT) AS SELECT * FROM > boxes") > scala> spark.table("students").show() > ++---+ > |name|age| > ++---+ > | 1| 2| > | 2| 4| > | 3| 6| > ++---+ > {code} > The columns don't even match in types. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15736) Gracefully handle loss of DiskStore files
[ https://issues.apache.org/jira/browse/SPARK-15736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or resolved SPARK-15736. --- Resolution: Fixed Fix Version/s: 2.0.0 1.6.2 > Gracefully handle loss of DiskStore files > - > > Key: SPARK-15736 > URL: https://issues.apache.org/jira/browse/SPARK-15736 > Project: Spark > Issue Type: Bug > Components: Block Manager >Affects Versions: 1.6.0 >Reporter: Josh Rosen >Assignee: Josh Rosen > Fix For: 1.6.2, 2.0.0 > > > If an RDD partition is cached on disk and the DiskStore file is lost, then > reads of that cached partition will fail and the missing partition is > supposed to be recomputed by a new task attempt. In the current BlockManager > implementation, however, the missing file does not trigger any metadata > updates / does not invalidate the cache, so subsequent task attempts will be > scheduled on the same executor and the doomed read will be repeatedly > retried, leading to repeated task failures and eventually a total job failure. > In order to fix this problem, the executor with the missing file needs to > properly mark the corresponding block as missing so that it stops advertising > itself as a cache location for that block. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15718) better error message for writing bucketing data
[ https://issues.apache.org/jira/browse/SPARK-15718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or resolved SPARK-15718. --- Resolution: Fixed Fix Version/s: 2.0.0 > better error message for writing bucketing data > --- > > Key: SPARK-15718 > URL: https://issues.apache.org/jira/browse/SPARK-15718 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan > Fix For: 2.0.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-15722) Wrong data when CTAS specifies schema
[ https://issues.apache.org/jira/browse/SPARK-15722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-15722: -- Comment: was deleted (was: User 'andrewor14' has created a pull request for this issue: https://github.com/apache/spark/pull/13457) > Wrong data when CTAS specifies schema > - > > Key: SPARK-15722 > URL: https://issues.apache.org/jira/browse/SPARK-15722 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Andrew Or >Assignee: Andrew Or > > {code} > scala> sql("CREATE TABLE boxes (width INT, length INT, height INT) USING CSV") > scala> (1 to 3).map { i => (i, i * 2, i * 3) }.toDF("height", "length", > "width").write.insertInto("boxes") > scala> spark.table("boxes").show() > +-+--+--+ > |width|length|height| > +-+--+--+ > |1| 2| 3| > |2| 4| 6| > |3| 6| 9| > +-+--+--+ > scala> sql("CREATE TABLE blocks (name STRING, age INT) AS SELECT * FROM > boxes") > scala> spark.table("students").show() > ++---+ > |name|age| > ++---+ > | 1| 2| > | 2| 4| > | 3| 6| > ++---+ > {code} > The columns don't even match in types. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15711) Ban CREATE TEMP TABLE USING AS SELECT for now
[ https://issues.apache.org/jira/browse/SPARK-15711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or resolved SPARK-15711. --- Resolution: Fixed Fix Version/s: 2.0.0 > Ban CREATE TEMP TABLE USING AS SELECT for now > - > > Key: SPARK-15711 > URL: https://issues.apache.org/jira/browse/SPARK-15711 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Yin Huai >Assignee: Sean Zhong >Priority: Critical > Fix For: 2.0.0 > > > CREATE TEMP TABLE USING AS SELECT is ill-defined. It requires that user to > specify the location and the temp data is not cleaned up when the session > exits. Before we fix it, I'd propose that we ban this command. I will create > a jira with description on proper temp table support. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15646) When spark.sql.hive.convertCTAS is true, we may still convert the table to a parquet table when TEXTFILE or SEQUENCEFILE is specified.
[ https://issues.apache.org/jira/browse/SPARK-15646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-15646: -- Assignee: Yin Huai > When spark.sql.hive.convertCTAS is true, we may still convert the table to a > parquet table when TEXTFILE or SEQUENCEFILE is specified. > -- > > Key: SPARK-15646 > URL: https://issues.apache.org/jira/browse/SPARK-15646 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Yin Huai >Assignee: Yin Huai > Fix For: 2.0.0 > > > When {{spark.sql.hive.convertCTAS}} is true, we try to convert the table to a > parquet table if the user does not specify any storage format. However, we > only check serde, which causes us to still convert the table when > TEXTFILE/SEQUENCEFILE is specified and a serde is not provided. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15646) When spark.sql.hive.convertCTAS is true, we may still convert the table to a parquet table when TEXTFILE or SEQUENCEFILE is specified.
[ https://issues.apache.org/jira/browse/SPARK-15646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or resolved SPARK-15646. --- Resolution: Fixed Fix Version/s: 2.0.0 > When spark.sql.hive.convertCTAS is true, we may still convert the table to a > parquet table when TEXTFILE or SEQUENCEFILE is specified. > -- > > Key: SPARK-15646 > URL: https://issues.apache.org/jira/browse/SPARK-15646 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Yin Huai > Fix For: 2.0.0 > > > When {{spark.sql.hive.convertCTAS}} is true, we try to convert the table to a > parquet table if the user does not specify any storage format. However, we > only check serde, which causes us to still convert the table when > TEXTFILE/SEQUENCEFILE is specified and a serde is not provided. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15722) Wrong data when CTAS specifies schema
Andrew Or created SPARK-15722: - Summary: Wrong data when CTAS specifies schema Key: SPARK-15722 URL: https://issues.apache.org/jira/browse/SPARK-15722 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.0 Reporter: Andrew Or Assignee: Andrew Or {code} scala> sql("CREATE TABLE boxes (width INT, length INT, height INT) USING CSV") scala> (1 to 3).map { i => (i, i * 2, i * 3) }.toDF("height", "length", "width").write.insertInto("boxes") scala> spark.table("boxes").show() +-+--+--+ |width|length|height| +-+--+--+ |1| 2| 3| |2| 4| 6| |3| 6| 9| +-+--+--+ scala> sql("CREATE TABLE blocks (name STRING, age INT) AS SELECT * FROM boxes") scala> spark.table("students").show() ++---+ |name|age| ++---+ | 1| 2| | 2| 4| | 3| 6| ++---+ {code} The columns don't even match in types. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15715) Altering partition storage information doesn't work in Hive
Andrew Or created SPARK-15715: - Summary: Altering partition storage information doesn't work in Hive Key: SPARK-15715 URL: https://issues.apache.org/jira/browse/SPARK-15715 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.0 Reporter: Andrew Or Assignee: Andrew Or In HiveClientImpl {code} private def toHivePartition( p: CatalogTablePartition, ht: HiveTable): HivePartition = { new HivePartition(ht, p.spec.asJava, p.storage.locationUri.map { l => new Path(l) }.orNull) } {code} Other than the location, we don't even store any of the storage information in the metastore: output format, input format, serde, serde props. The result is that doing something like the following doesn't actually do anything: {code} ALTER TABLE boxes PARTITION (width=3) SET SERDE 'com.sparkbricks.serde.ColumnarSerDe' WITH SERDEPROPERTIES ('compress'='true') {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15711) Ban CREATE TEMP TABLE USING AS SELECT for now
[ https://issues.apache.org/jira/browse/SPARK-15711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-15711: -- Assignee: Sean Zhong > Ban CREATE TEMP TABLE USING AS SELECT for now > - > > Key: SPARK-15711 > URL: https://issues.apache.org/jira/browse/SPARK-15711 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Yin Huai >Assignee: Sean Zhong >Priority: Critical > > CREATE TEMP TABLE USING AS SELECT is ill-defined. It requires that user to > specify the location and the temp data is not cleaned up when the session > exits. Before we fix it, I'd propose that we ban this command. I will create > a jira with description on proper temp table support. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15236) No way to disable Hive support in REPL
[ https://issues.apache.org/jira/browse/SPARK-15236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or resolved SPARK-15236. --- Resolution: Fixed Fix Version/s: 2.0.0 > No way to disable Hive support in REPL > -- > > Key: SPARK-15236 > URL: https://issues.apache.org/jira/browse/SPARK-15236 > Project: Spark > Issue Type: Bug > Components: Spark Shell, SQL >Affects Versions: 2.0.0 >Reporter: Andrew Or >Assignee: Xin Wu > Fix For: 2.0.0 > > > If you built Spark with Hive classes, there's no switch to flip to start a > new `spark-shell` using the InMemoryCatalog. The only thing you can do now is > to rebuild Spark again. That is quite inconvenient. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15618) Use SparkSession.builder.sparkContext(...) in tests where possible
[ https://issues.apache.org/jira/browse/SPARK-15618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or resolved SPARK-15618. --- Resolution: Fixed Fix Version/s: 2.0.0 > Use SparkSession.builder.sparkContext(...) in tests where possible > -- > > Key: SPARK-15618 > URL: https://issues.apache.org/jira/browse/SPARK-15618 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Andrew Or >Assignee: Dongjoon Hyun >Priority: Minor > Fix For: 2.0.0 > > > There are many places where we could be more explicit about the particular > underlying SparkContext we want, but we just do > `SparkSession.builder.getOrCreate()` anyway. It's better to be clearer in the > code. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15236) No way to disable Hive support in REPL
[ https://issues.apache.org/jira/browse/SPARK-15236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-15236: -- Assignee: Xin Wu > No way to disable Hive support in REPL > -- > > Key: SPARK-15236 > URL: https://issues.apache.org/jira/browse/SPARK-15236 > Project: Spark > Issue Type: Bug > Components: Spark Shell, SQL >Affects Versions: 2.0.0 >Reporter: Andrew Or >Assignee: Xin Wu > Fix For: 2.0.0 > > > If you built Spark with Hive classes, there's no switch to flip to start a > new `spark-shell` using the InMemoryCatalog. The only thing you can do now is > to rebuild Spark again. That is quite inconvenient. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15670) Add deprecate annotation for acumulator V1 interface in JavaSparkContext class
[ https://issues.apache.org/jira/browse/SPARK-15670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-15670: -- Assignee: Weichen Xu > Add deprecate annotation for acumulator V1 interface in JavaSparkContext class > -- > > Key: SPARK-15670 > URL: https://issues.apache.org/jira/browse/SPARK-15670 > Project: Spark > Issue Type: Improvement > Components: Java API, Spark Core >Reporter: Weichen Xu >Assignee: Weichen Xu >Priority: Minor > Fix For: 2.0.0 > > Original Estimate: 24h > Remaining Estimate: 24h > > Add deprecate annotation for acumulator V1 interface in JavaSparkContext class -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15670) Add deprecate annotation for acumulator V1 interface in JavaSparkContext class
[ https://issues.apache.org/jira/browse/SPARK-15670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or resolved SPARK-15670. --- Resolution: Fixed Fix Version/s: 2.0.0 Target Version/s: 2.0.0 > Add deprecate annotation for acumulator V1 interface in JavaSparkContext class > -- > > Key: SPARK-15670 > URL: https://issues.apache.org/jira/browse/SPARK-15670 > Project: Spark > Issue Type: Improvement > Components: Java API, Spark Core >Reporter: Weichen Xu >Priority: Minor > Fix For: 2.0.0 > > Original Estimate: 24h > Remaining Estimate: 24h > > Add deprecate annotation for acumulator V1 interface in JavaSparkContext class -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15662) Add since annotation for classes in sql.catalog
[ https://issues.apache.org/jira/browse/SPARK-15662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or resolved SPARK-15662. --- Resolution: Fixed Fix Version/s: 2.0.0 > Add since annotation for classes in sql.catalog > --- > > Key: SPARK-15662 > URL: https://issues.apache.org/jira/browse/SPARK-15662 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > Fix For: 2.0.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15635) ALTER TABLE RENAME doesn't work for datasource tables
Andrew Or created SPARK-15635: - Summary: ALTER TABLE RENAME doesn't work for datasource tables Key: SPARK-15635 URL: https://issues.apache.org/jira/browse/SPARK-15635 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.0 Reporter: Andrew Or Assignee: Andrew Or {code} scala> sql("CREATE TABLE students (age INT, name STRING) USING parquet") scala> sql("ALTER TABLE students RENAME TO teachers") scala> spark.table("teachers").show() com.google.common.util.concurrent.UncheckedExecutionException: org.apache.spark.sql.AnalysisException: Path does not exist: file:/Users/andrew/Documents/dev/spark/andrew-spark/spark-warehouse/students; at com.google.common.cache.LocalCache$LocalLoadingCache.getUnchecked(LocalCache.java:4882) at com.google.common.cache.LocalCache$LocalLoadingCache.apply(LocalCache.java:4898) at org.apache.spark.sql.hive.HiveMetastoreCatalog.lookupRelation(HiveMetastoreCatalog.scala:170) at org.apache.spark.sql.hive.HiveSessionCatalog.lookupRelation(HiveSessionCatalog.scala:67) at org.apache.spark.sql.SparkSession.table(SparkSession.scala:583) at org.apache.spark.sql.SparkSession.table(SparkSession.scala:579) ... 48 elided Caused by: org.apache.spark.sql.AnalysisException: Path does not exist: file:/Users/andrew/Documents/dev/spark/andrew-spark/spark-warehouse/students; at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$12.apply(DataSource.scala:351) at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$12.apply(DataSource.scala:340) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15450) Clean up SparkSession builder for python
[ https://issues.apache.org/jira/browse/SPARK-15450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or resolved SPARK-15450. --- Resolution: Fixed Fix Version/s: 2.0.0 > Clean up SparkSession builder for python > > > Key: SPARK-15450 > URL: https://issues.apache.org/jira/browse/SPARK-15450 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.0.0 >Reporter: Andrew Or >Assignee: Eric Liang > Fix For: 2.0.0 > > > This is the sister JIRA for SPARK-15075. Today we use > `SQLContext.getOrCreate` in our builder. Instead we should just have a real > `SparkSession.getOrCreate` and use that in our builder. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15534) TRUNCATE TABLE should throw exceptions, not logError
[ https://issues.apache.org/jira/browse/SPARK-15534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or resolved SPARK-15534. --- Resolution: Fixed Fix Version/s: 2.0.0 > TRUNCATE TABLE should throw exceptions, not logError > > > Key: SPARK-15534 > URL: https://issues.apache.org/jira/browse/SPARK-15534 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Andrew Or >Assignee: Andrew Or >Priority: Minor > Fix For: 2.0.0 > > > If the table to truncate doesn't exist, throw an exception! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15535) Remove code for TRUNCATE TABLE ... COLUMN
[ https://issues.apache.org/jira/browse/SPARK-15535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or resolved SPARK-15535. --- Resolution: Fixed Fix Version/s: 2.0.0 > Remove code for TRUNCATE TABLE ... COLUMN > - > > Key: SPARK-15535 > URL: https://issues.apache.org/jira/browse/SPARK-15535 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Andrew Or >Assignee: Andrew Or >Priority: Minor > Fix For: 2.0.0 > > > This was never supported in the first place. Also Hive doesn't support it: > https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15450) Clean up SparkSession builder for python
[ https://issues.apache.org/jira/browse/SPARK-15450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-15450: -- Assignee: Eric Liang (was: Andrew Or) > Clean up SparkSession builder for python > > > Key: SPARK-15450 > URL: https://issues.apache.org/jira/browse/SPARK-15450 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.0.0 >Reporter: Andrew Or >Assignee: Eric Liang > Fix For: 2.0.0 > > > This is the sister JIRA for SPARK-15075. Today we use > `SQLContext.getOrCreate` in our builder. Instead we should just have a real > `SparkSession.getOrCreate` and use that in our builder. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15618) Use SparkSession.builder.sparkContext(...) in tests where possible
[ https://issues.apache.org/jira/browse/SPARK-15618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15304506#comment-15304506 ] Andrew Or commented on SPARK-15618: --- it needs to be internal. At least it should be private[spark] > Use SparkSession.builder.sparkContext(...) in tests where possible > -- > > Key: SPARK-15618 > URL: https://issues.apache.org/jira/browse/SPARK-15618 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Andrew Or >Assignee: Dongjoon Hyun >Priority: Minor > > There are many places where we could be more explicit about the particular > underlying SparkContext we want, but we just do > `SparkSession.builder.getOrCreate()` anyway. It's better to be clearer in the > code. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15569) Executors spending significant time in DiskObjectWriter.updateBytesWritten function
[ https://issues.apache.org/jira/browse/SPARK-15569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or resolved SPARK-15569. --- Resolution: Fixed Assignee: Sital Kedia Fix Version/s: 2.0.0 Target Version/s: 2.0.0 > Executors spending significant time in DiskObjectWriter.updateBytesWritten > function > --- > > Key: SPARK-15569 > URL: https://issues.apache.org/jira/browse/SPARK-15569 > Project: Spark > Issue Type: Bug > Components: Shuffle >Reporter: Sital Kedia >Assignee: Sital Kedia > Fix For: 2.0.0 > > > Profiling a Spark job spilling large amount of intermediate data we found > that significant portion of time is being spent in > DiskObjectWriter.updateBytesWritten function. Looking at the code > (https://github.com/sitalkedia/spark/blob/master/core/src/main/scala/org/apache/spark/storage/DiskBlockObjectWriter.scala#L206), > we see that the function is being called too frequently to update the number > of bytes written to disk. We should reduce the frequency to avoid this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15599) Document createDataset functions in SparkSession
[ https://issues.apache.org/jira/browse/SPARK-15599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-15599: -- Affects Version/s: 2.0.0 Target Version/s: 2.0.0 Component/s: Documentation > Document createDataset functions in SparkSession > > > Key: SPARK-15599 > URL: https://issues.apache.org/jira/browse/SPARK-15599 > Project: Spark > Issue Type: Sub-task > Components: Documentation, SQL >Affects Versions: 2.0.0 >Reporter: Sameer Agarwal >Assignee: Sameer Agarwal > Fix For: 2.0.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15599) Document createDataset functions in SparkSession
[ https://issues.apache.org/jira/browse/SPARK-15599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or resolved SPARK-15599. --- Resolution: Fixed Fix Version/s: 2.0.0 > Document createDataset functions in SparkSession > > > Key: SPARK-15599 > URL: https://issues.apache.org/jira/browse/SPARK-15599 > Project: Spark > Issue Type: Sub-task > Components: Documentation, SQL >Affects Versions: 2.0.0 >Reporter: Sameer Agarwal >Assignee: Sameer Agarwal > Fix For: 2.0.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15599) Document createDataset functions in SparkSession
[ https://issues.apache.org/jira/browse/SPARK-15599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-15599: -- Assignee: Sameer Agarwal > Document createDataset functions in SparkSession > > > Key: SPARK-15599 > URL: https://issues.apache.org/jira/browse/SPARK-15599 > Project: Spark > Issue Type: Sub-task > Components: Documentation, SQL >Affects Versions: 2.0.0 >Reporter: Sameer Agarwal >Assignee: Sameer Agarwal > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15584) Abstract duplicate code: "spark.sql.sources." properties
[ https://issues.apache.org/jira/browse/SPARK-15584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or resolved SPARK-15584. --- Resolution: Fixed Fix Version/s: 2.0.0 > Abstract duplicate code: "spark.sql.sources." properties > > > Key: SPARK-15584 > URL: https://issues.apache.org/jira/browse/SPARK-15584 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Andrew Or >Assignee: Dongjoon Hyun >Priority: Minor > Fix For: 2.0.0 > > > Right now we have "spark.sql.sources.provider", "spark.sql.sources.numParts" > etc. everywhere. If we mistype something then things will silently fail. This > is pretty brittle. It would better if we have static variables that we can > reuse. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15603) Replace SQLContext with SparkSession in ML/MLLib
[ https://issues.apache.org/jira/browse/SPARK-15603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-15603: -- Fix Version/s: 2.0.0 > Replace SQLContext with SparkSession in ML/MLLib > > > Key: SPARK-15603 > URL: https://issues.apache.org/jira/browse/SPARK-15603 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 2.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun > Fix For: 2.0.0 > > > This issue replaces all deprecated `SQLContext` occurrences with > `SparkSession` in `ML/MLLib` module except the following two classes. These > two classes use `SQLContext` as their function arguments. > - ReadWrite.scala > - TreeModels.scala -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15603) Replace SQLContext with SparkSession in ML/MLLib
[ https://issues.apache.org/jira/browse/SPARK-15603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-15603: -- Assignee: Dongjoon Hyun > Replace SQLContext with SparkSession in ML/MLLib > > > Key: SPARK-15603 > URL: https://issues.apache.org/jira/browse/SPARK-15603 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 2.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun > Fix For: 2.0.0 > > > This issue replaces all deprecated `SQLContext` occurrences with > `SparkSession` in `ML/MLLib` module except the following two classes. These > two classes use `SQLContext` as their function arguments. > - ReadWrite.scala > - TreeModels.scala -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15618) Use SparkSession.builder.sparkContext(...) in tests where possible
[ https://issues.apache.org/jira/browse/SPARK-15618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-15618: -- Priority: Minor (was: Major) > Use SparkSession.builder.sparkContext(...) in tests where possible > -- > > Key: SPARK-15618 > URL: https://issues.apache.org/jira/browse/SPARK-15618 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Andrew Or >Assignee: Dongjoon Hyun >Priority: Minor > > There are many places where we could be more explicit about the particular > underlying SparkContext we want, but we just do > `SparkSession.builder.getOrCreate()` anyway. It's better to be clearer in the > code. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15603) Replace SQLContext with SparkSession in ML/MLLib
[ https://issues.apache.org/jira/browse/SPARK-15603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-15603: -- Affects Version/s: 2.0.0 > Replace SQLContext with SparkSession in ML/MLLib > > > Key: SPARK-15603 > URL: https://issues.apache.org/jira/browse/SPARK-15603 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 2.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun > Fix For: 2.0.0 > > > This issue replaces all deprecated `SQLContext` occurrences with > `SparkSession` in `ML/MLLib` module except the following two classes. These > two classes use `SQLContext` as their function arguments. > - ReadWrite.scala > - TreeModels.scala -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15618) Use SparkSession.builder.sparkContext(...) in tests where possible
Andrew Or created SPARK-15618: - Summary: Use SparkSession.builder.sparkContext(...) in tests where possible Key: SPARK-15618 URL: https://issues.apache.org/jira/browse/SPARK-15618 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.0 Reporter: Andrew Or Assignee: Dongjoon Hyun There are many places where we could be more explicit about the particular underlying SparkContext we want, but we just do `SparkSession.builder.getOrCreate()` anyway. It's better to be clearer in the code. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15536) Disallow TRUNCATE TABLE with external tables and views
[ https://issues.apache.org/jira/browse/SPARK-15536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or resolved SPARK-15536. --- Resolution: Fixed Fix Version/s: 2.0.0 > Disallow TRUNCATE TABLE with external tables and views > -- > > Key: SPARK-15536 > URL: https://issues.apache.org/jira/browse/SPARK-15536 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Andrew Or >Assignee: Andrew Or > Fix For: 2.0.0 > > > Otherwise we might accidentally delete existing data. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15538) Truncate table does not work on data source table
[ https://issues.apache.org/jira/browse/SPARK-15538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or resolved SPARK-15538. --- Resolution: Fixed Fix Version/s: 2.0.0 > Truncate table does not work on data source table > - > > Key: SPARK-15538 > URL: https://issues.apache.org/jira/browse/SPARK-15538 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Suresh Thalamati >Assignee: Andrew Or >Priority: Minor > Fix For: 2.0.0 > > > Truncate table does not seem to work on data source tables. > Repro: > {code} > val df = Seq((1 , "john", "CA") ,(2,"Mike", "NY"), (3, "Robert", > "CA")).toDF("id", "name", "state") > df.write.format("parquet").partitionBy("state").saveAsTable("emp") > scala> sql("truncate table emp") > res8: org.apache.spark.sql.DataFrame = [] > scala> sql("select * from emp").show() // FileNotFoundException > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15596) ALTER TABLE RENAME needs to uncache query
Andrew Or created SPARK-15596: - Summary: ALTER TABLE RENAME needs to uncache query Key: SPARK-15596 URL: https://issues.apache.org/jira/browse/SPARK-15596 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.0 Reporter: Andrew Or Assignee: Andrew Or -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15584) Abstract duplicate code: "spark.sql.sources." properties
[ https://issues.apache.org/jira/browse/SPARK-15584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-15584: -- Assignee: Dongjoon Hyun > Abstract duplicate code: "spark.sql.sources." properties > > > Key: SPARK-15584 > URL: https://issues.apache.org/jira/browse/SPARK-15584 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Andrew Or >Assignee: Dongjoon Hyun >Priority: Minor > > Right now we have "spark.sql.sources.provider", "spark.sql.sources.numParts" > etc. everywhere. If we mistype something then things will silently fail. This > is pretty brittle. It would better if we have static variables that we can > reuse. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15594) ALTER TABLE ... SERDEPROPERTIES does not respect partition spec
Andrew Or created SPARK-15594: - Summary: ALTER TABLE ... SERDEPROPERTIES does not respect partition spec Key: SPARK-15594 URL: https://issues.apache.org/jira/browse/SPARK-15594 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.0 Reporter: Andrew Or Assignee: Andrew Or {code} case class AlterTableSerDePropertiesCommand( tableName: TableIdentifier, serdeClassName: Option[String], serdeProperties: Option[Map[String, String]], partition: Option[Map[String, String]]) extends RunnableCommand { {code} The `partition` flag is not read anywhere! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15584) Abstract duplicate code: "spark.sql.sources." properties
Andrew Or created SPARK-15584: - Summary: Abstract duplicate code: "spark.sql.sources." properties Key: SPARK-15584 URL: https://issues.apache.org/jira/browse/SPARK-15584 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.0 Reporter: Andrew Or Assignee: Andrew Or Right now we have "spark.sql.sources.provider", "spark.sql.sources.numParts" etc. everywhere. If we mistype something then things will silently fail. This is pretty brittle. It would better if we have static variables that we can reuse. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15584) Abstract duplicate code: "spark.sql.sources." properties
[ https://issues.apache.org/jira/browse/SPARK-15584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-15584: -- Issue Type: Improvement (was: Bug) > Abstract duplicate code: "spark.sql.sources." properties > > > Key: SPARK-15584 > URL: https://issues.apache.org/jira/browse/SPARK-15584 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Andrew Or > > Right now we have "spark.sql.sources.provider", "spark.sql.sources.numParts" > etc. everywhere. If we mistype something then things will silently fail. This > is pretty brittle. It would better if we have static variables that we can > reuse. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15584) Abstract duplicate code: "spark.sql.sources." properties
[ https://issues.apache.org/jira/browse/SPARK-15584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15303054#comment-15303054 ] Andrew Or commented on SPARK-15584: --- [~dongjoon] would you like to work on this? > Abstract duplicate code: "spark.sql.sources." properties > > > Key: SPARK-15584 > URL: https://issues.apache.org/jira/browse/SPARK-15584 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Andrew Or > > Right now we have "spark.sql.sources.provider", "spark.sql.sources.numParts" > etc. everywhere. If we mistype something then things will silently fail. This > is pretty brittle. It would better if we have static variables that we can > reuse. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15584) Abstract duplicate code: "spark.sql.sources." properties
[ https://issues.apache.org/jira/browse/SPARK-15584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-15584: -- Assignee: (was: Andrew Or) > Abstract duplicate code: "spark.sql.sources." properties > > > Key: SPARK-15584 > URL: https://issues.apache.org/jira/browse/SPARK-15584 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Andrew Or > > Right now we have "spark.sql.sources.provider", "spark.sql.sources.numParts" > etc. everywhere. If we mistype something then things will silently fail. This > is pretty brittle. It would better if we have static variables that we can > reuse. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15584) Abstract duplicate code: "spark.sql.sources." properties
[ https://issues.apache.org/jira/browse/SPARK-15584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-15584: -- Priority: Minor (was: Major) > Abstract duplicate code: "spark.sql.sources." properties > > > Key: SPARK-15584 > URL: https://issues.apache.org/jira/browse/SPARK-15584 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Andrew Or >Priority: Minor > > Right now we have "spark.sql.sources.provider", "spark.sql.sources.numParts" > etc. everywhere. If we mistype something then things will silently fail. This > is pretty brittle. It would better if we have static variables that we can > reuse. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15583) Relax ALTER TABLE properties restriction for data source tables
Andrew Or created SPARK-15583: - Summary: Relax ALTER TABLE properties restriction for data source tables Key: SPARK-15583 URL: https://issues.apache.org/jira/browse/SPARK-15583 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.0 Reporter: Andrew Or Assignee: Andrew Or Looks like right now we just don't support ALTER TABLE SET TBLPROPERTIES for all properties. This is overly restrictive; as long as the user doesn't touch anything in the special namespace (spark.sql.sources.*) then we're OK. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org