[jira] [Commented] (SPARK-23325) DataSourceV2 readers should always produce InternalRow.
[ https://issues.apache.org/jira/browse/SPARK-23325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16388399#comment-16388399 ] Jose Torres commented on SPARK-23325: - How hard would it be to just declare that InternalRow is stable? The file has been touched about once per year for the past 3 years, and I doubt we'd be able to change it to any significant degree without risking serious regressions. >From my perspective, and I think (but correct me if I'm wrong) the perspective >of the SPIP, a stable interface which can match the performance of Spark's >internal data sources is one of the core goals of DataSourceV2. If >high-performance sources must implement InternalRow reads and writes, then >DataSourceV2 isn't stable until InternalRow is stable anyway. > DataSourceV2 readers should always produce InternalRow. > --- > > Key: SPARK-23325 > URL: https://issues.apache.org/jira/browse/SPARK-23325 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Ryan Blue >Priority: Major > > DataSourceV2 row-oriented implementations are limited to producing either > {{Row}} instances or {{UnsafeRow}} instances by implementing > {{SupportsScanUnsafeRow}}. Instead, I think that implementations should > always produce {{InternalRow}}. > The problem with the choice between {{Row}} and {{UnsafeRow}} is that neither > one is appropriate for implementers. > File formats don't produce {{Row}} instances or the data values used by > {{Row}}, like {{java.sql.Timestamp}} and {{java.sql.Date}}. An implementation > that uses {{Row}} instances must produce data that is immediately translated > from the representation that was just produced by Spark. In my experience, it > made little sense to translate a timestamp in microseconds to a > (milliseconds, nanoseconds) pair, create a {{Timestamp}} instance, and pass > that instance to Spark for immediate translation back. > On the other hand, {{UnsafeRow}} is very difficult to produce unless data is > already held in memory. Even the Parquet support built into Spark > deserializes to {{InternalRow}} and then uses {{UnsafeProjection}} to produce > unsafe rows. When I went to build an implementation that deserializes Parquet > or Avro directly to {{UnsafeRow}} (I tried both), I found that it couldn't be > done without first deserializing into memory because the size of an array > must be known before any values are written. > I ended up deciding to deserialize to {{InternalRow}} and use > {{UnsafeProjection}} to convert to unsafe. There are two problems with this: > first, this is Scala and was difficult to call from Java (it required > reflection), and second, this causes double projection in the physical plan > (a copy for unsafe to unsafe) if there is a projection that wasn't fully > pushed to the data source. > I think the solution is to have a single interface for readers that expects > {{InternalRow}}. Then, a projection should be added in the Spark plan to > convert to unsafe and avoid projection in the plan and in the data source. If > the data source already produces unsafe rows by deserializing directly, this > still minimizes the number of copies because the unsafe projection will check > whether the incoming data is already {{UnsafeRow}}. > Using {{InternalRow}} would also match the interface on the write side. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23613) Different Analyzed logical plan data types for the same table in different queries
[ https://issues.apache.org/jira/browse/SPARK-23613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16388494#comment-16388494 ] Ramandeep Singh commented on SPARK-23613: - To add to it, the query works fine with subquery factoring. with b1 as (select b.* from b) select * from jq ( select a.col1, b.col2 from a,b1 where a.col3=b1.col3) > Different Analyzed logical plan data types for the same table in different > queries > -- > > Key: SPARK-23613 > URL: https://issues.apache.org/jira/browse/SPARK-23613 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 > Environment: Spark 2.2.0 > Hive: 2 >Reporter: Ramandeep Singh >Priority: Blocker > Labels: SparkSQL > > Hi, > The column datatypes are correctly analyzed for simple select query. Note > that the problematic column is not selected anywhere in the complicated > scenario. > Let's say Select * from a; > Now let's say there is a query involving temporary view on another table and > its join with this table. > Let's call that table b (temporary view on a dataframe); > select * from jq ( select a.col1, b.col2 from a,b where a.col3=b=col3) > Fails with exception on some column not part of the projection in the join > query > Exception in thread "main" org.apache.spark.sql.AnalysisException: Cannot up > cast `a`.col5 from from decimal(8,0) to col5#1234: decimal(6,2) as it may > truncate. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23607) Use HDFS extended attributes to store application summary to improve the Spark History Server performance
[ https://issues.apache.org/jira/browse/SPARK-23607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16388230#comment-16388230 ] Marcelo Vanzin commented on SPARK-23607: I think this is a nice trick to speed things up, even though it only works for HDFS. I have some ideas on how to have a more generic speed up in this code, just haven't had the time to sit down and try them out, but this could help in the meantime. > Use HDFS extended attributes to store application summary to improve the > Spark History Server performance > - > > Key: SPARK-23607 > URL: https://issues.apache.org/jira/browse/SPARK-23607 > Project: Spark > Issue Type: Improvement > Components: Spark Core, Web UI >Affects Versions: 2.3.0 >Reporter: Ye Zhou >Priority: Major > Fix For: 2.4.0 > > > Currently in Spark History Server, checkForLogs thread will create replaying > tasks for log files which have file size change. The replaying task will > filter out most of the log file content and keep the application summary > including applicationId, user, attemptACL, start time, end time. The > application summary data will get updated into listing.ldb and serve the > application list on SHS home page. For a long running application, its log > file which name ends with "inprogress" will get replayed for multiple times > to get these application summary. This is a waste of computing and data > reading resource to SHS, which results in the delay for application to get > showing up on home page. Internally we have a patch which utilizes HDFS > extended attributes to improve the performance for getting application > summary in SHS. With this patch, Driver will write the application summary > information into extended attributes as key/value. SHS will try to read from > extended attributes. If SHS fails to read from extended attributes, it will > fall back to read from the log file content as usual. This feature can be > enable/disable through configuration. > It has been running fine for 4 months internally with this patch and the last > updated timestamp on SHS keeps within 1 minute as we configure the interval > to 1 minute. Originally we had long delay which could be as long as 30 > minutes in our scale where we have a large number of Spark applications > running per day. > We want to see whether this kind of approach is also acceptable to community. > Please comment. If so, I will post a pull request for the changes. Thanks. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23325) DataSourceV2 readers should always produce InternalRow.
[ https://issues.apache.org/jira/browse/SPARK-23325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16388322#comment-16388322 ] Wenchen Fan commented on SPARK-23325: - The problem is that, `Row` is a stable class Spark promises it won't change over versions, `InternalRow` is not. I agree it's hard to output either `Row` or `UnsafeRow`, we should allow users to produce `InternalRow` directly. I missed this as I was only considering performance at that time. But I think we should keep the interface producing `Row` before we can make `InternalRow` stable. > DataSourceV2 readers should always produce InternalRow. > --- > > Key: SPARK-23325 > URL: https://issues.apache.org/jira/browse/SPARK-23325 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Ryan Blue >Priority: Major > > DataSourceV2 row-oriented implementations are limited to producing either > {{Row}} instances or {{UnsafeRow}} instances by implementing > {{SupportsScanUnsafeRow}}. Instead, I think that implementations should > always produce {{InternalRow}}. > The problem with the choice between {{Row}} and {{UnsafeRow}} is that neither > one is appropriate for implementers. > File formats don't produce {{Row}} instances or the data values used by > {{Row}}, like {{java.sql.Timestamp}} and {{java.sql.Date}}. An implementation > that uses {{Row}} instances must produce data that is immediately translated > from the representation that was just produced by Spark. In my experience, it > made little sense to translate a timestamp in microseconds to a > (milliseconds, nanoseconds) pair, create a {{Timestamp}} instance, and pass > that instance to Spark for immediate translation back. > On the other hand, {{UnsafeRow}} is very difficult to produce unless data is > already held in memory. Even the Parquet support built into Spark > deserializes to {{InternalRow}} and then uses {{UnsafeProjection}} to produce > unsafe rows. When I went to build an implementation that deserializes Parquet > or Avro directly to {{UnsafeRow}} (I tried both), I found that it couldn't be > done without first deserializing into memory because the size of an array > must be known before any values are written. > I ended up deciding to deserialize to {{InternalRow}} and use > {{UnsafeProjection}} to convert to unsafe. There are two problems with this: > first, this is Scala and was difficult to call from Java (it required > reflection), and second, this causes double projection in the physical plan > (a copy for unsafe to unsafe) if there is a projection that wasn't fully > pushed to the data source. > I think the solution is to have a single interface for readers that expects > {{InternalRow}}. Then, a projection should be added in the Spark plan to > convert to unsafe and avoid projection in the plan and in the data source. If > the data source already produces unsafe rows by deserializing directly, this > still minimizes the number of copies because the unsafe projection will check > whether the incoming data is already {{UnsafeRow}}. > Using {{InternalRow}} would also match the interface on the write side. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18673) Dataframes doesn't work on Hadoop 3.x; Hive rejects Hadoop version
[ https://issues.apache.org/jira/browse/SPARK-18673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16388223#comment-16388223 ] Marcelo Vanzin commented on SPARK-18673: We can't close this because Spark is not using the latest version of Hive. So even if Hive is fixed, Spark is still not. > Dataframes doesn't work on Hadoop 3.x; Hive rejects Hadoop version > -- > > Key: SPARK-18673 > URL: https://issues.apache.org/jira/browse/SPARK-18673 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 > Environment: Spark built with -Dhadoop.version=3.0.0-alpha2-SNAPSHOT >Reporter: Steve Loughran >Priority: Major > > Spark Dataframes fail to run on Hadoop 3.0.x, because hive.jar's shimloader > considers 3.x to be an unknown Hadoop version. > Hive itself will have to fix this; as Spark uses its own hive 1.2.x JAR, it > will need to be updated to match. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23325) DataSourceV2 readers should always produce InternalRow.
[ https://issues.apache.org/jira/browse/SPARK-23325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16388411#comment-16388411 ] Ryan Blue commented on SPARK-23325: --- I agree that we should declare \{{InternalRow}} stable. It is effectively stable, as [~joseph.torres] argues. And by _far_ the easiest way to produce {{UnsafeRow}} is to produce {{InternalRow}} first and use Spark to convert to unsafe. If we're already relying on it there, we may as well have Spark handle the unsafe projection! > DataSourceV2 readers should always produce InternalRow. > --- > > Key: SPARK-23325 > URL: https://issues.apache.org/jira/browse/SPARK-23325 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Ryan Blue >Priority: Major > > DataSourceV2 row-oriented implementations are limited to producing either > {{Row}} instances or {{UnsafeRow}} instances by implementing > {{SupportsScanUnsafeRow}}. Instead, I think that implementations should > always produce {{InternalRow}}. > The problem with the choice between {{Row}} and {{UnsafeRow}} is that neither > one is appropriate for implementers. > File formats don't produce {{Row}} instances or the data values used by > {{Row}}, like {{java.sql.Timestamp}} and {{java.sql.Date}}. An implementation > that uses {{Row}} instances must produce data that is immediately translated > from the representation that was just produced by Spark. In my experience, it > made little sense to translate a timestamp in microseconds to a > (milliseconds, nanoseconds) pair, create a {{Timestamp}} instance, and pass > that instance to Spark for immediate translation back. > On the other hand, {{UnsafeRow}} is very difficult to produce unless data is > already held in memory. Even the Parquet support built into Spark > deserializes to {{InternalRow}} and then uses {{UnsafeProjection}} to produce > unsafe rows. When I went to build an implementation that deserializes Parquet > or Avro directly to {{UnsafeRow}} (I tried both), I found that it couldn't be > done without first deserializing into memory because the size of an array > must be known before any values are written. > I ended up deciding to deserialize to {{InternalRow}} and use > {{UnsafeProjection}} to convert to unsafe. There are two problems with this: > first, this is Scala and was difficult to call from Java (it required > reflection), and second, this causes double projection in the physical plan > (a copy for unsafe to unsafe) if there is a projection that wasn't fully > pushed to the data source. > I think the solution is to have a single interface for readers that expects > {{InternalRow}}. Then, a projection should be added in the Spark plan to > convert to unsafe and avoid projection in the plan and in the data source. If > the data source already produces unsafe rows by deserializing directly, this > still minimizes the number of copies because the unsafe projection will check > whether the incoming data is already {{UnsafeRow}}. > Using {{InternalRow}} would also match the interface on the write side. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23613) Different Analyzed logical plan data types for the same table in different queries
Ramandeep Singh created SPARK-23613: --- Summary: Different Analyzed logical plan data types for the same table in different queries Key: SPARK-23613 URL: https://issues.apache.org/jira/browse/SPARK-23613 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.2.0 Environment: Spark 2.2.0 Hive: 2 Reporter: Ramandeep Singh Hi, The column datatypes are correctly analyzed for simple select query. Note that the problematic column is not selected anywhere in the complicated scenario. Let's say Select * from a; Now let's say there is a query involving temporary view on another table and its join with this table. Let's call that table b (temporary view on a dataframe); select * from jq ( select a.col1, b.col2 from a,b where a.col3=b=col3) Fails with exception on some column not part of the projection in the join query Exception in thread "main" org.apache.spark.sql.AnalysisException: Cannot up cast `a`.col5 from from decimal(8,0) to col5#1234: decimal(6,2) as it may truncate. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23614) Union produces incorrect results when caching is used
[ https://issues.apache.org/jira/browse/SPARK-23614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Morten Hornbech updated SPARK-23614: Description: We just upgraded from 2.2 to 2.3 and our test suite caught this error: {code:java} case class TestData(x: Int, y: Int, z: Int) val frame = session.createDataset(Seq(TestData(1, 2, 3), TestData(4, 5, 6))).cache() val group1 = frame.groupBy("x").agg(min(col("y")) as "value") val group2 = frame.groupBy("x").agg(min(col("z")) as "value") group1.union(group2).show() // +---+-+ // | x|value| // +---+-+ // | 1| 2| // | 4| 5| // | 1| 2| // | 4| 5| // +---+-+ group2.union(group1).show() // +---+-+ // | x|value| // +---+-+ // | 1| 3| // | 4| 6| // | 1| 3| // | 4| 6| // +---+-+ {code} The error disappears if the first data frame is not cached or if the two group by's use separate copies. I'm not sure exactly what happens on the insides of Spark, but errors that produce incorrect results rather than exceptions always concerns me. was: We just upgraded from 2.2 to 2.3 and our test suite caught this error: {code:java} val frame = session.createDataset(Seq(TestData(1, 2, 3), TestData(4, 5, 6))).cache() val group1 = frame.groupBy("x").agg(min(col("y")) as "value") val group2 = frame.groupBy("x").agg(min(col("z")) as "value") group1.union(group2).show() // +---+-+ // | x|value| // +---+-+ // | 1| 2| // | 4| 5| // | 1| 2| // | 4| 5| // +---+-+ group2.union(group1).show() // +---+-+ // | x|value| // +---+-+ // | 1| 3| // | 4| 6| // | 1| 3| // | 4| 6| // +---+-+ {code} The error disappears if the first data frame is not cached or if the two group by's use separate copies. I'm not sure exactly what happens on the insides of Spark, but errors that produce incorrect results rather than exceptions always concerns me. > Union produces incorrect results when caching is used > - > > Key: SPARK-23614 > URL: https://issues.apache.org/jira/browse/SPARK-23614 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: Morten Hornbech >Priority: Major > > We just upgraded from 2.2 to 2.3 and our test suite caught this error: > {code:java} > case class TestData(x: Int, y: Int, z: Int) > val frame = session.createDataset(Seq(TestData(1, 2, 3), TestData(4, 5, > 6))).cache() > val group1 = frame.groupBy("x").agg(min(col("y")) as "value") > val group2 = frame.groupBy("x").agg(min(col("z")) as "value") > group1.union(group2).show() > // +---+-+ > // | x|value| > // +---+-+ > // | 1| 2| > // | 4| 5| > // | 1| 2| > // | 4| 5| > // +---+-+ > group2.union(group1).show() > // +---+-+ > // | x|value| > // +---+-+ > // | 1| 3| > // | 4| 6| > // | 1| 3| > // | 4| 6| > // +---+-+ > {code} > The error disappears if the first data frame is not cached or if the two > group by's use separate copies. I'm not sure exactly what happens on the > insides of Spark, but errors that produce incorrect results rather than > exceptions always concerns me. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23582) Add interpreted execution to StaticInvoke expression
[ https://issues.apache.org/jira/browse/SPARK-23582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16388742#comment-16388742 ] Herman van Hovell commented on SPARK-23582: --- That is a good start! I am just wondering if method handles won't be more performant. > Add interpreted execution to StaticInvoke expression > > > Key: SPARK-23582 > URL: https://issues.apache.org/jira/browse/SPARK-23582 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Herman van Hovell >Assignee: Kazuaki Ishizaki >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23609) Test EnsureRequirements's test cases to eliminate ShuffleExchange while is not expected
[ https://issues.apache.org/jira/browse/SPARK-23609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caoxuewen updated SPARK-23609: -- Summary: Test EnsureRequirements's test cases to eliminate ShuffleExchange while is not expected (was: Test code does not conform to the test title) > Test EnsureRequirements's test cases to eliminate ShuffleExchange while is > not expected > --- > > Key: SPARK-23609 > URL: https://issues.apache.org/jira/browse/SPARK-23609 > Project: Spark > Issue Type: Improvement > Components: SQL, Tests >Affects Versions: 2.4.0 >Reporter: caoxuewen >Priority: Minor > > Currently, In testing EnsureRequirements's test cases to eliminate > ShuffleExchange, The test code is not in conformity with the purpose of the > test.These test cases are as follows: > 1、test("EnsureRequirements eliminates Exchange if child has same > partitioning") > The checking condition is that there is no ShuffleExchange in the physical > plan. = = 2 It's not accurate here. > 2、test("EnsureRequirements does not eliminate Exchange with different > partitioning") > The purpose of the test is to not eliminate ShuffleExchange, but its test > code is the same as test("EnsureRequirements eliminates Exchange if child has > same partitioning") -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23325) DataSourceV2 readers should always produce InternalRow.
[ https://issues.apache.org/jira/browse/SPARK-23325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16388520#comment-16388520 ] Wenchen Fan commented on SPARK-23325: - Making `InternalRow` stable is not only about stabilizing the interfaces, but also the semantics of data types and their data structure. e.g. timestamp type is microseconds from Unix epoch in Spark, string type is UTF8 encoded string via the `UTF8String` class, map type is a combination of 2 arrays, etc. cc [~rxin] and [~marmbrus] for broader discussions. > DataSourceV2 readers should always produce InternalRow. > --- > > Key: SPARK-23325 > URL: https://issues.apache.org/jira/browse/SPARK-23325 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Ryan Blue >Priority: Major > > DataSourceV2 row-oriented implementations are limited to producing either > {{Row}} instances or {{UnsafeRow}} instances by implementing > {{SupportsScanUnsafeRow}}. Instead, I think that implementations should > always produce {{InternalRow}}. > The problem with the choice between {{Row}} and {{UnsafeRow}} is that neither > one is appropriate for implementers. > File formats don't produce {{Row}} instances or the data values used by > {{Row}}, like {{java.sql.Timestamp}} and {{java.sql.Date}}. An implementation > that uses {{Row}} instances must produce data that is immediately translated > from the representation that was just produced by Spark. In my experience, it > made little sense to translate a timestamp in microseconds to a > (milliseconds, nanoseconds) pair, create a {{Timestamp}} instance, and pass > that instance to Spark for immediate translation back. > On the other hand, {{UnsafeRow}} is very difficult to produce unless data is > already held in memory. Even the Parquet support built into Spark > deserializes to {{InternalRow}} and then uses {{UnsafeProjection}} to produce > unsafe rows. When I went to build an implementation that deserializes Parquet > or Avro directly to {{UnsafeRow}} (I tried both), I found that it couldn't be > done without first deserializing into memory because the size of an array > must be known before any values are written. > I ended up deciding to deserialize to {{InternalRow}} and use > {{UnsafeProjection}} to convert to unsafe. There are two problems with this: > first, this is Scala and was difficult to call from Java (it required > reflection), and second, this causes double projection in the physical plan > (a copy for unsafe to unsafe) if there is a projection that wasn't fully > pushed to the data source. > I think the solution is to have a single interface for readers that expects > {{InternalRow}}. Then, a projection should be added in the Spark plan to > convert to unsafe and avoid projection in the plan and in the data source. If > the data source already produces unsafe rows by deserializing directly, this > still minimizes the number of copies because the unsafe projection will check > whether the incoming data is already {{UnsafeRow}}. > Using {{InternalRow}} would also match the interface on the write side. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23582) Add interpreted execution to StaticInvoke expression
[ https://issues.apache.org/jira/browse/SPARK-23582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16388729#comment-16388729 ] Kazuaki Ishizaki commented on SPARK-23582: -- I see. Now, I have a prototype using old-school reflection. > Add interpreted execution to StaticInvoke expression > > > Key: SPARK-23582 > URL: https://issues.apache.org/jira/browse/SPARK-23582 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Herman van Hovell >Assignee: Kazuaki Ishizaki >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18673) Dataframes doesn't work on Hadoop 3.x; Hive rejects Hadoop version
[ https://issues.apache.org/jira/browse/SPARK-18673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16388757#comment-16388757 ] Darek commented on SPARK-18673: --- When running the pyspark tests using Hadoop 3.0.0 I am not getting the java.lang.IllegalArgumentException but I am getting ClassNotFoundException: org.apache.hadoop.hive.sql.metadata.HiveException. Who can help to move this ticket forward? Thanks > Dataframes doesn't work on Hadoop 3.x; Hive rejects Hadoop version > -- > > Key: SPARK-18673 > URL: https://issues.apache.org/jira/browse/SPARK-18673 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 > Environment: Spark built with -Dhadoop.version=3.0.0-alpha2-SNAPSHOT >Reporter: Steve Loughran >Priority: Major > > Spark Dataframes fail to run on Hadoop 3.0.x, because hive.jar's shimloader > considers 3.x to be an unknown Hadoop version. > Hive itself will have to fix this; as Spark uses its own hive 1.2.x JAR, it > will need to be updated to match. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23534) Spark run on Hadoop 3.0.0
[ https://issues.apache.org/jira/browse/SPARK-23534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16388761#comment-16388761 ] Saisai Shao commented on SPARK-23534: - I don't think so. Spark uses its own fork hive version (hive-1.2.1.spark2), which doesn't include HIVE-15016 and HIVE-18550, these two patches only landed in Hive community's Hive, not Spark's Hive. Unless we shift to use Hive community's Hive, or path our own forked hive, then this will not be a blocker. > Spark run on Hadoop 3.0.0 > - > > Key: SPARK-23534 > URL: https://issues.apache.org/jira/browse/SPARK-23534 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 2.3.0 >Reporter: Saisai Shao >Priority: Major > > Major Hadoop vendors already/will step in Hadoop 3.0. So we should also make > sure Spark can run with Hadoop 3.0. This Jira tracks the work to make Spark > run on Hadoop 3.0. > The work includes: > # Add a Hadoop 3.0.0 new profile to make Spark build-able with Hadoop 3.0. > # Test to see if there's dependency issues with Hadoop 3.0. > # Investigating the feasibility to use shaded client jars (HADOOP-11804). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23615) Add maxDF Parameter to Python CountVectorizer
Bryan Cutler created SPARK-23615: Summary: Add maxDF Parameter to Python CountVectorizer Key: SPARK-23615 URL: https://issues.apache.org/jira/browse/SPARK-23615 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 2.4.0 Reporter: Bryan Cutler The maxDF parameter is for filtering out frequently occurring terms. This param was recently added to the Scala CountVectorizer and needs to be added to Python also. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23616) Streaming self-join using SQL throws resolution exceptions
Tathagata Das created SPARK-23616: - Summary: Streaming self-join using SQL throws resolution exceptions Key: SPARK-23616 URL: https://issues.apache.org/jira/browse/SPARK-23616 Project: Spark Issue Type: New Feature Components: Structured Streaming Affects Versions: 2.3.0 Reporter: Tathagata Das Assignee: Tathagata Das Reported on the dev list. {code} import org.apache.spark.sql.streaming.Trigger val jdf = spark.readStream.format("kafka").option("kafka.bootstrap.servers", "localhost:9092").option("subscribe", "join_test").option("startingOffsets", "earliest").load(); jdf.createOrReplaceTempView("table") val resultdf = spark.sql("select * from table as x inner join table as y on x.offset=y.offset") resultdf.writeStream.outputMode("update").format("console").option("truncate", false).trigger(Trigger.ProcessingTime(1000)).start() {code} This is giving the following error {code} org.apache.spark.sql.AnalysisException: cannot resolve '`x.offset`' given input columns: [x.value, x.offset, x.key, x.timestampType, x.topic, x.timestamp, x.partition]; line 1 pos 50; 'Project [*] +- 'Join Inner, ('x.offset = 'y.offset) :- SubqueryAlias x : +- SubqueryAlias table : +- StreamingRelation DataSource(org.apache.spark.sql.SparkSession@15f3f9cf,kafka,List(),None,List(),None,Map(startingOffsets -> earliest, subscribe -> join_test, kafka.bootstrap.servers -> localhost:9092),None), kafka, [key#28, value#29, topic#30, partition#31, offset#32L, timestamp#33, timestampType#34] +- SubqueryAlias y +- SubqueryAlias table +- StreamingRelation DataSource(org.apache.spark.sql.SparkSession@15f3f9cf,kafka,List(),None,List(),None,Map(startingOffsets -> earliest, subscribe -> join_test, kafka.bootstrap.servers -> localhost:9092),None), kafka, [key#28, value#29, topic#30, partition#31, offset#32L, timestamp#33, timestampType#34] {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23614) Union produces incorrect results when caching is used
Morten Hornbech created SPARK-23614: --- Summary: Union produces incorrect results when caching is used Key: SPARK-23614 URL: https://issues.apache.org/jira/browse/SPARK-23614 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.3.0 Reporter: Morten Hornbech We just upgraded from 2.2 to 2.3 and our test suite caught this error: {code:java} val frame = session.createDataset(Seq(TestData(1, 2, 3), TestData(4, 5, 6))).cache() val group1 = frame.groupBy("x").agg(min(col("y")) as "value") val group2 = frame.groupBy("x").agg(min(col("z")) as "value") group1.union(group2).show() // +---+-+ // | x|value| // +---+-+ // | 1| 2| // | 4| 5| // | 1| 2| // | 4| 5| // +---+-+ group2.union(group1).show() // +---+-+ // | x|value| // +---+-+ // | 1| 3| // | 4| 6| // | 1| 3| // | 4| 6| // +---+-+ {code} The error disappears if the first data frame is not cached or if the two group by's use separate copies. I'm not sure exactly what happens on the insides of Spark, but errors that produce incorrect results rather than exceptions always concerns me. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23595) Add interpreted execution for ValidateExternalType expression
[ https://issues.apache.org/jira/browse/SPARK-23595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16388707#comment-16388707 ] Takeshi Yamamuro commented on SPARK-23595: -- ok, If you need help in other tickets, please let me know, too. Thanks! > Add interpreted execution for ValidateExternalType expression > - > > Key: SPARK-23595 > URL: https://issues.apache.org/jira/browse/SPARK-23595 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Herman van Hovell >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23615) Add maxDF Parameter to Python CountVectorizer
[ https://issues.apache.org/jira/browse/SPARK-23615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler updated SPARK-23615: - Component/s: ML > Add maxDF Parameter to Python CountVectorizer > - > > Key: SPARK-23615 > URL: https://issues.apache.org/jira/browse/SPARK-23615 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Affects Versions: 2.4.0 >Reporter: Bryan Cutler >Priority: Minor > > The maxDF parameter is for filtering out frequently occurring terms. This > param was recently added to the Scala CountVectorizer and needs to be added > to Python also. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23616) Streaming self-join using SQL throws resolution exceptions
[ https://issues.apache.org/jira/browse/SPARK-23616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das resolved SPARK-23616. --- Resolution: Duplicate This is the same underlying issue as SPARK-23406. However the error is different due to the use of pure SQL join instead of Dataset join. > Streaming self-join using SQL throws resolution exceptions > -- > > Key: SPARK-23616 > URL: https://issues.apache.org/jira/browse/SPARK-23616 > Project: Spark > Issue Type: New Feature > Components: Structured Streaming >Affects Versions: 2.3.0 >Reporter: Tathagata Das >Assignee: Tathagata Das >Priority: Major > > Reported on the dev list. > {code} > import org.apache.spark.sql.streaming.Trigger > val jdf = spark.readStream.format("kafka").option("kafka.bootstrap.servers", > "localhost:9092").option("subscribe", "join_test").option("startingOffsets", > "earliest").load(); > jdf.createOrReplaceTempView("table") > val resultdf = spark.sql("select * from table as x inner join table as y on > x.offset=y.offset") > resultdf.writeStream.outputMode("update").format("console").option("truncate", > false).trigger(Trigger.ProcessingTime(1000)).start() > {code} > This is giving the following error > {code} > org.apache.spark.sql.AnalysisException: cannot resolve '`x.offset`' given > input columns: [x.value, x.offset, x.key, x.timestampType, x.topic, > x.timestamp, x.partition]; line 1 pos 50; > 'Project [*] > +- 'Join Inner, ('x.offset = 'y.offset) > :- SubqueryAlias x > : +- SubqueryAlias table > : +- StreamingRelation > DataSource(org.apache.spark.sql.SparkSession@15f3f9cf,kafka,List(),None,List(),None,Map(startingOffsets > -> earliest, subscribe -> join_test, kafka.bootstrap.servers -> > localhost:9092),None), kafka, [key#28, value#29, topic#30, partition#31, > offset#32L, timestamp#33, timestampType#34] > +- SubqueryAlias y > +- SubqueryAlias table > +- StreamingRelation > DataSource(org.apache.spark.sql.SparkSession@15f3f9cf,kafka,List(),None,List(),None,Map(startingOffsets > -> earliest, subscribe -> join_test, kafka.bootstrap.servers -> > localhost:9092),None), kafka, [key#28, value#29, topic#30, partition#31, > offset#32L, timestamp#33, timestampType#34] > {code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-23534) Spark run on Hadoop 3.0.0
[ https://issues.apache.org/jira/browse/SPARK-23534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16388761#comment-16388761 ] Saisai Shao edited comment on SPARK-23534 at 3/7/18 12:37 AM: -- I don't think so. Spark uses its own fork hive version (hive-1.2.1.spark2), which doesn't include HIVE-15016 and HIVE-18550, these two patches only landed in Hive community's Hive, not Spark's Hive. Unless we shift to use Hive community's Hive, or patch our own forked hive, then this will not be a blocker. was (Author: jerryshao): I don't think so. Spark uses its own fork hive version (hive-1.2.1.spark2), which doesn't include HIVE-15016 and HIVE-18550, these two patches only landed in Hive community's Hive, not Spark's Hive. Unless we shift to use Hive community's Hive, or path our own forked hive, then this will not be a blocker. > Spark run on Hadoop 3.0.0 > - > > Key: SPARK-23534 > URL: https://issues.apache.org/jira/browse/SPARK-23534 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 2.3.0 >Reporter: Saisai Shao >Priority: Major > > Major Hadoop vendors already/will step in Hadoop 3.0. So we should also make > sure Spark can run with Hadoop 3.0. This Jira tracks the work to make Spark > run on Hadoop 3.0. > The work includes: > # Add a Hadoop 3.0.0 new profile to make Spark build-able with Hadoop 3.0. > # Test to see if there's dependency issues with Hadoop 3.0. > # Investigating the feasibility to use shaded client jars (HADOOP-11804). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23582) Add interpreted execution to StaticInvoke expression
[ https://issues.apache.org/jira/browse/SPARK-23582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23582: Assignee: Kazuaki Ishizaki (was: Apache Spark) > Add interpreted execution to StaticInvoke expression > > > Key: SPARK-23582 > URL: https://issues.apache.org/jira/browse/SPARK-23582 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Herman van Hovell >Assignee: Kazuaki Ishizaki >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23582) Add interpreted execution to StaticInvoke expression
[ https://issues.apache.org/jira/browse/SPARK-23582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16388792#comment-16388792 ] Apache Spark commented on SPARK-23582: -- User 'kiszk' has created a pull request for this issue: https://github.com/apache/spark/pull/20753 > Add interpreted execution to StaticInvoke expression > > > Key: SPARK-23582 > URL: https://issues.apache.org/jira/browse/SPARK-23582 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Herman van Hovell >Assignee: Kazuaki Ishizaki >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23582) Add interpreted execution to StaticInvoke expression
[ https://issues.apache.org/jira/browse/SPARK-23582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23582: Assignee: Apache Spark (was: Kazuaki Ishizaki) > Add interpreted execution to StaticInvoke expression > > > Key: SPARK-23582 > URL: https://issues.apache.org/jira/browse/SPARK-23582 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Herman van Hovell >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23406) Stream-stream self joins does not work
[ https://issues.apache.org/jira/browse/SPARK-23406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16388898#comment-16388898 ] Apache Spark commented on SPARK-23406: -- User 'tdas' has created a pull request for this issue: https://github.com/apache/spark/pull/20755 > Stream-stream self joins does not work > -- > > Key: SPARK-23406 > URL: https://issues.apache.org/jira/browse/SPARK-23406 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.3.0 >Reporter: Tathagata Das >Assignee: Tathagata Das >Priority: Major > Fix For: 2.4.0 > > > Currently stream-stream self join throws the following error > {code} > val df = spark.readStream.format("rate").option("numRowsPerSecond", > "1").option("numPartitions", "1").load() > display(df.withColumn("key", $"value" / 10).join(df.withColumn("key", > $"value" / 5), "key")) > {code} > error: > {code} > Failure when resolving conflicting references in Join: > 'Join UsingJoin(Inner,List(key)) > :- Project [timestamp#850, value#851L, (cast(value#851L as double) / cast(10 > as double)) AS key#855] > : +- StreamingRelation > DataSource(org.apache.spark.sql.SparkSession@7f1d2a68,rate,List(),None,List(),None,Map(numPartitions > -> 1, numRowsPerSecond -> 1),None), rate, [timestamp#850, value#851L] > +- Project [timestamp#850, value#851L, (cast(value#851L as double) / cast(5 > as double)) AS key#860] > +- StreamingRelation > DataSource(org.apache.spark.sql.SparkSession@7f1d2a68,rate,List(),None,List(),None,Map(numPartitions > -> 1, numRowsPerSecond -> 1),None), rate, [timestamp#850, value#851L] > Conflicting attributes: timestamp#850,value#851L > ;; > 'Join UsingJoin(Inner,List(key)) > :- Project [timestamp#850, value#851L, (cast(value#851L as double) / cast(10 > as double)) AS key#855] > : +- StreamingRelation > DataSource(org.apache.spark.sql.SparkSession@7f1d2a68,rate,List(),None,List(),None,Map(numPartitions > -> 1, numRowsPerSecond -> 1),None), rate, [timestamp#850, value#851L] > +- Project [timestamp#850, value#851L, (cast(value#851L as double) / cast(5 > as double)) AS key#860] > +- StreamingRelation > DataSource(org.apache.spark.sql.SparkSession@7f1d2a68,rate,List(),None,List(),None,Map(numPartitions > -> 1, numRowsPerSecond -> 1),None), rate, [timestamp#850, value#851L] > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:39) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:101) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:378) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:98) > at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:148) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:98) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:101) > at > org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:71) > at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:73) > at > org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$withPlan(Dataset.scala:3063) > at org.apache.spark.sql.Dataset.join(Dataset.scala:787) > at org.apache.spark.sql.Dataset.join(Dataset.scala:756) > at org.apache.spark.sql.Dataset.join(Dataset.scala:731) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23287) Spark scheduler does not remove initial executor if not one job submitted
[ https://issues.apache.org/jira/browse/SPARK-23287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23287: Assignee: Apache Spark > Spark scheduler does not remove initial executor if not one job submitted > - > > Key: SPARK-23287 > URL: https://issues.apache.org/jira/browse/SPARK-23287 > Project: Spark > Issue Type: Bug > Components: Mesos, Scheduler >Affects Versions: 2.2.1 > Environment: Cluster manager - Mesos 1.4.1 > Spark 2.2.1 > spark app configuration: > spark.dynamicAllocation.minExecutors=0 > spark.dynamicAllocation.executorIdleTimeout=25s > spark.dynamicAllocation.initialExecutors=1 > spark.dynamicAllocation.schedulerBacklogTimeout=4s > spark.dynamicAllocation.sustainedSchedulerBacklogTimeout=5s >Reporter: Pavel Plotnikov >Assignee: Apache Spark >Priority: Minor > > When spark application submitted it deploy initial number of executors. If > none of job has been submitted to application spark doesn't remove initial > executor. > > Cluster manager - Mesos 1.4.1 > Spark 2.2.1 > spark app configuration: > spark.dynamicAllocation.minExecutors=0 > spark.dynamicAllocation.executorIdleTimeout=25s > spark.dynamicAllocation.initialExecutors=1 > spark.dynamicAllocation.schedulerBacklogTimeout=4s > spark.dynamicAllocation.sustainedSchedulerBacklogTimeout=5s -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23547) Cleanup the .pipeout file when the Hive Session closed
[ https://issues.apache.org/jira/browse/SPARK-23547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zuotingbing updated SPARK-23547: Description: !2018-03-07_121010.png! when the hive session closed, we should also cleanup the .pipeout file. was: !2018-03-07_121010.png! when the hive session closed, we should also cleanup the .pipeout file. > Cleanup the .pipeout file when the Hive Session closed > -- > > Key: SPARK-23547 > URL: https://issues.apache.org/jira/browse/SPARK-23547 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: zuotingbing >Priority: Major > Attachments: 2018-03-07_121010.png > > > !2018-03-07_121010.png! > > when the hive session closed, we should also cleanup the .pipeout file. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23584) Add interpreted execution to NewInstance expression
[ https://issues.apache.org/jira/browse/SPARK-23584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16389140#comment-16389140 ] Takeshi Yamamuro commented on SPARK-23584: -- I'm working on it. > Add interpreted execution to NewInstance expression > --- > > Key: SPARK-23584 > URL: https://issues.apache.org/jira/browse/SPARK-23584 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Herman van Hovell >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23617) Register a Function without params with Spark SQL Java API
[ https://issues.apache.org/jira/browse/SPARK-23617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Paul Wu updated SPARK-23617: Description: One can register a function using Scala: {{spark.udf.register("uuid", ()=>java.util.UUID.randomUUID.toString)}} Now, if I use Java API: {{spark.udf().register("uuid", ()=>java.util.UUID.randomUUID().toString());}} The code does not compile. Define UDF0 for Java API? was: One can register a function using Scala: spark.udf.register("uuid", ()=>java.util.UUID.randomUUID.toString) Now, if I use Java API: spark.udf().register("uuid", ()=>java.util.UUID.randomUUID().toString()); The code does not compile. Define UDF0 for Java API? > Register a Function without params with Spark SQL Java API > -- > > Key: SPARK-23617 > URL: https://issues.apache.org/jira/browse/SPARK-23617 > Project: Spark > Issue Type: Improvement > Components: Java API, SQL >Affects Versions: 2.2.1 >Reporter: Paul Wu >Priority: Major > > One can register a function using Scala: > {{spark.udf.register("uuid", ()=>java.util.UUID.randomUUID.toString)}} > Now, if I use Java API: > {{spark.udf().register("uuid", ()=>java.util.UUID.randomUUID().toString());}} > The code does not compile. Define UDF0 for Java API? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23618) docker-image-tool.sh Fails While Building Image
[ https://issues.apache.org/jira/browse/SPARK-23618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ninad Ingole updated SPARK-23618: - Description: I am trying to build kubernetes image for version 2.3.0, using {code:java} ./bin/docker-image-tool.sh -r ninadingole/spark-docker -t v2.3.0 build {code} giving me an issue for docker build error: {code:java} "docker build" requires exactly 1 argument. See 'docker build --help'. Usage: docker build [OPTIONS] PATH | URL | - [flags] Build an image from a Dockerfile {code} Executing the command within the spark distribution directory. Please let me know what's the issue. was: I am trying to build kubernetes image for version 2.3.0, using {code:java} ./bin/docker-image-tool.sh -r ninadingole/spark-docker -t v2.3.0 build {code} giving me an issue for docker build error: {code:java} "docker build" requires exactly 1 argument. See 'docker build --help'. Usage: docker build [OPTIONS] PATH | URL | - [flags] Build an image from a Dockerfile {code} Executing the command within the spark distribution directory. Please let me know what's the issue. > docker-image-tool.sh Fails While Building Image > --- > > Key: SPARK-23618 > URL: https://issues.apache.org/jira/browse/SPARK-23618 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.3.0 >Reporter: Ninad Ingole >Priority: Major > > I am trying to build kubernetes image for version 2.3.0, using > {code:java} > ./bin/docker-image-tool.sh -r ninadingole/spark-docker -t v2.3.0 build > {code} > giving me an issue for docker build > error: > {code:java} > "docker build" requires exactly 1 argument. > See 'docker build --help'. > Usage: docker build [OPTIONS] PATH | URL | - [flags] > Build an image from a Dockerfile > {code} > > Executing the command within the spark distribution directory. Please let me > know what's the issue. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23617) Register a Function without params with Spark SQL Java API
[ https://issues.apache.org/jira/browse/SPARK-23617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Paul Wu updated SPARK-23617: Description: One can register a function using Scala: spark.udf.register("uuid", ()=>java.util.UUID.randomUUID.toString) Now, if I use Java API: spark.udf().register("uuid", ()=>java.util.UUID.randomUUID().toString()); The code does not compile. Define UDF0 for Java API? was: One can register a function using Scala: {{spark.udf.register("uuid", ()=>java.util.UUID.randomUUID.toString) }} Now, if I use Java API: {{spark.udf().register("uuid", ()=>java.util.UUID.randomUUID().toString()); }} The code does not compile. Define UDF0 for Java API? > Register a Function without params with Spark SQL Java API > -- > > Key: SPARK-23617 > URL: https://issues.apache.org/jira/browse/SPARK-23617 > Project: Spark > Issue Type: Improvement > Components: Java API, SQL >Affects Versions: 2.2.1 >Reporter: Paul Wu >Priority: Major > > One can register a function using Scala: > spark.udf.register("uuid", ()=>java.util.UUID.randomUUID.toString) > Now, if I use Java API: > spark.udf().register("uuid", ()=>java.util.UUID.randomUUID().toString()); > The code does not compile. Define UDF0 for Java API? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23617) Register a Function without params with Spark SQL Java API
[ https://issues.apache.org/jira/browse/SPARK-23617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Paul Wu updated SPARK-23617: Description: One can register a function using Scala: {{spark.udf.register("uuid", ()=>java.util.UUID.randomUUID.toString) }} Now, if I use Java API: {{spark.udf().register("uuid", ()=>java.util.UUID.randomUUID().toString()); }} The code does not compile. Define UDF0 for Java API? was: One can register a function using Scala: {{spark.udf.register("uuid", ()=>java.util.UUID.randomUUID.toString) }} Now, if I use Java API: {{ spark.udf().register("uuid", ()=>java.util.UUID.randomUUID().toString()); }} The code does not compile. Define UDF0 for Java API? > Register a Function without params with Spark SQL Java API > -- > > Key: SPARK-23617 > URL: https://issues.apache.org/jira/browse/SPARK-23617 > Project: Spark > Issue Type: Improvement > Components: Java API, SQL >Affects Versions: 2.2.1 >Reporter: Paul Wu >Priority: Major > > One can register a function using Scala: > {{spark.udf.register("uuid", ()=>java.util.UUID.randomUUID.toString) }} > Now, if I use Java API: > {{spark.udf().register("uuid", ()=>java.util.UUID.randomUUID().toString()); }} > The code does not compile. Define UDF0 for Java API? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23519) Create View Commands Fails with The view output (col1,col1) contains duplicate column name
[ https://issues.apache.org/jira/browse/SPARK-23519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16389150#comment-16389150 ] Franck Tago commented on SPARK-23519: - Any updates on this ? > Create View Commands Fails with The view output (col1,col1) contains > duplicate column name > --- > > Key: SPARK-23519 > URL: https://issues.apache.org/jira/browse/SPARK-23519 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.1 >Reporter: Franck Tago >Priority: Critical > > 1- create and populate a hive table . I did this in a hive cli session .[ > not that this matters ] > create table atable (col1 int) ; > insert into atable values (10 ) , (100) ; > 2. create a view form the table. [ I did this from a spark shell ] > spark.sql("create view default.aview (int1 , int2 ) as select col1 , col1 > from atable ") > java.lang.AssertionError: assertion failed: The view output (col1,col1) > contains duplicate column name. > at scala.Predef$.assert(Predef.scala:170) > at > org.apache.spark.sql.execution.command.ViewHelper$.generateViewProperties(views.scala:361) > at > org.apache.spark.sql.execution.command.CreateViewCommand.prepareTable(views.scala:236) > at > org.apache.spark.sql.execution.command.CreateViewCommand.run(views.scala:174) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:67) > at org.apache.spark.sql.Dataset.(Dataset.scala:183) > at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:68) > at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:632) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23287) Spark scheduler does not remove initial executor if not one job submitted
[ https://issues.apache.org/jira/browse/SPARK-23287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1635#comment-1635 ] Apache Spark commented on SPARK-23287: -- User 'devaraj-kavali' has created a pull request for this issue: https://github.com/apache/spark/pull/20754 > Spark scheduler does not remove initial executor if not one job submitted > - > > Key: SPARK-23287 > URL: https://issues.apache.org/jira/browse/SPARK-23287 > Project: Spark > Issue Type: Bug > Components: Mesos, Scheduler >Affects Versions: 2.2.1 > Environment: Cluster manager - Mesos 1.4.1 > Spark 2.2.1 > spark app configuration: > spark.dynamicAllocation.minExecutors=0 > spark.dynamicAllocation.executorIdleTimeout=25s > spark.dynamicAllocation.initialExecutors=1 > spark.dynamicAllocation.schedulerBacklogTimeout=4s > spark.dynamicAllocation.sustainedSchedulerBacklogTimeout=5s >Reporter: Pavel Plotnikov >Priority: Minor > > When spark application submitted it deploy initial number of executors. If > none of job has been submitted to application spark doesn't remove initial > executor. > > Cluster manager - Mesos 1.4.1 > Spark 2.2.1 > spark app configuration: > spark.dynamicAllocation.minExecutors=0 > spark.dynamicAllocation.executorIdleTimeout=25s > spark.dynamicAllocation.initialExecutors=1 > spark.dynamicAllocation.schedulerBacklogTimeout=4s > spark.dynamicAllocation.sustainedSchedulerBacklogTimeout=5s -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23287) Spark scheduler does not remove initial executor if not one job submitted
[ https://issues.apache.org/jira/browse/SPARK-23287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23287: Assignee: (was: Apache Spark) > Spark scheduler does not remove initial executor if not one job submitted > - > > Key: SPARK-23287 > URL: https://issues.apache.org/jira/browse/SPARK-23287 > Project: Spark > Issue Type: Bug > Components: Mesos, Scheduler >Affects Versions: 2.2.1 > Environment: Cluster manager - Mesos 1.4.1 > Spark 2.2.1 > spark app configuration: > spark.dynamicAllocation.minExecutors=0 > spark.dynamicAllocation.executorIdleTimeout=25s > spark.dynamicAllocation.initialExecutors=1 > spark.dynamicAllocation.schedulerBacklogTimeout=4s > spark.dynamicAllocation.sustainedSchedulerBacklogTimeout=5s >Reporter: Pavel Plotnikov >Priority: Minor > > When spark application submitted it deploy initial number of executors. If > none of job has been submitted to application spark doesn't remove initial > executor. > > Cluster manager - Mesos 1.4.1 > Spark 2.2.1 > spark app configuration: > spark.dynamicAllocation.minExecutors=0 > spark.dynamicAllocation.executorIdleTimeout=25s > spark.dynamicAllocation.initialExecutors=1 > spark.dynamicAllocation.schedulerBacklogTimeout=4s > spark.dynamicAllocation.sustainedSchedulerBacklogTimeout=5s -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23495) Creating a json file using a dataframe Generates an issue
[ https://issues.apache.org/jira/browse/SPARK-23495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-23495: -- Target Version/s: (was: 2.1.0) Fix Version/s: (was: 2.1.0) > Creating a json file using a dataframe Generates an issue > - > > Key: SPARK-23495 > URL: https://issues.apache.org/jira/browse/SPARK-23495 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.1.0 >Reporter: AIT OUFKIR >Priority: Major > Original Estimate: 4h > Remaining Estimate: 4h > > Issue happen when trying to create json file using a dataframe (see code > below) > from pyspark.sql import SQLContext > a = ["a1","a2"] > b = ["b1","b2","b3"] > c = ["c1","c2","c3", "c4"] > d = \{'d1':1, 'd2':2} > e = \{'e1':1, 'e2':2, 'e3':3} > f = ['f1','f2','f3'] > g = ['g1','g2','g3','g4'] > metadata_dump = dict(asi=a, basi=b, casi = c, dasi=d, fasi=f, > gasi=g{color:#ff}, easi=e{color}) > md = sqlContext.createDataFrame([metadata_dump]).collect() > metadata = sqlContext.createDataFrame(md,['asi', 'basi', > 'casi','dasi','fasi', 'gasi', 'easi']) > metadata_path = "/folder/fileNameErr" > metadata.write.mode('overwrite').json(metadata_path) > {"{color:#14892c}asi":["a1","a2"],"basi":["b1","b2","b3"],"casi":["c1","c2","c3","c4"],"dasi":\{"d1":1,"d2":2{color}},"fasi":\{"e1":1,"e2":2,"e3":3},"gasi":["f1","f2","f3"],"easi":["g1","g2","g3","g4{color}"]} > > when switching the dictionary e > > metadata_dump = dict(asi=a, basi=b, casi = c, dasi=d{color:#ff}*, > easi=e*{color}, fasi=f, gasi=g) > md = sqlContext.createDataFrame([metadata_dump]).collect() > metadata = sqlContext.createDataFrame(md,['asi', 'basi', 'casi','dasi', > {color:#ff}*'easi',*{color}'fasi', 'gasi']) > metadata_path = "/folder/fileNameCorr" > metadata.write.mode('overwrite').json(metadata_path) > {color:#14892c}{"asi":["a1","a2"],"basi":["b1","b2","b3"],"casi":["c1","c2","c3","c4"],"dasi":\\{"d1":1,"d2":2},"easi":\{"e1":1,"e2":2,"e3":3},"fasi":["f1","f2","f3"],"gasi":["g1","g2","g3","g4"]}{color} > > > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23607) Use HDFS extended attributes to store application summary to improve the Spark History Server performance
[ https://issues.apache.org/jira/browse/SPARK-23607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-23607: -- Target Version/s: (was: 2.4.0) Priority: Minor (was: Major) Fix Version/s: (was: 2.4.0) > Use HDFS extended attributes to store application summary to improve the > Spark History Server performance > - > > Key: SPARK-23607 > URL: https://issues.apache.org/jira/browse/SPARK-23607 > Project: Spark > Issue Type: Improvement > Components: Spark Core, Web UI >Affects Versions: 2.3.0 >Reporter: Ye Zhou >Priority: Minor > > Currently in Spark History Server, checkForLogs thread will create replaying > tasks for log files which have file size change. The replaying task will > filter out most of the log file content and keep the application summary > including applicationId, user, attemptACL, start time, end time. The > application summary data will get updated into listing.ldb and serve the > application list on SHS home page. For a long running application, its log > file which name ends with "inprogress" will get replayed for multiple times > to get these application summary. This is a waste of computing and data > reading resource to SHS, which results in the delay for application to get > showing up on home page. Internally we have a patch which utilizes HDFS > extended attributes to improve the performance for getting application > summary in SHS. With this patch, Driver will write the application summary > information into extended attributes as key/value. SHS will try to read from > extended attributes. If SHS fails to read from extended attributes, it will > fall back to read from the log file content as usual. This feature can be > enable/disable through configuration. > It has been running fine for 4 months internally with this patch and the last > updated timestamp on SHS keeps within 1 minute as we configure the interval > to 1 minute. Originally we had long delay which could be as long as 30 > minutes in our scale where we have a large number of Spark applications > running per day. > We want to see whether this kind of approach is also acceptable to community. > Please comment. If so, I will post a pull request for the changes. Thanks. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23495) Creating a json file using a dataframe Generates an issue
[ https://issues.apache.org/jira/browse/SPARK-23495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-23495. --- Resolution: Invalid You've just listed some code and output and not described a problem. > Creating a json file using a dataframe Generates an issue > - > > Key: SPARK-23495 > URL: https://issues.apache.org/jira/browse/SPARK-23495 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.1.0 >Reporter: AIT OUFKIR >Priority: Major > Original Estimate: 4h > Remaining Estimate: 4h > > Issue happen when trying to create json file using a dataframe (see code > below) > from pyspark.sql import SQLContext > a = ["a1","a2"] > b = ["b1","b2","b3"] > c = ["c1","c2","c3", "c4"] > d = \{'d1':1, 'd2':2} > e = \{'e1':1, 'e2':2, 'e3':3} > f = ['f1','f2','f3'] > g = ['g1','g2','g3','g4'] > metadata_dump = dict(asi=a, basi=b, casi = c, dasi=d, fasi=f, > gasi=g{color:#ff}, easi=e{color}) > md = sqlContext.createDataFrame([metadata_dump]).collect() > metadata = sqlContext.createDataFrame(md,['asi', 'basi', > 'casi','dasi','fasi', 'gasi', 'easi']) > metadata_path = "/folder/fileNameErr" > metadata.write.mode('overwrite').json(metadata_path) > {"{color:#14892c}asi":["a1","a2"],"basi":["b1","b2","b3"],"casi":["c1","c2","c3","c4"],"dasi":\{"d1":1,"d2":2{color}},"fasi":\{"e1":1,"e2":2,"e3":3},"gasi":["f1","f2","f3"],"easi":["g1","g2","g3","g4{color}"]} > > when switching the dictionary e > > metadata_dump = dict(asi=a, basi=b, casi = c, dasi=d{color:#ff}*, > easi=e*{color}, fasi=f, gasi=g) > md = sqlContext.createDataFrame([metadata_dump]).collect() > metadata = sqlContext.createDataFrame(md,['asi', 'basi', 'casi','dasi', > {color:#ff}*'easi',*{color}'fasi', 'gasi']) > metadata_path = "/folder/fileNameCorr" > metadata.write.mode('overwrite').json(metadata_path) > {color:#14892c}{"asi":["a1","a2"],"basi":["b1","b2","b3"],"casi":["c1","c2","c3","c4"],"dasi":\\{"d1":1,"d2":2},"easi":\{"e1":1,"e2":2,"e3":3},"fasi":["f1","f2","f3"],"gasi":["g1","g2","g3","g4"]}{color} > > > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23499) Mesos Cluster Dispatcher should support priority queues to submit drivers
[ https://issues.apache.org/jira/browse/SPARK-23499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-23499: -- Affects Version/s: (was: 2.4.0) Target Version/s: (was: 2.2.1, 2.2.2, 2.3.0, 2.3.1) Fix Version/s: (was: 2.4.0) > Mesos Cluster Dispatcher should support priority queues to submit drivers > - > > Key: SPARK-23499 > URL: https://issues.apache.org/jira/browse/SPARK-23499 > Project: Spark > Issue Type: Improvement > Components: Mesos >Affects Versions: 2.2.1, 2.2.2, 2.3.0, 2.3.1 >Reporter: Pascal GILLET >Priority: Major > Attachments: Screenshot from 2018-02-28 17-22-47.png > > > As for Yarn, Mesos users should be able to specify priority queues to define > a workload management policy for queued drivers in the Mesos Cluster > Dispatcher. > Submitted drivers are *currently* kept in order of their submission: the > first driver added to the queue will be the first one to be executed (FIFO). > Each driver could have a "priority" associated with it. A driver with high > priority is served (Mesos resources) before a driver with low priority. If > two drivers have the same priority, they are served according to their submit > date in the queue. > To set up such priority queues, the following changes are proposed: > * The Mesos Cluster Dispatcher can optionally be configured with the > _spark.mesos.dispatcher.queue.[QueueName]_ property. This property takes a > float as value. This adds a new queue named _QueueName_ for submitted drivers > with the specified priority. > Higher numbers indicate higher priority. > The user can then specify multiple queues. > * A driver can be submitted to a specific queue with > _spark.mesos.dispatcher.queue_. This property takes the name of a queue > previously declared in the dispatcher as value. > By default, the dispatcher has a single "default" queue with 0.0 priority > (cannot be overridden). If none of the properties above are specified, the > behavior is the same as the current one (i.e. simple FIFO). > Additionaly, it is possible to implement a consistent and overall workload > management policy throughout the lifecycle of drivers by mapping these > priority queues to weighted Mesos roles if any (i.e. from the QUEUED state in > the dispatcher to the final states in the Mesos cluster), and by specifying a > _spark.mesos.role_ along with a _spark.mesos.dispatcher.queue_ when > submitting an application. > For example, with the URGENT Mesos role: > {code:java} > # Conf on the dispatcher side > spark.mesos.dispatcher.queue.URGENT=1.0 > # Conf on the driver side > spark.mesos.dispatcher.queue=URGENT > spark.mesos.role=URGENT > {code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-21795) Broadcast hint ignored when dataframe is cached
[ https://issues.apache.org/jira/browse/SPARK-21795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-21795. --- Resolution: Duplicate > Broadcast hint ignored when dataframe is cached > --- > > Key: SPARK-21795 > URL: https://issues.apache.org/jira/browse/SPARK-21795 > Project: Spark > Issue Type: Question > Components: Documentation, SQL >Affects Versions: 2.2.0 >Reporter: Lior Chaga >Priority: Minor > > Not sure if it's a bug or by design, but if a DF is cached, the broadcast > hint is ignored, and spark uses SortMergeJoin. > {code} > val largeDf = ... > val smalDf = ... > smallDf = smallDf.cache > largeDf.join(broadcast(smallDf)) > {code} > It make sense there's no need to use cache when using broadcast join, > however, I wonder if it's the correct behavior for spark to ignore the > broadcast hint just because the DF is cached. Consider a case when a DF > should be cached for several queries, and on different queries it should be > broadcasted. > If this is the correct behavior, at least it's worth documenting that cached > DF cannot be broadcasted. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23617) Register a Function without params with Spark SQL Java API
[ https://issues.apache.org/jira/browse/SPARK-23617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16388954#comment-16388954 ] Hyukjin Kwon commented on SPARK-23617: -- Is this a duplicate of SPARK-19285? and does this work in Spark 2.3.0? > Register a Function without params with Spark SQL Java API > -- > > Key: SPARK-23617 > URL: https://issues.apache.org/jira/browse/SPARK-23617 > Project: Spark > Issue Type: Improvement > Components: Java API, SQL >Affects Versions: 2.2.1 >Reporter: Paul Wu >Priority: Major > > One can register a function using Scala: > {{spark.udf.register("uuid", ()=>java.util.UUID.randomUUID.toString)}} > Now, if I use Java API: > {{spark.udf().register("uuid", ()=>java.util.UUID.randomUUID().toString());}} > The code does not compile. Define UDF0 for Java API? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23547) Cleanup the .pipeout file when the Hive Session closed
[ https://issues.apache.org/jira/browse/SPARK-23547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zuotingbing updated SPARK-23547: Description: when the hive session closed, we should also cleanup the .pipeout file. was: !2018-03-01_202415.png! when the hive session closed, we should also cleanup the .pipeout file. > Cleanup the .pipeout file when the Hive Session closed > -- > > Key: SPARK-23547 > URL: https://issues.apache.org/jira/browse/SPARK-23547 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: zuotingbing >Priority: Major > Attachments: 2018-03-07_121010.png > > > > > when the hive session closed, we should also cleanup the .pipeout file. > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23547) Cleanup the .pipeout file when the Hive Session closed
[ https://issues.apache.org/jira/browse/SPARK-23547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zuotingbing updated SPARK-23547: Attachment: (was: 2018-03-01_202415.png) > Cleanup the .pipeout file when the Hive Session closed > -- > > Key: SPARK-23547 > URL: https://issues.apache.org/jira/browse/SPARK-23547 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: zuotingbing >Priority: Major > Attachments: 2018-03-07_121010.png > > > !2018-03-01_202415.png! > > when the hive session closed, we should also cleanup the .pipeout file. > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23547) Cleanup the .pipeout file when the Hive Session closed
[ https://issues.apache.org/jira/browse/SPARK-23547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zuotingbing updated SPARK-23547: Attachment: 2018-03-07_121010.png > Cleanup the .pipeout file when the Hive Session closed > -- > > Key: SPARK-23547 > URL: https://issues.apache.org/jira/browse/SPARK-23547 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: zuotingbing >Priority: Major > Attachments: 2018-03-07_121010.png > > > !2018-03-01_202415.png! > > when the hive session closed, we should also cleanup the .pipeout file. > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23617) Register a Function without params with Spark SQL Java API
[ https://issues.apache.org/jira/browse/SPARK-23617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Paul Wu resolved SPARK-23617. - Resolution: Duplicate Fix Version/s: 2.3.0 As commented by Hyukjin Kwon, the issue is duplicated and has been fixed in 2.3.0. > Register a Function without params with Spark SQL Java API > -- > > Key: SPARK-23617 > URL: https://issues.apache.org/jira/browse/SPARK-23617 > Project: Spark > Issue Type: Improvement > Components: Java API, SQL >Affects Versions: 2.2.1 >Reporter: Paul Wu >Priority: Major > Fix For: 2.3.0 > > > One can register a function using Scala: > {{spark.udf.register("uuid", ()=>java.util.UUID.randomUUID.toString)}} > Now, if I use Java API: > {{spark.udf().register("uuid", ()=>java.util.UUID.randomUUID().toString());}} > The code does not compile. Define UDF0 for Java API? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23325) DataSourceV2 readers should always produce InternalRow.
[ https://issues.apache.org/jira/browse/SPARK-23325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16389135#comment-16389135 ] Reynold Xin commented on SPARK-23325: - Yes perhaps we should do that. It is a lot more work than what you guys think though, because as Wenchen said we need to properly define the semantics of all the data, similar to all of Hadoop IO (Text, etc) but more, because we have more data types. I'd probably prefer us defining the columnar format first, since if one is going after high performance, one'd probably prefer using that one... > DataSourceV2 readers should always produce InternalRow. > --- > > Key: SPARK-23325 > URL: https://issues.apache.org/jira/browse/SPARK-23325 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Ryan Blue >Priority: Major > > DataSourceV2 row-oriented implementations are limited to producing either > {{Row}} instances or {{UnsafeRow}} instances by implementing > {{SupportsScanUnsafeRow}}. Instead, I think that implementations should > always produce {{InternalRow}}. > The problem with the choice between {{Row}} and {{UnsafeRow}} is that neither > one is appropriate for implementers. > File formats don't produce {{Row}} instances or the data values used by > {{Row}}, like {{java.sql.Timestamp}} and {{java.sql.Date}}. An implementation > that uses {{Row}} instances must produce data that is immediately translated > from the representation that was just produced by Spark. In my experience, it > made little sense to translate a timestamp in microseconds to a > (milliseconds, nanoseconds) pair, create a {{Timestamp}} instance, and pass > that instance to Spark for immediate translation back. > On the other hand, {{UnsafeRow}} is very difficult to produce unless data is > already held in memory. Even the Parquet support built into Spark > deserializes to {{InternalRow}} and then uses {{UnsafeProjection}} to produce > unsafe rows. When I went to build an implementation that deserializes Parquet > or Avro directly to {{UnsafeRow}} (I tried both), I found that it couldn't be > done without first deserializing into memory because the size of an array > must be known before any values are written. > I ended up deciding to deserialize to {{InternalRow}} and use > {{UnsafeProjection}} to convert to unsafe. There are two problems with this: > first, this is Scala and was difficult to call from Java (it required > reflection), and second, this causes double projection in the physical plan > (a copy for unsafe to unsafe) if there is a projection that wasn't fully > pushed to the data source. > I think the solution is to have a single interface for readers that expects > {{InternalRow}}. Then, a projection should be added in the Spark plan to > convert to unsafe and avoid projection in the plan and in the data source. If > the data source already produces unsafe rows by deserializing directly, this > still minimizes the number of copies because the unsafe projection will check > whether the incoming data is already {{UnsafeRow}}. > Using {{InternalRow}} would also match the interface on the write side. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23593) Add interpreted execution for InitializeJavaBean expression
[ https://issues.apache.org/jira/browse/SPARK-23593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23593: Assignee: Apache Spark > Add interpreted execution for InitializeJavaBean expression > --- > > Key: SPARK-23593 > URL: https://issues.apache.org/jira/browse/SPARK-23593 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Herman van Hovell >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23593) Add interpreted execution for InitializeJavaBean expression
[ https://issues.apache.org/jira/browse/SPARK-23593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16388938#comment-16388938 ] Apache Spark commented on SPARK-23593: -- User 'viirya' has created a pull request for this issue: https://github.com/apache/spark/pull/20756 > Add interpreted execution for InitializeJavaBean expression > --- > > Key: SPARK-23593 > URL: https://issues.apache.org/jira/browse/SPARK-23593 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Herman van Hovell >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23593) Add interpreted execution for InitializeJavaBean expression
[ https://issues.apache.org/jira/browse/SPARK-23593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23593: Assignee: (was: Apache Spark) > Add interpreted execution for InitializeJavaBean expression > --- > > Key: SPARK-23593 > URL: https://issues.apache.org/jira/browse/SPARK-23593 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Herman van Hovell >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23595) Add interpreted execution for ValidateExternalType expression
[ https://issues.apache.org/jira/browse/SPARK-23595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23595: Assignee: (was: Apache Spark) > Add interpreted execution for ValidateExternalType expression > - > > Key: SPARK-23595 > URL: https://issues.apache.org/jira/browse/SPARK-23595 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Herman van Hovell >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23595) Add interpreted execution for ValidateExternalType expression
[ https://issues.apache.org/jira/browse/SPARK-23595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16389058#comment-16389058 ] Apache Spark commented on SPARK-23595: -- User 'maropu' has created a pull request for this issue: https://github.com/apache/spark/pull/20757 > Add interpreted execution for ValidateExternalType expression > - > > Key: SPARK-23595 > URL: https://issues.apache.org/jira/browse/SPARK-23595 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Herman van Hovell >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23595) Add interpreted execution for ValidateExternalType expression
[ https://issues.apache.org/jira/browse/SPARK-23595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23595: Assignee: Apache Spark > Add interpreted execution for ValidateExternalType expression > - > > Key: SPARK-23595 > URL: https://issues.apache.org/jira/browse/SPARK-23595 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Herman van Hovell >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23618) docker-image-tool.sh Fails While Building Image
Ninad Ingole created SPARK-23618: Summary: docker-image-tool.sh Fails While Building Image Key: SPARK-23618 URL: https://issues.apache.org/jira/browse/SPARK-23618 Project: Spark Issue Type: Bug Components: Kubernetes Affects Versions: 2.3.0 Reporter: Ninad Ingole I am trying to build kubernetes image for version 2.3.0, using {code:java} ./bin/docker-image-tool.sh -r ninadingole/spark-docker -t v2.3.0 build {code} giving me an issue for docker build error: {code:java} "docker build" requires exactly 1 argument. See 'docker build --help'. Usage: docker build [OPTIONS] PATH | URL | - [flags] Build an image from a Dockerfile {code} Executing the command within the spark distribution directory. Please let me know what's the issue. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23617) Register a Function without params with Spark SQL Java API
Paul Wu created SPARK-23617: --- Summary: Register a Function without params with Spark SQL Java API Key: SPARK-23617 URL: https://issues.apache.org/jira/browse/SPARK-23617 Project: Spark Issue Type: Improvement Components: Java API, SQL Affects Versions: 2.2.1 Reporter: Paul Wu One can register a function using Scala: {{spark.udf.register("uuid", ()=>java.util.UUID.randomUUID.toString) }} Now, if I use Java API: {{ spark.udf().register("uuid", ()=>java.util.UUID.randomUUID().toString()); }} The code does not compile. Define UDF0 for Java API? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23266) Matrix Inversion on BlockMatrix
[ https://issues.apache.org/jira/browse/SPARK-23266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16389061#comment-16389061 ] Chandan Misra commented on SPARK-23266: --- I have implemented matrix inversion using Spark version 2.2.0. Though the implementation can be executed using Spark version 2.0.0 onwards. It would be really helpful if the inversion is added in the next Spark version. As already mentioned, I have the implementation of the inversion and happy to contribute. > Matrix Inversion on BlockMatrix > --- > > Key: SPARK-23266 > URL: https://issues.apache.org/jira/browse/SPARK-23266 > Project: Spark > Issue Type: New Feature > Components: MLlib >Affects Versions: 2.2.1 >Reporter: Chandan Misra >Priority: Minor > > Matrix inversion is the basic building block for many other algorithms like > regression, classification, geostatistical analysis using ordinary kriging > etc. A simple Spark BlockMatrix based efficient distributed > divide-and-conquer algorithm can be implemented using only *6* > multiplications in each recursion level of the algorithm. The reference paper > can be found in > [https://arxiv.org/abs/1801.04723] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23547) Cleanup the .pipeout file when the Hive Session closed
[ https://issues.apache.org/jira/browse/SPARK-23547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zuotingbing updated SPARK-23547: Attachment: 2018-03-07_121010.png > Cleanup the .pipeout file when the Hive Session closed > -- > > Key: SPARK-23547 > URL: https://issues.apache.org/jira/browse/SPARK-23547 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: zuotingbing >Priority: Major > Attachments: 2018-03-07_121010.png > > > > > when the hive session closed, we should also cleanup the .pipeout file. > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23547) Cleanup the .pipeout file when the Hive Session closed
[ https://issues.apache.org/jira/browse/SPARK-23547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zuotingbing updated SPARK-23547: Attachment: (was: 2018-03-07_121010.png) > Cleanup the .pipeout file when the Hive Session closed > -- > > Key: SPARK-23547 > URL: https://issues.apache.org/jira/browse/SPARK-23547 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: zuotingbing >Priority: Major > Attachments: 2018-03-07_121010.png > > > > > when the hive session closed, we should also cleanup the .pipeout file. > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23547) Cleanup the .pipeout file when the Hive Session closed
[ https://issues.apache.org/jira/browse/SPARK-23547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zuotingbing updated SPARK-23547: Description: !2018-03-07_121010.png! when the hive session closed, we should also cleanup the .pipeout file. was: when the hive session closed, we should also cleanup the .pipeout file. > Cleanup the .pipeout file when the Hive Session closed > -- > > Key: SPARK-23547 > URL: https://issues.apache.org/jira/browse/SPARK-23547 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: zuotingbing >Priority: Major > Attachments: 2018-03-07_121010.png > > > !2018-03-07_121010.png! > when the hive session closed, we should also cleanup the .pipeout file. > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23595) Add interpreted execution for ValidateExternalType expression
[ https://issues.apache.org/jira/browse/SPARK-23595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16388114#comment-16388114 ] Takeshi Yamamuro commented on SPARK-23595: -- [~DylanGuedes] oh, I'm already working on it. But, if you want to take over this for practice, I'm ok to leave this to you (cuz I have some pending other tickets). This is my incomplete work here: https://github.com/apache/spark/compare/master...maropu:SPARK-23595 > Add interpreted execution for ValidateExternalType expression > - > > Key: SPARK-23595 > URL: https://issues.apache.org/jira/browse/SPARK-23595 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Herman van Hovell >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23559) add epoch ID to data writer factory
[ https://issues.apache.org/jira/browse/SPARK-23559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16388122#comment-16388122 ] Apache Spark commented on SPARK-23559: -- User 'jose-torres' has created a pull request for this issue: https://github.com/apache/spark/pull/20752 > add epoch ID to data writer factory > --- > > Key: SPARK-23559 > URL: https://issues.apache.org/jira/browse/SPARK-23559 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: Jose Torres >Assignee: Jose Torres >Priority: Major > Fix For: 3.0.0 > > > To support the StreamWriter lifecycle described in SPARK-22910, epoch ID has > to be specifiable at DataWriter creation. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23595) Add interpreted execution for ValidateExternalType expression
[ https://issues.apache.org/jira/browse/SPARK-23595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16388136#comment-16388136 ] Dylan Guedes commented on SPARK-23595: -- [~maropu] I checked your progress, and looks like you are almost finishing it, so It is fine. Whatever, your solution was very enlightening, thank you! > Add interpreted execution for ValidateExternalType expression > - > > Key: SPARK-23595 > URL: https://issues.apache.org/jira/browse/SPARK-23595 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Herman van Hovell >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23610) Cast of ArrayType of NullType to ArrayType of nullable material type does not work
[ https://issues.apache.org/jira/browse/SPARK-23610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Kießling updated SPARK-23610: - Description: Given a DataFrame that contains a column with _ArrayType of NullType_ casting this column into ArrayType of any material nullable type (e.g. _ArrayType(LongType, true)_ ) should be possible. {code} it("can cast arrays of null type into arrays of nullable material types") { val inputData = Seq( Row(Array()) ).asJava val schema = StructType(Seq( StructField("list", ArrayType(NullType, true), false) )) val data = caps.sparkSession.createDataFrame(inputData, schema) data.withColumn("longList",data.col("list").cast(ArrayType(LongType, true))).show } {code} This test fails with the message: {noformat} NullType (of class org.apache.spark.sql.types.NullType$) scala.MatchError: NullType (of class org.apache.spark.sql.types.NullType$) at org.apache.spark.sql.catalyst.expressions.Cast.castToLong(Cast.scala:310) at org.apache.spark.sql.catalyst.expressions.Cast.org$apache$spark$sql$catalyst$expressions$Cast$$cast(Cast.scala:516) at org.apache.spark.sql.catalyst.expressions.Cast.castArray(Cast.scala:455) at org.apache.spark.sql.catalyst.expressions.Cast.org$apache$spark$sql$catalyst$expressions$Cast$$cast(Cast.scala:519) at org.apache.spark.sql.catalyst.expressions.Cast.cast$lzycompute(Cast.scala:531) at org.apache.spark.sql.catalyst.expressions.Cast.cast(Cast.scala:531) at org.apache.spark.sql.catalyst.expressions.Cast.nullSafeEval(Cast.scala:533) at org.apache.spark.sql.catalyst.expressions.UnaryExpression.eval(Expression.scala:327) {noformat} was: Given a DataFrame that contains a column with _ArrayType of NullType_ casting this column into ArrayType of any material nullable type (e.g. _ArrayType(LongType, true)_ ) should be possible. {code} it("can cast arrays of null type into arrays of nullable material types") { val inputData = Seq( Row(Array()) ).asJava val schema = StructType(Seq( StructField("list", ArrayType(NullType, true), false) )) val data = caps.sparkSession.createDataFrame(inputData, schema) data.withColumn("longList",data.col("list").cast(ArrayType(LongType, true))).show } {code} This test fails with the message: {noformat} NullType (of class org.apache.spark.sql.types.NullType$) scala.MatchError: NullType (of class org.apache.spark.sql.types.NullType$) at org.apache.spark.sql.catalyst.expressions.Cast.castToLong(Cast.scala:310) at org.apache.spark.sql.catalyst.expressions.Cast.org$apache$spark$sql$catalyst$expressions$Cast$$cast(Cast.scala:516) at org.apache.spark.sql.catalyst.expressions.Cast.castArray(Cast.scala:455) at org.apache.spark.sql.catalyst.expressions.Cast.org$apache$spark$sql$catalyst$expressions$Cast$$cast(Cast.scala:519) at org.apache.spark.sql.catalyst.expressions.Cast.cast$lzycompute(Cast.scala:531) at org.apache.spark.sql.catalyst.expressions.Cast.cast(Cast.scala:531) at org.apache.spark.sql.catalyst.expressions.Cast.nullSafeEval(Cast.scala:533) at org.apache.spark.sql.catalyst.expressions.UnaryExpression.eval(Expression.scala:327) {noformat} > Cast of ArrayType of NullType to ArrayType of nullable material type does not > work > -- > > Key: SPARK-23610 > URL: https://issues.apache.org/jira/browse/SPARK-23610 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Max Kießling >Priority: Minor > > Given a DataFrame that contains a column with _ArrayType of NullType_ > casting this column into ArrayType of any material nullable type (e.g. > _ArrayType(LongType, true)_ ) should be possible. > {code} > it("can cast arrays of null type into arrays of nullable material types") { > val inputData = Seq( > Row(Array()) > ).asJava > val schema = StructType(Seq( > StructField("list", ArrayType(NullType, true), false) > )) > val data = caps.sparkSession.createDataFrame(inputData, schema) > data.withColumn("longList",data.col("list").cast(ArrayType(LongType, > true))).show > } > {code} > This test fails with the message: > {noformat} > NullType (of class org.apache.spark.sql.types.NullType$) > scala.MatchError: NullType (of class org.apache.spark.sql.types.NullType$) > at org.apache.spark.sql.catalyst.expressions.Cast.castToLong(Cast.scala:310) > at > org.apache.spark.sql.catalyst.expressions.Cast.org$apache$spark$sql$catalyst$expressions$Cast$$cast(Cast.scala:516) > at org.apache.spark.sql.catalyst.expressions.Cast.castArray(Cast.scala:455) > at > org.apache.spark.sql.catalyst.expressions.Cast.org$apache$spark$sql$catalyst$expressions$Cast$$cast(Cast.scala:519) > at > org.apache.spark.sql.catalyst.expressions.Cast.cast$lzycompute(Cast.scala:531) > at
[jira] [Commented] (SPARK-23537) Logistic Regression without standardization
[ https://issues.apache.org/jira/browse/SPARK-23537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16387587#comment-16387587 ] Jordi commented on SPARK-23537: --- [~Teng Peng] we don't need standardization for L-BFGS but it's recommended since it will improve the convergence. I've been checking the code and I found excerpts that I don't properly understand. I added some comments hoping that the developer clarifies them: [https://github.com/apache/spark/pull/7080/files#diff-3734f1689cb8a80b07974eb93de0795dR588] [https://github.com/apache/spark/pull/5967/files#diff-3734f1689cb8a80b07974eb93de0795dR201] > Logistic Regression without standardization > --- > > Key: SPARK-23537 > URL: https://issues.apache.org/jira/browse/SPARK-23537 > Project: Spark > Issue Type: Bug > Components: ML, Optimizer >Affects Versions: 2.0.2, 2.2.1 >Reporter: Jordi >Priority: Major > Attachments: non-standardization.log, standardization.log > > > I'm trying to train a Logistic Regression model, using Spark 2.2.1. I prefer > to not use standardization since all my features are binary, using the > hashing trick (2^20 sparse vector). > I trained two models to compare results, I've been expecting to end with two > similar models since it seems that internally the optimizer performs > standardization and "de-standardization" (when it's deactivated) in order to > improve the convergence. > Here you have the code I used: > {code:java} > val lr = new org.apache.spark.ml.classification.LogisticRegression() > .setRegParam(0.05) > .setElasticNetParam(0.0) > .setFitIntercept(true) > .setMaxIter(5000) > .setStandardization(false) > val model = lr.fit(data) > {code} > The results are disturbing me, I end with two significantly different models. > *Standardization:* > Training time: 8min. > Iterations: 37 > Intercept: -4.386090107224499 > Max weight: 4.724752299455218 > Min weight: -3.560570478164854 > Mean weight: -0.049325201841722795 > l1 norm: 116710.39522171849 > l2 norm: 402.2581552373957 > Non zero weights: 128084 > Non zero ratio: 0.12215042114257812 > Last 10 LBFGS Val and Grand Norms: > {code:java} > 18/02/27 17:14:45 INFO LBFGS: Val and Grad Norm: 0.430740 (rel: 8.00e-07) > 0.000559057 > 18/02/27 17:14:50 INFO LBFGS: Val and Grad Norm: 0.430740 (rel: 3.94e-07) > 0.000267527 > 18/02/27 17:14:54 INFO LBFGS: Val and Grad Norm: 0.430739 (rel: 2.62e-07) > 0.000205888 > 18/02/27 17:14:59 INFO LBFGS: Val and Grad Norm: 0.430739 (rel: 1.36e-07) > 0.000144173 > 18/02/27 17:15:04 INFO LBFGS: Val and Grad Norm: 0.430739 (rel: 7.74e-08) > 0.000140296 > 18/02/27 17:15:09 INFO LBFGS: Val and Grad Norm: 0.430739 (rel: 1.52e-08) > 0.000122709 > 18/02/27 17:15:13 INFO LBFGS: Val and Grad Norm: 0.430739 (rel: 1.78e-08) > 3.08789e-05 > 18/02/27 17:15:18 INFO LBFGS: Val and Grad Norm: 0.430739 (rel: 2.66e-09) > 2.23806e-05 > 18/02/27 17:15:23 INFO LBFGS: Val and Grad Norm: 0.430739 (rel: 4.31e-09) > 1.47422e-05 > 18/02/27 17:15:28 INFO LBFGS: Val and Grad Norm: 0.430739 (rel: 9.17e-10) > 2.37442e-05 > {code} > *No standardization:* > Training time: 7h 14 min. > Iterations: 4992 > Intercept: -4.216690468849263 > Max weight: 0.41930559767624725 > Min weight: -0.5949182537565524 > Mean weight: -1.2659769019012E-6 > l1 norm: 14.262025330648694 > l2 norm: 1.2508777025612263 > Non zero weights: 128955 > Non zero ratio: 0.12298107147216797 > Last 10 LBFGS Val and Grand Norms: > {code:java} > 18/02/28 00:28:56 INFO LBFGS: Val and Grad Norm: 0.559320 (rel: 2.17e-07) > 0.217581 > 18/02/28 00:29:01 INFO LBFGS: Val and Grad Norm: 0.559320 (rel: 1.88e-07) > 0.185812 > 18/02/28 00:29:06 INFO LBFGS: Val and Grad Norm: 0.559320 (rel: 1.33e-07) > 0.214570 > 18/02/28 00:29:11 INFO LBFGS: Val and Grad Norm: 0.559320 (rel: 8.62e-08) > 0.489464 > 18/02/28 00:29:16 INFO LBFGS: Val and Grad Norm: 0.559320 (rel: 1.90e-07) > 0.178448 > 18/02/28 00:29:21 INFO LBFGS: Val and Grad Norm: 0.559320 (rel: 7.91e-08) > 0.172527 > 18/02/28 00:29:26 INFO LBFGS: Val and Grad Norm: 0.559320 (rel: 1.38e-07) > 0.189389 > 18/02/28 00:29:31 INFO LBFGS: Val and Grad Norm: 0.559320 (rel: 1.13e-07) > 0.480678 > 18/02/28 00:29:36 INFO LBFGS: Val and Grad Norm: 0.559320 (rel: 1.75e-07) > 0.184529 > 18/02/28 00:29:41 INFO LBFGS: Val and Grad Norm: 0.559319 (rel: 8.90e-08) > 0.154329 > {code} > Am I missing something? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23611) Extend ExpressionEvalHelper harness to also test failures
Herman van Hovell created SPARK-23611: - Summary: Extend ExpressionEvalHelper harness to also test failures Key: SPARK-23611 URL: https://issues.apache.org/jira/browse/SPARK-23611 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 2.3.0 Reporter: Herman van Hovell -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22450) Safely register class for mllib
[ https://issues.apache.org/jira/browse/SPARK-22450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16387630#comment-16387630 ] Richard Wilkinson commented on SPARK-22450: --- Just as an FYI, the change to org.apache.spark.serializer.KryoSerializer#newKryo from (i think this ticket) this is a performance hit over the in 2.2.1. I am calling org.apache.spark.serializer.KryoSerializer#newInstance alot, which is probably an issue in itself (hence not rasing a bug report), but im not aware of how much this is called internal to spark. I do not have the ML jars on my classpath. > Safely register class for mllib > --- > > Key: SPARK-22450 > URL: https://issues.apache.org/jira/browse/SPARK-22450 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Xianyang Liu >Assignee: Xianyang Liu >Priority: Major > Fix For: 2.3.0 > > > There are still some algorithms based on mllib, such as KMeans. For now, > many mllib common class (such as: Vector, DenseVector, SparseVector, Matrix, > DenseMatrix, SparseMatrix) are not registered in Kryo. So there are some > performance issues for those object serialization or deserialization. > Previously dicussed: https://github.com/apache/spark/pull/19586 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18492) GeneratedIterator grows beyond 64 KB
[ https://issues.apache.org/jira/browse/SPARK-18492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16387969#comment-16387969 ] imran shaik commented on SPARK-18492: - truetradescast schema root |-- Event_Time: long (nullable = true) |-- Symbol: string (nullable = true) |-- Kline_Start_Time: long (nullable = true) |-- Kline_Close_Time: long (nullable = true) |-- Open_Price: float (nullable = true) |-- Close_Price: float (nullable = true) |-- High_Price: float (nullable = true) |-- Low_Price: float (nullable = true) |-- Base_Asset_Volume: float (nullable = true) |-- Number_Of_Trades: long (nullable = true) |-- TimeStamp: timestamp (nullable = true) Can you solve this asap? > GeneratedIterator grows beyond 64 KB > > > Key: SPARK-18492 > URL: https://issues.apache.org/jira/browse/SPARK-18492 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1 > Environment: CentOS release 6.7 (Final) >Reporter: Norris Merritt >Priority: Major > Attachments: Screenshot from 2018-03-02 12-57-51.png > > > spark-submit fails with ERROR CodeGenerator: failed to compile: > org.codehaus.janino.JaninoRuntimeException: Code of method > "(I[Lscala/collection/Iterator;)V" of class > "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator" > grows beyond 64 KB > Error message is followed by a huge dump of generated source code. > The generated code declares 1,454 field sequences like the following: > /* 036 */ private org.apache.spark.sql.catalyst.expressions.ScalaUDF > project_scalaUDF1; > /* 037 */ private scala.Function1 project_catalystConverter1; > /* 038 */ private scala.Function1 project_converter1; > /* 039 */ private scala.Function1 project_converter2; > /* 040 */ private scala.Function2 project_udf1; > (many omitted lines) ... > /* 6089 */ private org.apache.spark.sql.catalyst.expressions.ScalaUDF > project_scalaUDF1454; > /* 6090 */ private scala.Function1 project_catalystConverter1454; > /* 6091 */ private scala.Function1 project_converter1695; > /* 6092 */ private scala.Function1 project_udf1454; > It then proceeds to emit code for several methods (init, processNext) each of > which has totally repetitive sequences of statements pertaining to each of > the sequences of variables declared in the class. For example: > /* 6101 */ public void init(int index, scala.collection.Iterator inputs[]) { > The reason that the 64KB JVM limit for code for a method is exceeded is > because the code generator is using an incredibly naive strategy. It emits a > sequence like the one shown below for each of the 1,454 groups of variables > shown above, in > /* 6132 */ this.project_udf = > (scala.Function1)project_scalaUDF.userDefinedFunc(); > /* 6133 */ this.project_scalaUDF1 = > (org.apache.spark.sql.catalyst.expressions.ScalaUDF) references[10]; > /* 6134 */ this.project_catalystConverter1 = > (scala.Function1)org.apache.spark.sql.catalyst.CatalystTypeConverters$.MODULE$.createToCatalystConverter(project_scalaUDF1.dataType()); > /* 6135 */ this.project_converter1 = > (scala.Function1)org.apache.spark.sql.catalyst.CatalystTypeConverters$.MODULE$.createToScalaConverter(((org.apache.spark.sql.catalyst.expressions.Expression)(((org.apache.spark.sql.catalyst.expressions.ScalaUDF)references[10]).getChildren().apply(0))).dataType()); > /* 6136 */ this.project_converter2 = > (scala.Function1)org.apache.spark.sql.catalyst.CatalystTypeConverters$.MODULE$.createToScalaConverter(((org.apache.spark.sql.catalyst.expressions.Expression)(((org.apache.spark.sql.catalyst.expressions.ScalaUDF)references[10]).getChildren().apply(1))).dataType()); > It blows up after emitting 230 such sequences, while trying to emit the 231st: > /* 7282 */ this.project_udf230 = > (scala.Function2)project_scalaUDF230.userDefinedFunc(); > /* 7283 */ this.project_scalaUDF231 = > (org.apache.spark.sql.catalyst.expressions.ScalaUDF) references[240]; > /* 7284 */ this.project_catalystConverter231 = > (scala.Function1)org.apache.spark.sql.catalyst.CatalystTypeConverters$.MODULE$.createToCatalystConverter(project_scalaUDF231.dataType()); > many omitted lines ... > Example of repetitive code sequences emitted for processNext method: > /* 12253 */ boolean project_isNull247 = project_result244 == null; > /* 12254 */ MapData project_value247 = null; > /* 12255 */ if (!project_isNull247) { > /* 12256 */ project_value247 = project_result244; > /* 12257 */ } > /* 12258 */ Object project_arg = sort_isNull5 ? null : > project_converter489.apply(sort_value5); > /* 12259 */ > /* 12260 */ ArrayData project_result249 = null; > /* 12261 */ try { > /* 12262 */ project_result249 = >
[jira] [Commented] (SPARK-22683) DynamicAllocation wastes resources by allocating containers that will barely be used
[ https://issues.apache.org/jira/browse/SPARK-22683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16387963#comment-16387963 ] Thomas Graves commented on SPARK-22683: --- I left comments on the open PR already, lets move the discussion there > DynamicAllocation wastes resources by allocating containers that will barely > be used > > > Key: SPARK-22683 > URL: https://issues.apache.org/jira/browse/SPARK-22683 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.1.0, 2.2.0 >Reporter: Julien Cuquemelle >Priority: Major > Labels: pull-request-available > > While migrating a series of jobs from MR to Spark using dynamicAllocation, > I've noticed almost a doubling (+114% exactly) of resource consumption of > Spark w.r.t MR, for a wall clock time gain of 43% > About the context: > - resource usage stands for vcore-hours allocation for the whole job, as seen > by YARN > - I'm talking about a series of jobs because we provide our users with a way > to define experiments (via UI / DSL) that automatically get translated to > Spark / MR jobs and submitted on the cluster > - we submit around 500 of such jobs each day > - these jobs are usually one shot, and the amount of processing can vary a > lot between jobs, and as such finding an efficient number of executors for > each job is difficult to get right, which is the reason I took the path of > dynamic allocation. > - Some of the tests have been scheduled on an idle queue, some on a full > queue. > - experiments have been conducted with spark.executor-cores = 5 and 10, only > results for 5 cores have been reported because efficiency was overall better > than with 10 cores > - the figures I give are averaged over a representative sample of those jobs > (about 600 jobs) ranging from tens to thousands splits in the data > partitioning and between 400 to 9000 seconds of wall clock time. > - executor idle timeout is set to 30s; > > Definition: > - let's say an executor has spark.executor.cores / spark.task.cpus taskSlots, > which represent the max number of tasks an executor will process in parallel. > - the current behaviour of the dynamic allocation is to allocate enough > containers to have one taskSlot per task, which minimizes latency, but wastes > resources when tasks are small regarding executor allocation and idling > overhead. > The results using the proposal (described below) over the job sample (600 > jobs): > - by using 2 tasks per taskSlot, we get a 5% (against -114%) reduction in > resource usage, for a 37% (against 43%) reduction in wall clock time for > Spark w.r.t MR > - by trying to minimize the average resource consumption, I ended up with 6 > tasks per core, with a 30% resource usage reduction, for a similar wall clock > time w.r.t. MR > What did I try to solve the issue with existing parameters (summing up a few > points mentioned in the comments) ? > - change dynamicAllocation.maxExecutors: this would need to be adapted for > each job (tens to thousands splits can occur), and essentially remove the > interest of using the dynamic allocation. > - use dynamicAllocation.backlogTimeout: > - setting this parameter right to avoid creating unused executors is very > dependant on wall clock time. One basically needs to solve the exponential > ramp up for the target time. So this is not an option for my use case where I > don't want a per-job tuning. > - I've still done a series of experiments, details in the comments. > Result is that after manual tuning, the best I could get was a similar > resource consumption at the expense of 20% more wall clock time, or a similar > wall clock time at the expense of 60% more resource consumption than what I > got using my proposal @ 6 tasks per slot (this value being optimized over a > much larger range of jobs as already stated) > - as mentioned in another comment, tampering with the exponential ramp up > might yield task imbalance and such old executors could become contention > points for other exes trying to remotely access blocks in the old exes (not > witnessed in the jobs I'm talking about, but we did see this behavior in > other jobs) > Proposal: > Simply add a tasksPerExecutorSlot parameter, which makes it possible to > specify how many tasks a single taskSlot should ideally execute to mitigate > the overhead of executor allocation. > PR: https://github.com/apache/spark/pull/19881 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23591) Add interpreted execution for EncodeUsingSerializer expression
[ https://issues.apache.org/jira/browse/SPARK-23591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23591: Assignee: (was: Apache Spark) > Add interpreted execution for EncodeUsingSerializer expression > -- > > Key: SPARK-23591 > URL: https://issues.apache.org/jira/browse/SPARK-23591 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Herman van Hovell >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23591) Add interpreted execution for EncodeUsingSerializer expression
[ https://issues.apache.org/jira/browse/SPARK-23591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16387998#comment-16387998 ] Apache Spark commented on SPARK-23591: -- User 'mgaido91' has created a pull request for this issue: https://github.com/apache/spark/pull/20751 > Add interpreted execution for EncodeUsingSerializer expression > -- > > Key: SPARK-23591 > URL: https://issues.apache.org/jira/browse/SPARK-23591 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Herman van Hovell >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23591) Add interpreted execution for EncodeUsingSerializer expression
[ https://issues.apache.org/jira/browse/SPARK-23591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23591: Assignee: Apache Spark > Add interpreted execution for EncodeUsingSerializer expression > -- > > Key: SPARK-23591 > URL: https://issues.apache.org/jira/browse/SPARK-23591 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Herman van Hovell >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18673) Dataframes doesn't work on Hadoop 3.x; Hive rejects Hadoop version
[ https://issues.apache.org/jira/browse/SPARK-18673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16388014#comment-16388014 ] Darek commented on SPARK-18673: --- HIVE tickets are closed already, can we close this ticket? > Dataframes doesn't work on Hadoop 3.x; Hive rejects Hadoop version > -- > > Key: SPARK-18673 > URL: https://issues.apache.org/jira/browse/SPARK-18673 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 > Environment: Spark built with -Dhadoop.version=3.0.0-alpha2-SNAPSHOT >Reporter: Steve Loughran >Priority: Major > > Spark Dataframes fail to run on Hadoop 3.0.x, because hive.jar's shimloader > considers 3.x to be an unknown Hadoop version. > Hive itself will have to fix this; as Spark uses its own hive 1.2.x JAR, it > will need to be updated to match. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23595) Add interpreted execution for ValidateExternalType expression
[ https://issues.apache.org/jira/browse/SPARK-23595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16387864#comment-16387864 ] Herman van Hovell commented on SPARK-23595: --- [~DylanGuedes] feel free to pick this up. I think it is a good starter task, since it is relatively self-contained and very well testable. If you need some inspiration, just take a look at the approach taken by other tickets in the umbrella. Let me know if you need help. > Add interpreted execution for ValidateExternalType expression > - > > Key: SPARK-23595 > URL: https://issues.apache.org/jira/browse/SPARK-23595 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Herman van Hovell >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23581) Add an interpreted version of GenerateUnsafeProjection
[ https://issues.apache.org/jira/browse/SPARK-23581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16387872#comment-16387872 ] Apache Spark commented on SPARK-23581: -- User 'hvanhovell' has created a pull request for this issue: https://github.com/apache/spark/pull/20750 > Add an interpreted version of GenerateUnsafeProjection > -- > > Key: SPARK-23581 > URL: https://issues.apache.org/jira/browse/SPARK-23581 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Herman van Hovell >Assignee: Herman van Hovell >Priority: Major > > GenerateUnsafeProjection should have an interpreted cousin. See the parent > ticket for the motivation. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23601) Remove .md5 files from release
[ https://issues.apache.org/jira/browse/SPARK-23601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-23601. --- Resolution: Fixed Fix Version/s: 2.4.0 2.3.1 Resolved by https://github.com/apache/spark/pull/20737 > Remove .md5 files from release > -- > > Key: SPARK-23601 > URL: https://issues.apache.org/jira/browse/SPARK-23601 > Project: Spark > Issue Type: Task > Components: Build >Affects Versions: 2.4.0 >Reporter: Sean Owen >Assignee: Sean Owen >Priority: Minor > Fix For: 2.3.1, 2.4.0 > > > Per email from Henk to PMCs: > {code} >The Release Distribution Policy[1] changed regarding checksum files. > See under "Cryptographic Signatures and Checksums Requirements" [2]. > MD5-file == a .md5 file > SHA-file == a .sha1, sha256 or .sha512 file >Old policy : > -- MUST provide a MD5-file > -- SHOULD provide a SHA-file [SHA-512 recommended] >New policy : > -- MUST provide a SHA- or MD5-file > -- SHOULD provide a SHA-file > -- SHOULD NOT provide a MD5-file > Providing MD5 checksum files is now discouraged for new releases, > but still allowed for past releases. >Why this change : > -- MD5 is broken for many purposes ; we should move away from it. > https://en.wikipedia.org/wiki/MD5#Overview_of_security_issues >Impact for PMCs : > -- for new releases : > -- please do provide a SHA-file (one or more, if you like) > -- do NOT provide a MD5-file > -- for past releases : > -- you are not required to change anything > -- for artifacts accompanied by a SHA-file /and/ a MD5-file, > it would be nice if you removed the MD5-file > -- if, at the moment, you provide MD5-files, > please adjust your release tooling. > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22683) DynamicAllocation wastes resources by allocating containers that will barely be used
[ https://issues.apache.org/jira/browse/SPARK-22683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16387938#comment-16387938 ] Thomas Graves commented on SPARK-22683: --- [~jcuquemelle] do you have time to update the PR, otherwise we should close that for now > DynamicAllocation wastes resources by allocating containers that will barely > be used > > > Key: SPARK-22683 > URL: https://issues.apache.org/jira/browse/SPARK-22683 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.1.0, 2.2.0 >Reporter: Julien Cuquemelle >Priority: Major > Labels: pull-request-available > > While migrating a series of jobs from MR to Spark using dynamicAllocation, > I've noticed almost a doubling (+114% exactly) of resource consumption of > Spark w.r.t MR, for a wall clock time gain of 43% > About the context: > - resource usage stands for vcore-hours allocation for the whole job, as seen > by YARN > - I'm talking about a series of jobs because we provide our users with a way > to define experiments (via UI / DSL) that automatically get translated to > Spark / MR jobs and submitted on the cluster > - we submit around 500 of such jobs each day > - these jobs are usually one shot, and the amount of processing can vary a > lot between jobs, and as such finding an efficient number of executors for > each job is difficult to get right, which is the reason I took the path of > dynamic allocation. > - Some of the tests have been scheduled on an idle queue, some on a full > queue. > - experiments have been conducted with spark.executor-cores = 5 and 10, only > results for 5 cores have been reported because efficiency was overall better > than with 10 cores > - the figures I give are averaged over a representative sample of those jobs > (about 600 jobs) ranging from tens to thousands splits in the data > partitioning and between 400 to 9000 seconds of wall clock time. > - executor idle timeout is set to 30s; > > Definition: > - let's say an executor has spark.executor.cores / spark.task.cpus taskSlots, > which represent the max number of tasks an executor will process in parallel. > - the current behaviour of the dynamic allocation is to allocate enough > containers to have one taskSlot per task, which minimizes latency, but wastes > resources when tasks are small regarding executor allocation and idling > overhead. > The results using the proposal (described below) over the job sample (600 > jobs): > - by using 2 tasks per taskSlot, we get a 5% (against -114%) reduction in > resource usage, for a 37% (against 43%) reduction in wall clock time for > Spark w.r.t MR > - by trying to minimize the average resource consumption, I ended up with 6 > tasks per core, with a 30% resource usage reduction, for a similar wall clock > time w.r.t. MR > What did I try to solve the issue with existing parameters (summing up a few > points mentioned in the comments) ? > - change dynamicAllocation.maxExecutors: this would need to be adapted for > each job (tens to thousands splits can occur), and essentially remove the > interest of using the dynamic allocation. > - use dynamicAllocation.backlogTimeout: > - setting this parameter right to avoid creating unused executors is very > dependant on wall clock time. One basically needs to solve the exponential > ramp up for the target time. So this is not an option for my use case where I > don't want a per-job tuning. > - I've still done a series of experiments, details in the comments. > Result is that after manual tuning, the best I could get was a similar > resource consumption at the expense of 20% more wall clock time, or a similar > wall clock time at the expense of 60% more resource consumption than what I > got using my proposal @ 6 tasks per slot (this value being optimized over a > much larger range of jobs as already stated) > - as mentioned in another comment, tampering with the exponential ramp up > might yield task imbalance and such old executors could become contention > points for other exes trying to remotely access blocks in the old exes (not > witnessed in the jobs I'm talking about, but we did see this behavior in > other jobs) > Proposal: > Simply add a tasksPerExecutorSlot parameter, which makes it possible to > specify how many tasks a single taskSlot should ideally execute to mitigate > the overhead of executor allocation. > PR: https://github.com/apache/spark/pull/19881 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22683) DynamicAllocation wastes resources by allocating containers that will barely be used
[ https://issues.apache.org/jira/browse/SPARK-22683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16387947#comment-16387947 ] Julien Cuquemelle commented on SPARK-22683: --- Yes, I have time. I was waiting for suggestions for the parameter name. how about spark.dynamicAllocation.fullParallelismDivisor (if we agree that parameter could be a double) ? > DynamicAllocation wastes resources by allocating containers that will barely > be used > > > Key: SPARK-22683 > URL: https://issues.apache.org/jira/browse/SPARK-22683 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.1.0, 2.2.0 >Reporter: Julien Cuquemelle >Priority: Major > Labels: pull-request-available > > While migrating a series of jobs from MR to Spark using dynamicAllocation, > I've noticed almost a doubling (+114% exactly) of resource consumption of > Spark w.r.t MR, for a wall clock time gain of 43% > About the context: > - resource usage stands for vcore-hours allocation for the whole job, as seen > by YARN > - I'm talking about a series of jobs because we provide our users with a way > to define experiments (via UI / DSL) that automatically get translated to > Spark / MR jobs and submitted on the cluster > - we submit around 500 of such jobs each day > - these jobs are usually one shot, and the amount of processing can vary a > lot between jobs, and as such finding an efficient number of executors for > each job is difficult to get right, which is the reason I took the path of > dynamic allocation. > - Some of the tests have been scheduled on an idle queue, some on a full > queue. > - experiments have been conducted with spark.executor-cores = 5 and 10, only > results for 5 cores have been reported because efficiency was overall better > than with 10 cores > - the figures I give are averaged over a representative sample of those jobs > (about 600 jobs) ranging from tens to thousands splits in the data > partitioning and between 400 to 9000 seconds of wall clock time. > - executor idle timeout is set to 30s; > > Definition: > - let's say an executor has spark.executor.cores / spark.task.cpus taskSlots, > which represent the max number of tasks an executor will process in parallel. > - the current behaviour of the dynamic allocation is to allocate enough > containers to have one taskSlot per task, which minimizes latency, but wastes > resources when tasks are small regarding executor allocation and idling > overhead. > The results using the proposal (described below) over the job sample (600 > jobs): > - by using 2 tasks per taskSlot, we get a 5% (against -114%) reduction in > resource usage, for a 37% (against 43%) reduction in wall clock time for > Spark w.r.t MR > - by trying to minimize the average resource consumption, I ended up with 6 > tasks per core, with a 30% resource usage reduction, for a similar wall clock > time w.r.t. MR > What did I try to solve the issue with existing parameters (summing up a few > points mentioned in the comments) ? > - change dynamicAllocation.maxExecutors: this would need to be adapted for > each job (tens to thousands splits can occur), and essentially remove the > interest of using the dynamic allocation. > - use dynamicAllocation.backlogTimeout: > - setting this parameter right to avoid creating unused executors is very > dependant on wall clock time. One basically needs to solve the exponential > ramp up for the target time. So this is not an option for my use case where I > don't want a per-job tuning. > - I've still done a series of experiments, details in the comments. > Result is that after manual tuning, the best I could get was a similar > resource consumption at the expense of 20% more wall clock time, or a similar > wall clock time at the expense of 60% more resource consumption than what I > got using my proposal @ 6 tasks per slot (this value being optimized over a > much larger range of jobs as already stated) > - as mentioned in another comment, tampering with the exponential ramp up > might yield task imbalance and such old executors could become contention > points for other exes trying to remotely access blocks in the old exes (not > witnessed in the jobs I'm talking about, but we did see this behavior in > other jobs) > Proposal: > Simply add a tasksPerExecutorSlot parameter, which makes it possible to > specify how many tasks a single taskSlot should ideally execute to mitigate > the overhead of executor allocation. > PR: https://github.com/apache/spark/pull/19881 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail:
[jira] [Commented] (SPARK-23534) Spark run on Hadoop 3.0.0
[ https://issues.apache.org/jira/browse/SPARK-23534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16388017#comment-16388017 ] Darek commented on SPARK-23534: --- SPARK-18673 should be closed since HIVE-15016 and HIVE-18550 are closed. There should be not blockers are this point. > Spark run on Hadoop 3.0.0 > - > > Key: SPARK-23534 > URL: https://issues.apache.org/jira/browse/SPARK-23534 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 2.3.0 >Reporter: Saisai Shao >Priority: Major > > Major Hadoop vendors already/will step in Hadoop 3.0. So we should also make > sure Spark can run with Hadoop 3.0. This Jira tracks the work to make Spark > run on Hadoop 3.0. > The work includes: > # Add a Hadoop 3.0.0 new profile to make Spark build-able with Hadoop 3.0. > # Test to see if there's dependency issues with Hadoop 3.0. > # Investigating the feasibility to use shaded client jars (HADOOP-11804). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23595) Add interpreted execution for ValidateExternalType expression
[ https://issues.apache.org/jira/browse/SPARK-23595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16387853#comment-16387853 ] Dylan Guedes commented on SPARK-23595: -- Hi, I would like to help with this issue, but since I am a newcomer I am not sure if it is a good way to start (maybe it is too hard and I don't want to be a bottleneck). I started reading code of the related issues, it is similar? What do you guys think? Thank you! > Add interpreted execution for ValidateExternalType expression > - > > Key: SPARK-23595 > URL: https://issues.apache.org/jira/browse/SPARK-23595 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Herman van Hovell >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17495) Hive hash implementation
[ https://issues.apache.org/jira/browse/SPARK-17495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16387873#comment-16387873 ] Xiaoju Wu commented on SPARK-17495: --- [~tejasp] I can see HiveHash merged but never used. Seems the using of spark/hive hash is still under discussion, is there any update on this topic? > Hive hash implementation > > > Key: SPARK-17495 > URL: https://issues.apache.org/jira/browse/SPARK-17495 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Tejas Patil >Assignee: Tejas Patil >Priority: Minor > > Spark internally uses Murmur3Hash for partitioning. This is different from > the one used by Hive. For queries which use bucketing this leads to different > results if one tries the same query on both engines. For us, we want users to > have backward compatibility to that one can switch parts of applications > across the engines without observing regressions. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16630) Blacklist a node if executors won't launch on it.
[ https://issues.apache.org/jira/browse/SPARK-16630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16387846#comment-16387846 ] Thomas Graves commented on SPARK-16630: --- yes something along these lines is what I was thinking. we would want a configurable number of failures (perhaps we can reuse one of the existing settings, but woudl need to think about more) at which point we would blacklist the node due to executor launch failures and we could have a timeout at which point we could retry. We also want to take into account small clusters and perhaps stop blacklisting if a certain percent of the cluster is already blacklisted. > Blacklist a node if executors won't launch on it. > - > > Key: SPARK-16630 > URL: https://issues.apache.org/jira/browse/SPARK-16630 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 1.6.2 >Reporter: Thomas Graves >Priority: Major > > On YARN, its possible that a node is messed or misconfigured such that a > container won't launch on it. For instance if the Spark external shuffle > handler didn't get loaded on it , maybe its just some other hardware issue or > hadoop configuration issue. > It would be nice we could recognize this happening and stop trying to launch > executors on it since that could end up causing us to hit our max number of > executor failures and then kill the job. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-23595) Add interpreted execution for ValidateExternalType expression
[ https://issues.apache.org/jira/browse/SPARK-23595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16387853#comment-16387853 ] Dylan Guedes edited comment on SPARK-23595 at 3/6/18 2:24 PM: -- Hi, I would like to help with this issue, but since I am a newcomer I am not sure if it is a good way to start (maybe it is too hard and I don't want to be a bottleneck). I started reading code of the related issues, is this one similar? What do you guys think? Thank you! was (Author: dylanguedes): Hi, I would like to help with this issue, but since I am a newcomer I am not sure if it is a good way to start (maybe it is too hard and I don't want to be a bottleneck). I started reading code of the related issues, it is similar? What do you guys think? Thank you! > Add interpreted execution for ValidateExternalType expression > - > > Key: SPARK-23595 > URL: https://issues.apache.org/jira/browse/SPARK-23595 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Herman van Hovell >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21048) Add an option --merged-properties-file to distinguish the configuration loading behavior
[ https://issues.apache.org/jira/browse/SPARK-21048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16388045#comment-16388045 ] Yaqub Alwan commented on SPARK-21048: - Commenting here because SPARK-21023 was closed and contrary to the remark "I think this is confusing relative to any value it adds" I also have found this behaviour of completely ignoring the spark-defaults.conf to be counterintuitive (if I am being honest I would actually say I find it obnoxious, as they're hardly defaults if they get ignored when not explicitly overridden) when supplied with a properties file, and I would also like to see _some_ solution to this problem, as a application developer doesn't want or need to know about cluster level configuration in order to just set some application level properties. I would prefer to see something like --merge-properties-with-defaults in combination with --properties-file but any implementation works. I am a little concerned seeing the resistance to having this implemented, when the current behaviour is not intuitive. > Add an option --merged-properties-file to distinguish the configuration > loading behavior > > > Key: SPARK-21048 > URL: https://issues.apache.org/jira/browse/SPARK-21048 > Project: Spark > Issue Type: Improvement > Components: Spark Submit >Affects Versions: 2.1.1 >Reporter: Lantao Jin >Priority: Minor > > The problem description is the same as > [SPARK-21023|https://issues.apache.org/jira/browse/SPARK-21023]. But > different with that ticket. The purpose is not making sure the default > properties file always be loaded. Instead, just offering other option to let > user choose what they want. > {quote} > {{\-\-properties-file}} user-specified properties file which will replace the > default properties file. deprecated. > {{\-\-replaced-properties-file}} new option which equals the > {{\-\-properties-file}} but more friendly. > {{\-\-merged-properties-file}} user-specified properties file which will > merge with the default properties file. > {quote} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23612) Specify formats for individual DateType and TimestampType columns in schemas
Patrick Young created SPARK-23612: - Summary: Specify formats for individual DateType and TimestampType columns in schemas Key: SPARK-23612 URL: https://issues.apache.org/jira/browse/SPARK-23612 Project: Spark Issue Type: Improvement Components: PySpark, SQL Affects Versions: 2.3.0 Reporter: Patrick Young [https://github.com/apache/spark/blob/407f67249639709c40c46917700ed6dd736daa7d/python/pyspark/sql/types.py#L162-L200] It would be very helpful if it were possible to specify the format for individual columns in a schema when reading csv files, rather than one format: {code:title=Bar.python|borderStyle=solid} # Currently can only do something like: spark.read.option("**dateFormat", "MMdd").csv(...) # Would like to be able to do something like: schema = StructType([ StructField("date1", DateType(format="MM/dd/"), True), StructField("date2", DateType(format="MMdd"), True) ] read.schema(schema).csv(...) {{{code}}} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23612) Specify formats for individual DateType and TimestampType columns in schemas
[ https://issues.apache.org/jira/browse/SPARK-23612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Young updated SPARK-23612: -- Description: [https://github.com/apache/spark/blob/407f67249639709c40c46917700ed6dd736daa7d/python/pyspark/sql/types.py#L162-L200] It would be very helpful if it were possible to specify the format for individual columns in a schema when reading csv files, rather than one format: {code:java|title=Bar.python|borderStyle=solid} # Currently can only do something like: spark.read.option("dateFormat", "MMdd").csv(...) # Would like to be able to do something like: schema = StructType([ StructField("date1", DateType(format="MM/dd/"), True), StructField("date2", DateType(format="MMdd"), True) ] read.schema(schema).csv(...) {code} Thanks for any help, input! was: [https://github.com/apache/spark/blob/407f67249639709c40c46917700ed6dd736daa7d/python/pyspark/sql/types.py#L162-L200] It would be very helpful if it were possible to specify the format for individual columns in a schema when reading csv files, rather than one format: {code:title=Bar.python|borderStyle=solid} # Currently can only do something like: spark.read.option("**dateFormat", "MMdd").csv(...) # Would like to be able to do something like: schema = StructType([ StructField("date1", DateType(format="MM/dd/"), True), StructField("date2", DateType(format="MMdd"), True) ] read.schema(schema).csv(...) {{{code}}} > Specify formats for individual DateType and TimestampType columns in schemas > > > Key: SPARK-23612 > URL: https://issues.apache.org/jira/browse/SPARK-23612 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 2.3.0 >Reporter: Patrick Young >Priority: Minor > Labels: DataType, date, sql > > [https://github.com/apache/spark/blob/407f67249639709c40c46917700ed6dd736daa7d/python/pyspark/sql/types.py#L162-L200] > It would be very helpful if it were possible to specify the format for > individual columns in a schema when reading csv files, rather than one format: > {code:java|title=Bar.python|borderStyle=solid} > # Currently can only do something like: > spark.read.option("dateFormat", "MMdd").csv(...) > # Would like to be able to do something like: > schema = StructType([ > StructField("date1", DateType(format="MM/dd/"), True), > StructField("date2", DateType(format="MMdd"), True) > ] > read.schema(schema).csv(...) > {code} > Thanks for any help, input! -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23591) Add interpreted execution for EncodeUsingSerializer expression
[ https://issues.apache.org/jira/browse/SPARK-23591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hovell reassigned SPARK-23591: - Assignee: Marco Gaido > Add interpreted execution for EncodeUsingSerializer expression > -- > > Key: SPARK-23591 > URL: https://issues.apache.org/jira/browse/SPARK-23591 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Herman van Hovell >Assignee: Marco Gaido >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23590) Add interpreted execution for CreateExternalRow expression
[ https://issues.apache.org/jira/browse/SPARK-23590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hovell resolved SPARK-23590. --- Resolution: Fixed Fix Version/s: 2.4.0 > Add interpreted execution for CreateExternalRow expression > -- > > Key: SPARK-23590 > URL: https://issues.apache.org/jira/browse/SPARK-23590 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Herman van Hovell >Assignee: Marco Gaido >Priority: Major > Fix For: 2.4.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23346) Failed tasks reported as success if the failure reason is not ExceptionFailure
[ https://issues.apache.org/jira/browse/SPARK-23346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16388088#comment-16388088 ] Herman van Hovell commented on SPARK-23346: --- [~wuzhilon88] This is completely un-actionable. Describe what you are doing here, and add a reproduction. Otherwise I will close the ticket. > Failed tasks reported as success if the failure reason is not ExceptionFailure > -- > > Key: SPARK-23346 > URL: https://issues.apache.org/jira/browse/SPARK-23346 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.2.0 > Environment: HADOOP 2.6 + JDK1.8 + SPARK 2.2.0 >Reporter: 吴志龙 >Priority: Critical > Attachments: 企业微信截图_15179714603307.png, 企业微信截图_15179715023606.png > > > !企业微信截图_15179715023606.png! !企业微信截图_15179714603307.png! We have many other > failure reasons, such as TaskResultLost,but the status is success. In the web > ui, we count non-ExceptionFailure failures as successful tasks, which is > highly misleading. > detail message: > Job aborted due to stage failure: Task 0 in stage 7.0 failed 10 times, most > recent failure: Lost task 0.9 in stage 7.0 (TID 35, 60.hadoop.com, executor > 27): TaskResultLost (result lost from block manager) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23592) Add interpreted execution for DecodeUsingSerializer expression
[ https://issues.apache.org/jira/browse/SPARK-23592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16388075#comment-16388075 ] Marco Gaido commented on SPARK-23592: - I will submit a PR as soon as SPARK-23591 gets merged, thanks > Add interpreted execution for DecodeUsingSerializer expression > -- > > Key: SPARK-23592 > URL: https://issues.apache.org/jira/browse/SPARK-23592 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Herman van Hovell >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23592) Add interpreted execution for DecodeUsingSerializer expression
[ https://issues.apache.org/jira/browse/SPARK-23592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hovell reassigned SPARK-23592: - Assignee: Marco Gaido > Add interpreted execution for DecodeUsingSerializer expression > -- > > Key: SPARK-23592 > URL: https://issues.apache.org/jira/browse/SPARK-23592 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Herman van Hovell >Assignee: Marco Gaido >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23580) Interpreted mode fallback should be implemented for all expressions & projections
[ https://issues.apache.org/jira/browse/SPARK-23580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hovell updated SPARK-23580: -- Labels: release-notes (was: ) > Interpreted mode fallback should be implemented for all expressions & > projections > - > > Key: SPARK-23580 > URL: https://issues.apache.org/jira/browse/SPARK-23580 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Herman van Hovell >Priority: Major > Labels: release-notes > > Spark SQL currently does not support interpreted mode for all expressions and > projections. This is a problem for scenario's where were code generation does > not work, or blows past the JVM class limits. We currently cannot gracefully > fallback. > This ticket is an umbrella to fix this class of problem in Spark SQL. This > work can be divided into two main area's: > - Add interpreted versions for all dataset related expressions. > - Add an interpreted version of {{GenerateUnsafeProjection}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23609) Test code does not conform to the test title
caoxuewen created SPARK-23609: - Summary: Test code does not conform to the test title Key: SPARK-23609 URL: https://issues.apache.org/jira/browse/SPARK-23609 Project: Spark Issue Type: Improvement Components: SQL, Tests Affects Versions: 2.4.0 Reporter: caoxuewen Currently, In testing EnsureRequirements's test cases to eliminate ShuffleExchange, The test code is not in conformity with the purpose of the test.These test cases are as follows: 1、test("EnsureRequirements eliminates Exchange if child has same partitioning") The checking condition is that there is no ShuffleExchange in the physical plan. = = 2 It's not accurate here. 2、test("EnsureRequirements does not eliminate Exchange with different partitioning") The purpose of the test is to not eliminate ShuffleExchange, but its test code is the same as test("EnsureRequirements eliminates Exchange if child has same partitioning") -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23609) Test code does not conform to the test title
[ https://issues.apache.org/jira/browse/SPARK-23609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23609: Assignee: Apache Spark > Test code does not conform to the test title > > > Key: SPARK-23609 > URL: https://issues.apache.org/jira/browse/SPARK-23609 > Project: Spark > Issue Type: Improvement > Components: SQL, Tests >Affects Versions: 2.4.0 >Reporter: caoxuewen >Assignee: Apache Spark >Priority: Minor > > Currently, In testing EnsureRequirements's test cases to eliminate > ShuffleExchange, The test code is not in conformity with the purpose of the > test.These test cases are as follows: > 1、test("EnsureRequirements eliminates Exchange if child has same > partitioning") > The checking condition is that there is no ShuffleExchange in the physical > plan. = = 2 It's not accurate here. > 2、test("EnsureRequirements does not eliminate Exchange with different > partitioning") > The purpose of the test is to not eliminate ShuffleExchange, but its test > code is the same as test("EnsureRequirements eliminates Exchange if child has > same partitioning") -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23609) Test code does not conform to the test title
[ https://issues.apache.org/jira/browse/SPARK-23609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16387470#comment-16387470 ] Apache Spark commented on SPARK-23609: -- User 'heary-cao' has created a pull request for this issue: https://github.com/apache/spark/pull/20747 > Test code does not conform to the test title > > > Key: SPARK-23609 > URL: https://issues.apache.org/jira/browse/SPARK-23609 > Project: Spark > Issue Type: Improvement > Components: SQL, Tests >Affects Versions: 2.4.0 >Reporter: caoxuewen >Priority: Minor > > Currently, In testing EnsureRequirements's test cases to eliminate > ShuffleExchange, The test code is not in conformity with the purpose of the > test.These test cases are as follows: > 1、test("EnsureRequirements eliminates Exchange if child has same > partitioning") > The checking condition is that there is no ShuffleExchange in the physical > plan. = = 2 It's not accurate here. > 2、test("EnsureRequirements does not eliminate Exchange with different > partitioning") > The purpose of the test is to not eliminate ShuffleExchange, but its test > code is the same as test("EnsureRequirements eliminates Exchange if child has > same partitioning") -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23609) Test code does not conform to the test title
[ https://issues.apache.org/jira/browse/SPARK-23609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23609: Assignee: (was: Apache Spark) > Test code does not conform to the test title > > > Key: SPARK-23609 > URL: https://issues.apache.org/jira/browse/SPARK-23609 > Project: Spark > Issue Type: Improvement > Components: SQL, Tests >Affects Versions: 2.4.0 >Reporter: caoxuewen >Priority: Minor > > Currently, In testing EnsureRequirements's test cases to eliminate > ShuffleExchange, The test code is not in conformity with the purpose of the > test.These test cases are as follows: > 1、test("EnsureRequirements eliminates Exchange if child has same > partitioning") > The checking condition is that there is no ShuffleExchange in the physical > plan. = = 2 It's not accurate here. > 2、test("EnsureRequirements does not eliminate Exchange with different > partitioning") > The purpose of the test is to not eliminate ShuffleExchange, but its test > code is the same as test("EnsureRequirements eliminates Exchange if child has > same partitioning") -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-20162) Reading data from MySQL - Cannot up cast from decimal(30,6) to decimal(38,18)
[ https://issues.apache.org/jira/browse/SPARK-20162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16383471#comment-16383471 ] Caio Quirino da Silva edited comment on SPARK-20162 at 3/6/18 11:53 AM: I have reproduced the problem using Spark 2.2.0 with that snippet: {code:java} case class MyEntity(field: BigDecimal) private val avroFileDir = "abc.avro" def test(): Unit = { val sp = sparkSession import sp.implicits._ val rdd = sparkSession.sparkContext.parallelize(List(MyEntity(BigDecimal(1.23 val df = sp.createDataFrame(rdd) df.write.mode(SaveMode.Append).avro(avroFileDir) sp.read.avro(avroFileDir).as[MyEntity].head }{code} So I think that we can reopen this issue... org.apache.spark.sql.AnalysisException: Cannot up cast lambdavariable from string to decimal(38,18) as it may truncate was (Author: caioquirino): I have reproduced the problem using Spark 2.2.0 with that snippet: {code:java} case class MyEntity(field: BigDecimal) val df = ss.createDataframe(Seq(MyEntity(BigDecimal(1.23 df.write.mode(SaveMode.Append).avro("dir.avro") ss.read.avro("dir.avro").as[MyEntity].head {code} So I think that we can reopen this issue... org.apache.spark.sql.AnalysisException: Cannot up cast lambdavariable from string to decimal(38,18) as it may truncate > Reading data from MySQL - Cannot up cast from decimal(30,6) to decimal(38,18) > - > > Key: SPARK-20162 > URL: https://issues.apache.org/jira/browse/SPARK-20162 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Miroslav Spehar >Priority: Major > > While reading data from MySQL, type conversion doesn't work for Decimal type > when the decimal in database is of lower precision/scale than the one spark > expects. > Error: > Exception in thread "main" org.apache.spark.sql.AnalysisException: Cannot up > cast `DECIMAL_AMOUNT` from decimal(30,6) to decimal(38,18) as it may truncate > The type path of the target object is: > - field (class: "org.apache.spark.sql.types.Decimal", name: "DECIMAL_AMOUNT") > - root class: "com.misp.spark.Structure" > You can either add an explicit cast to the input data or choose a higher > precision type of the field in the target object; > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast$.org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveUpCast$$fail(Analyzer.scala:2119) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast$$anonfun$apply$34$$anonfun$applyOrElse$14.applyOrElse(Analyzer.scala:2141) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast$$anonfun$apply$34$$anonfun$applyOrElse$14.applyOrElse(Analyzer.scala:2136) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:288) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:288) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:287) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:293) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:293) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5$$anonfun$apply$11.apply(TreeNode.scala:360) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245) > at scala.collection.immutable.List.foreach(List.scala:381) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:245) > at scala.collection.immutable.List.map(List.scala:285) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:358) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:329) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:293) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionDown$1(QueryPlan.scala:248) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:258) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$6.apply(QueryPlan.scala:267) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188) > at >
[jira] [Resolved] (SPARK-23594) Add interpreted execution for GetExternalRowField expression
[ https://issues.apache.org/jira/browse/SPARK-23594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hovell resolved SPARK-23594. --- Resolution: Fixed Assignee: Takeshi Yamamuro Fix Version/s: 2.4.0 > Add interpreted execution for GetExternalRowField expression > > > Key: SPARK-23594 > URL: https://issues.apache.org/jira/browse/SPARK-23594 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Herman van Hovell >Assignee: Takeshi Yamamuro >Priority: Major > Fix For: 2.4.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23590) Add interpreted execution for CreateExternalRow expression
[ https://issues.apache.org/jira/browse/SPARK-23590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16387721#comment-16387721 ] Marco Gaido commented on SPARK-23590: - I am working on this > Add interpreted execution for CreateExternalRow expression > -- > > Key: SPARK-23590 > URL: https://issues.apache.org/jira/browse/SPARK-23590 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Herman van Hovell >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-20162) Reading data from MySQL - Cannot up cast from decimal(30,6) to decimal(38,18)
[ https://issues.apache.org/jira/browse/SPARK-20162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16387661#comment-16387661 ] Caio Quirino da Silva edited comment on SPARK-20162 at 3/6/18 12:08 PM: Yes! And I can say that it started to fail from version 2.2.x. For Spark 2.1.2 it's fine. I have updated my last code snippet to create a cleaner stacktrace: {code:java} 18/03/06 11:51:10 INFO DAGScheduler: Job 0 finished: save at package.scala:26, took 0.941392 s 18/03/06 11:51:10 INFO FileFormatWriter: Job null committed. Cannot up cast `field` from string to decimal(38,18) as it may truncate The type path of the target object is: - field (class: "scala.math.BigDecimal", name: "field") - root class: "org.farfetch.bigdata.streaming.MyEntity" You can either add an explicit cast to the input data or choose a higher precision type of the field in the target object; org.apache.spark.sql.AnalysisException: Cannot up cast `field` from string to decimal(38,18) as it may truncate The type path of the target object is: - field (class: "scala.math.BigDecimal", name: "field") - root class: "org.farfetch.bigdata.streaming.MyEntity" You can either add an explicit cast to the input data or choose a higher precision type of the field in the target object; at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast$.org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveUpCast$$fail(Analyzer.scala:2123) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast$$anonfun$apply$34$$anonfun$applyOrElse$14.applyOrElse(Analyzer.scala:2153) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast$$anonfun$apply$34$$anonfun$applyOrElse$14.applyOrElse(Analyzer.scala:2140) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:268) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:268) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:267) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:273) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:273) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:307) at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188) at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:305) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:273) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:273) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:273) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4$$anonfun$apply$11.apply(TreeNode.scala:336) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.immutable.List.foreach(List.scala:381) at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at scala.collection.immutable.List.map(List.scala:285) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:334) at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188) at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:305) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:273) at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsDown$1.apply(QueryPlan.scala:245) at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsDown$1.apply(QueryPlan.scala:245) at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpression$1(QueryPlan.scala:266) at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:276) at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$6.apply(QueryPlan.scala:285) at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188) at org.apache.spark.sql.catalyst.plans.QueryPlan.mapExpressions(QueryPlan.scala:285) at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsDown(QueryPlan.scala:245) at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressions(QueryPlan.scala:236) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast$$anonfun$apply$34.applyOrElse(Analyzer.scala:2140) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast$$anonfun$apply$34.applyOrElse(Analyzer.scala:2136) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61) at
[jira] [Assigned] (SPARK-23582) Add interpreted execution to StaticInvoke expression
[ https://issues.apache.org/jira/browse/SPARK-23582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hovell reassigned SPARK-23582: - Assignee: Kazuaki Ishizaki > Add interpreted execution to StaticInvoke expression > > > Key: SPARK-23582 > URL: https://issues.apache.org/jira/browse/SPARK-23582 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Herman van Hovell >Assignee: Kazuaki Ishizaki >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23611) Extend ExpressionEvalHelper harness to also test failures
[ https://issues.apache.org/jira/browse/SPARK-23611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23611: Assignee: Apache Spark > Extend ExpressionEvalHelper harness to also test failures > - > > Key: SPARK-23611 > URL: https://issues.apache.org/jira/browse/SPARK-23611 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Herman van Hovell >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org