[jira] [Commented] (SPARK-23325) DataSourceV2 readers should always produce InternalRow.

2018-03-06 Thread Jose Torres (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16388399#comment-16388399
 ] 

Jose Torres commented on SPARK-23325:
-

How hard would it be to just declare that InternalRow is stable? The file has 
been touched about once per year for the past 3 years, and I doubt we'd be able 
to change it to any significant degree without risking serious regressions.

>From my perspective, and I think (but correct me if I'm wrong) the perspective 
>of the SPIP, a stable interface which can match the performance of Spark's 
>internal data sources is one of the core goals of DataSourceV2. If 
>high-performance sources must implement InternalRow reads and writes, then 
>DataSourceV2 isn't stable until InternalRow is stable anyway.

> DataSourceV2 readers should always produce InternalRow.
> ---
>
> Key: SPARK-23325
> URL: https://issues.apache.org/jira/browse/SPARK-23325
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Ryan Blue
>Priority: Major
>
> DataSourceV2 row-oriented implementations are limited to producing either 
> {{Row}} instances or {{UnsafeRow}} instances by implementing 
> {{SupportsScanUnsafeRow}}. Instead, I think that implementations should 
> always produce {{InternalRow}}.
> The problem with the choice between {{Row}} and {{UnsafeRow}} is that neither 
> one is appropriate for implementers.
> File formats don't produce {{Row}} instances or the data values used by 
> {{Row}}, like {{java.sql.Timestamp}} and {{java.sql.Date}}. An implementation 
> that uses {{Row}} instances must produce data that is immediately translated 
> from the representation that was just produced by Spark. In my experience, it 
> made little sense to translate a timestamp in microseconds to a 
> (milliseconds, nanoseconds) pair, create a {{Timestamp}} instance, and pass 
> that instance to Spark for immediate translation back.
> On the other hand, {{UnsafeRow}} is very difficult to produce unless data is 
> already held in memory. Even the Parquet support built into Spark 
> deserializes to {{InternalRow}} and then uses {{UnsafeProjection}} to produce 
> unsafe rows. When I went to build an implementation that deserializes Parquet 
> or Avro directly to {{UnsafeRow}} (I tried both), I found that it couldn't be 
> done without first deserializing into memory because the size of an array 
> must be known before any values are written.
> I ended up deciding to deserialize to {{InternalRow}} and use 
> {{UnsafeProjection}} to convert to unsafe. There are two problems with this: 
> first, this is Scala and was difficult to call from Java (it required 
> reflection), and second, this causes double projection in the physical plan 
> (a copy for unsafe to unsafe) if there is a projection that wasn't fully 
> pushed to the data source.
> I think the solution is to have a single interface for readers that expects 
> {{InternalRow}}. Then, a projection should be added in the Spark plan to 
> convert to unsafe and avoid projection in the plan and in the data source. If 
> the data source already produces unsafe rows by deserializing directly, this 
> still minimizes the number of copies because the unsafe projection will check 
> whether the incoming data is already {{UnsafeRow}}.
> Using {{InternalRow}} would also match the interface on the write side.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23613) Different Analyzed logical plan data types for the same table in different queries

2018-03-06 Thread Ramandeep Singh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16388494#comment-16388494
 ] 

Ramandeep Singh commented on SPARK-23613:
-

To add to it, the query works fine with subquery factoring.

 

with b1 as (select b.* from b)

select * from jq ( select a.col1, b.col2 from a,b1 where a.col3=b1.col3)

 

 

> Different Analyzed logical plan data types for the same table in different 
> queries
> --
>
> Key: SPARK-23613
> URL: https://issues.apache.org/jira/browse/SPARK-23613
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
> Environment: Spark 2.2.0
> Hive: 2
>Reporter: Ramandeep Singh
>Priority: Blocker
>  Labels: SparkSQL
>
> Hi,
> The column datatypes are correctly analyzed for simple select query. Note 
> that the problematic column is not selected anywhere in the complicated 
> scenario.
> Let's say Select * from a;
> Now let's say there is a query involving temporary view on another table and 
> its join with this table. 
> Let's call that table b (temporary view on a dataframe); 
> select * from jq ( select a.col1, b.col2 from a,b where a.col3=b=col3)
> Fails with exception on some column not part of the projection in the join 
> query
> Exception in thread "main" org.apache.spark.sql.AnalysisException: Cannot up 
> cast `a`.col5 from from decimal(8,0) to  col5#1234: decimal(6,2) as it may 
> truncate.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23607) Use HDFS extended attributes to store application summary to improve the Spark History Server performance

2018-03-06 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16388230#comment-16388230
 ] 

Marcelo Vanzin commented on SPARK-23607:


I think this is a nice trick to speed things up, even though it only works for 
HDFS. I have some ideas on how to have a more generic speed up in this code, 
just haven't had the time to sit down and try them out, but this could help in 
the meantime.

> Use HDFS extended attributes to store application summary to improve the 
> Spark History Server performance
> -
>
> Key: SPARK-23607
> URL: https://issues.apache.org/jira/browse/SPARK-23607
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, Web UI
>Affects Versions: 2.3.0
>Reporter: Ye Zhou
>Priority: Major
> Fix For: 2.4.0
>
>
> Currently in Spark History Server, checkForLogs thread will create replaying 
> tasks for log files which have file size change. The replaying task will 
> filter out most of the log file content and keep the application summary 
> including applicationId, user, attemptACL, start time, end time. The 
> application summary data will get updated into listing.ldb and serve the 
> application list on SHS home page. For a long running application, its log 
> file which name ends with "inprogress" will get replayed for multiple times 
> to get these application summary. This is a waste of computing and data 
> reading resource to SHS, which results in the delay for application to get 
> showing up on home page. Internally we have a patch which utilizes HDFS 
> extended attributes to improve the performance for getting application 
> summary in SHS. With this patch, Driver will write the application summary 
> information into extended attributes as key/value. SHS will try to read from 
> extended attributes. If SHS fails to read from extended attributes, it will 
> fall back to read from the log file content as usual. This feature can be 
> enable/disable through configuration.
> It has been running fine for 4 months internally with this patch and the last 
> updated timestamp on SHS keeps within 1 minute as we configure the interval 
> to 1 minute. Originally we had long delay which could be as long as 30 
> minutes in our scale where we have a large number of Spark applications 
> running per day.
> We want to see whether this kind of approach is also acceptable to community. 
> Please comment. If so, I will post a pull request for the changes. Thanks.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23325) DataSourceV2 readers should always produce InternalRow.

2018-03-06 Thread Wenchen Fan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16388322#comment-16388322
 ] 

Wenchen Fan commented on SPARK-23325:
-

The problem is that, `Row` is a stable class Spark promises it won't change 
over versions, `InternalRow` is not. I agree it's hard to output either `Row` 
or `UnsafeRow`, we should allow users to produce `InternalRow` directly. I 
missed this as I was only considering performance at that time. But I think we 
should keep the interface producing `Row` before we can make `InternalRow` 
stable.

> DataSourceV2 readers should always produce InternalRow.
> ---
>
> Key: SPARK-23325
> URL: https://issues.apache.org/jira/browse/SPARK-23325
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Ryan Blue
>Priority: Major
>
> DataSourceV2 row-oriented implementations are limited to producing either 
> {{Row}} instances or {{UnsafeRow}} instances by implementing 
> {{SupportsScanUnsafeRow}}. Instead, I think that implementations should 
> always produce {{InternalRow}}.
> The problem with the choice between {{Row}} and {{UnsafeRow}} is that neither 
> one is appropriate for implementers.
> File formats don't produce {{Row}} instances or the data values used by 
> {{Row}}, like {{java.sql.Timestamp}} and {{java.sql.Date}}. An implementation 
> that uses {{Row}} instances must produce data that is immediately translated 
> from the representation that was just produced by Spark. In my experience, it 
> made little sense to translate a timestamp in microseconds to a 
> (milliseconds, nanoseconds) pair, create a {{Timestamp}} instance, and pass 
> that instance to Spark for immediate translation back.
> On the other hand, {{UnsafeRow}} is very difficult to produce unless data is 
> already held in memory. Even the Parquet support built into Spark 
> deserializes to {{InternalRow}} and then uses {{UnsafeProjection}} to produce 
> unsafe rows. When I went to build an implementation that deserializes Parquet 
> or Avro directly to {{UnsafeRow}} (I tried both), I found that it couldn't be 
> done without first deserializing into memory because the size of an array 
> must be known before any values are written.
> I ended up deciding to deserialize to {{InternalRow}} and use 
> {{UnsafeProjection}} to convert to unsafe. There are two problems with this: 
> first, this is Scala and was difficult to call from Java (it required 
> reflection), and second, this causes double projection in the physical plan 
> (a copy for unsafe to unsafe) if there is a projection that wasn't fully 
> pushed to the data source.
> I think the solution is to have a single interface for readers that expects 
> {{InternalRow}}. Then, a projection should be added in the Spark plan to 
> convert to unsafe and avoid projection in the plan and in the data source. If 
> the data source already produces unsafe rows by deserializing directly, this 
> still minimizes the number of copies because the unsafe projection will check 
> whether the incoming data is already {{UnsafeRow}}.
> Using {{InternalRow}} would also match the interface on the write side.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18673) Dataframes doesn't work on Hadoop 3.x; Hive rejects Hadoop version

2018-03-06 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16388223#comment-16388223
 ] 

Marcelo Vanzin commented on SPARK-18673:


We can't close this because Spark is not using the latest version of Hive. So 
even if Hive is fixed, Spark is still not.

> Dataframes doesn't work on Hadoop 3.x; Hive rejects Hadoop version
> --
>
> Key: SPARK-18673
> URL: https://issues.apache.org/jira/browse/SPARK-18673
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
> Environment: Spark built with -Dhadoop.version=3.0.0-alpha2-SNAPSHOT 
>Reporter: Steve Loughran
>Priority: Major
>
> Spark Dataframes fail to run on Hadoop 3.0.x, because hive.jar's shimloader 
> considers 3.x to be an unknown Hadoop version.
> Hive itself will have to fix this; as Spark uses its own hive 1.2.x JAR, it 
> will need to be updated to match.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23325) DataSourceV2 readers should always produce InternalRow.

2018-03-06 Thread Ryan Blue (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16388411#comment-16388411
 ] 

Ryan Blue commented on SPARK-23325:
---

I agree that we should declare \{{InternalRow}} stable. It is effectively 
stable, as [~joseph.torres] argues. And by _far_ the easiest way to produce 
{{UnsafeRow}} is to produce {{InternalRow}} first and use Spark to convert to 
unsafe. If we're already relying on it there, we may as well have Spark handle 
the unsafe projection!

> DataSourceV2 readers should always produce InternalRow.
> ---
>
> Key: SPARK-23325
> URL: https://issues.apache.org/jira/browse/SPARK-23325
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Ryan Blue
>Priority: Major
>
> DataSourceV2 row-oriented implementations are limited to producing either 
> {{Row}} instances or {{UnsafeRow}} instances by implementing 
> {{SupportsScanUnsafeRow}}. Instead, I think that implementations should 
> always produce {{InternalRow}}.
> The problem with the choice between {{Row}} and {{UnsafeRow}} is that neither 
> one is appropriate for implementers.
> File formats don't produce {{Row}} instances or the data values used by 
> {{Row}}, like {{java.sql.Timestamp}} and {{java.sql.Date}}. An implementation 
> that uses {{Row}} instances must produce data that is immediately translated 
> from the representation that was just produced by Spark. In my experience, it 
> made little sense to translate a timestamp in microseconds to a 
> (milliseconds, nanoseconds) pair, create a {{Timestamp}} instance, and pass 
> that instance to Spark for immediate translation back.
> On the other hand, {{UnsafeRow}} is very difficult to produce unless data is 
> already held in memory. Even the Parquet support built into Spark 
> deserializes to {{InternalRow}} and then uses {{UnsafeProjection}} to produce 
> unsafe rows. When I went to build an implementation that deserializes Parquet 
> or Avro directly to {{UnsafeRow}} (I tried both), I found that it couldn't be 
> done without first deserializing into memory because the size of an array 
> must be known before any values are written.
> I ended up deciding to deserialize to {{InternalRow}} and use 
> {{UnsafeProjection}} to convert to unsafe. There are two problems with this: 
> first, this is Scala and was difficult to call from Java (it required 
> reflection), and second, this causes double projection in the physical plan 
> (a copy for unsafe to unsafe) if there is a projection that wasn't fully 
> pushed to the data source.
> I think the solution is to have a single interface for readers that expects 
> {{InternalRow}}. Then, a projection should be added in the Spark plan to 
> convert to unsafe and avoid projection in the plan and in the data source. If 
> the data source already produces unsafe rows by deserializing directly, this 
> still minimizes the number of copies because the unsafe projection will check 
> whether the incoming data is already {{UnsafeRow}}.
> Using {{InternalRow}} would also match the interface on the write side.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23613) Different Analyzed logical plan data types for the same table in different queries

2018-03-06 Thread Ramandeep Singh (JIRA)
Ramandeep Singh created SPARK-23613:
---

 Summary: Different Analyzed logical plan data types for the same 
table in different queries
 Key: SPARK-23613
 URL: https://issues.apache.org/jira/browse/SPARK-23613
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0
 Environment: Spark 2.2.0

Hive: 2
Reporter: Ramandeep Singh


Hi,

The column datatypes are correctly analyzed for simple select query. Note that 
the problematic column is not selected anywhere in the complicated scenario.

Let's say Select * from a;

Now let's say there is a query involving temporary view on another table and 
its join with this table. 

Let's call that table b (temporary view on a dataframe); 

select * from jq ( select a.col1, b.col2 from a,b where a.col3=b=col3)

Fails with exception on some column not part of the projection in the join query

Exception in thread "main" org.apache.spark.sql.AnalysisException: Cannot up 
cast `a`.col5 from from decimal(8,0) to  col5#1234: decimal(6,2) as it may 
truncate.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23614) Union produces incorrect results when caching is used

2018-03-06 Thread Morten Hornbech (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Morten Hornbech updated SPARK-23614:

Description: 
We just upgraded from 2.2 to 2.3 and our test suite caught this error:
{code:java}
case class TestData(x: Int, y: Int, z: Int)

val frame = session.createDataset(Seq(TestData(1, 2, 3), TestData(4, 5, 
6))).cache()
val group1 = frame.groupBy("x").agg(min(col("y")) as "value")
val group2 = frame.groupBy("x").agg(min(col("z")) as "value")
group1.union(group2).show()
// +---+-+
// | x|value|
// +---+-+
// | 1| 2|
// | 4| 5|
// | 1| 2|
// | 4| 5|
// +---+-+
group2.union(group1).show()
// +---+-+
// | x|value|
// +---+-+
// | 1| 3|
// | 4| 6|
// | 1| 3|
// | 4| 6|
// +---+-+
{code}
The error disappears if the first data frame is not cached or if the two group 
by's use separate copies. I'm not sure exactly what happens on the insides of 
Spark, but errors that produce incorrect results rather than exceptions always 
concerns me.

  was:
We just upgraded from 2.2 to 2.3 and our test suite caught this error:

{code:java}
val frame = session.createDataset(Seq(TestData(1, 2, 3), TestData(4, 5, 
6))).cache()
val group1 = frame.groupBy("x").agg(min(col("y")) as "value")
val group2 = frame.groupBy("x").agg(min(col("z")) as "value")
group1.union(group2).show()
// +---+-+
// | x|value|
// +---+-+
// | 1| 2|
// | 4| 5|
// | 1| 2|
// | 4| 5|
// +---+-+
group2.union(group1).show()
// +---+-+
// | x|value|
// +---+-+
// | 1| 3|
// | 4| 6|
// | 1| 3|
// | 4| 6|
// +---+-+
{code}

The error disappears if the first data frame is not cached or if the two group 
by's use separate copies. I'm not sure exactly what happens on the insides of 
Spark, but errors that produce incorrect results rather than exceptions always 
concerns me.


> Union produces incorrect results when caching is used
> -
>
> Key: SPARK-23614
> URL: https://issues.apache.org/jira/browse/SPARK-23614
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Morten Hornbech
>Priority: Major
>
> We just upgraded from 2.2 to 2.3 and our test suite caught this error:
> {code:java}
> case class TestData(x: Int, y: Int, z: Int)
> val frame = session.createDataset(Seq(TestData(1, 2, 3), TestData(4, 5, 
> 6))).cache()
> val group1 = frame.groupBy("x").agg(min(col("y")) as "value")
> val group2 = frame.groupBy("x").agg(min(col("z")) as "value")
> group1.union(group2).show()
> // +---+-+
> // | x|value|
> // +---+-+
> // | 1| 2|
> // | 4| 5|
> // | 1| 2|
> // | 4| 5|
> // +---+-+
> group2.union(group1).show()
> // +---+-+
> // | x|value|
> // +---+-+
> // | 1| 3|
> // | 4| 6|
> // | 1| 3|
> // | 4| 6|
> // +---+-+
> {code}
> The error disappears if the first data frame is not cached or if the two 
> group by's use separate copies. I'm not sure exactly what happens on the 
> insides of Spark, but errors that produce incorrect results rather than 
> exceptions always concerns me.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23582) Add interpreted execution to StaticInvoke expression

2018-03-06 Thread Herman van Hovell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16388742#comment-16388742
 ] 

Herman van Hovell commented on SPARK-23582:
---

That is a good start! I am just wondering if method handles won't be more 
performant.

> Add interpreted execution to StaticInvoke expression
> 
>
> Key: SPARK-23582
> URL: https://issues.apache.org/jira/browse/SPARK-23582
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Herman van Hovell
>Assignee: Kazuaki Ishizaki
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23609) Test EnsureRequirements's test cases to eliminate ShuffleExchange while is not expected

2018-03-06 Thread caoxuewen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

caoxuewen updated SPARK-23609:
--
Summary: Test EnsureRequirements's test cases to eliminate ShuffleExchange 
while is not expected  (was: Test code does not conform to the test title)

> Test EnsureRequirements's test cases to eliminate ShuffleExchange while is 
> not expected
> ---
>
> Key: SPARK-23609
> URL: https://issues.apache.org/jira/browse/SPARK-23609
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Tests
>Affects Versions: 2.4.0
>Reporter: caoxuewen
>Priority: Minor
>
> Currently, In testing EnsureRequirements's test cases to eliminate 
> ShuffleExchange, The test code is not in conformity with the purpose of the 
> test.These test cases are as follows:
> 1、test("EnsureRequirements eliminates Exchange if child has same 
> partitioning")
>    The checking condition is that there is no ShuffleExchange in the physical 
> plan. = = 2 It's not accurate here.
> 2、test("EnsureRequirements does not eliminate Exchange with different 
> partitioning")
>    The purpose of the test is to not eliminate ShuffleExchange, but its test 
> code is the same as test("EnsureRequirements eliminates Exchange if child has 
> same partitioning")



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23325) DataSourceV2 readers should always produce InternalRow.

2018-03-06 Thread Wenchen Fan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16388520#comment-16388520
 ] 

Wenchen Fan commented on SPARK-23325:
-

Making `InternalRow` stable is not only about stabilizing the interfaces, but 
also the semantics of data types and their data structure. e.g. timestamp type 
is microseconds from Unix epoch in Spark, string type is UTF8 encoded string 
via the `UTF8String` class, map type is a combination of 2 arrays, etc.

 

cc [~rxin] and [~marmbrus] for broader discussions.

> DataSourceV2 readers should always produce InternalRow.
> ---
>
> Key: SPARK-23325
> URL: https://issues.apache.org/jira/browse/SPARK-23325
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Ryan Blue
>Priority: Major
>
> DataSourceV2 row-oriented implementations are limited to producing either 
> {{Row}} instances or {{UnsafeRow}} instances by implementing 
> {{SupportsScanUnsafeRow}}. Instead, I think that implementations should 
> always produce {{InternalRow}}.
> The problem with the choice between {{Row}} and {{UnsafeRow}} is that neither 
> one is appropriate for implementers.
> File formats don't produce {{Row}} instances or the data values used by 
> {{Row}}, like {{java.sql.Timestamp}} and {{java.sql.Date}}. An implementation 
> that uses {{Row}} instances must produce data that is immediately translated 
> from the representation that was just produced by Spark. In my experience, it 
> made little sense to translate a timestamp in microseconds to a 
> (milliseconds, nanoseconds) pair, create a {{Timestamp}} instance, and pass 
> that instance to Spark for immediate translation back.
> On the other hand, {{UnsafeRow}} is very difficult to produce unless data is 
> already held in memory. Even the Parquet support built into Spark 
> deserializes to {{InternalRow}} and then uses {{UnsafeProjection}} to produce 
> unsafe rows. When I went to build an implementation that deserializes Parquet 
> or Avro directly to {{UnsafeRow}} (I tried both), I found that it couldn't be 
> done without first deserializing into memory because the size of an array 
> must be known before any values are written.
> I ended up deciding to deserialize to {{InternalRow}} and use 
> {{UnsafeProjection}} to convert to unsafe. There are two problems with this: 
> first, this is Scala and was difficult to call from Java (it required 
> reflection), and second, this causes double projection in the physical plan 
> (a copy for unsafe to unsafe) if there is a projection that wasn't fully 
> pushed to the data source.
> I think the solution is to have a single interface for readers that expects 
> {{InternalRow}}. Then, a projection should be added in the Spark plan to 
> convert to unsafe and avoid projection in the plan and in the data source. If 
> the data source already produces unsafe rows by deserializing directly, this 
> still minimizes the number of copies because the unsafe projection will check 
> whether the incoming data is already {{UnsafeRow}}.
> Using {{InternalRow}} would also match the interface on the write side.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23582) Add interpreted execution to StaticInvoke expression

2018-03-06 Thread Kazuaki Ishizaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16388729#comment-16388729
 ] 

Kazuaki Ishizaki commented on SPARK-23582:
--

I see. Now, I have a prototype using old-school reflection.

> Add interpreted execution to StaticInvoke expression
> 
>
> Key: SPARK-23582
> URL: https://issues.apache.org/jira/browse/SPARK-23582
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Herman van Hovell
>Assignee: Kazuaki Ishizaki
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18673) Dataframes doesn't work on Hadoop 3.x; Hive rejects Hadoop version

2018-03-06 Thread Darek (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16388757#comment-16388757
 ] 

Darek commented on SPARK-18673:
---

When running the pyspark tests using Hadoop 3.0.0 I am not getting the 
java.lang.IllegalArgumentException but I am getting ClassNotFoundException: 
org.apache.hadoop.hive.sql.metadata.HiveException.

Who can help to move this ticket forward?

Thanks

> Dataframes doesn't work on Hadoop 3.x; Hive rejects Hadoop version
> --
>
> Key: SPARK-18673
> URL: https://issues.apache.org/jira/browse/SPARK-18673
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
> Environment: Spark built with -Dhadoop.version=3.0.0-alpha2-SNAPSHOT 
>Reporter: Steve Loughran
>Priority: Major
>
> Spark Dataframes fail to run on Hadoop 3.0.x, because hive.jar's shimloader 
> considers 3.x to be an unknown Hadoop version.
> Hive itself will have to fix this; as Spark uses its own hive 1.2.x JAR, it 
> will need to be updated to match.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23534) Spark run on Hadoop 3.0.0

2018-03-06 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16388761#comment-16388761
 ] 

Saisai Shao commented on SPARK-23534:
-

I don't think so. Spark uses its own fork hive version (hive-1.2.1.spark2), 
which doesn't include HIVE-15016 and HIVE-18550, these two patches only landed 
in Hive community's Hive, not Spark's Hive. Unless we shift to use Hive 
community's Hive, or path our own forked hive, then this will not be a blocker.

> Spark run on Hadoop 3.0.0
> -
>
> Key: SPARK-23534
> URL: https://issues.apache.org/jira/browse/SPARK-23534
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.3.0
>Reporter: Saisai Shao
>Priority: Major
>
> Major Hadoop vendors already/will step in Hadoop 3.0. So we should also make 
> sure Spark can run with Hadoop 3.0. This Jira tracks the work to make Spark 
> run on Hadoop 3.0.
> The work includes:
>  # Add a Hadoop 3.0.0 new profile to make Spark build-able with Hadoop 3.0.
>  # Test to see if there's dependency issues with Hadoop 3.0.
>  # Investigating the feasibility to use shaded client jars (HADOOP-11804).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23615) Add maxDF Parameter to Python CountVectorizer

2018-03-06 Thread Bryan Cutler (JIRA)
Bryan Cutler created SPARK-23615:


 Summary: Add maxDF Parameter to Python CountVectorizer
 Key: SPARK-23615
 URL: https://issues.apache.org/jira/browse/SPARK-23615
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 2.4.0
Reporter: Bryan Cutler


The maxDF parameter is for filtering out frequently occurring terms.  This 
param was recently added to the Scala CountVectorizer and needs to be added to 
Python also.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23616) Streaming self-join using SQL throws resolution exceptions

2018-03-06 Thread Tathagata Das (JIRA)
Tathagata Das created SPARK-23616:
-

 Summary: Streaming self-join using SQL throws resolution exceptions
 Key: SPARK-23616
 URL: https://issues.apache.org/jira/browse/SPARK-23616
 Project: Spark
  Issue Type: New Feature
  Components: Structured Streaming
Affects Versions: 2.3.0
Reporter: Tathagata Das
Assignee: Tathagata Das


Reported on the dev list.
{code}
import org.apache.spark.sql.streaming.Trigger 
val jdf = spark.readStream.format("kafka").option("kafka.bootstrap.servers", 
"localhost:9092").option("subscribe", "join_test").option("startingOffsets", 
"earliest").load();

jdf.createOrReplaceTempView("table")

val resultdf = spark.sql("select * from table as x inner join table as y on 
x.offset=y.offset")

resultdf.writeStream.outputMode("update").format("console").option("truncate", 
false).trigger(Trigger.ProcessingTime(1000)).start()
{code}

This is giving the following error
{code}
org.apache.spark.sql.AnalysisException: cannot resolve '`x.offset`' given input 
columns: [x.value, x.offset, x.key, x.timestampType, x.topic, x.timestamp, 
x.partition]; line 1 pos 50;
'Project [*]
+- 'Join Inner, ('x.offset = 'y.offset)
 :- SubqueryAlias x
 : +- SubqueryAlias table
 : +- StreamingRelation 
DataSource(org.apache.spark.sql.SparkSession@15f3f9cf,kafka,List(),None,List(),None,Map(startingOffsets
 -> earliest, subscribe -> join_test, kafka.bootstrap.servers -> 
localhost:9092),None), kafka, [key#28, value#29, topic#30, partition#31, 
offset#32L, timestamp#33, timestampType#34]
 +- SubqueryAlias y
 +- SubqueryAlias table
 +- StreamingRelation 
DataSource(org.apache.spark.sql.SparkSession@15f3f9cf,kafka,List(),None,List(),None,Map(startingOffsets
 -> earliest, subscribe -> join_test, kafka.bootstrap.servers -> 
localhost:9092),None), kafka, [key#28, value#29, topic#30, partition#31, 
offset#32L, timestamp#33, timestampType#34]
{code}

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23614) Union produces incorrect results when caching is used

2018-03-06 Thread Morten Hornbech (JIRA)
Morten Hornbech created SPARK-23614:
---

 Summary: Union produces incorrect results when caching is used
 Key: SPARK-23614
 URL: https://issues.apache.org/jira/browse/SPARK-23614
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.3.0
Reporter: Morten Hornbech


We just upgraded from 2.2 to 2.3 and our test suite caught this error:

{code:java}
val frame = session.createDataset(Seq(TestData(1, 2, 3), TestData(4, 5, 
6))).cache()
val group1 = frame.groupBy("x").agg(min(col("y")) as "value")
val group2 = frame.groupBy("x").agg(min(col("z")) as "value")
group1.union(group2).show()
// +---+-+
// | x|value|
// +---+-+
// | 1| 2|
// | 4| 5|
// | 1| 2|
// | 4| 5|
// +---+-+
group2.union(group1).show()
// +---+-+
// | x|value|
// +---+-+
// | 1| 3|
// | 4| 6|
// | 1| 3|
// | 4| 6|
// +---+-+
{code}

The error disappears if the first data frame is not cached or if the two group 
by's use separate copies. I'm not sure exactly what happens on the insides of 
Spark, but errors that produce incorrect results rather than exceptions always 
concerns me.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23595) Add interpreted execution for ValidateExternalType expression

2018-03-06 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16388707#comment-16388707
 ] 

Takeshi Yamamuro commented on SPARK-23595:
--

ok, If you need help in other tickets, please let me know, too. Thanks!

> Add interpreted execution for ValidateExternalType expression
> -
>
> Key: SPARK-23595
> URL: https://issues.apache.org/jira/browse/SPARK-23595
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Herman van Hovell
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23615) Add maxDF Parameter to Python CountVectorizer

2018-03-06 Thread Bryan Cutler (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler updated SPARK-23615:
-
Component/s: ML

> Add maxDF Parameter to Python CountVectorizer
> -
>
> Key: SPARK-23615
> URL: https://issues.apache.org/jira/browse/SPARK-23615
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.4.0
>Reporter: Bryan Cutler
>Priority: Minor
>
> The maxDF parameter is for filtering out frequently occurring terms.  This 
> param was recently added to the Scala CountVectorizer and needs to be added 
> to Python also.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23616) Streaming self-join using SQL throws resolution exceptions

2018-03-06 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das resolved SPARK-23616.
---
Resolution: Duplicate

This is the same underlying issue as SPARK-23406. However the error is 
different due to the use of pure SQL join instead of Dataset join.

> Streaming self-join using SQL throws resolution exceptions
> --
>
> Key: SPARK-23616
> URL: https://issues.apache.org/jira/browse/SPARK-23616
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>Priority: Major
>
> Reported on the dev list.
> {code}
> import org.apache.spark.sql.streaming.Trigger 
> val jdf = spark.readStream.format("kafka").option("kafka.bootstrap.servers", 
> "localhost:9092").option("subscribe", "join_test").option("startingOffsets", 
> "earliest").load();
> jdf.createOrReplaceTempView("table")
> val resultdf = spark.sql("select * from table as x inner join table as y on 
> x.offset=y.offset")
> resultdf.writeStream.outputMode("update").format("console").option("truncate",
>  false).trigger(Trigger.ProcessingTime(1000)).start()
> {code}
> This is giving the following error
> {code}
> org.apache.spark.sql.AnalysisException: cannot resolve '`x.offset`' given 
> input columns: [x.value, x.offset, x.key, x.timestampType, x.topic, 
> x.timestamp, x.partition]; line 1 pos 50;
> 'Project [*]
> +- 'Join Inner, ('x.offset = 'y.offset)
>  :- SubqueryAlias x
>  : +- SubqueryAlias table
>  : +- StreamingRelation 
> DataSource(org.apache.spark.sql.SparkSession@15f3f9cf,kafka,List(),None,List(),None,Map(startingOffsets
>  -> earliest, subscribe -> join_test, kafka.bootstrap.servers -> 
> localhost:9092),None), kafka, [key#28, value#29, topic#30, partition#31, 
> offset#32L, timestamp#33, timestampType#34]
>  +- SubqueryAlias y
>  +- SubqueryAlias table
>  +- StreamingRelation 
> DataSource(org.apache.spark.sql.SparkSession@15f3f9cf,kafka,List(),None,List(),None,Map(startingOffsets
>  -> earliest, subscribe -> join_test, kafka.bootstrap.servers -> 
> localhost:9092),None), kafka, [key#28, value#29, topic#30, partition#31, 
> offset#32L, timestamp#33, timestampType#34]
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-23534) Spark run on Hadoop 3.0.0

2018-03-06 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16388761#comment-16388761
 ] 

Saisai Shao edited comment on SPARK-23534 at 3/7/18 12:37 AM:
--

I don't think so. Spark uses its own fork hive version (hive-1.2.1.spark2), 
which doesn't include HIVE-15016 and HIVE-18550, these two patches only landed 
in Hive community's Hive, not Spark's Hive. Unless we shift to use Hive 
community's Hive, or patch our own forked hive, then this will not be a blocker.


was (Author: jerryshao):
I don't think so. Spark uses its own fork hive version (hive-1.2.1.spark2), 
which doesn't include HIVE-15016 and HIVE-18550, these two patches only landed 
in Hive community's Hive, not Spark's Hive. Unless we shift to use Hive 
community's Hive, or path our own forked hive, then this will not be a blocker.

> Spark run on Hadoop 3.0.0
> -
>
> Key: SPARK-23534
> URL: https://issues.apache.org/jira/browse/SPARK-23534
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.3.0
>Reporter: Saisai Shao
>Priority: Major
>
> Major Hadoop vendors already/will step in Hadoop 3.0. So we should also make 
> sure Spark can run with Hadoop 3.0. This Jira tracks the work to make Spark 
> run on Hadoop 3.0.
> The work includes:
>  # Add a Hadoop 3.0.0 new profile to make Spark build-able with Hadoop 3.0.
>  # Test to see if there's dependency issues with Hadoop 3.0.
>  # Investigating the feasibility to use shaded client jars (HADOOP-11804).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23582) Add interpreted execution to StaticInvoke expression

2018-03-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23582:


Assignee: Kazuaki Ishizaki  (was: Apache Spark)

> Add interpreted execution to StaticInvoke expression
> 
>
> Key: SPARK-23582
> URL: https://issues.apache.org/jira/browse/SPARK-23582
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Herman van Hovell
>Assignee: Kazuaki Ishizaki
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23582) Add interpreted execution to StaticInvoke expression

2018-03-06 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16388792#comment-16388792
 ] 

Apache Spark commented on SPARK-23582:
--

User 'kiszk' has created a pull request for this issue:
https://github.com/apache/spark/pull/20753

> Add interpreted execution to StaticInvoke expression
> 
>
> Key: SPARK-23582
> URL: https://issues.apache.org/jira/browse/SPARK-23582
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Herman van Hovell
>Assignee: Kazuaki Ishizaki
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23582) Add interpreted execution to StaticInvoke expression

2018-03-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23582:


Assignee: Apache Spark  (was: Kazuaki Ishizaki)

> Add interpreted execution to StaticInvoke expression
> 
>
> Key: SPARK-23582
> URL: https://issues.apache.org/jira/browse/SPARK-23582
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Herman van Hovell
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23406) Stream-stream self joins does not work

2018-03-06 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16388898#comment-16388898
 ] 

Apache Spark commented on SPARK-23406:
--

User 'tdas' has created a pull request for this issue:
https://github.com/apache/spark/pull/20755

> Stream-stream self joins does not work
> --
>
> Key: SPARK-23406
> URL: https://issues.apache.org/jira/browse/SPARK-23406
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>Priority: Major
> Fix For: 2.4.0
>
>
> Currently stream-stream self join throws the following error
> {code}
> val df = spark.readStream.format("rate").option("numRowsPerSecond", 
> "1").option("numPartitions", "1").load()
> display(df.withColumn("key", $"value" / 10).join(df.withColumn("key", 
> $"value" / 5), "key"))
> {code}
> error:
> {code}
> Failure when resolving conflicting references in Join:
> 'Join UsingJoin(Inner,List(key))
> :- Project [timestamp#850, value#851L, (cast(value#851L as double) / cast(10 
> as double)) AS key#855]
> : +- StreamingRelation 
> DataSource(org.apache.spark.sql.SparkSession@7f1d2a68,rate,List(),None,List(),None,Map(numPartitions
>  -> 1, numRowsPerSecond -> 1),None), rate, [timestamp#850, value#851L]
> +- Project [timestamp#850, value#851L, (cast(value#851L as double) / cast(5 
> as double)) AS key#860]
>  +- StreamingRelation 
> DataSource(org.apache.spark.sql.SparkSession@7f1d2a68,rate,List(),None,List(),None,Map(numPartitions
>  -> 1, numRowsPerSecond -> 1),None), rate, [timestamp#850, value#851L]
> Conflicting attributes: timestamp#850,value#851L
> ;;
> 'Join UsingJoin(Inner,List(key))
> :- Project [timestamp#850, value#851L, (cast(value#851L as double) / cast(10 
> as double)) AS key#855]
> : +- StreamingRelation 
> DataSource(org.apache.spark.sql.SparkSession@7f1d2a68,rate,List(),None,List(),None,Map(numPartitions
>  -> 1, numRowsPerSecond -> 1),None), rate, [timestamp#850, value#851L]
> +- Project [timestamp#850, value#851L, (cast(value#851L as double) / cast(5 
> as double)) AS key#860]
>  +- StreamingRelation 
> DataSource(org.apache.spark.sql.SparkSession@7f1d2a68,rate,List(),None,List(),None,Map(numPartitions
>  -> 1, numRowsPerSecond -> 1),None), rate, [timestamp#850, value#851L]
> at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:39)
>  at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:101)
>  at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:378)
>  at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:98)
>  at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:148)
>  at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:98)
>  at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:101)
>  at 
> org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:71)
>  at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:73)
>  at 
> org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$withPlan(Dataset.scala:3063)
>  at org.apache.spark.sql.Dataset.join(Dataset.scala:787)
>  at org.apache.spark.sql.Dataset.join(Dataset.scala:756)
>  at org.apache.spark.sql.Dataset.join(Dataset.scala:731)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23287) Spark scheduler does not remove initial executor if not one job submitted

2018-03-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23287:


Assignee: Apache Spark

> Spark scheduler does not remove initial executor if not one job submitted
> -
>
> Key: SPARK-23287
> URL: https://issues.apache.org/jira/browse/SPARK-23287
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos, Scheduler
>Affects Versions: 2.2.1
> Environment: Cluster manager - Mesos 1.4.1
> Spark 2.2.1
> spark app configuration:
> spark.dynamicAllocation.minExecutors=0
> spark.dynamicAllocation.executorIdleTimeout=25s
> spark.dynamicAllocation.initialExecutors=1
> spark.dynamicAllocation.schedulerBacklogTimeout=4s
> spark.dynamicAllocation.sustainedSchedulerBacklogTimeout=5s
>Reporter: Pavel Plotnikov
>Assignee: Apache Spark
>Priority: Minor
>
> When spark application submitted it deploy initial number of executors. If 
> none of job has been submitted to application spark doesn't remove initial 
> executor.
>  
> Cluster manager - Mesos 1.4.1
> Spark 2.2.1
> spark app configuration:
> spark.dynamicAllocation.minExecutors=0
> spark.dynamicAllocation.executorIdleTimeout=25s
> spark.dynamicAllocation.initialExecutors=1
> spark.dynamicAllocation.schedulerBacklogTimeout=4s
> spark.dynamicAllocation.sustainedSchedulerBacklogTimeout=5s



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23547) Cleanup the .pipeout file when the Hive Session closed

2018-03-06 Thread zuotingbing (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zuotingbing updated SPARK-23547:

Description: 
  !2018-03-07_121010.png!

 

when the hive session closed, we should also cleanup the .pipeout file.

 

  was:
  !2018-03-07_121010.png!

when the hive session closed, we should also cleanup the .pipeout file.

 

 


> Cleanup the .pipeout file when the Hive Session closed
> --
>
> Key: SPARK-23547
> URL: https://issues.apache.org/jira/browse/SPARK-23547
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: zuotingbing
>Priority: Major
> Attachments: 2018-03-07_121010.png
>
>
>   !2018-03-07_121010.png!
>  
> when the hive session closed, we should also cleanup the .pipeout file.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23584) Add interpreted execution to NewInstance expression

2018-03-06 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16389140#comment-16389140
 ] 

Takeshi Yamamuro commented on SPARK-23584:
--

I'm working on it.

> Add interpreted execution to NewInstance expression
> ---
>
> Key: SPARK-23584
> URL: https://issues.apache.org/jira/browse/SPARK-23584
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Herman van Hovell
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23617) Register a Function without params with Spark SQL Java API

2018-03-06 Thread Paul Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Wu updated SPARK-23617:

Description: 
One can register a function using Scala:

{{spark.udf.register("uuid", ()=>java.util.UUID.randomUUID.toString)}}

Now, if I use Java API:

{{spark.udf().register("uuid", ()=>java.util.UUID.randomUUID().toString());}}

The code does not compile. Define UDF0 for Java API?

  was:
One can register a function using Scala:

spark.udf.register("uuid", ()=>java.util.UUID.randomUUID.toString) 

Now, if I use Java API:

spark.udf().register("uuid", ()=>java.util.UUID.randomUUID().toString()); 

The code does not compile. Define UDF0 for Java API?


> Register a Function without params with Spark SQL Java API
> --
>
> Key: SPARK-23617
> URL: https://issues.apache.org/jira/browse/SPARK-23617
> Project: Spark
>  Issue Type: Improvement
>  Components: Java API, SQL
>Affects Versions: 2.2.1
>Reporter: Paul Wu
>Priority: Major
>
> One can register a function using Scala:
> {{spark.udf.register("uuid", ()=>java.util.UUID.randomUUID.toString)}}
> Now, if I use Java API:
> {{spark.udf().register("uuid", ()=>java.util.UUID.randomUUID().toString());}}
> The code does not compile. Define UDF0 for Java API?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23618) docker-image-tool.sh Fails While Building Image

2018-03-06 Thread Ninad Ingole (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ninad Ingole updated SPARK-23618:
-
Description: 
I am trying to build kubernetes image for version 2.3.0, using 
{code:java}
./bin/docker-image-tool.sh -r ninadingole/spark-docker -t v2.3.0 build
{code}
giving me an issue for docker build 

error:
{code:java}
"docker build" requires exactly 1 argument.
See 'docker build --help'.
Usage: docker build [OPTIONS] PATH | URL | - [flags]
Build an image from a Dockerfile
{code}
 

Executing the command within the spark distribution directory. Please let me 
know what's the issue.

 

  was:
I am trying to build kubernetes image for version 2.3.0, using 

 
{code:java}
./bin/docker-image-tool.sh -r ninadingole/spark-docker -t v2.3.0 build
{code}
giving me an issue for docker build 

error:

 
{code:java}
"docker build" requires exactly 1 argument.
See 'docker build --help'.
Usage: docker build [OPTIONS] PATH | URL | - [flags]
Build an image from a Dockerfile
{code}
 

Executing the command within the spark distribution directory. Please let me 
know what's the issue.

 


> docker-image-tool.sh Fails While Building Image
> ---
>
> Key: SPARK-23618
> URL: https://issues.apache.org/jira/browse/SPARK-23618
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Ninad Ingole
>Priority: Major
>
> I am trying to build kubernetes image for version 2.3.0, using 
> {code:java}
> ./bin/docker-image-tool.sh -r ninadingole/spark-docker -t v2.3.0 build
> {code}
> giving me an issue for docker build 
> error:
> {code:java}
> "docker build" requires exactly 1 argument.
> See 'docker build --help'.
> Usage: docker build [OPTIONS] PATH | URL | - [flags]
> Build an image from a Dockerfile
> {code}
>  
> Executing the command within the spark distribution directory. Please let me 
> know what's the issue.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23617) Register a Function without params with Spark SQL Java API

2018-03-06 Thread Paul Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Wu updated SPARK-23617:

Description: 
One can register a function using Scala:

spark.udf.register("uuid", ()=>java.util.UUID.randomUUID.toString) 

Now, if I use Java API:

spark.udf().register("uuid", ()=>java.util.UUID.randomUUID().toString()); 

The code does not compile. Define UDF0 for Java API?

  was:
One can register a function using Scala:

{{spark.udf.register("uuid", ()=>java.util.UUID.randomUUID.toString) }}

Now, if I use Java API:

{{spark.udf().register("uuid", ()=>java.util.UUID.randomUUID().toString()); }}

The code does not compile. Define UDF0 for Java API?


> Register a Function without params with Spark SQL Java API
> --
>
> Key: SPARK-23617
> URL: https://issues.apache.org/jira/browse/SPARK-23617
> Project: Spark
>  Issue Type: Improvement
>  Components: Java API, SQL
>Affects Versions: 2.2.1
>Reporter: Paul Wu
>Priority: Major
>
> One can register a function using Scala:
> spark.udf.register("uuid", ()=>java.util.UUID.randomUUID.toString) 
> Now, if I use Java API:
> spark.udf().register("uuid", ()=>java.util.UUID.randomUUID().toString()); 
> The code does not compile. Define UDF0 for Java API?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23617) Register a Function without params with Spark SQL Java API

2018-03-06 Thread Paul Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Wu updated SPARK-23617:

Description: 
One can register a function using Scala:

{{spark.udf.register("uuid", ()=>java.util.UUID.randomUUID.toString) }}

Now, if I use Java API:

{{spark.udf().register("uuid", ()=>java.util.UUID.randomUUID().toString()); }}

The code does not compile. Define UDF0 for Java API?

  was:
One can register a function using Scala:

{{spark.udf.register("uuid", ()=>java.util.UUID.randomUUID.toString) }}

Now, if I use Java API:

{{ spark.udf().register("uuid", ()=>java.util.UUID.randomUUID().toString()); }}

The code does not compile. Define UDF0 for Java API?


> Register a Function without params with Spark SQL Java API
> --
>
> Key: SPARK-23617
> URL: https://issues.apache.org/jira/browse/SPARK-23617
> Project: Spark
>  Issue Type: Improvement
>  Components: Java API, SQL
>Affects Versions: 2.2.1
>Reporter: Paul Wu
>Priority: Major
>
> One can register a function using Scala:
> {{spark.udf.register("uuid", ()=>java.util.UUID.randomUUID.toString) }}
> Now, if I use Java API:
> {{spark.udf().register("uuid", ()=>java.util.UUID.randomUUID().toString()); }}
> The code does not compile. Define UDF0 for Java API?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23519) Create View Commands Fails with The view output (col1,col1) contains duplicate column name

2018-03-06 Thread Franck Tago (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16389150#comment-16389150
 ] 

Franck Tago commented on SPARK-23519:
-

Any updates on this ?

> Create View Commands Fails with  The view output (col1,col1) contains 
> duplicate column name
> ---
>
> Key: SPARK-23519
> URL: https://issues.apache.org/jira/browse/SPARK-23519
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: Franck Tago
>Priority: Critical
>
> 1- create and populate a hive table  . I did this in a hive cli session .[ 
> not that this matters ]
> create table  atable (col1 int) ;
> insert  into atable values (10 ) , (100)  ;
> 2. create a view form the table.   [ I did this from a spark shell ]
> spark.sql("create view  default.aview  (int1 , int2 ) as select  col1 , col1 
> from atable ")
> java.lang.AssertionError: assertion failed: The view output (col1,col1) 
> contains duplicate column name.
>  at scala.Predef$.assert(Predef.scala:170)
>  at 
> org.apache.spark.sql.execution.command.ViewHelper$.generateViewProperties(views.scala:361)
>  at 
> org.apache.spark.sql.execution.command.CreateViewCommand.prepareTable(views.scala:236)
>  at 
> org.apache.spark.sql.execution.command.CreateViewCommand.run(views.scala:174)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:67)
>  at org.apache.spark.sql.Dataset.(Dataset.scala:183)
>  at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:68)
>  at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:632)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23287) Spark scheduler does not remove initial executor if not one job submitted

2018-03-06 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1635#comment-1635
 ] 

Apache Spark commented on SPARK-23287:
--

User 'devaraj-kavali' has created a pull request for this issue:
https://github.com/apache/spark/pull/20754

> Spark scheduler does not remove initial executor if not one job submitted
> -
>
> Key: SPARK-23287
> URL: https://issues.apache.org/jira/browse/SPARK-23287
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos, Scheduler
>Affects Versions: 2.2.1
> Environment: Cluster manager - Mesos 1.4.1
> Spark 2.2.1
> spark app configuration:
> spark.dynamicAllocation.minExecutors=0
> spark.dynamicAllocation.executorIdleTimeout=25s
> spark.dynamicAllocation.initialExecutors=1
> spark.dynamicAllocation.schedulerBacklogTimeout=4s
> spark.dynamicAllocation.sustainedSchedulerBacklogTimeout=5s
>Reporter: Pavel Plotnikov
>Priority: Minor
>
> When spark application submitted it deploy initial number of executors. If 
> none of job has been submitted to application spark doesn't remove initial 
> executor.
>  
> Cluster manager - Mesos 1.4.1
> Spark 2.2.1
> spark app configuration:
> spark.dynamicAllocation.minExecutors=0
> spark.dynamicAllocation.executorIdleTimeout=25s
> spark.dynamicAllocation.initialExecutors=1
> spark.dynamicAllocation.schedulerBacklogTimeout=4s
> spark.dynamicAllocation.sustainedSchedulerBacklogTimeout=5s



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23287) Spark scheduler does not remove initial executor if not one job submitted

2018-03-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23287:


Assignee: (was: Apache Spark)

> Spark scheduler does not remove initial executor if not one job submitted
> -
>
> Key: SPARK-23287
> URL: https://issues.apache.org/jira/browse/SPARK-23287
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos, Scheduler
>Affects Versions: 2.2.1
> Environment: Cluster manager - Mesos 1.4.1
> Spark 2.2.1
> spark app configuration:
> spark.dynamicAllocation.minExecutors=0
> spark.dynamicAllocation.executorIdleTimeout=25s
> spark.dynamicAllocation.initialExecutors=1
> spark.dynamicAllocation.schedulerBacklogTimeout=4s
> spark.dynamicAllocation.sustainedSchedulerBacklogTimeout=5s
>Reporter: Pavel Plotnikov
>Priority: Minor
>
> When spark application submitted it deploy initial number of executors. If 
> none of job has been submitted to application spark doesn't remove initial 
> executor.
>  
> Cluster manager - Mesos 1.4.1
> Spark 2.2.1
> spark app configuration:
> spark.dynamicAllocation.minExecutors=0
> spark.dynamicAllocation.executorIdleTimeout=25s
> spark.dynamicAllocation.initialExecutors=1
> spark.dynamicAllocation.schedulerBacklogTimeout=4s
> spark.dynamicAllocation.sustainedSchedulerBacklogTimeout=5s



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23495) Creating a json file using a dataframe Generates an issue

2018-03-06 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-23495:
--
Target Version/s:   (was: 2.1.0)
   Fix Version/s: (was: 2.1.0)

> Creating a json file using a dataframe Generates an issue
> -
>
> Key: SPARK-23495
> URL: https://issues.apache.org/jira/browse/SPARK-23495
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.1.0
>Reporter: AIT OUFKIR
>Priority: Major
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> Issue happen when trying to create json file using a dataframe (see code 
> below)
> from pyspark.sql import SQLContext
>  a = ["a1","a2"]
>  b = ["b1","b2","b3"]
>  c = ["c1","c2","c3", "c4"]
>  d = \{'d1':1, 'd2':2}
>  e = \{'e1':1, 'e2':2, 'e3':3}
>  f = ['f1','f2','f3']
>  g = ['g1','g2','g3','g4']
> metadata_dump = dict(asi=a, basi=b, casi = c, dasi=d, fasi=f, 
> gasi=g{color:#ff}, easi=e{color})
>  md = sqlContext.createDataFrame([metadata_dump]).collect()
>  metadata = sqlContext.createDataFrame(md,['asi', 'basi', 
> 'casi','dasi','fasi', 'gasi', 'easi'])
> metadata_path = "/folder/fileNameErr"
>  metadata.write.mode('overwrite').json(metadata_path)
> {"{color:#14892c}asi":["a1","a2"],"basi":["b1","b2","b3"],"casi":["c1","c2","c3","c4"],"dasi":\{"d1":1,"d2":2{color}},"fasi":\{"e1":1,"e2":2,"e3":3},"gasi":["f1","f2","f3"],"easi":["g1","g2","g3","g4{color}"]}
>  
> when switching the dictionary e
>  
> metadata_dump = dict(asi=a, basi=b, casi = c, dasi=d{color:#ff}*, 
> easi=e*{color}, fasi=f, gasi=g)
>  md = sqlContext.createDataFrame([metadata_dump]).collect()
>  metadata = sqlContext.createDataFrame(md,['asi', 'basi', 'casi','dasi', 
> {color:#ff}*'easi',*{color}'fasi', 'gasi'])
>  metadata_path = "/folder/fileNameCorr"
>  metadata.write.mode('overwrite').json(metadata_path)
> {color:#14892c}{"asi":["a1","a2"],"basi":["b1","b2","b3"],"casi":["c1","c2","c3","c4"],"dasi":\\{"d1":1,"d2":2},"easi":\{"e1":1,"e2":2,"e3":3},"fasi":["f1","f2","f3"],"gasi":["g1","g2","g3","g4"]}{color}
>  
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23607) Use HDFS extended attributes to store application summary to improve the Spark History Server performance

2018-03-06 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-23607:
--
Target Version/s:   (was: 2.4.0)
Priority: Minor  (was: Major)
   Fix Version/s: (was: 2.4.0)

> Use HDFS extended attributes to store application summary to improve the 
> Spark History Server performance
> -
>
> Key: SPARK-23607
> URL: https://issues.apache.org/jira/browse/SPARK-23607
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, Web UI
>Affects Versions: 2.3.0
>Reporter: Ye Zhou
>Priority: Minor
>
> Currently in Spark History Server, checkForLogs thread will create replaying 
> tasks for log files which have file size change. The replaying task will 
> filter out most of the log file content and keep the application summary 
> including applicationId, user, attemptACL, start time, end time. The 
> application summary data will get updated into listing.ldb and serve the 
> application list on SHS home page. For a long running application, its log 
> file which name ends with "inprogress" will get replayed for multiple times 
> to get these application summary. This is a waste of computing and data 
> reading resource to SHS, which results in the delay for application to get 
> showing up on home page. Internally we have a patch which utilizes HDFS 
> extended attributes to improve the performance for getting application 
> summary in SHS. With this patch, Driver will write the application summary 
> information into extended attributes as key/value. SHS will try to read from 
> extended attributes. If SHS fails to read from extended attributes, it will 
> fall back to read from the log file content as usual. This feature can be 
> enable/disable through configuration.
> It has been running fine for 4 months internally with this patch and the last 
> updated timestamp on SHS keeps within 1 minute as we configure the interval 
> to 1 minute. Originally we had long delay which could be as long as 30 
> minutes in our scale where we have a large number of Spark applications 
> running per day.
> We want to see whether this kind of approach is also acceptable to community. 
> Please comment. If so, I will post a pull request for the changes. Thanks.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23495) Creating a json file using a dataframe Generates an issue

2018-03-06 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-23495.
---
Resolution: Invalid

You've just listed some code and output and not described a problem.

> Creating a json file using a dataframe Generates an issue
> -
>
> Key: SPARK-23495
> URL: https://issues.apache.org/jira/browse/SPARK-23495
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.1.0
>Reporter: AIT OUFKIR
>Priority: Major
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> Issue happen when trying to create json file using a dataframe (see code 
> below)
> from pyspark.sql import SQLContext
>  a = ["a1","a2"]
>  b = ["b1","b2","b3"]
>  c = ["c1","c2","c3", "c4"]
>  d = \{'d1':1, 'd2':2}
>  e = \{'e1':1, 'e2':2, 'e3':3}
>  f = ['f1','f2','f3']
>  g = ['g1','g2','g3','g4']
> metadata_dump = dict(asi=a, basi=b, casi = c, dasi=d, fasi=f, 
> gasi=g{color:#ff}, easi=e{color})
>  md = sqlContext.createDataFrame([metadata_dump]).collect()
>  metadata = sqlContext.createDataFrame(md,['asi', 'basi', 
> 'casi','dasi','fasi', 'gasi', 'easi'])
> metadata_path = "/folder/fileNameErr"
>  metadata.write.mode('overwrite').json(metadata_path)
> {"{color:#14892c}asi":["a1","a2"],"basi":["b1","b2","b3"],"casi":["c1","c2","c3","c4"],"dasi":\{"d1":1,"d2":2{color}},"fasi":\{"e1":1,"e2":2,"e3":3},"gasi":["f1","f2","f3"],"easi":["g1","g2","g3","g4{color}"]}
>  
> when switching the dictionary e
>  
> metadata_dump = dict(asi=a, basi=b, casi = c, dasi=d{color:#ff}*, 
> easi=e*{color}, fasi=f, gasi=g)
>  md = sqlContext.createDataFrame([metadata_dump]).collect()
>  metadata = sqlContext.createDataFrame(md,['asi', 'basi', 'casi','dasi', 
> {color:#ff}*'easi',*{color}'fasi', 'gasi'])
>  metadata_path = "/folder/fileNameCorr"
>  metadata.write.mode('overwrite').json(metadata_path)
> {color:#14892c}{"asi":["a1","a2"],"basi":["b1","b2","b3"],"casi":["c1","c2","c3","c4"],"dasi":\\{"d1":1,"d2":2},"easi":\{"e1":1,"e2":2,"e3":3},"fasi":["f1","f2","f3"],"gasi":["g1","g2","g3","g4"]}{color}
>  
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23499) Mesos Cluster Dispatcher should support priority queues to submit drivers

2018-03-06 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-23499:
--
Affects Version/s: (was: 2.4.0)
 Target Version/s:   (was: 2.2.1, 2.2.2, 2.3.0, 2.3.1)
Fix Version/s: (was: 2.4.0)

> Mesos Cluster Dispatcher should support priority queues to submit drivers
> -
>
> Key: SPARK-23499
> URL: https://issues.apache.org/jira/browse/SPARK-23499
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Affects Versions: 2.2.1, 2.2.2, 2.3.0, 2.3.1
>Reporter: Pascal GILLET
>Priority: Major
> Attachments: Screenshot from 2018-02-28 17-22-47.png
>
>
> As for Yarn, Mesos users should be able to specify priority queues to define 
> a workload management policy for queued drivers in the Mesos Cluster 
> Dispatcher.
> Submitted drivers are *currently* kept in order of their submission: the 
> first driver added to the queue will be the first one to be executed (FIFO).
> Each driver could have a "priority" associated with it. A driver with high 
> priority is served (Mesos resources) before a driver with low priority. If 
> two drivers have the same priority, they are served according to their submit 
> date in the queue.
> To set up such priority queues, the following changes are proposed:
>  * The Mesos Cluster Dispatcher can optionally be configured with the 
> _spark.mesos.dispatcher.queue.[QueueName]_ property. This property takes a 
> float as value. This adds a new queue named _QueueName_ for submitted drivers 
> with the specified priority.
>  Higher numbers indicate higher priority.
>  The user can then specify multiple queues.
>  * A driver can be submitted to a specific queue with 
> _spark.mesos.dispatcher.queue_. This property takes the name of a queue 
> previously declared in the dispatcher as value.
> By default, the dispatcher has a single "default" queue with 0.0 priority 
> (cannot be overridden). If none of the properties above are specified, the 
> behavior is the same as the current one (i.e. simple FIFO).
> Additionaly, it is possible to implement a consistent and overall workload 
> management policy throughout the lifecycle of drivers by mapping these 
> priority queues to weighted Mesos roles if any (i.e. from the QUEUED state in 
> the dispatcher to the final states in the Mesos cluster), and by specifying a 
> _spark.mesos.role_ along with a _spark.mesos.dispatcher.queue_ when 
> submitting an application.
> For example, with the URGENT Mesos role:
> {code:java}
> # Conf on the dispatcher side
> spark.mesos.dispatcher.queue.URGENT=1.0
> # Conf on the driver side
> spark.mesos.dispatcher.queue=URGENT
> spark.mesos.role=URGENT
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21795) Broadcast hint ignored when dataframe is cached

2018-03-06 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-21795.
---
Resolution: Duplicate

> Broadcast hint ignored when dataframe is cached
> ---
>
> Key: SPARK-21795
> URL: https://issues.apache.org/jira/browse/SPARK-21795
> Project: Spark
>  Issue Type: Question
>  Components: Documentation, SQL
>Affects Versions: 2.2.0
>Reporter: Lior Chaga
>Priority: Minor
>
> Not sure if it's a bug or by design, but if a DF is cached, the broadcast 
> hint is ignored, and spark uses SortMergeJoin.
> {code}
> val largeDf = ...
> val smalDf = ...
> smallDf = smallDf.cache
> largeDf.join(broadcast(smallDf))
> {code}
> It make sense there's no need to use cache when using broadcast join, 
> however, I wonder if it's the correct behavior for spark to ignore the 
> broadcast hint just because the DF is cached. Consider a case when a DF 
> should be cached for several queries, and on different queries it should be 
> broadcasted.
> If this is the correct behavior, at least it's worth documenting that cached 
> DF cannot be broadcasted.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23617) Register a Function without params with Spark SQL Java API

2018-03-06 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16388954#comment-16388954
 ] 

Hyukjin Kwon commented on SPARK-23617:
--

Is this a duplicate of SPARK-19285? and does this work in Spark 2.3.0?

> Register a Function without params with Spark SQL Java API
> --
>
> Key: SPARK-23617
> URL: https://issues.apache.org/jira/browse/SPARK-23617
> Project: Spark
>  Issue Type: Improvement
>  Components: Java API, SQL
>Affects Versions: 2.2.1
>Reporter: Paul Wu
>Priority: Major
>
> One can register a function using Scala:
> {{spark.udf.register("uuid", ()=>java.util.UUID.randomUUID.toString)}}
> Now, if I use Java API:
> {{spark.udf().register("uuid", ()=>java.util.UUID.randomUUID().toString());}}
> The code does not compile. Define UDF0 for Java API?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23547) Cleanup the .pipeout file when the Hive Session closed

2018-03-06 Thread zuotingbing (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zuotingbing updated SPARK-23547:

Description: 
 

 

when the hive session closed, we should also cleanup the .pipeout file.

 

 

  was:
!2018-03-01_202415.png!

 

when the hive session closed, we should also cleanup the .pipeout file.

 

 


> Cleanup the .pipeout file when the Hive Session closed
> --
>
> Key: SPARK-23547
> URL: https://issues.apache.org/jira/browse/SPARK-23547
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: zuotingbing
>Priority: Major
> Attachments: 2018-03-07_121010.png
>
>
>  
>  
> when the hive session closed, we should also cleanup the .pipeout file.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23547) Cleanup the .pipeout file when the Hive Session closed

2018-03-06 Thread zuotingbing (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zuotingbing updated SPARK-23547:

Attachment: (was: 2018-03-01_202415.png)

> Cleanup the .pipeout file when the Hive Session closed
> --
>
> Key: SPARK-23547
> URL: https://issues.apache.org/jira/browse/SPARK-23547
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: zuotingbing
>Priority: Major
> Attachments: 2018-03-07_121010.png
>
>
> !2018-03-01_202415.png!
>  
> when the hive session closed, we should also cleanup the .pipeout file.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23547) Cleanup the .pipeout file when the Hive Session closed

2018-03-06 Thread zuotingbing (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zuotingbing updated SPARK-23547:

Attachment: 2018-03-07_121010.png

> Cleanup the .pipeout file when the Hive Session closed
> --
>
> Key: SPARK-23547
> URL: https://issues.apache.org/jira/browse/SPARK-23547
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: zuotingbing
>Priority: Major
> Attachments: 2018-03-07_121010.png
>
>
> !2018-03-01_202415.png!
>  
> when the hive session closed, we should also cleanup the .pipeout file.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23617) Register a Function without params with Spark SQL Java API

2018-03-06 Thread Paul Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Wu resolved SPARK-23617.
-
   Resolution: Duplicate
Fix Version/s: 2.3.0

As commented by Hyukjin Kwon, the issue is duplicated and has been fixed in 
2.3.0.

> Register a Function without params with Spark SQL Java API
> --
>
> Key: SPARK-23617
> URL: https://issues.apache.org/jira/browse/SPARK-23617
> Project: Spark
>  Issue Type: Improvement
>  Components: Java API, SQL
>Affects Versions: 2.2.1
>Reporter: Paul Wu
>Priority: Major
> Fix For: 2.3.0
>
>
> One can register a function using Scala:
> {{spark.udf.register("uuid", ()=>java.util.UUID.randomUUID.toString)}}
> Now, if I use Java API:
> {{spark.udf().register("uuid", ()=>java.util.UUID.randomUUID().toString());}}
> The code does not compile. Define UDF0 for Java API?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23325) DataSourceV2 readers should always produce InternalRow.

2018-03-06 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16389135#comment-16389135
 ] 

Reynold Xin commented on SPARK-23325:
-

Yes perhaps we should do that. It is a lot more work than what you guys think 
though, because as Wenchen said we need to properly define the semantics of all 
the data, similar to all of Hadoop IO (Text, etc) but more, because we have 
more data types.

I'd probably prefer us defining the columnar format first, since if one is 
going after high performance, one'd probably prefer using that one...

 

> DataSourceV2 readers should always produce InternalRow.
> ---
>
> Key: SPARK-23325
> URL: https://issues.apache.org/jira/browse/SPARK-23325
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Ryan Blue
>Priority: Major
>
> DataSourceV2 row-oriented implementations are limited to producing either 
> {{Row}} instances or {{UnsafeRow}} instances by implementing 
> {{SupportsScanUnsafeRow}}. Instead, I think that implementations should 
> always produce {{InternalRow}}.
> The problem with the choice between {{Row}} and {{UnsafeRow}} is that neither 
> one is appropriate for implementers.
> File formats don't produce {{Row}} instances or the data values used by 
> {{Row}}, like {{java.sql.Timestamp}} and {{java.sql.Date}}. An implementation 
> that uses {{Row}} instances must produce data that is immediately translated 
> from the representation that was just produced by Spark. In my experience, it 
> made little sense to translate a timestamp in microseconds to a 
> (milliseconds, nanoseconds) pair, create a {{Timestamp}} instance, and pass 
> that instance to Spark for immediate translation back.
> On the other hand, {{UnsafeRow}} is very difficult to produce unless data is 
> already held in memory. Even the Parquet support built into Spark 
> deserializes to {{InternalRow}} and then uses {{UnsafeProjection}} to produce 
> unsafe rows. When I went to build an implementation that deserializes Parquet 
> or Avro directly to {{UnsafeRow}} (I tried both), I found that it couldn't be 
> done without first deserializing into memory because the size of an array 
> must be known before any values are written.
> I ended up deciding to deserialize to {{InternalRow}} and use 
> {{UnsafeProjection}} to convert to unsafe. There are two problems with this: 
> first, this is Scala and was difficult to call from Java (it required 
> reflection), and second, this causes double projection in the physical plan 
> (a copy for unsafe to unsafe) if there is a projection that wasn't fully 
> pushed to the data source.
> I think the solution is to have a single interface for readers that expects 
> {{InternalRow}}. Then, a projection should be added in the Spark plan to 
> convert to unsafe and avoid projection in the plan and in the data source. If 
> the data source already produces unsafe rows by deserializing directly, this 
> still minimizes the number of copies because the unsafe projection will check 
> whether the incoming data is already {{UnsafeRow}}.
> Using {{InternalRow}} would also match the interface on the write side.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23593) Add interpreted execution for InitializeJavaBean expression

2018-03-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23593:


Assignee: Apache Spark

> Add interpreted execution for InitializeJavaBean expression
> ---
>
> Key: SPARK-23593
> URL: https://issues.apache.org/jira/browse/SPARK-23593
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Herman van Hovell
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23593) Add interpreted execution for InitializeJavaBean expression

2018-03-06 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16388938#comment-16388938
 ] 

Apache Spark commented on SPARK-23593:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/20756

> Add interpreted execution for InitializeJavaBean expression
> ---
>
> Key: SPARK-23593
> URL: https://issues.apache.org/jira/browse/SPARK-23593
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Herman van Hovell
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23593) Add interpreted execution for InitializeJavaBean expression

2018-03-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23593:


Assignee: (was: Apache Spark)

> Add interpreted execution for InitializeJavaBean expression
> ---
>
> Key: SPARK-23593
> URL: https://issues.apache.org/jira/browse/SPARK-23593
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Herman van Hovell
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23595) Add interpreted execution for ValidateExternalType expression

2018-03-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23595:


Assignee: (was: Apache Spark)

> Add interpreted execution for ValidateExternalType expression
> -
>
> Key: SPARK-23595
> URL: https://issues.apache.org/jira/browse/SPARK-23595
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Herman van Hovell
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23595) Add interpreted execution for ValidateExternalType expression

2018-03-06 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16389058#comment-16389058
 ] 

Apache Spark commented on SPARK-23595:
--

User 'maropu' has created a pull request for this issue:
https://github.com/apache/spark/pull/20757

> Add interpreted execution for ValidateExternalType expression
> -
>
> Key: SPARK-23595
> URL: https://issues.apache.org/jira/browse/SPARK-23595
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Herman van Hovell
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23595) Add interpreted execution for ValidateExternalType expression

2018-03-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23595:


Assignee: Apache Spark

> Add interpreted execution for ValidateExternalType expression
> -
>
> Key: SPARK-23595
> URL: https://issues.apache.org/jira/browse/SPARK-23595
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Herman van Hovell
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23618) docker-image-tool.sh Fails While Building Image

2018-03-06 Thread Ninad Ingole (JIRA)
Ninad Ingole created SPARK-23618:


 Summary: docker-image-tool.sh Fails While Building Image
 Key: SPARK-23618
 URL: https://issues.apache.org/jira/browse/SPARK-23618
 Project: Spark
  Issue Type: Bug
  Components: Kubernetes
Affects Versions: 2.3.0
Reporter: Ninad Ingole


I am trying to build kubernetes image for version 2.3.0, using 

 
{code:java}
./bin/docker-image-tool.sh -r ninadingole/spark-docker -t v2.3.0 build
{code}
giving me an issue for docker build 

error:

 
{code:java}
"docker build" requires exactly 1 argument.
See 'docker build --help'.
Usage: docker build [OPTIONS] PATH | URL | - [flags]
Build an image from a Dockerfile
{code}
 

Executing the command within the spark distribution directory. Please let me 
know what's the issue.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23617) Register a Function without params with Spark SQL Java API

2018-03-06 Thread Paul Wu (JIRA)
Paul Wu created SPARK-23617:
---

 Summary: Register a Function without params with Spark SQL Java API
 Key: SPARK-23617
 URL: https://issues.apache.org/jira/browse/SPARK-23617
 Project: Spark
  Issue Type: Improvement
  Components: Java API, SQL
Affects Versions: 2.2.1
Reporter: Paul Wu


One can register a function using Scala:

{{spark.udf.register("uuid", ()=>java.util.UUID.randomUUID.toString) }}

Now, if I use Java API:

{{ spark.udf().register("uuid", ()=>java.util.UUID.randomUUID().toString()); }}

The code does not compile. Define UDF0 for Java API?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23266) Matrix Inversion on BlockMatrix

2018-03-06 Thread Chandan Misra (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16389061#comment-16389061
 ] 

Chandan Misra commented on SPARK-23266:
---

I have implemented matrix inversion using Spark version 2.2.0. Though the 
implementation can be executed using Spark version 2.0.0 onwards. It would be 
really helpful if the inversion is added in the next Spark version. As already 
mentioned, I have the implementation of the inversion and happy to contribute.

> Matrix Inversion on BlockMatrix
> ---
>
> Key: SPARK-23266
> URL: https://issues.apache.org/jira/browse/SPARK-23266
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 2.2.1
>Reporter: Chandan Misra
>Priority: Minor
>
> Matrix inversion is the basic building block for many other algorithms like 
> regression, classification, geostatistical analysis using ordinary kriging 
> etc. A simple Spark BlockMatrix based efficient distributed 
> divide-and-conquer algorithm can be implemented using only *6* 
> multiplications in each recursion level of the algorithm. The reference paper 
> can be found in
> [https://arxiv.org/abs/1801.04723]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23547) Cleanup the .pipeout file when the Hive Session closed

2018-03-06 Thread zuotingbing (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zuotingbing updated SPARK-23547:

Attachment: 2018-03-07_121010.png

> Cleanup the .pipeout file when the Hive Session closed
> --
>
> Key: SPARK-23547
> URL: https://issues.apache.org/jira/browse/SPARK-23547
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: zuotingbing
>Priority: Major
> Attachments: 2018-03-07_121010.png
>
>
>  
>  
> when the hive session closed, we should also cleanup the .pipeout file.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23547) Cleanup the .pipeout file when the Hive Session closed

2018-03-06 Thread zuotingbing (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zuotingbing updated SPARK-23547:

Attachment: (was: 2018-03-07_121010.png)

> Cleanup the .pipeout file when the Hive Session closed
> --
>
> Key: SPARK-23547
> URL: https://issues.apache.org/jira/browse/SPARK-23547
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: zuotingbing
>Priority: Major
> Attachments: 2018-03-07_121010.png
>
>
>  
>  
> when the hive session closed, we should also cleanup the .pipeout file.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23547) Cleanup the .pipeout file when the Hive Session closed

2018-03-06 Thread zuotingbing (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zuotingbing updated SPARK-23547:

Description: 
  !2018-03-07_121010.png!

when the hive session closed, we should also cleanup the .pipeout file.

 

 

  was:
 

 

when the hive session closed, we should also cleanup the .pipeout file.

 

 


> Cleanup the .pipeout file when the Hive Session closed
> --
>
> Key: SPARK-23547
> URL: https://issues.apache.org/jira/browse/SPARK-23547
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: zuotingbing
>Priority: Major
> Attachments: 2018-03-07_121010.png
>
>
>   !2018-03-07_121010.png!
> when the hive session closed, we should also cleanup the .pipeout file.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23595) Add interpreted execution for ValidateExternalType expression

2018-03-06 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16388114#comment-16388114
 ] 

Takeshi Yamamuro commented on SPARK-23595:
--

[~DylanGuedes] oh, I'm already working on it. But, if you want to take over 
this for practice, I'm ok to leave this to you (cuz I have some pending other 
tickets). This is my incomplete work here:  
https://github.com/apache/spark/compare/master...maropu:SPARK-23595

> Add interpreted execution for ValidateExternalType expression
> -
>
> Key: SPARK-23595
> URL: https://issues.apache.org/jira/browse/SPARK-23595
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Herman van Hovell
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23559) add epoch ID to data writer factory

2018-03-06 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16388122#comment-16388122
 ] 

Apache Spark commented on SPARK-23559:
--

User 'jose-torres' has created a pull request for this issue:
https://github.com/apache/spark/pull/20752

> add epoch ID to data writer factory
> ---
>
> Key: SPARK-23559
> URL: https://issues.apache.org/jira/browse/SPARK-23559
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Jose Torres
>Assignee: Jose Torres
>Priority: Major
> Fix For: 3.0.0
>
>
> To support the StreamWriter lifecycle described in SPARK-22910, epoch ID has 
> to be specifiable at DataWriter creation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23595) Add interpreted execution for ValidateExternalType expression

2018-03-06 Thread Dylan Guedes (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16388136#comment-16388136
 ] 

Dylan Guedes commented on SPARK-23595:
--

[~maropu] I checked your progress, and looks like you are almost finishing it, 
so It is fine. Whatever, your solution was very enlightening, thank you!

> Add interpreted execution for ValidateExternalType expression
> -
>
> Key: SPARK-23595
> URL: https://issues.apache.org/jira/browse/SPARK-23595
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Herman van Hovell
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23610) Cast of ArrayType of NullType to ArrayType of nullable material type does not work

2018-03-06 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/SPARK-23610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Kießling updated SPARK-23610:
-
Description: 
Given a DataFrame that contains a column with _ArrayType of NullType_
casting this column into ArrayType of any material nullable type (e.g. 
_ArrayType(LongType, true)_ ) should be possible.
{code}
it("can cast arrays of null type into arrays of nullable material types") {
  val inputData = Seq(
Row(Array())
  ).asJava

  val schema = StructType(Seq(
StructField("list", ArrayType(NullType, true), false)
  ))

  val data = caps.sparkSession.createDataFrame(inputData, schema)

  data.withColumn("longList",data.col("list").cast(ArrayType(LongType, 
true))).show
}
{code}

This test fails with the message: 

{noformat}
NullType (of class org.apache.spark.sql.types.NullType$)
 scala.MatchError: NullType (of class org.apache.spark.sql.types.NullType$)
 at org.apache.spark.sql.catalyst.expressions.Cast.castToLong(Cast.scala:310)
 at 
org.apache.spark.sql.catalyst.expressions.Cast.org$apache$spark$sql$catalyst$expressions$Cast$$cast(Cast.scala:516)
 at org.apache.spark.sql.catalyst.expressions.Cast.castArray(Cast.scala:455)
 at 
org.apache.spark.sql.catalyst.expressions.Cast.org$apache$spark$sql$catalyst$expressions$Cast$$cast(Cast.scala:519)
 at 
org.apache.spark.sql.catalyst.expressions.Cast.cast$lzycompute(Cast.scala:531)
 at org.apache.spark.sql.catalyst.expressions.Cast.cast(Cast.scala:531)
 at org.apache.spark.sql.catalyst.expressions.Cast.nullSafeEval(Cast.scala:533)
 at 
org.apache.spark.sql.catalyst.expressions.UnaryExpression.eval(Expression.scala:327)
{noformat}


  was:
Given a DataFrame that contains a column with _ArrayType of NullType_
casting this column into ArrayType of any material nullable type (e.g. 
_ArrayType(LongType, true)_ ) should be possible.
{code}
it("can cast arrays of null type into arrays of nullable material types") {
 val inputData = Seq(
 Row(Array())
 ).asJava

val schema = StructType(Seq(
 StructField("list", ArrayType(NullType, true), false)
 ))

val data = caps.sparkSession.createDataFrame(inputData, schema)

data.withColumn("longList",data.col("list").cast(ArrayType(LongType, 
true))).show
 }
{code}

This test fails with the message: 

{noformat}
NullType (of class org.apache.spark.sql.types.NullType$)
 scala.MatchError: NullType (of class org.apache.spark.sql.types.NullType$)
 at org.apache.spark.sql.catalyst.expressions.Cast.castToLong(Cast.scala:310)
 at 
org.apache.spark.sql.catalyst.expressions.Cast.org$apache$spark$sql$catalyst$expressions$Cast$$cast(Cast.scala:516)
 at org.apache.spark.sql.catalyst.expressions.Cast.castArray(Cast.scala:455)
 at 
org.apache.spark.sql.catalyst.expressions.Cast.org$apache$spark$sql$catalyst$expressions$Cast$$cast(Cast.scala:519)
 at 
org.apache.spark.sql.catalyst.expressions.Cast.cast$lzycompute(Cast.scala:531)
 at org.apache.spark.sql.catalyst.expressions.Cast.cast(Cast.scala:531)
 at org.apache.spark.sql.catalyst.expressions.Cast.nullSafeEval(Cast.scala:533)
 at 
org.apache.spark.sql.catalyst.expressions.UnaryExpression.eval(Expression.scala:327)
{noformat}



> Cast of ArrayType of NullType to ArrayType of nullable material type does not 
> work
> --
>
> Key: SPARK-23610
> URL: https://issues.apache.org/jira/browse/SPARK-23610
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Max Kießling
>Priority: Minor
>
> Given a DataFrame that contains a column with _ArrayType of NullType_
> casting this column into ArrayType of any material nullable type (e.g. 
> _ArrayType(LongType, true)_ ) should be possible.
> {code}
> it("can cast arrays of null type into arrays of nullable material types") {
>   val inputData = Seq(
> Row(Array())
>   ).asJava
>   val schema = StructType(Seq(
> StructField("list", ArrayType(NullType, true), false)
>   ))
>   val data = caps.sparkSession.createDataFrame(inputData, schema)
>   data.withColumn("longList",data.col("list").cast(ArrayType(LongType, 
> true))).show
> }
> {code}
> This test fails with the message: 
> {noformat}
> NullType (of class org.apache.spark.sql.types.NullType$)
>  scala.MatchError: NullType (of class org.apache.spark.sql.types.NullType$)
>  at org.apache.spark.sql.catalyst.expressions.Cast.castToLong(Cast.scala:310)
>  at 
> org.apache.spark.sql.catalyst.expressions.Cast.org$apache$spark$sql$catalyst$expressions$Cast$$cast(Cast.scala:516)
>  at org.apache.spark.sql.catalyst.expressions.Cast.castArray(Cast.scala:455)
>  at 
> org.apache.spark.sql.catalyst.expressions.Cast.org$apache$spark$sql$catalyst$expressions$Cast$$cast(Cast.scala:519)
>  at 
> org.apache.spark.sql.catalyst.expressions.Cast.cast$lzycompute(Cast.scala:531)
>  at 

[jira] [Commented] (SPARK-23537) Logistic Regression without standardization

2018-03-06 Thread Jordi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16387587#comment-16387587
 ] 

Jordi commented on SPARK-23537:
---

[~Teng Peng] we don't need standardization for L-BFGS but it's recommended 
since it will improve the convergence. I've been checking the code and I found 
excerpts that I don't properly understand. I added some comments hoping that 
the developer clarifies them:

[https://github.com/apache/spark/pull/7080/files#diff-3734f1689cb8a80b07974eb93de0795dR588]

[https://github.com/apache/spark/pull/5967/files#diff-3734f1689cb8a80b07974eb93de0795dR201]

 

> Logistic Regression without standardization
> ---
>
> Key: SPARK-23537
> URL: https://issues.apache.org/jira/browse/SPARK-23537
> Project: Spark
>  Issue Type: Bug
>  Components: ML, Optimizer
>Affects Versions: 2.0.2, 2.2.1
>Reporter: Jordi
>Priority: Major
> Attachments: non-standardization.log, standardization.log
>
>
> I'm trying to train a Logistic Regression model, using Spark 2.2.1. I prefer 
> to not use standardization since all my features are binary, using the 
> hashing trick (2^20 sparse vector).
> I trained two models to compare results, I've been expecting to end with two 
> similar models since it seems that internally the optimizer performs 
> standardization and "de-standardization" (when it's deactivated) in order to 
> improve the convergence.
> Here you have the code I used:
> {code:java}
> val lr = new org.apache.spark.ml.classification.LogisticRegression()
> .setRegParam(0.05)
> .setElasticNetParam(0.0)
> .setFitIntercept(true)
> .setMaxIter(5000)
> .setStandardization(false)
> val model = lr.fit(data)
> {code}
> The results are disturbing me, I end with two significantly different models.
> *Standardization:*
> Training time: 8min.
> Iterations: 37
> Intercept: -4.386090107224499
> Max weight: 4.724752299455218
> Min weight: -3.560570478164854
> Mean weight: -0.049325201841722795
> l1 norm: 116710.39522171849
> l2 norm: 402.2581552373957
> Non zero weights: 128084
> Non zero ratio: 0.12215042114257812
> Last 10 LBFGS Val and Grand Norms:
> {code:java}
> 18/02/27 17:14:45 INFO LBFGS: Val and Grad Norm: 0.430740 (rel: 8.00e-07) 
> 0.000559057
> 18/02/27 17:14:50 INFO LBFGS: Val and Grad Norm: 0.430740 (rel: 3.94e-07) 
> 0.000267527
> 18/02/27 17:14:54 INFO LBFGS: Val and Grad Norm: 0.430739 (rel: 2.62e-07) 
> 0.000205888
> 18/02/27 17:14:59 INFO LBFGS: Val and Grad Norm: 0.430739 (rel: 1.36e-07) 
> 0.000144173
> 18/02/27 17:15:04 INFO LBFGS: Val and Grad Norm: 0.430739 (rel: 7.74e-08) 
> 0.000140296
> 18/02/27 17:15:09 INFO LBFGS: Val and Grad Norm: 0.430739 (rel: 1.52e-08) 
> 0.000122709
> 18/02/27 17:15:13 INFO LBFGS: Val and Grad Norm: 0.430739 (rel: 1.78e-08) 
> 3.08789e-05
> 18/02/27 17:15:18 INFO LBFGS: Val and Grad Norm: 0.430739 (rel: 2.66e-09) 
> 2.23806e-05
> 18/02/27 17:15:23 INFO LBFGS: Val and Grad Norm: 0.430739 (rel: 4.31e-09) 
> 1.47422e-05
> 18/02/27 17:15:28 INFO LBFGS: Val and Grad Norm: 0.430739 (rel: 9.17e-10) 
> 2.37442e-05
> {code}
> *No standardization:*
> Training time: 7h 14 min.
> Iterations: 4992
> Intercept: -4.216690468849263
> Max weight: 0.41930559767624725
> Min weight: -0.5949182537565524
> Mean weight: -1.2659769019012E-6
> l1 norm: 14.262025330648694
> l2 norm: 1.2508777025612263
> Non zero weights: 128955
> Non zero ratio: 0.12298107147216797
> Last 10 LBFGS Val and Grand Norms:
> {code:java}
> 18/02/28 00:28:56 INFO LBFGS: Val and Grad Norm: 0.559320 (rel: 2.17e-07) 
> 0.217581
> 18/02/28 00:29:01 INFO LBFGS: Val and Grad Norm: 0.559320 (rel: 1.88e-07) 
> 0.185812
> 18/02/28 00:29:06 INFO LBFGS: Val and Grad Norm: 0.559320 (rel: 1.33e-07) 
> 0.214570
> 18/02/28 00:29:11 INFO LBFGS: Val and Grad Norm: 0.559320 (rel: 8.62e-08) 
> 0.489464
> 18/02/28 00:29:16 INFO LBFGS: Val and Grad Norm: 0.559320 (rel: 1.90e-07) 
> 0.178448
> 18/02/28 00:29:21 INFO LBFGS: Val and Grad Norm: 0.559320 (rel: 7.91e-08) 
> 0.172527
> 18/02/28 00:29:26 INFO LBFGS: Val and Grad Norm: 0.559320 (rel: 1.38e-07) 
> 0.189389
> 18/02/28 00:29:31 INFO LBFGS: Val and Grad Norm: 0.559320 (rel: 1.13e-07) 
> 0.480678
> 18/02/28 00:29:36 INFO LBFGS: Val and Grad Norm: 0.559320 (rel: 1.75e-07) 
> 0.184529
> 18/02/28 00:29:41 INFO LBFGS: Val and Grad Norm: 0.559319 (rel: 8.90e-08) 
> 0.154329
> {code}
> Am I missing something?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23611) Extend ExpressionEvalHelper harness to also test failures

2018-03-06 Thread Herman van Hovell (JIRA)
Herman van Hovell created SPARK-23611:
-

 Summary: Extend ExpressionEvalHelper harness to also test failures
 Key: SPARK-23611
 URL: https://issues.apache.org/jira/browse/SPARK-23611
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 2.3.0
Reporter: Herman van Hovell






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22450) Safely register class for mllib

2018-03-06 Thread Richard Wilkinson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16387630#comment-16387630
 ] 

Richard Wilkinson commented on SPARK-22450:
---

Just as an FYI, the change to 
org.apache.spark.serializer.KryoSerializer#newKryo from (i think this ticket) 
this is a performance hit over the in 2.2.1.  I am calling 
org.apache.spark.serializer.KryoSerializer#newInstance alot, which is probably 
an issue in itself (hence not rasing a bug report), but im not aware of how 
much this is called internal to spark.  I do not have the ML jars on my 
classpath.

> Safely register class for mllib
> ---
>
> Key: SPARK-22450
> URL: https://issues.apache.org/jira/browse/SPARK-22450
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Xianyang Liu
>Assignee: Xianyang Liu
>Priority: Major
> Fix For: 2.3.0
>
>
> There are still some algorithms based on mllib, such as KMeans.  For now, 
> many mllib common class (such as: Vector, DenseVector, SparseVector, Matrix, 
> DenseMatrix, SparseMatrix) are not registered in Kryo. So there are some 
> performance issues for those object serialization or deserialization.
> Previously dicussed: https://github.com/apache/spark/pull/19586



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18492) GeneratedIterator grows beyond 64 KB

2018-03-06 Thread imran shaik (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16387969#comment-16387969
 ] 

imran shaik commented on SPARK-18492:
-

truetradescast schema
root
 |-- Event_Time: long (nullable = true)
 |-- Symbol: string (nullable = true)
 |-- Kline_Start_Time: long (nullable = true)
 |-- Kline_Close_Time: long (nullable = true)
 |-- Open_Price: float (nullable = true)
 |-- Close_Price: float (nullable = true)
 |-- High_Price: float (nullable = true)
 |-- Low_Price: float (nullable = true)
 |-- Base_Asset_Volume: float (nullable = true)
 |-- Number_Of_Trades: long (nullable = true)
 |-- TimeStamp: timestamp (nullable = true)

Can you solve this asap?


> GeneratedIterator grows beyond 64 KB
> 
>
> Key: SPARK-18492
> URL: https://issues.apache.org/jira/browse/SPARK-18492
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
> Environment: CentOS release 6.7 (Final)
>Reporter: Norris Merritt
>Priority: Major
> Attachments: Screenshot from 2018-03-02 12-57-51.png
>
>
> spark-submit fails with ERROR CodeGenerator: failed to compile: 
> org.codehaus.janino.JaninoRuntimeException: Code of method 
> "(I[Lscala/collection/Iterator;)V" of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator" 
> grows beyond 64 KB
> Error message is followed by a huge dump of generated source code.
> The generated code declares 1,454 field sequences like the following:
> /* 036 */   private org.apache.spark.sql.catalyst.expressions.ScalaUDF 
> project_scalaUDF1;
> /* 037 */   private scala.Function1 project_catalystConverter1;
> /* 038 */   private scala.Function1 project_converter1;
> /* 039 */   private scala.Function1 project_converter2;
> /* 040 */   private scala.Function2 project_udf1;
>   (many omitted lines) ...
> /* 6089 */   private org.apache.spark.sql.catalyst.expressions.ScalaUDF 
> project_scalaUDF1454;
> /* 6090 */   private scala.Function1 project_catalystConverter1454;
> /* 6091 */   private scala.Function1 project_converter1695;
> /* 6092 */   private scala.Function1 project_udf1454;
> It then proceeds to emit code for several methods (init, processNext) each of 
> which has totally repetitive sequences of statements pertaining to each of 
> the sequences of variables declared in the class.  For example:
> /* 6101 */   public void init(int index, scala.collection.Iterator inputs[]) {
> The reason that the 64KB JVM limit for code for a method is exceeded is 
> because the code generator is using an incredibly naive strategy.  It emits a 
> sequence like the one shown below for each of the 1,454 groups of variables 
> shown above, in 
> /* 6132 */ this.project_udf = 
> (scala.Function1)project_scalaUDF.userDefinedFunc();
> /* 6133 */ this.project_scalaUDF1 = 
> (org.apache.spark.sql.catalyst.expressions.ScalaUDF) references[10];
> /* 6134 */ this.project_catalystConverter1 = 
> (scala.Function1)org.apache.spark.sql.catalyst.CatalystTypeConverters$.MODULE$.createToCatalystConverter(project_scalaUDF1.dataType());
> /* 6135 */ this.project_converter1 = 
> (scala.Function1)org.apache.spark.sql.catalyst.CatalystTypeConverters$.MODULE$.createToScalaConverter(((org.apache.spark.sql.catalyst.expressions.Expression)(((org.apache.spark.sql.catalyst.expressions.ScalaUDF)references[10]).getChildren().apply(0))).dataType());
> /* 6136 */ this.project_converter2 = 
> (scala.Function1)org.apache.spark.sql.catalyst.CatalystTypeConverters$.MODULE$.createToScalaConverter(((org.apache.spark.sql.catalyst.expressions.Expression)(((org.apache.spark.sql.catalyst.expressions.ScalaUDF)references[10]).getChildren().apply(1))).dataType());
> It blows up after emitting 230 such sequences, while trying to emit the 231st:
> /* 7282 */ this.project_udf230 = 
> (scala.Function2)project_scalaUDF230.userDefinedFunc();
> /* 7283 */ this.project_scalaUDF231 = 
> (org.apache.spark.sql.catalyst.expressions.ScalaUDF) references[240];
> /* 7284 */ this.project_catalystConverter231 = 
> (scala.Function1)org.apache.spark.sql.catalyst.CatalystTypeConverters$.MODULE$.createToCatalystConverter(project_scalaUDF231.dataType());
>   many omitted lines ...
>  Example of repetitive code sequences emitted for processNext method:
> /* 12253 */   boolean project_isNull247 = project_result244 == null;
> /* 12254 */   MapData project_value247 = null;
> /* 12255 */   if (!project_isNull247) {
> /* 12256 */ project_value247 = project_result244;
> /* 12257 */   }
> /* 12258 */   Object project_arg = sort_isNull5 ? null : 
> project_converter489.apply(sort_value5);
> /* 12259 */
> /* 12260 */   ArrayData project_result249 = null;
> /* 12261 */   try {
> /* 12262 */ project_result249 = 
> 

[jira] [Commented] (SPARK-22683) DynamicAllocation wastes resources by allocating containers that will barely be used

2018-03-06 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16387963#comment-16387963
 ] 

Thomas Graves commented on SPARK-22683:
---

I left comments on the open PR already, lets move the discussion there

> DynamicAllocation wastes resources by allocating containers that will barely 
> be used
> 
>
> Key: SPARK-22683
> URL: https://issues.apache.org/jira/browse/SPARK-22683
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Julien Cuquemelle
>Priority: Major
>  Labels: pull-request-available
>
> While migrating a series of jobs from MR to Spark using dynamicAllocation, 
> I've noticed almost a doubling (+114% exactly) of resource consumption of 
> Spark w.r.t MR, for a wall clock time gain of 43%
> About the context: 
> - resource usage stands for vcore-hours allocation for the whole job, as seen 
> by YARN
> - I'm talking about a series of jobs because we provide our users with a way 
> to define experiments (via UI / DSL) that automatically get translated to 
> Spark / MR jobs and submitted on the cluster
> - we submit around 500 of such jobs each day
> - these jobs are usually one shot, and the amount of processing can vary a 
> lot between jobs, and as such finding an efficient number of executors for 
> each job is difficult to get right, which is the reason I took the path of 
> dynamic allocation.  
> - Some of the tests have been scheduled on an idle queue, some on a full 
> queue.
> - experiments have been conducted with spark.executor-cores = 5 and 10, only 
> results for 5 cores have been reported because efficiency was overall better 
> than with 10 cores
> - the figures I give are averaged over a representative sample of those jobs 
> (about 600 jobs) ranging from tens to thousands splits in the data 
> partitioning and between 400 to 9000 seconds of wall clock time.
> - executor idle timeout is set to 30s;
>  
> Definition: 
> - let's say an executor has spark.executor.cores / spark.task.cpus taskSlots, 
> which represent the max number of tasks an executor will process in parallel.
> - the current behaviour of the dynamic allocation is to allocate enough 
> containers to have one taskSlot per task, which minimizes latency, but wastes 
> resources when tasks are small regarding executor allocation and idling 
> overhead. 
> The results using the proposal (described below) over the job sample (600 
> jobs):
> - by using 2 tasks per taskSlot, we get a 5% (against -114%) reduction in 
> resource usage, for a 37% (against 43%) reduction in wall clock time for 
> Spark w.r.t MR
> - by trying to minimize the average resource consumption, I ended up with 6 
> tasks per core, with a 30% resource usage reduction, for a similar wall clock 
> time w.r.t. MR
> What did I try to solve the issue with existing parameters (summing up a few 
> points mentioned in the comments) ?
> - change dynamicAllocation.maxExecutors: this would need to be adapted for 
> each job (tens to thousands splits can occur), and essentially remove the 
> interest of using the dynamic allocation.
> - use dynamicAllocation.backlogTimeout: 
> - setting this parameter right to avoid creating unused executors is very 
> dependant on wall clock time. One basically needs to solve the exponential 
> ramp up for the target time. So this is not an option for my use case where I 
> don't want a per-job tuning. 
> - I've still done a series of experiments, details in the comments. 
> Result is that after manual tuning, the best I could get was a similar 
> resource consumption at the expense of 20% more wall clock time, or a similar 
> wall clock time at the expense of 60% more resource consumption than what I 
> got using my proposal @ 6 tasks per slot (this value being optimized over a 
> much larger range of jobs as already stated)
> - as mentioned in another comment, tampering with the exponential ramp up 
> might yield task imbalance and such old executors could become contention 
> points for other exes trying to remotely access blocks in the old exes (not 
> witnessed in the jobs I'm talking about, but we did see this behavior in 
> other jobs)
> Proposal: 
> Simply add a tasksPerExecutorSlot parameter, which makes it possible to 
> specify how many tasks a single taskSlot should ideally execute to mitigate 
> the overhead of executor allocation.
> PR: https://github.com/apache/spark/pull/19881



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23591) Add interpreted execution for EncodeUsingSerializer expression

2018-03-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23591:


Assignee: (was: Apache Spark)

> Add interpreted execution for EncodeUsingSerializer expression
> --
>
> Key: SPARK-23591
> URL: https://issues.apache.org/jira/browse/SPARK-23591
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Herman van Hovell
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23591) Add interpreted execution for EncodeUsingSerializer expression

2018-03-06 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16387998#comment-16387998
 ] 

Apache Spark commented on SPARK-23591:
--

User 'mgaido91' has created a pull request for this issue:
https://github.com/apache/spark/pull/20751

> Add interpreted execution for EncodeUsingSerializer expression
> --
>
> Key: SPARK-23591
> URL: https://issues.apache.org/jira/browse/SPARK-23591
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Herman van Hovell
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23591) Add interpreted execution for EncodeUsingSerializer expression

2018-03-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23591:


Assignee: Apache Spark

> Add interpreted execution for EncodeUsingSerializer expression
> --
>
> Key: SPARK-23591
> URL: https://issues.apache.org/jira/browse/SPARK-23591
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Herman van Hovell
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18673) Dataframes doesn't work on Hadoop 3.x; Hive rejects Hadoop version

2018-03-06 Thread Darek (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16388014#comment-16388014
 ] 

Darek commented on SPARK-18673:
---

HIVE tickets are closed already, can we close this ticket?

> Dataframes doesn't work on Hadoop 3.x; Hive rejects Hadoop version
> --
>
> Key: SPARK-18673
> URL: https://issues.apache.org/jira/browse/SPARK-18673
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
> Environment: Spark built with -Dhadoop.version=3.0.0-alpha2-SNAPSHOT 
>Reporter: Steve Loughran
>Priority: Major
>
> Spark Dataframes fail to run on Hadoop 3.0.x, because hive.jar's shimloader 
> considers 3.x to be an unknown Hadoop version.
> Hive itself will have to fix this; as Spark uses its own hive 1.2.x JAR, it 
> will need to be updated to match.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23595) Add interpreted execution for ValidateExternalType expression

2018-03-06 Thread Herman van Hovell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16387864#comment-16387864
 ] 

Herman van Hovell commented on SPARK-23595:
---

[~DylanGuedes] feel free to pick this up. I think it is a good starter task, 
since it is relatively self-contained and very well testable. If you need some 
inspiration, just take a look at the approach taken by other tickets in the 
umbrella. Let me know if you need help.

> Add interpreted execution for ValidateExternalType expression
> -
>
> Key: SPARK-23595
> URL: https://issues.apache.org/jira/browse/SPARK-23595
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Herman van Hovell
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23581) Add an interpreted version of GenerateUnsafeProjection

2018-03-06 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16387872#comment-16387872
 ] 

Apache Spark commented on SPARK-23581:
--

User 'hvanhovell' has created a pull request for this issue:
https://github.com/apache/spark/pull/20750

> Add an interpreted version of GenerateUnsafeProjection
> --
>
> Key: SPARK-23581
> URL: https://issues.apache.org/jira/browse/SPARK-23581
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Herman van Hovell
>Assignee: Herman van Hovell
>Priority: Major
>
> GenerateUnsafeProjection should have an interpreted cousin. See the parent 
> ticket for the motivation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23601) Remove .md5 files from release

2018-03-06 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-23601.
---
   Resolution: Fixed
Fix Version/s: 2.4.0
   2.3.1

Resolved by https://github.com/apache/spark/pull/20737

> Remove .md5 files from release
> --
>
> Key: SPARK-23601
> URL: https://issues.apache.org/jira/browse/SPARK-23601
> Project: Spark
>  Issue Type: Task
>  Components: Build
>Affects Versions: 2.4.0
>Reporter: Sean Owen
>Assignee: Sean Owen
>Priority: Minor
> Fix For: 2.3.1, 2.4.0
>
>
> Per email from Henk to PMCs:
> {code}
>The Release Distribution Policy[1] changed regarding checksum files.
> See under "Cryptographic Signatures and Checksums Requirements" [2].
>   MD5-file == a .md5 file
>   SHA-file == a .sha1, sha256 or .sha512 file
>Old policy :
>   -- MUST provide a MD5-file
>   -- SHOULD provide a SHA-file [SHA-512 recommended]
>New policy :
>   -- MUST provide a SHA- or MD5-file
>   -- SHOULD provide a SHA-file
>   -- SHOULD NOT provide a MD5-file
>   Providing MD5 checksum files is now discouraged for new releases,
>   but still allowed for past releases.
>Why this change :
>   -- MD5 is broken for many purposes ; we should move away from it.
>  https://en.wikipedia.org/wiki/MD5#Overview_of_security_issues
>Impact for PMCs :
>   -- for new releases :
>  -- please do provide a SHA-file (one or more, if you like)
>  -- do NOT provide a MD5-file
>   -- for past releases :
>  -- you are not required to change anything
>  -- for artifacts accompanied by a SHA-file /and/ a MD5-file,
> it would be nice if you removed the MD5-file
>   -- if, at the moment, you provide MD5-files,
>  please adjust your release tooling.
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22683) DynamicAllocation wastes resources by allocating containers that will barely be used

2018-03-06 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16387938#comment-16387938
 ] 

Thomas Graves commented on SPARK-22683:
---

[~jcuquemelle] do you have time to update the PR, otherwise we should close 
that for now

> DynamicAllocation wastes resources by allocating containers that will barely 
> be used
> 
>
> Key: SPARK-22683
> URL: https://issues.apache.org/jira/browse/SPARK-22683
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Julien Cuquemelle
>Priority: Major
>  Labels: pull-request-available
>
> While migrating a series of jobs from MR to Spark using dynamicAllocation, 
> I've noticed almost a doubling (+114% exactly) of resource consumption of 
> Spark w.r.t MR, for a wall clock time gain of 43%
> About the context: 
> - resource usage stands for vcore-hours allocation for the whole job, as seen 
> by YARN
> - I'm talking about a series of jobs because we provide our users with a way 
> to define experiments (via UI / DSL) that automatically get translated to 
> Spark / MR jobs and submitted on the cluster
> - we submit around 500 of such jobs each day
> - these jobs are usually one shot, and the amount of processing can vary a 
> lot between jobs, and as such finding an efficient number of executors for 
> each job is difficult to get right, which is the reason I took the path of 
> dynamic allocation.  
> - Some of the tests have been scheduled on an idle queue, some on a full 
> queue.
> - experiments have been conducted with spark.executor-cores = 5 and 10, only 
> results for 5 cores have been reported because efficiency was overall better 
> than with 10 cores
> - the figures I give are averaged over a representative sample of those jobs 
> (about 600 jobs) ranging from tens to thousands splits in the data 
> partitioning and between 400 to 9000 seconds of wall clock time.
> - executor idle timeout is set to 30s;
>  
> Definition: 
> - let's say an executor has spark.executor.cores / spark.task.cpus taskSlots, 
> which represent the max number of tasks an executor will process in parallel.
> - the current behaviour of the dynamic allocation is to allocate enough 
> containers to have one taskSlot per task, which minimizes latency, but wastes 
> resources when tasks are small regarding executor allocation and idling 
> overhead. 
> The results using the proposal (described below) over the job sample (600 
> jobs):
> - by using 2 tasks per taskSlot, we get a 5% (against -114%) reduction in 
> resource usage, for a 37% (against 43%) reduction in wall clock time for 
> Spark w.r.t MR
> - by trying to minimize the average resource consumption, I ended up with 6 
> tasks per core, with a 30% resource usage reduction, for a similar wall clock 
> time w.r.t. MR
> What did I try to solve the issue with existing parameters (summing up a few 
> points mentioned in the comments) ?
> - change dynamicAllocation.maxExecutors: this would need to be adapted for 
> each job (tens to thousands splits can occur), and essentially remove the 
> interest of using the dynamic allocation.
> - use dynamicAllocation.backlogTimeout: 
> - setting this parameter right to avoid creating unused executors is very 
> dependant on wall clock time. One basically needs to solve the exponential 
> ramp up for the target time. So this is not an option for my use case where I 
> don't want a per-job tuning. 
> - I've still done a series of experiments, details in the comments. 
> Result is that after manual tuning, the best I could get was a similar 
> resource consumption at the expense of 20% more wall clock time, or a similar 
> wall clock time at the expense of 60% more resource consumption than what I 
> got using my proposal @ 6 tasks per slot (this value being optimized over a 
> much larger range of jobs as already stated)
> - as mentioned in another comment, tampering with the exponential ramp up 
> might yield task imbalance and such old executors could become contention 
> points for other exes trying to remotely access blocks in the old exes (not 
> witnessed in the jobs I'm talking about, but we did see this behavior in 
> other jobs)
> Proposal: 
> Simply add a tasksPerExecutorSlot parameter, which makes it possible to 
> specify how many tasks a single taskSlot should ideally execute to mitigate 
> the overhead of executor allocation.
> PR: https://github.com/apache/spark/pull/19881



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22683) DynamicAllocation wastes resources by allocating containers that will barely be used

2018-03-06 Thread Julien Cuquemelle (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16387947#comment-16387947
 ] 

Julien Cuquemelle commented on SPARK-22683:
---

Yes, I have time.
I was waiting for suggestions for the parameter name.

how about spark.dynamicAllocation.fullParallelismDivisor (if we agree that 
parameter could be a double) ? 

> DynamicAllocation wastes resources by allocating containers that will barely 
> be used
> 
>
> Key: SPARK-22683
> URL: https://issues.apache.org/jira/browse/SPARK-22683
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Julien Cuquemelle
>Priority: Major
>  Labels: pull-request-available
>
> While migrating a series of jobs from MR to Spark using dynamicAllocation, 
> I've noticed almost a doubling (+114% exactly) of resource consumption of 
> Spark w.r.t MR, for a wall clock time gain of 43%
> About the context: 
> - resource usage stands for vcore-hours allocation for the whole job, as seen 
> by YARN
> - I'm talking about a series of jobs because we provide our users with a way 
> to define experiments (via UI / DSL) that automatically get translated to 
> Spark / MR jobs and submitted on the cluster
> - we submit around 500 of such jobs each day
> - these jobs are usually one shot, and the amount of processing can vary a 
> lot between jobs, and as such finding an efficient number of executors for 
> each job is difficult to get right, which is the reason I took the path of 
> dynamic allocation.  
> - Some of the tests have been scheduled on an idle queue, some on a full 
> queue.
> - experiments have been conducted with spark.executor-cores = 5 and 10, only 
> results for 5 cores have been reported because efficiency was overall better 
> than with 10 cores
> - the figures I give are averaged over a representative sample of those jobs 
> (about 600 jobs) ranging from tens to thousands splits in the data 
> partitioning and between 400 to 9000 seconds of wall clock time.
> - executor idle timeout is set to 30s;
>  
> Definition: 
> - let's say an executor has spark.executor.cores / spark.task.cpus taskSlots, 
> which represent the max number of tasks an executor will process in parallel.
> - the current behaviour of the dynamic allocation is to allocate enough 
> containers to have one taskSlot per task, which minimizes latency, but wastes 
> resources when tasks are small regarding executor allocation and idling 
> overhead. 
> The results using the proposal (described below) over the job sample (600 
> jobs):
> - by using 2 tasks per taskSlot, we get a 5% (against -114%) reduction in 
> resource usage, for a 37% (against 43%) reduction in wall clock time for 
> Spark w.r.t MR
> - by trying to minimize the average resource consumption, I ended up with 6 
> tasks per core, with a 30% resource usage reduction, for a similar wall clock 
> time w.r.t. MR
> What did I try to solve the issue with existing parameters (summing up a few 
> points mentioned in the comments) ?
> - change dynamicAllocation.maxExecutors: this would need to be adapted for 
> each job (tens to thousands splits can occur), and essentially remove the 
> interest of using the dynamic allocation.
> - use dynamicAllocation.backlogTimeout: 
> - setting this parameter right to avoid creating unused executors is very 
> dependant on wall clock time. One basically needs to solve the exponential 
> ramp up for the target time. So this is not an option for my use case where I 
> don't want a per-job tuning. 
> - I've still done a series of experiments, details in the comments. 
> Result is that after manual tuning, the best I could get was a similar 
> resource consumption at the expense of 20% more wall clock time, or a similar 
> wall clock time at the expense of 60% more resource consumption than what I 
> got using my proposal @ 6 tasks per slot (this value being optimized over a 
> much larger range of jobs as already stated)
> - as mentioned in another comment, tampering with the exponential ramp up 
> might yield task imbalance and such old executors could become contention 
> points for other exes trying to remotely access blocks in the old exes (not 
> witnessed in the jobs I'm talking about, but we did see this behavior in 
> other jobs)
> Proposal: 
> Simply add a tasksPerExecutorSlot parameter, which makes it possible to 
> specify how many tasks a single taskSlot should ideally execute to mitigate 
> the overhead of executor allocation.
> PR: https://github.com/apache/spark/pull/19881



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: 

[jira] [Commented] (SPARK-23534) Spark run on Hadoop 3.0.0

2018-03-06 Thread Darek (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16388017#comment-16388017
 ] 

Darek commented on SPARK-23534:
---

SPARK-18673 should be closed since HIVE-15016 and HIVE-18550 are closed.

There should be not blockers are this point.

> Spark run on Hadoop 3.0.0
> -
>
> Key: SPARK-23534
> URL: https://issues.apache.org/jira/browse/SPARK-23534
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.3.0
>Reporter: Saisai Shao
>Priority: Major
>
> Major Hadoop vendors already/will step in Hadoop 3.0. So we should also make 
> sure Spark can run with Hadoop 3.0. This Jira tracks the work to make Spark 
> run on Hadoop 3.0.
> The work includes:
>  # Add a Hadoop 3.0.0 new profile to make Spark build-able with Hadoop 3.0.
>  # Test to see if there's dependency issues with Hadoop 3.0.
>  # Investigating the feasibility to use shaded client jars (HADOOP-11804).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23595) Add interpreted execution for ValidateExternalType expression

2018-03-06 Thread Dylan Guedes (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16387853#comment-16387853
 ] 

Dylan Guedes commented on SPARK-23595:
--

Hi,

I would like to help with this issue, but since I am a newcomer I am not sure 
if it is a good way to start (maybe it is too hard and I don't want to be a 
bottleneck). I started reading code of the related issues, it is similar? What 
do you guys think?

Thank you!

> Add interpreted execution for ValidateExternalType expression
> -
>
> Key: SPARK-23595
> URL: https://issues.apache.org/jira/browse/SPARK-23595
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Herman van Hovell
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17495) Hive hash implementation

2018-03-06 Thread Xiaoju Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16387873#comment-16387873
 ] 

Xiaoju Wu commented on SPARK-17495:
---

[~tejasp] I can see HiveHash merged but never used. Seems the using of 
spark/hive hash is still under discussion, is there any update on this topic?

> Hive hash implementation
> 
>
> Key: SPARK-17495
> URL: https://issues.apache.org/jira/browse/SPARK-17495
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Tejas Patil
>Assignee: Tejas Patil
>Priority: Minor
>
> Spark internally uses Murmur3Hash for partitioning. This is different from 
> the one used by Hive. For queries which use bucketing this leads to different 
> results if one tries the same query on both engines. For us, we want users to 
> have backward compatibility to that one can switch parts of applications 
> across the engines without observing regressions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16630) Blacklist a node if executors won't launch on it.

2018-03-06 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16387846#comment-16387846
 ] 

Thomas Graves commented on SPARK-16630:
---

yes something along these lines is what I was thinking. we would want a 
configurable number of failures (perhaps we can reuse one of the existing 
settings, but woudl need to think about more) at which point we would blacklist 
the node due to executor launch failures and we could have a timeout at which 
point we could retry.  We also want to take into account small clusters and 
perhaps stop blacklisting if a certain percent of the cluster is already 
blacklisted.

> Blacklist a node if executors won't launch on it.
> -
>
> Key: SPARK-16630
> URL: https://issues.apache.org/jira/browse/SPARK-16630
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 1.6.2
>Reporter: Thomas Graves
>Priority: Major
>
> On YARN, its possible that a node is messed or misconfigured such that a 
> container won't launch on it.  For instance if the Spark external shuffle 
> handler didn't get loaded on it , maybe its just some other hardware issue or 
> hadoop configuration issue. 
> It would be nice we could recognize this happening and stop trying to launch 
> executors on it since that could end up causing us to hit our max number of 
> executor failures and then kill the job.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-23595) Add interpreted execution for ValidateExternalType expression

2018-03-06 Thread Dylan Guedes (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16387853#comment-16387853
 ] 

Dylan Guedes edited comment on SPARK-23595 at 3/6/18 2:24 PM:
--

Hi,

I would like to help with this issue, but since I am a newcomer I am not sure 
if it is a good way to start (maybe it is too hard and I don't want to be a 
bottleneck). I started reading code of the related issues, is this one similar? 
What do you guys think?

Thank you!


was (Author: dylanguedes):
Hi,

I would like to help with this issue, but since I am a newcomer I am not sure 
if it is a good way to start (maybe it is too hard and I don't want to be a 
bottleneck). I started reading code of the related issues, it is similar? What 
do you guys think?

Thank you!

> Add interpreted execution for ValidateExternalType expression
> -
>
> Key: SPARK-23595
> URL: https://issues.apache.org/jira/browse/SPARK-23595
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Herman van Hovell
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21048) Add an option --merged-properties-file to distinguish the configuration loading behavior

2018-03-06 Thread Yaqub Alwan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16388045#comment-16388045
 ] 

Yaqub Alwan commented on SPARK-21048:
-

Commenting here because SPARK-21023 was closed and contrary to the remark "I 
think this is confusing relative to any value it adds" I also have found this 
behaviour of completely ignoring the spark-defaults.conf to be counterintuitive 
(if I am being honest I would actually say I find it obnoxious, as they're 
hardly defaults if they get ignored when not explicitly overridden) when 
supplied with a properties file, and I would also like to see _some_ solution 
to this problem, as a application developer doesn't want or need to know about 
cluster level configuration in order to just set some application level 
properties.

I would prefer to see something like --merge-properties-with-defaults in 
combination with --properties-file but any implementation works. I am a little 
concerned seeing the resistance to having this implemented, when the current 
behaviour is not intuitive.

> Add an option --merged-properties-file to distinguish the configuration 
> loading behavior
> 
>
> Key: SPARK-21048
> URL: https://issues.apache.org/jira/browse/SPARK-21048
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Submit
>Affects Versions: 2.1.1
>Reporter: Lantao Jin
>Priority: Minor
>
> The problem description is the same as 
> [SPARK-21023|https://issues.apache.org/jira/browse/SPARK-21023]. But 
> different with that ticket. The purpose is not making sure the default 
> properties file always be loaded. Instead, just offering other option to let 
> user choose what they want.
> {quote}
> {{\-\-properties-file}} user-specified properties file which will replace the 
> default properties file. deprecated.
> {{\-\-replaced-properties-file}} new option which equals the 
> {{\-\-properties-file}} but more friendly. 
> {{\-\-merged-properties-file}} user-specified properties file which will 
> merge with the default properties file.
> {quote}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23612) Specify formats for individual DateType and TimestampType columns in schemas

2018-03-06 Thread Patrick Young (JIRA)
Patrick Young created SPARK-23612:
-

 Summary: Specify formats for individual DateType and TimestampType 
columns in schemas
 Key: SPARK-23612
 URL: https://issues.apache.org/jira/browse/SPARK-23612
 Project: Spark
  Issue Type: Improvement
  Components: PySpark, SQL
Affects Versions: 2.3.0
Reporter: Patrick Young


[https://github.com/apache/spark/blob/407f67249639709c40c46917700ed6dd736daa7d/python/pyspark/sql/types.py#L162-L200]

It would be very helpful if it were possible to specify the format for 
individual columns in a schema when reading csv files, rather than one format:

{code:title=Bar.python|borderStyle=solid}

# Currently can only do something like:

spark.read.option("**dateFormat", "MMdd").csv(...) 

# Would like to be able to do something like:

schema = StructType([

    StructField("date1", DateType(format="MM/dd/"), True),

    StructField("date2", DateType(format="MMdd"), True)

]

read.schema(schema).csv(...)

{{{code}}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23612) Specify formats for individual DateType and TimestampType columns in schemas

2018-03-06 Thread Patrick Young (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Young updated SPARK-23612:
--
Description: 
[https://github.com/apache/spark/blob/407f67249639709c40c46917700ed6dd736daa7d/python/pyspark/sql/types.py#L162-L200]

It would be very helpful if it were possible to specify the format for 
individual columns in a schema when reading csv files, rather than one format:
{code:java|title=Bar.python|borderStyle=solid}
# Currently can only do something like:

spark.read.option("dateFormat", "MMdd").csv(...) 

# Would like to be able to do something like:

schema = StructType([

    StructField("date1", DateType(format="MM/dd/"), True),

    StructField("date2", DateType(format="MMdd"), True)

]

read.schema(schema).csv(...)

{code}
Thanks for any help, input!

  was:
[https://github.com/apache/spark/blob/407f67249639709c40c46917700ed6dd736daa7d/python/pyspark/sql/types.py#L162-L200]

It would be very helpful if it were possible to specify the format for 
individual columns in a schema when reading csv files, rather than one format:

{code:title=Bar.python|borderStyle=solid}

# Currently can only do something like:

spark.read.option("**dateFormat", "MMdd").csv(...) 

# Would like to be able to do something like:

schema = StructType([

    StructField("date1", DateType(format="MM/dd/"), True),

    StructField("date2", DateType(format="MMdd"), True)

]

read.schema(schema).csv(...)

{{{code}}}


> Specify formats for individual DateType and TimestampType columns in schemas
> 
>
> Key: SPARK-23612
> URL: https://issues.apache.org/jira/browse/SPARK-23612
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 2.3.0
>Reporter: Patrick Young
>Priority: Minor
>  Labels: DataType, date, sql
>
> [https://github.com/apache/spark/blob/407f67249639709c40c46917700ed6dd736daa7d/python/pyspark/sql/types.py#L162-L200]
> It would be very helpful if it were possible to specify the format for 
> individual columns in a schema when reading csv files, rather than one format:
> {code:java|title=Bar.python|borderStyle=solid}
> # Currently can only do something like:
> spark.read.option("dateFormat", "MMdd").csv(...) 
> # Would like to be able to do something like:
> schema = StructType([
>     StructField("date1", DateType(format="MM/dd/"), True),
>     StructField("date2", DateType(format="MMdd"), True)
> ]
> read.schema(schema).csv(...)
> {code}
> Thanks for any help, input!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23591) Add interpreted execution for EncodeUsingSerializer expression

2018-03-06 Thread Herman van Hovell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell reassigned SPARK-23591:
-

Assignee: Marco Gaido

> Add interpreted execution for EncodeUsingSerializer expression
> --
>
> Key: SPARK-23591
> URL: https://issues.apache.org/jira/browse/SPARK-23591
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Herman van Hovell
>Assignee: Marco Gaido
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23590) Add interpreted execution for CreateExternalRow expression

2018-03-06 Thread Herman van Hovell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell resolved SPARK-23590.
---
   Resolution: Fixed
Fix Version/s: 2.4.0

> Add interpreted execution for CreateExternalRow expression
> --
>
> Key: SPARK-23590
> URL: https://issues.apache.org/jira/browse/SPARK-23590
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Herman van Hovell
>Assignee: Marco Gaido
>Priority: Major
> Fix For: 2.4.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23346) Failed tasks reported as success if the failure reason is not ExceptionFailure

2018-03-06 Thread Herman van Hovell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16388088#comment-16388088
 ] 

Herman van Hovell commented on SPARK-23346:
---

[~wuzhilon88] This is completely un-actionable. Describe what you are doing 
here, and add a reproduction. Otherwise I will close the ticket.

> Failed tasks reported as success if the failure reason is not ExceptionFailure
> --
>
> Key: SPARK-23346
> URL: https://issues.apache.org/jira/browse/SPARK-23346
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.2.0
> Environment: HADOOP 2.6 + JDK1.8 + SPARK 2.2.0
>Reporter: 吴志龙
>Priority: Critical
> Attachments: 企业微信截图_15179714603307.png, 企业微信截图_15179715023606.png
>
>
>  !企业微信截图_15179715023606.png!  !企业微信截图_15179714603307.png! We have many other 
> failure reasons, such as TaskResultLost,but the status is success. In the web 
> ui, we count non-ExceptionFailure failures as successful tasks, which is 
> highly misleading.
> detail message:
> Job aborted due to stage failure: Task 0 in stage 7.0 failed 10 times, most 
> recent failure: Lost task 0.9 in stage 7.0 (TID 35, 60.hadoop.com, executor 
> 27): TaskResultLost (result lost from block manager)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23592) Add interpreted execution for DecodeUsingSerializer expression

2018-03-06 Thread Marco Gaido (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16388075#comment-16388075
 ] 

Marco Gaido commented on SPARK-23592:
-

I will submit a PR as soon as SPARK-23591 gets merged, thanks

> Add interpreted execution for DecodeUsingSerializer expression
> --
>
> Key: SPARK-23592
> URL: https://issues.apache.org/jira/browse/SPARK-23592
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Herman van Hovell
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23592) Add interpreted execution for DecodeUsingSerializer expression

2018-03-06 Thread Herman van Hovell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell reassigned SPARK-23592:
-

Assignee: Marco Gaido

> Add interpreted execution for DecodeUsingSerializer expression
> --
>
> Key: SPARK-23592
> URL: https://issues.apache.org/jira/browse/SPARK-23592
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Herman van Hovell
>Assignee: Marco Gaido
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23580) Interpreted mode fallback should be implemented for all expressions & projections

2018-03-06 Thread Herman van Hovell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell updated SPARK-23580:
--
Labels: release-notes  (was: )

> Interpreted mode fallback should be implemented for all expressions & 
> projections
> -
>
> Key: SPARK-23580
> URL: https://issues.apache.org/jira/browse/SPARK-23580
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Herman van Hovell
>Priority: Major
>  Labels: release-notes
>
> Spark SQL currently does not support interpreted mode for all expressions and 
> projections. This is a problem for scenario's where were code generation does 
> not work, or blows past the JVM class limits. We currently cannot gracefully 
> fallback.
> This ticket is an umbrella to fix this class of problem in Spark SQL. This 
> work can be divided into two main area's:
> - Add interpreted versions for all dataset related expressions.
> - Add an interpreted version of {{GenerateUnsafeProjection}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23609) Test code does not conform to the test title

2018-03-06 Thread caoxuewen (JIRA)
caoxuewen created SPARK-23609:
-

 Summary: Test code does not conform to the test title
 Key: SPARK-23609
 URL: https://issues.apache.org/jira/browse/SPARK-23609
 Project: Spark
  Issue Type: Improvement
  Components: SQL, Tests
Affects Versions: 2.4.0
Reporter: caoxuewen


Currently, In testing EnsureRequirements's test cases to eliminate 
ShuffleExchange, The test code is not in conformity with the purpose of the 
test.These test cases are as follows:
1、test("EnsureRequirements eliminates Exchange if child has same partitioning")
   The checking condition is that there is no ShuffleExchange in the physical 
plan. = = 2 It's not accurate here.
2、test("EnsureRequirements does not eliminate Exchange with different 
partitioning")
   The purpose of the test is to not eliminate ShuffleExchange, but its test 
code is the same as test("EnsureRequirements eliminates Exchange if child has 
same partitioning")



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23609) Test code does not conform to the test title

2018-03-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23609:


Assignee: Apache Spark

> Test code does not conform to the test title
> 
>
> Key: SPARK-23609
> URL: https://issues.apache.org/jira/browse/SPARK-23609
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Tests
>Affects Versions: 2.4.0
>Reporter: caoxuewen
>Assignee: Apache Spark
>Priority: Minor
>
> Currently, In testing EnsureRequirements's test cases to eliminate 
> ShuffleExchange, The test code is not in conformity with the purpose of the 
> test.These test cases are as follows:
> 1、test("EnsureRequirements eliminates Exchange if child has same 
> partitioning")
>    The checking condition is that there is no ShuffleExchange in the physical 
> plan. = = 2 It's not accurate here.
> 2、test("EnsureRequirements does not eliminate Exchange with different 
> partitioning")
>    The purpose of the test is to not eliminate ShuffleExchange, but its test 
> code is the same as test("EnsureRequirements eliminates Exchange if child has 
> same partitioning")



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23609) Test code does not conform to the test title

2018-03-06 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16387470#comment-16387470
 ] 

Apache Spark commented on SPARK-23609:
--

User 'heary-cao' has created a pull request for this issue:
https://github.com/apache/spark/pull/20747

> Test code does not conform to the test title
> 
>
> Key: SPARK-23609
> URL: https://issues.apache.org/jira/browse/SPARK-23609
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Tests
>Affects Versions: 2.4.0
>Reporter: caoxuewen
>Priority: Minor
>
> Currently, In testing EnsureRequirements's test cases to eliminate 
> ShuffleExchange, The test code is not in conformity with the purpose of the 
> test.These test cases are as follows:
> 1、test("EnsureRequirements eliminates Exchange if child has same 
> partitioning")
>    The checking condition is that there is no ShuffleExchange in the physical 
> plan. = = 2 It's not accurate here.
> 2、test("EnsureRequirements does not eliminate Exchange with different 
> partitioning")
>    The purpose of the test is to not eliminate ShuffleExchange, but its test 
> code is the same as test("EnsureRequirements eliminates Exchange if child has 
> same partitioning")



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23609) Test code does not conform to the test title

2018-03-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23609:


Assignee: (was: Apache Spark)

> Test code does not conform to the test title
> 
>
> Key: SPARK-23609
> URL: https://issues.apache.org/jira/browse/SPARK-23609
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Tests
>Affects Versions: 2.4.0
>Reporter: caoxuewen
>Priority: Minor
>
> Currently, In testing EnsureRequirements's test cases to eliminate 
> ShuffleExchange, The test code is not in conformity with the purpose of the 
> test.These test cases are as follows:
> 1、test("EnsureRequirements eliminates Exchange if child has same 
> partitioning")
>    The checking condition is that there is no ShuffleExchange in the physical 
> plan. = = 2 It's not accurate here.
> 2、test("EnsureRequirements does not eliminate Exchange with different 
> partitioning")
>    The purpose of the test is to not eliminate ShuffleExchange, but its test 
> code is the same as test("EnsureRequirements eliminates Exchange if child has 
> same partitioning")



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-20162) Reading data from MySQL - Cannot up cast from decimal(30,6) to decimal(38,18)

2018-03-06 Thread Caio Quirino da Silva (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16383471#comment-16383471
 ] 

Caio Quirino da Silva edited comment on SPARK-20162 at 3/6/18 11:53 AM:


I have reproduced the problem using Spark 2.2.0 with that snippet:

 
{code:java}
case class MyEntity(field: BigDecimal)

private val avroFileDir = "abc.avro"
def test(): Unit = {
  val sp = sparkSession
  import sp.implicits._
  
  val rdd = 
sparkSession.sparkContext.parallelize(List(MyEntity(BigDecimal(1.23
  
  val df = sp.createDataFrame(rdd)
  df.write.mode(SaveMode.Append).avro(avroFileDir)
  sp.read.avro(avroFileDir).as[MyEntity].head
}{code}
 

 

So I think that we can reopen this issue...

 

org.apache.spark.sql.AnalysisException: Cannot up cast lambdavariable  
from string to decimal(38,18) as it may truncate


was (Author: caioquirino):
I have reproduced the problem using Spark 2.2.0 with that snippet:

 
{code:java}
case class MyEntity(field: BigDecimal)
val df = ss.createDataframe(Seq(MyEntity(BigDecimal(1.23
df.write.mode(SaveMode.Append).avro("dir.avro")
ss.read.avro("dir.avro").as[MyEntity].head
{code}
 

 

So I think that we can reopen this issue...

 

org.apache.spark.sql.AnalysisException: Cannot up cast lambdavariable  
from string to decimal(38,18) as it may truncate

> Reading data from MySQL - Cannot up cast from decimal(30,6) to decimal(38,18)
> -
>
> Key: SPARK-20162
> URL: https://issues.apache.org/jira/browse/SPARK-20162
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Miroslav Spehar
>Priority: Major
>
> While reading data from MySQL, type conversion doesn't work for Decimal type 
> when the decimal in database is of lower precision/scale than the one spark 
> expects.
> Error:
> Exception in thread "main" org.apache.spark.sql.AnalysisException: Cannot up 
> cast `DECIMAL_AMOUNT` from decimal(30,6) to decimal(38,18) as it may truncate
> The type path of the target object is:
> - field (class: "org.apache.spark.sql.types.Decimal", name: "DECIMAL_AMOUNT")
> - root class: "com.misp.spark.Structure"
> You can either add an explicit cast to the input data or choose a higher 
> precision type of the field in the target object;
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast$.org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveUpCast$$fail(Analyzer.scala:2119)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast$$anonfun$apply$34$$anonfun$applyOrElse$14.applyOrElse(Analyzer.scala:2141)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast$$anonfun$apply$34$$anonfun$applyOrElse$14.applyOrElse(Analyzer.scala:2136)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:288)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:288)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:287)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:293)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:293)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5$$anonfun$apply$11.apply(TreeNode.scala:360)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:245)
>   at scala.collection.immutable.List.map(List.scala:285)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:358)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:329)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:293)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionDown$1(QueryPlan.scala:248)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:258)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$6.apply(QueryPlan.scala:267)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188)
>   at 
> 

[jira] [Resolved] (SPARK-23594) Add interpreted execution for GetExternalRowField expression

2018-03-06 Thread Herman van Hovell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell resolved SPARK-23594.
---
   Resolution: Fixed
 Assignee: Takeshi Yamamuro
Fix Version/s: 2.4.0

> Add interpreted execution for GetExternalRowField expression
> 
>
> Key: SPARK-23594
> URL: https://issues.apache.org/jira/browse/SPARK-23594
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Herman van Hovell
>Assignee: Takeshi Yamamuro
>Priority: Major
> Fix For: 2.4.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23590) Add interpreted execution for CreateExternalRow expression

2018-03-06 Thread Marco Gaido (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16387721#comment-16387721
 ] 

Marco Gaido commented on SPARK-23590:
-

I am working on this

> Add interpreted execution for CreateExternalRow expression
> --
>
> Key: SPARK-23590
> URL: https://issues.apache.org/jira/browse/SPARK-23590
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Herman van Hovell
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-20162) Reading data from MySQL - Cannot up cast from decimal(30,6) to decimal(38,18)

2018-03-06 Thread Caio Quirino da Silva (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16387661#comment-16387661
 ] 

Caio Quirino da Silva edited comment on SPARK-20162 at 3/6/18 12:08 PM:


Yes! And I can say that it started to fail from version 2.2.x.
 For Spark 2.1.2 it's fine.

I have updated my last code snippet to create a cleaner stacktrace:
{code:java}
18/03/06 11:51:10 INFO DAGScheduler: Job 0 finished: save at package.scala:26, 
took 0.941392 s
18/03/06 11:51:10 INFO FileFormatWriter: Job null committed.

Cannot up cast `field` from string to decimal(38,18) as it may truncate
The type path of the target object is:
- field (class: "scala.math.BigDecimal", name: "field")
- root class: "org.farfetch.bigdata.streaming.MyEntity"
You can either add an explicit cast to the input data or choose a higher 
precision type of the field in the target object;
org.apache.spark.sql.AnalysisException: Cannot up cast `field` from string to 
decimal(38,18) as it may truncate
The type path of the target object is:
- field (class: "scala.math.BigDecimal", name: "field")
- root class: "org.farfetch.bigdata.streaming.MyEntity"
You can either add an explicit cast to the input data or choose a higher 
precision type of the field in the target object;
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast$.org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveUpCast$$fail(Analyzer.scala:2123)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast$$anonfun$apply$34$$anonfun$applyOrElse$14.applyOrElse(Analyzer.scala:2153)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast$$anonfun$apply$34$$anonfun$applyOrElse$14.applyOrElse(Analyzer.scala:2140)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:268)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:268)
at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:267)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:273)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:273)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:307)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:305)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:273)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:273)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:273)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4$$anonfun$apply$11.apply(TreeNode.scala:336)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.immutable.List.foreach(List.scala:381)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.immutable.List.map(List.scala:285)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:334)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:305)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:273)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsDown$1.apply(QueryPlan.scala:245)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsDown$1.apply(QueryPlan.scala:245)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpression$1(QueryPlan.scala:266)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:276)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$6.apply(QueryPlan.scala:285)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.mapExpressions(QueryPlan.scala:285)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsDown(QueryPlan.scala:245)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressions(QueryPlan.scala:236)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast$$anonfun$apply$34.applyOrElse(Analyzer.scala:2140)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast$$anonfun$apply$34.applyOrElse(Analyzer.scala:2136)
at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61)
at 

[jira] [Assigned] (SPARK-23582) Add interpreted execution to StaticInvoke expression

2018-03-06 Thread Herman van Hovell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell reassigned SPARK-23582:
-

Assignee: Kazuaki Ishizaki

> Add interpreted execution to StaticInvoke expression
> 
>
> Key: SPARK-23582
> URL: https://issues.apache.org/jira/browse/SPARK-23582
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Herman van Hovell
>Assignee: Kazuaki Ishizaki
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23611) Extend ExpressionEvalHelper harness to also test failures

2018-03-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23611:


Assignee: Apache Spark

> Extend ExpressionEvalHelper harness to also test failures
> -
>
> Key: SPARK-23611
> URL: https://issues.apache.org/jira/browse/SPARK-23611
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Herman van Hovell
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >