[jira] [Commented] (SPARK-29274) Can not coerce decimal type to double type when it's join key

2019-09-27 Thread Yuming Wang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16939867#comment-16939867
 ] 

Yuming Wang commented on SPARK-29274:
-

I don't know how Vertica works. But it returns incorrect values.
{noformat}
create table t1 (incdata_id decimal(21,0), v VARCHAR);
create table t2 (incdata_id VARCHAR, v VARCHAR);
insert into t1 values(1001636981212, '1');
insert into t2 values(1001636981213, '2');


dbadmin=> select * from t1 join t2 on (t1.incdata_id = t2.incdata_id);
  incdata_id   | v |  incdata_id   | v
---+---+---+---
 1001636981212 | 1 | 1001636981213 | 2
(1 row)

dbadmin=> explain select * from t1 join t2 on (t1.incdata_id = t2.incdata_id);

   QUERY PLAN
-
 --
 QUERY PLAN DESCRIPTION:
 --

 explain select * from t1 join t2 on (t1.incdata_id = t2.incdata_id);

 Access Path:
 +-JOIN HASH [Cost: 4, Rows: 1 (NO STATISTICS)] (PATH ID: 1)
 |  Join Cond: (t1.incdata_id = t2.incdata_id)
 |  Materialize at Output: t1.v
 | +-- Outer -> STORAGE ACCESS for t1 [Cost: 1, Rows: 1 (NO STATISTICS)] (PATH 
ID: 2)
 | |  Projection: public.t1_super
 | |  Materialize: t1.incdata_id
 | |  Runtime Filter: (SIP1(HashJoin): t1.incdata_id)
 | +-- Inner -> STORAGE ACCESS for t2 [Cost: 2, Rows: 1 (NO STATISTICS)] (PATH 
ID: 3)
 | |  Projection: public.t2_super
 | |  Materialize: t2.incdata_id, t2.v


 --
 ---
 PLAN: BASE QUERY PLAN (GraphViz Format)
 ---
 digraph G {
 graph [rankdir=BT, label = "BASE QUERY PLAN\nQuery: explain select * from t1 
join t2 on (t1.incdata_id = t2.incdata_id);\n\nAll Nodes Vector: \n\n  
node[0]=v_docker_node0001 (initiator) Up\n", labelloc=t, labeljust=l 
ordering=out]
 0[label = "Root \nOutBlk=[UncTuple(4)]", color = "green", shape = "house"];
 1[label = "NewEENode \nOutBlk=[UncTuple(4)]", color = "green", shape = "box"];
 2[label = "StorageUnionStep: t1_super\nUnc: Numeric(21,0)\nUnc: 
Varchar(80)\nUnc: Varchar(80)\nUnc: Varchar(80)", color = "purple", shape = 
"box"];
 3[label = "Join: Hash-Join: \n(public.t1 x public.t2) using t1_super and 
t2_super (PATH ID: 1)\n (t1.incdata_id = t2.incdata_id)\n\nUnc: 
Numeric(21,0)\nUnc: Varchar(80)\nUnc: Varchar(80)\nUnc: Varchar(80)", color = 
"brown", shape = "box"];
 4[label = "FilterStep: \n(t1.incdata_id IS NOT NULL)\nUnc: Numeric(21,0)", 
color = "brown", shape = "box"];
 5[label = "ScanStep: t1_super\nSIP1(HashJoin): t1.incdata_id\nincdata_id\nUnc: 
Numeric(21,0)", color = "brown", shape = "box"];
 6[label = "FilterStep: \n(t2.incdata_id IS NOT NULL)\nUnc: Varchar(80)\nUnc: 
Varchar(80)", color = "green", shape = "box"];
 7[label = "StorageUnionStep: t2_super\nUnc: Varchar(80)\nUnc: Varchar(80)", 
color = "purple", shape = "box"];
 8[label = "ScanStep: t2_super\nincdata_id\nv\nUnc: Varchar(80)\nUnc: 
Varchar(80)", color = "brown", shape = "box"];
 1->0 [label = "V[0] C=4",color = "black",style="bold", arrowtail="inv"];
 2->1 [label = "0",color = "blue"];
 3->2 [label = "0",color = "blue"];
 4->3 [label = "0",color = "blue"];
 5->4 [label = "0",color = "blue"];
 6->3 [label = "1",color = "blue"];
 7->6 [label = "0",color = "blue"];
 8->7 [label = "0",color = "blue"];
 }
(43 rows)
{noformat}

> Can not coerce decimal type to double type when it's join key
> -
>
> Key: SPARK-29274
> URL: https://issues.apache.org/jira/browse/SPARK-29274
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.4, 2.4.4, 3.0.0
>Reporter: Yuming Wang
>Assignee: Pengfei Chang
>Priority: Major
> Attachments: image-2019-09-27-20-20-24-238.png
>
>
> How to reproduce this issue:
> {code:sql}
> create table t1 (incdata_id decimal(21,0), v string) using parquet;
> create table t2 (incdata_id string, v string) using parquet;
> explain select * from t1 join t2 on (t1.incdata_id = t2.incdata_id);
> == Physical Plan ==
> *(5) SortMergeJoin 
> [knownfloatingpointnormalized(normalizenanandzero(cast(incdata_id#31 as 
> double)))], 
> [knownfloatingpointnormalized(normalizenanandzero(cast(incdata_id#33 as 
> double)))], Inner
> :- *(2) Sort 
> [knownfloatingpointnormalized(normalizenanandzero(cast(incdata_id#31 as 
> double))) ASC NULLS FIRST], false, 0
> :  +- Exchange 
> 

[jira] [Commented] (SPARK-29274) Can not coerce decimal type to double type when it's join key

2019-09-27 Thread Yuming Wang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16939865#comment-16939865
 ] 

Yuming Wang commented on SPARK-29274:
-

Oracle will cast varchar type to number type:
{noformat}
create table t1 (incdata_id number(21,0), v VARCHAR(21));
create table t2 (incdata_id VARCHAR(210), v VARCHAR(21));
insert into t1 values(1001636981212, '1');
insert into t2 values(1001636981213, '2');

SQL> select * from t1 join t2 on (t1.incdata_id = t2.incdata_id);

no rows selected

SQL> explain plan for select * from t1 join t2 on (t1.incdata_id = 
t2.incdata_id);

Explained.

SQL> select * from table(dbms_xplan.display);

PLAN_TABLE_OUTPUT

Plan hash value: 1838229974

---
| Id  | Operation  | Name | Rows  | Bytes | Cost (%CPU)| Time |
---
|   0 | SELECT STATEMENT   |  | 1 |   144 | 4   (0)| 00:00:01 |
|*  1 |  HASH JOIN |  | 1 |   144 | 4   (0)| 00:00:01 |
|   2 |   TABLE ACCESS FULL| T1   | 1 |25 | 2   (0)| 00:00:01 |
|   3 |   TABLE ACCESS FULL| T2   | 1 |   119 | 2   (0)| 00:00:01 |
---


PLAN_TABLE_OUTPUT

Predicate Information (identified by operation id):
---

   1 - access("T1"."INCDATA_ID"=TO_NUMBER("T2"."INCDATA_ID"))

Note
-
   - dynamic statistics used: dynamic sampling (level=2)

19 rows selected.
{noformat}

> Can not coerce decimal type to double type when it's join key
> -
>
> Key: SPARK-29274
> URL: https://issues.apache.org/jira/browse/SPARK-29274
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.4, 2.4.4, 3.0.0
>Reporter: Yuming Wang
>Assignee: Pengfei Chang
>Priority: Major
> Attachments: image-2019-09-27-20-20-24-238.png
>
>
> How to reproduce this issue:
> {code:sql}
> create table t1 (incdata_id decimal(21,0), v string) using parquet;
> create table t2 (incdata_id string, v string) using parquet;
> explain select * from t1 join t2 on (t1.incdata_id = t2.incdata_id);
> == Physical Plan ==
> *(5) SortMergeJoin 
> [knownfloatingpointnormalized(normalizenanandzero(cast(incdata_id#31 as 
> double)))], 
> [knownfloatingpointnormalized(normalizenanandzero(cast(incdata_id#33 as 
> double)))], Inner
> :- *(2) Sort 
> [knownfloatingpointnormalized(normalizenanandzero(cast(incdata_id#31 as 
> double))) ASC NULLS FIRST], false, 0
> :  +- Exchange 
> hashpartitioning(knownfloatingpointnormalized(normalizenanandzero(cast(incdata_id#31
>  as double))), 200), true, [id=#104]
> : +- *(1) Filter isnotnull(incdata_id#31)
> :+- Scan hive default.t1 [incdata_id#31, v#32], HiveTableRelation 
> `default`.`t1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, 
> [incdata_id#31, v#32], Statistics(sizeInBytes=8.0 EiB)
> +- *(4) Sort 
> [knownfloatingpointnormalized(normalizenanandzero(cast(incdata_id#33 as 
> double))) ASC NULLS FIRST], false, 0
>+- Exchange 
> hashpartitioning(knownfloatingpointnormalized(normalizenanandzero(cast(incdata_id#33
>  as double))), 200), true, [id=#112]
>   +- *(3) Filter isnotnull(incdata_id#33)
>  +- Scan hive default.t2 [incdata_id#33, v#34], HiveTableRelation 
> `default`.`t2`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, 
> [incdata_id#33, v#34], Statistics(sizeInBytes=8.0 EiB)
> {code}
> {code:sql}
> select cast(v1 as double) as v3, cast(v2 as double) as v4,
>   cast(v1 as double) = cast(v2 as double), v1 = v2 
> from (select cast('1001636981212' as decimal(21, 0)) as v1,
>   cast('1001636981213' as decimal(21, 0)) as v2) t;
> 1.00163697E20 1.00163697E20   truefalse
> {code}
>  
> It's a realy case in our production:
> !image-2019-09-27-20-20-24-238.png|width=100%!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29280) DataFrameReader should support a compression option

2019-09-27 Thread Nicholas Chammas (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-29280:
-
Description: 
[DataFrameWriter|http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameWriter]
 supports a {{compression}} option, but 
[DataFrameReader|http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameReader]
 doesn't. The lack of a {{compression}} option in the reader causes some 
friction in the following cases:
 # You want to read some data compressed with a codec that Spark does not [load 
by 
default|http://spark.apache.org/docs/latest/configuration.html#compression-and-serialization].
 # You want to read some data with a codec that overrides one of the built-in 
codecs that Spark supports.
 # You want to explicitly instruct Spark on what codec to use on read when it 
will not be able to correctly auto-detect it (e.g. because the file extension 
is [missing,|https://stackoverflow.com/q/52011697/877069] 
[non-standard|https://stackoverflow.com/q/44372995/877069], or 
[incorrect|https://stackoverflow.com/q/49110384/877069]).

Case #2 came up in SPARK-29102. There is a very handy library called 
[SplittableGzip|https://github.com/nielsbasjes/splittablegzip] that lets you 
load a single gzipped file using multiple concurrent tasks. (You can see the 
details of how it works and why it's useful in the project README and in 
SPARK-29102.)

To use this codec, I had to set {{io.compression.codecs}}. I guess this is a 
Hadoop filesystem API setting, since it [doesn't appear to be documented by 
Spark|http://spark.apache.org/docs/latest/configuration.html]. Confusingly, 
there is also a setting called {{spark.io.compression.codec}}, which seems to 
be for a different purpose.

It would be much clearer for the user and more consistent with the writer 
interface if the reader let you directly specify the codec.

For example, I think all of the following should be possible:
{code:python}
spark.read.option('compression', 'lz4').csv(...)
spark.read.csv(..., 
compression='nl.basjes.hadoop.io.compress.SplittableGzipCodec')
spark.read.json(..., compression='none')
{code}

  was:
[DataFrameWriter|http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameWriter]
 supports a {{compression}} option, but 
[DataFrameReader|http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameReader]
 doesn't. The lack of a {{compression}} option in the reader causes some 
friction in the following cases:
 # You want to read some data compressed with a codec that Spark does not [load 
by 
default|http://spark.apache.org/docs/latest/configuration.html#compression-and-serialization].
 # You want to read some data with a codec that overrides one of the built-in 
codecs that Spark supports.
 # You want to explicitly instruct Spark on what codec to use on read when it 
will not be able to correctly auto-detect it (e.g. because the file extension 
is [missing,|https://stackoverflow.com/q/52011697/877069] 
[non-standard|https://stackoverflow.com/q/44372995/877069], or 
[incorrect|https://stackoverflow.com/q/49110384/877069]).

Case #2 came up in SPARK-29102. There is a very handy library called 
[SplittableGzip|https://github.com/nielsbasjes/splittablegzip] that lets you 
load a single gzipped file using multiple concurrent tasks. (You can see the 
details of how it works and why it's useful in the project README and in 
SPARK-29102.)

To use this codec, I had to set {{io.compression.codecs}}. I guess this is a 
Hadoop filesystem API setting, since it [doesn't appear to be documented by 
Spark|http://spark.apache.org/docs/latest/configuration.html]. Confusingly, 
there is also a setting called {{spark.io.compression.codec}}, which seems to 
be for a different purpose.

It would be much clearer for the user and more consistent with the writer 
interface if the reader let you directly specify the codec.

For example:
{code:java}
spark.read.option('compression', 'lz4').csv(...)
spark.read.csv(..., 
compression='nl.basjes.hadoop.io.compress.SplittableGzipCodec') {code}


> DataFrameReader should support a compression option
> ---
>
> Key: SPARK-29280
> URL: https://issues.apache.org/jira/browse/SPARK-29280
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output
>Affects Versions: 2.4.4
>Reporter: Nicholas Chammas
>Priority: Minor
>
> [DataFrameWriter|http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameWriter]
>  supports a {{compression}} option, but 
> [DataFrameReader|http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameReader]
>  doesn't. The lack of a {{compression}} option in the reader causes some 
> 

[jira] [Comment Edited] (SPARK-29102) Read gzipped file into multiple partitions without full gzip expansion on a single-node

2019-09-27 Thread Nicholas Chammas (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16939856#comment-16939856
 ] 

Nicholas Chammas edited comment on SPARK-29102 at 9/28/19 5:35 AM:
---

I figured it out. Looks like the correct setting is {{io.compression.codecs,}} 
as instructed in the 
[SplittableGzip|https://github.com/nielsbasjes/splittablegzip] repo, not 
{{spark.hadoop.io.compression.codecs}}. I mistakenly tried that after seeing 
others use it elsewhere.

So in summary, to read gzip files in Spark using this codec, you need to:
 # Start up Spark with "{{--packages nl.basjes.hadoop:splittablegzip:1.2}}".
 # Then, enable the new codec with "{{spark.conf.set('io.compression.codecs', 
'nl.basjes.hadoop.io.compress.SplittableGzipCodec')}}".
 # From there, you can read gzipped CSVs as you would normally, via 
"{{spark.read.csv(...)}}".

I've confirmed that, using this codec, Spark loads a single gzipped file with 
multiple concurrent tasks (and without the codec it only runs one task). I 
haven't done any further testing to see what performance benefits there are in 
a realistic use case, but if this codec works as described in its README then 
that should be good enough for me!

I've filed SPARK-29280 about adding a {{compression}} option to 
{{DataFrameReader}} to match {{DataFrameWriter}} and make this kind of workflow 
a bit more straightforward.


was (Author: nchammas):
I figured it out. Looks like the correct setting is {{io.compression.codecs,}} 
as instructed in the 
[SplittableGzip|https://github.com/nielsbasjes/splittablegzip] repo, not 
{{spark.hadoop.io.compression.codecs}}. I mistakenly tried that after seeing 
others use it elsewhere.

So in summary, to read gzip files in Spark using this codec, you need to:
 # Start up Spark with "{{--packages nl.basjes.hadoop:splittablegzip:1.2}}".
 # Then, enable the new codec with "{{spark.conf.set('io.compression.codecs', 
'nl.basjes.hadoop.io.compress.SplittableGzipCodec')}}".
 # From there, you can read gzipped CSVs as you would normally, via 
"{{spark.read.csv(...)}}".

I've confirmed that, using this codec, Spark loads a single gzipped file with 
multiple concurrent tasks (and without the codec it only runs one task). I 
haven't done any further testing to see what performance benefits there are in 
a realistic use case, but if this codec works as described in its README then 
that should be good enough for me!

At this point I think all that remains is for me to file a Jira about adding a 
{{compression}} option to {{DataFrameReader}} to match {{DataFrameWriter}} and 
make this kind of workflow a bit more straightforward.

> Read gzipped file into multiple partitions without full gzip expansion on a 
> single-node
> ---
>
> Key: SPARK-29102
> URL: https://issues.apache.org/jira/browse/SPARK-29102
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output
>Affects Versions: 2.4.4
>Reporter: Nicholas Chammas
>Priority: Minor
>
> Large gzipped files are a common stumbling block for new users (SPARK-5685, 
> SPARK-28366) and an ongoing pain point for users who must process such files 
> delivered from external parties who can't or won't break them up into smaller 
> files or compress them using a splittable compression format like bzip2.
> To deal with large gzipped files today, users must either load them via a 
> single task and then repartition the resulting RDD or DataFrame, or they must 
> launch a preprocessing step outside of Spark to split up the file or 
> recompress it using a splittable format. In either case, the user needs a 
> single host capable of holding the entire decompressed file.
> Spark can potentially a) spare new users the confusion over why only one task 
> is processing their gzipped data, and b) relieve new and experienced users 
> alike from needing to maintain infrastructure capable of decompressing a 
> large gzipped file on a single node, by directly loading gzipped files into 
> multiple partitions across the cluster.
> The rough idea is to have tasks divide a given gzipped file into ranges and 
> then have them all concurrently decompress the file, with each task throwing 
> away the data leading up to the target range. (This kind of partial 
> decompression is apparently [doable using standard Unix 
> utilities|https://unix.stackexchange.com/a/415831/70630], so it should be 
> doable in Spark too.)
> In this way multiple tasks can concurrently load a single gzipped file into 
> multiple partitions. Even though every task will need to unpack the file from 
> the beginning to the task's target range, and the stage will run no faster 
> than what it would take with Spark's current gzip loading behavior, this 
> nonetheless 

[jira] [Commented] (SPARK-29280) DataFrameReader should support a compression option

2019-09-27 Thread Nicholas Chammas (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16939864#comment-16939864
 ] 

Nicholas Chammas commented on SPARK-29280:
--

cc [~hyukjin.kwon], [~cloud_fan]

> DataFrameReader should support a compression option
> ---
>
> Key: SPARK-29280
> URL: https://issues.apache.org/jira/browse/SPARK-29280
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output
>Affects Versions: 2.4.4
>Reporter: Nicholas Chammas
>Priority: Minor
>
> [DataFrameWriter|http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameWriter]
>  supports a {{compression}} option, but 
> [DataFrameReader|http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameReader]
>  doesn't. The lack of a {{compression}} option in the reader causes some 
> friction in the following cases:
>  # You want to read some data compressed with a codec that Spark does not 
> [load by 
> default|http://spark.apache.org/docs/latest/configuration.html#compression-and-serialization].
>  # You want to read some data with a codec that overrides one of the built-in 
> codecs that Spark supports.
>  # You want to explicitly instruct Spark on what codec to use on read when it 
> will not be able to correctly auto-detect it (e.g. because the file extension 
> is [missing,|https://stackoverflow.com/q/52011697/877069] 
> [non-standard|https://stackoverflow.com/q/44372995/877069], or 
> [incorrect|https://stackoverflow.com/q/49110384/877069]).
> Case #2 came up in SPARK-29102. There is a very handy library called 
> [SplittableGzip|https://github.com/nielsbasjes/splittablegzip] that lets you 
> load a single gzipped file using multiple concurrent tasks. (You can see the 
> details of how it works and why it's useful in the project README and in 
> SPARK-29102.)
> To use this codec, I had to set {{io.compression.codecs}}. I guess this is a 
> Hadoop filesystem API setting, since it [doesn't appear to be documented by 
> Spark|http://spark.apache.org/docs/latest/configuration.html]. Confusingly, 
> there is also a setting called {{spark.io.compression.codec}}, which seems to 
> be for a different purpose.
> It would be much clearer for the user and more consistent with the writer 
> interface if the reader let you directly specify the codec.
> For example:
> {code:java}
> spark.read.option('compression', 'lz4').csv(...)
> spark.read.csv(..., 
> compression='nl.basjes.hadoop.io.compress.SplittableGzipCodec') {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29280) DataFrameReader should support a compression option

2019-09-27 Thread Nicholas Chammas (Jira)
Nicholas Chammas created SPARK-29280:


 Summary: DataFrameReader should support a compression option
 Key: SPARK-29280
 URL: https://issues.apache.org/jira/browse/SPARK-29280
 Project: Spark
  Issue Type: Improvement
  Components: Input/Output
Affects Versions: 2.4.4
Reporter: Nicholas Chammas


[DataFrameWriter|http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameWriter]
 supports a {{compression}} option, but 
[DataFrameReader|http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameReader]
 doesn't. The lack of a {{compression}} option in the reader causes some 
friction in the following cases:
 # You want to read some data compressed with a codec that Spark does not [load 
by 
default|http://spark.apache.org/docs/latest/configuration.html#compression-and-serialization].
 # You want to read some data with a codec that overrides one of the built-in 
codecs that Spark supports.
 # You want to explicitly instruct Spark on what codec to use on read when it 
will not be able to correctly auto-detect it (e.g. because the file extension 
is [missing,|https://stackoverflow.com/q/52011697/877069] 
[non-standard|https://stackoverflow.com/q/44372995/877069], or 
[incorrect|https://stackoverflow.com/q/49110384/877069]).

Case #2 came up in SPARK-29102. There is a very handy library called 
[SplittableGzip|https://github.com/nielsbasjes/splittablegzip] that lets you 
load a single gzipped file using multiple concurrent tasks. (You can see the 
details of how it works and why it's useful in the project README and in 
SPARK-29102.)

To use this codec, I had to set {{io.compression.codecs}}. I guess this is a 
Hadoop filesystem API setting, since it [doesn't appear to be documented by 
Spark|http://spark.apache.org/docs/latest/configuration.html]. Confusingly, 
there is also a setting called {{spark.io.compression.codec}}, which seems to 
be for a different purpose.

It would be much clearer for the user and more consistent with the writer 
interface if the reader let you directly specify the codec.

For example:
{code:java}
spark.read.option('compression', 'lz4').csv(...)
spark.read.csv(..., 
compression='nl.basjes.hadoop.io.compress.SplittableGzipCodec') {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29274) Can not coerce decimal type to double type when it's join key

2019-09-27 Thread Yuming Wang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16939862#comment-16939862
 ] 

Yuming Wang commented on SPARK-29274:
-

SQL server will cast string type to decimal type:

{noformat}
create table t1 (incdata_id decimal(21,0), v VARCHAR(21))
create table t2 (incdata_id VARCHAR(210), v VARCHAR(21))
insert into t1 values(1001636981212, '1')
insert into t2 values(1001636981213, '2')

1> explain select * from t1 join t2 on (t1.incdata_id = t2.incdata_id)
2> go
StmtText
   StmtId  NodeId   
   Parent  PhysicalOp LogicalOp  
Argument
  DefinedValues 
 EstimateRows   EstimateIO EstimateCPU
AvgRowSize  TotalSubtreeCost OutputList 
   Warnings 
Type Parallel 
EstimateExecutions
--
 --- --- --- -- 
-- 
-
 -- 
-- -- -- ---  
-
   
 --
select * from t1 join t2 on (t1.incdata_id = t2.incdata_id);
 1  
 1   0 NULL   NULL   1  

   NULL 
 1.0   NULL   NULL
NULL 2.4348916E-2 NULL  
NULL SELECT 
 0   
NULL
  |--Hash Match(Inner Join, 
HASH:([master].[dbo].[t1].[incdata_id])=([Expr1006]), 
RESIDUAL:([master].[dbo].[t1].[incdata_id]=[Expr1006]))1   
2   1 Hash Match Inner Join 
HASH:([master].[dbo].[t1].[incdata_id])=([Expr1006]), 
RESIDUAL:([master].[dbo].[t1].[incdata_id]=[Expr1006])  NULL
  1.0   
 0.0   1.7779617E-2  59 2.4348916E-2 
[master].[dbo].[t1].[incdata_id], [master].[dbo].[t1].[v], 
[master].[dbo].[t2].[incdata_id], [master].[dbo].[t2].[v]  NULL PLAN_ROW
01.0
   |--Table Scan(OBJECT:([master].[dbo].[t1]))  
 1  
 3   2 Table Scan Table Scan 
OBJECT:([master].[dbo].[t1])
  [master].[dbo].[t1].[incdata_id], 
[master].[dbo].[t1].[v] 1.0   0.003125  
0.0001581  240.0032831 [master].[dbo].[t1].[incdata_id], 
[master].[dbo].[t1].[v] 
NULL PLAN_ROW   
 01.0
   |--Compute 
Scalar(DEFINE:([Expr1006]=CONVERT_IMPLICIT(decimal(21,0),[master].[dbo].[t2].[incdata_id],0)))
 1   4   2 Compute Scalar   
  Compute Scalar 
DEFINE:([Expr1006]=CONVERT_IMPLICIT(decimal(21,0),[master].[dbo].[t2].[incdata_id],0))

[Expr1006]=CONVERT_IMPLICIT(decimal(21,0),[master].[dbo].[t2].[incdata_id],0)   
  1.00.0  0.001  57 3.2832001E-3 
[master].[dbo].[t2].[incdata_id], [master].[dbo].[t2].[v], [Expr1006]   
  NULL PLAN_ROW 
   01.0
|--Table 

[jira] [Comment Edited] (SPARK-29274) Can not coerce decimal type to double type when it's join key

2019-09-27 Thread Yuming Wang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16939862#comment-16939862
 ] 

Yuming Wang edited comment on SPARK-29274 at 9/28/19 5:26 AM:
--

SQL Server will cast string type to decimal type:
{noformat}
create table t1 (incdata_id decimal(21,0), v VARCHAR(21))
create table t2 (incdata_id VARCHAR(210), v VARCHAR(21))
insert into t1 values(1001636981212, '1')
insert into t2 values(1001636981213, '2')

1> explain select * from t1 join t2 on (t1.incdata_id = t2.incdata_id)
2> go
StmtText
   StmtId  NodeId   
   Parent  PhysicalOp LogicalOp  
Argument
  DefinedValues 
 EstimateRows   EstimateIO EstimateCPU
AvgRowSize  TotalSubtreeCost OutputList 
   Warnings 
Type Parallel 
EstimateExecutions
--
 --- --- --- -- 
-- 
-
 -- 
-- -- -- ---  
-
   
 --
select * from t1 join t2 on (t1.incdata_id = t2.incdata_id);
 1  
 1   0 NULL   NULL   1  

   NULL 
 1.0   NULL   NULL
NULL 2.4348916E-2 NULL  
NULL SELECT 
 0   
NULL
  |--Hash Match(Inner Join, 
HASH:([master].[dbo].[t1].[incdata_id])=([Expr1006]), 
RESIDUAL:([master].[dbo].[t1].[incdata_id]=[Expr1006]))1   
2   1 Hash Match Inner Join 
HASH:([master].[dbo].[t1].[incdata_id])=([Expr1006]), 
RESIDUAL:([master].[dbo].[t1].[incdata_id]=[Expr1006])  NULL
  1.0   
 0.0   1.7779617E-2  59 2.4348916E-2 
[master].[dbo].[t1].[incdata_id], [master].[dbo].[t1].[v], 
[master].[dbo].[t2].[incdata_id], [master].[dbo].[t2].[v]  NULL PLAN_ROW
01.0
   |--Table Scan(OBJECT:([master].[dbo].[t1]))  
 1  
 3   2 Table Scan Table Scan 
OBJECT:([master].[dbo].[t1])
  [master].[dbo].[t1].[incdata_id], 
[master].[dbo].[t1].[v] 1.0   0.003125  
0.0001581  240.0032831 [master].[dbo].[t1].[incdata_id], 
[master].[dbo].[t1].[v] 
NULL PLAN_ROW   
 01.0
   |--Compute 
Scalar(DEFINE:([Expr1006]=CONVERT_IMPLICIT(decimal(21,0),[master].[dbo].[t2].[incdata_id],0)))
 1   4   2 Compute Scalar   
  Compute Scalar 
DEFINE:([Expr1006]=CONVERT_IMPLICIT(decimal(21,0),[master].[dbo].[t2].[incdata_id],0))

[Expr1006]=CONVERT_IMPLICIT(decimal(21,0),[master].[dbo].[t2].[incdata_id],0)   
  1.00.0  0.001  57 3.2832001E-3 
[master].[dbo].[t2].[incdata_id], [master].[dbo].[t2].[v], [Expr1006]   
  NULL PLAN_ROW 
   0

[jira] [Commented] (SPARK-29102) Read gzipped file into multiple partitions without full gzip expansion on a single-node

2019-09-27 Thread Nicholas Chammas (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16939856#comment-16939856
 ] 

Nicholas Chammas commented on SPARK-29102:
--

I figured it out. Looks like the correct setting is {{io.compression.codecs,}} 
as instructed in the 
[SplittableGzip|https://github.com/nielsbasjes/splittablegzip] repo, not 
{{spark.hadoop.io.compression.codecs}}. I mistakenly tried that after seeing 
others use it elsewhere.

So in summary, to read gzip files in Spark using this codec, you need to:
 # Start up Spark with "{{--packages nl.basjes.hadoop:splittablegzip:1.2}}".
 # Then, enable the new codec with "{{spark.conf.set('io.compression.codecs', 
'nl.basjes.hadoop.io.compress.SplittableGzipCodec')}}".
 # From there, you can read gzipped CSVs as you would normally, via 
"{{spark.read.csv(...)}}".

I've confirmed that, using this codec, Spark loads a single gzipped file with 
multiple concurrent tasks (and without the codec it only runs one task). I 
haven't done any further testing to see what performance benefits there are in 
a realistic use case, but if this codec works as described in its README then 
that should be good enough for me!

At this point I think all that remains is for me to file a Jira about adding a 
{{compression}} option to {{DataFrameReader}} to match {{DataFrameWriter}} and 
make this kind of workflow a bit more straightforward.

> Read gzipped file into multiple partitions without full gzip expansion on a 
> single-node
> ---
>
> Key: SPARK-29102
> URL: https://issues.apache.org/jira/browse/SPARK-29102
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output
>Affects Versions: 2.4.4
>Reporter: Nicholas Chammas
>Priority: Minor
>
> Large gzipped files are a common stumbling block for new users (SPARK-5685, 
> SPARK-28366) and an ongoing pain point for users who must process such files 
> delivered from external parties who can't or won't break them up into smaller 
> files or compress them using a splittable compression format like bzip2.
> To deal with large gzipped files today, users must either load them via a 
> single task and then repartition the resulting RDD or DataFrame, or they must 
> launch a preprocessing step outside of Spark to split up the file or 
> recompress it using a splittable format. In either case, the user needs a 
> single host capable of holding the entire decompressed file.
> Spark can potentially a) spare new users the confusion over why only one task 
> is processing their gzipped data, and b) relieve new and experienced users 
> alike from needing to maintain infrastructure capable of decompressing a 
> large gzipped file on a single node, by directly loading gzipped files into 
> multiple partitions across the cluster.
> The rough idea is to have tasks divide a given gzipped file into ranges and 
> then have them all concurrently decompress the file, with each task throwing 
> away the data leading up to the target range. (This kind of partial 
> decompression is apparently [doable using standard Unix 
> utilities|https://unix.stackexchange.com/a/415831/70630], so it should be 
> doable in Spark too.)
> In this way multiple tasks can concurrently load a single gzipped file into 
> multiple partitions. Even though every task will need to unpack the file from 
> the beginning to the task's target range, and the stage will run no faster 
> than what it would take with Spark's current gzip loading behavior, this 
> nonetheless addresses the two problems called out above. Users no longer need 
> to load and then repartition gzipped files, and their infrastructure does not 
> need to decompress any large gzipped file on a single node.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29279) DataSourceV2: merge SHOW NAMESPACES and SHOW DATABASES code path

2019-09-27 Thread Terry Kim (Jira)
Terry Kim created SPARK-29279:
-

 Summary: DataSourceV2: merge SHOW NAMESPACES and SHOW DATABASES 
code path
 Key: SPARK-29279
 URL: https://issues.apache.org/jira/browse/SPARK-29279
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 3.0.0
Reporter: Terry Kim


Currently,  SHOW NAMESPACES and SHOW DATABASES are separate code paths. These 
should be merged.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-29273) Spark peakExecutionMemory metrics is zero

2019-09-27 Thread angerszhu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16939827#comment-16939827
 ] 

angerszhu edited comment on SPARK-29273 at 9/28/19 3:07 AM:


[~UncleHuang]


{code}
  /**
   * Peak memory used by internal data structures created during shuffles, 
aggregations and
   * joins. The value of this accumulator should be approximately the sum of 
the peak sizes
   * across all such data structures created in this task. For SQL jobs, this 
only tracks all
   * unsafe operators and ExternalSort.
   */
  def peakExecutionMemory: Long = _peakExecutionMemory.sum

 {code}


was (Author: angerszhuuu):
[~UncleHuang]
PS: I work for exa spark now.

{code}
  /**
   * Peak memory used by internal data structures created during shuffles, 
aggregations and
   * joins. The value of this accumulator should be approximately the sum of 
the peak sizes
   * across all such data structures created in this task. For SQL jobs, this 
only tracks all
   * unsafe operators and ExternalSort.
   */
  def peakExecutionMemory: Long = _peakExecutionMemory.sum

 {code}

> Spark peakExecutionMemory metrics is zero
> -
>
> Key: SPARK-29273
> URL: https://issues.apache.org/jira/browse/SPARK-29273
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.3
> Environment: hadoop 2.7.3
> spark 2.4.3
> jdk 1.8.0_60
>Reporter: huangweiyi
>Priority: Major
>
> with spark 2.4.3 in our production environment, i want to get the 
> peakExecutionMemory which is exposed by the TaskMetrics, but alway get the 
> zero value



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-29273) Spark peakExecutionMemory metrics is zero

2019-09-27 Thread angerszhu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16939827#comment-16939827
 ] 

angerszhu edited comment on SPARK-29273 at 9/28/19 3:06 AM:


[~UncleHuang]
PS: I work for exa spark now.

{code}
  /**
   * Peak memory used by internal data structures created during shuffles, 
aggregations and
   * joins. The value of this accumulator should be approximately the sum of 
the peak sizes
   * across all such data structures created in this task. For SQL jobs, this 
only tracks all
   * unsafe operators and ExternalSort.
   */
  def peakExecutionMemory: Long = _peakExecutionMemory.sum

 {code}


was (Author: angerszhuuu):
[~UncleHuang]

{code}
  /**
   * Peak memory used by internal data structures created during shuffles, 
aggregations and
   * joins. The value of this accumulator should be approximately the sum of 
the peak sizes
   * across all such data structures created in this task. For SQL jobs, this 
only tracks all
   * unsafe operators and ExternalSort.
   */
  def peakExecutionMemory: Long = _peakExecutionMemory.sum

 {code}

> Spark peakExecutionMemory metrics is zero
> -
>
> Key: SPARK-29273
> URL: https://issues.apache.org/jira/browse/SPARK-29273
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.3
> Environment: hadoop 2.7.3
> spark 2.4.3
> jdk 1.8.0_60
>Reporter: huangweiyi
>Priority: Major
>
> with spark 2.4.3 in our production environment, i want to get the 
> peakExecutionMemory which is exposed by the TaskMetrics, but alway get the 
> zero value



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29273) Spark peakExecutionMemory metrics is zero

2019-09-27 Thread angerszhu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16939827#comment-16939827
 ] 

angerszhu commented on SPARK-29273:
---

[~UncleHuang]

{code}
  /**
   * Peak memory used by internal data structures created during shuffles, 
aggregations and
   * joins. The value of this accumulator should be approximately the sum of 
the peak sizes
   * across all such data structures created in this task. For SQL jobs, this 
only tracks all
   * unsafe operators and ExternalSort.
   */
  def peakExecutionMemory: Long = _peakExecutionMemory.sum

 {code}

> Spark peakExecutionMemory metrics is zero
> -
>
> Key: SPARK-29273
> URL: https://issues.apache.org/jira/browse/SPARK-29273
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.3
> Environment: hadoop 2.7.3
> spark 2.4.3
> jdk 1.8.0_60
>Reporter: huangweiyi
>Priority: Major
>
> with spark 2.4.3 in our production environment, i want to get the 
> peakExecutionMemory which is exposed by the TaskMetrics, but alway get the 
> zero value



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29186) SubqueryAlias name value is null in Spark 2.4.3 Logical plan.

2019-09-27 Thread angerszhu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16939825#comment-16939825
 ] 

angerszhu commented on SPARK-29186:
---

In spark-2.4 & and master branch. 

RUN SQL 
{code:java}
create table customerTempView(CustomerID int);

create table orderTempView(CustomerID int, OrderID int);

explain extended SELECT OID, MAX(MID + CID) AS MID_new, ROW_NUMBER() OVER 
(ORDER BY CID) AS rn
FROM (
SELECT OID_1 AS OID, CID_1 AS CID, OID_1 + CID_1 AS MID
FROM (
SELECT MIN(ot.OrderID) AS OID_1, ct.CustomerID AS CID_1
FROM orderTempView ot
INNER JOIN customerTempView ct ON ot.CustomerID = 
ct.CustomerID
GROUP BY CID_1
)
)
GROUP BY OID, CID

{code}


Got analyzed plan :

{code}
== Analyzed Logical Plan ==
OID: int, MID_new: int, rn: int
Project [OID#2, MID_new#5, rn#6]
+- Project [OID#2, MID_new#5, CID#3, rn#6, rn#6]
   +- Window [row_number() windowspecdefinition(CID#3 ASC NULLS FIRST, 
specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS rn#6], 
[CID#3 ASC NULLS FIRST]
  +- Aggregate [OID#2, CID#3], [OID#2, max((MID#4 + CID#3)) AS MID_new#5, 
CID#3]
 +- SubqueryAlias `__auto_generated_subquery_name`
+- Project [OID_1#0 AS OID#2, CID_1#1 AS CID#3, (OID_1#0 + CID_1#1) 
AS MID#4]
   +- SubqueryAlias `__auto_generated_subquery_name`
  +- Aggregate [CustomerID#12], [min(OrderID#11) AS OID_1#0, 
CustomerID#12 AS CID_1#1]
 +- Join Inner, (CustomerID#10 = CustomerID#12)
:- SubqueryAlias `ot`
:  +- SubqueryAlias `default`.`ordertempview`
: +- HiveTableRelation `default`.`ordertempview`, 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [CustomerID#10, OrderID#11]
+- SubqueryAlias `ct`
   +- SubqueryAlias `default`.`customertempview`
  +- HiveTableRelation 
`default`.`customertempview`, 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [CustomerID#12]

{code}

> SubqueryAlias name value is null in Spark 2.4.3 Logical plan.
> -
>
> Key: SPARK-29186
> URL: https://issues.apache.org/jira/browse/SPARK-29186
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.3
> Environment: I have tried this on AWS Glue with Spark 2.4.3
> and on windows 10 with 2.4.4
> at both of them facing same issue
>Reporter: Tarun Khaneja
>Priority: Blocker
> Fix For: 2.2.1
>
> Attachments: image-2019-09-25-12-17-53-552.png, 
> image-2019-09-25-12-21-52-136.png
>
>
> I am writing a program to analyze sql query. So I am using Spark logical 
> plan.I am writing a program to analyze sql query. So I am using Spark logical 
> plan.
> Below is the code which I am using
>     
> {code:java}
> object QueryAnalyzer
> {   
> val LOG = LoggerFactory.getLogger(this.getClass)     //Spark Conf 
>    
> val conf = new     
> SparkConf().setMaster("local[2]").setAppName("LocalEdlExecutor")     
> //Spark Context    
> val sc = new SparkContext(conf)
> //sql Context    
> val sqlContext = new SQLContext(sc)   
>   
> //Spark Session    
> val sparkSession = SparkSession      
> .builder()      
> .appName("Spark User Data")      .config("spark.app.name", "LocalEdl")      
> .getOrCreate()     
> def main(args: Array[String])
> {          
> var inputDfColumns = Map[String,List[String]]() 
> val dfSession =  sparkSession.read.format("csv").      option("header", 
> "true").      option("inferschema", "true").      option("delimiter", 
> ",").option("decoding", "utf8").option("multiline", true) 
>        
> var oDF = dfSession.      load("C:\\Users\\tarun.khaneja\\data\\order.csv")   
>      
> println("smaple data in oDF>")
>       
> oDF.show()           
> var cusDF = dfSession.        
> load("C:\\Users\\tarun.khaneja\\data\\customer.csv")          
> println("smaple data in cusDF>")      cusDF.show() 
>              oDF.createOrReplaceTempView("orderTempView")      
> cusDF.createOrReplaceTempView("customerTempView")
>             
> //get input columns from all dataframe      
> inputDfColumns += 
> ("orderTempView"->oDF.columns.toList) 
>      
> inputDfColumns += 
> ("customerTempView"->cusDF.columns.toList) 
>            
> val res = sqlContext.sql("""select OID, max(MID+CID) as MID_new,ROW_NUMBER() 
> OVER (                      
> ORDER BY CID) as rn from                             (select OID_1 as OID, 
> CID_1 as CID, OID_1+CID_1 as MID from (select min(ot.OrderID) as OID_1, 
> ct.CustomerID as CID_1 from orderTempView as ot inner join customerTempView 
> as ct                          on ot.CustomerID = 

[jira] [Updated] (SPARK-29278) Implement CATALOG/NAMESPACE related SQL commands for Data Source V2

2019-09-27 Thread Terry Kim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Terry Kim updated SPARK-29278:
--
Summary: Implement CATALOG/NAMESPACE related SQL commands for Data Source 
V2  (was: Implement NAMESPACE related SQL commands for Data Source V2)

> Implement CATALOG/NAMESPACE related SQL commands for Data Source V2
> ---
>
> Key: SPARK-29278
> URL: https://issues.apache.org/jira/browse/SPARK-29278
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Terry Kim
>Priority: Major
>
> Introduce the following SQL commands for Data Source V2
> {code:sql}
> CREATE NAMESPACE mycatalog.ns1.ns2
> SHOW CURRENT CATALOG
> SHOW CURRENT NAMESPACE
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29278) Implement NAMESPACE related SQL commands for Data Source V2

2019-09-27 Thread Terry Kim (Jira)
Terry Kim created SPARK-29278:
-

 Summary: Implement NAMESPACE related SQL commands for Data Source 
V2
 Key: SPARK-29278
 URL: https://issues.apache.org/jira/browse/SPARK-29278
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 3.0.0
Reporter: Terry Kim


Introduce the following SQL commands for Data Source V2

{code:sql}
CREATE NAMESPACE mycatalog.ns1.ns2
SHOW CURRENT CATALOG
SHOW CURRENT NAMESPACE
{code}

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29221) Flaky test: SQLQueryTestSuite.sql (subquery/scalar-subquery/scalar-subquery-select.sql)

2019-09-27 Thread Jungtaek Lim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim resolved SPARK-29221.
--
Fix Version/s: 3.0.0
   Resolution: Fixed

This is resolved via [https://github.com/apache/spark/pull/25925]

> Flaky test: SQLQueryTestSuite.sql 
> (subquery/scalar-subquery/scalar-subquery-select.sql)
> ---
>
> Key: SPARK-29221
> URL: https://issues.apache.org/jira/browse/SPARK-29221
> Project: Spark
>  Issue Type: Test
>  Components: SQL, Tests
>Affects Versions: 3.0.0
>Reporter: Jungtaek Lim
>Priority: Minor
> Fix For: 3.0.0
>
>
> [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/111236/testReport/org.apache.spark.sql/SQLQueryTestSuite/sql/]
> (The page is somewhat weird, you may need to click "sql" in the page again to 
> see failure.)
> {code:java}
> Error Message
> org.scalatest.exceptions.TestFailedException: 
> subquery/scalar-subquery/scalar-subquery-select.sql Expected 
> "struct<[min_t3d:bigint,max_t2h:timestamp]>", but got "struct<[]>" Schema did 
> not match for query #3 SELECT (SELECT min(t3d) FROM t3) min_t3d,
> (SELECT max(t2h) FROM t2) max_t2h FROM   t1 WHERE  t1a = 'val1c': 
> QueryOutput(SELECT (SELECT min(t3d) FROM t3) min_t3d,(SELECT max(t2h) 
> FROM t2) max_t2h FROM   t1 WHERE  t1a = 
> 'val1c',struct<>,java.lang.NullPointerException null)
> Stacktrace
> sbt.ForkMain$ForkError: org.scalatest.exceptions.TestFailedException: 
> subquery/scalar-subquery/scalar-subquery-select.sql
> Expected "struct<[min_t3d:bigint,max_t2h:timestamp]>", but got "struct<[]>" 
> Schema did not match for query #3
> SELECT (SELECT min(t3d) FROM t3) min_t3d,
>(SELECT max(t2h) FROM t2) max_t2h
> FROM   t1
> WHERE  t1a = 'val1c': QueryOutput(SELECT (SELECT min(t3d) FROM t3) min_t3d,
>(SELECT max(t2h) FROM t2) max_t2h
> FROM   t1
> WHERE  t1a = 'val1c',struct<>,java.lang.NullPointerException
> null)
>   at 
> org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:528)
>   at 
> org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:527)
>   at 
> org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560)
>   at org.scalatest.Assertions.assertResult(Assertions.scala:1003)
>   at org.scalatest.Assertions.assertResult$(Assertions.scala:998)
>   at org.scalatest.FunSuite.assertResult(FunSuite.scala:1560)
>   at 
> org.apache.spark.sql.SQLQueryTestSuite.$anonfun$runQueries$11(SQLQueryTestSuite.scala:382)
>   at 
> scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
>   at 
> scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
>   at 
> org.apache.spark.sql.SQLQueryTestSuite.$anonfun$runQueries$9(SQLQueryTestSuite.scala:377)
>   at 
> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
>   at org.scalatest.Assertions.withClue(Assertions.scala:1221)
>   at org.scalatest.Assertions.withClue$(Assertions.scala:1208)
>   at org.scalatest.FunSuite.withClue(FunSuite.scala:1560)
>   at 
> org.apache.spark.sql.SQLQueryTestSuite.runQueries(SQLQueryTestSuite.scala:353)
>   at 
> org.apache.spark.sql.SQLQueryTestSuite.$anonfun$runTest$15(SQLQueryTestSuite.scala:276)
>   at 
> org.apache.spark.sql.SQLQueryTestSuite.$anonfun$runTest$15$adapted(SQLQueryTestSuite.scala:274)
>   at 
> scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
>   at 
> scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198)
>   at 
> org.apache.spark.sql.SQLQueryTestSuite.runTest(SQLQueryTestSuite.scala:274)
>   at 
> org.apache.spark.sql.SQLQueryTestSuite.$anonfun$createScalaTestCase$5(SQLQueryTestSuite.scala:223)
>   at 
> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
>   at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>   at org.scalatest.Transformer.apply(Transformer.scala:22)
>   at org.scalatest.Transformer.apply(Transformer.scala:20)
>   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186)
>   at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:149)
>   at 
> org.scalatest.FunSuiteLike.invokeWithFixture$1(FunSuiteLike.scala:184)
>   at org.scalatest.FunSuiteLike.$anonfun$runTest$1(FunSuiteLike.scala:196)
>   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:289)
>   at 

[jira] [Created] (SPARK-29277) DataSourceV2: Add early filter and projection pushdown

2019-09-27 Thread Ryan Blue (Jira)
Ryan Blue created SPARK-29277:
-

 Summary: DataSourceV2: Add early filter and projection pushdown
 Key: SPARK-29277
 URL: https://issues.apache.org/jira/browse/SPARK-29277
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Ryan Blue


Spark uses optimizer rules that need stats before conversion to physical plan. 
DataSourceV2 should support early pushdown for those rules.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29055) Memory leak in Spark

2019-09-27 Thread Jim Kleckner (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16939787#comment-16939787
 ] 

Jim Kleckner commented on SPARK-29055:
--

[~Geopap] How quickly does the memory grow?

> Memory leak in Spark
> 
>
> Key: SPARK-29055
> URL: https://issues.apache.org/jira/browse/SPARK-29055
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager, Spark Core
>Affects Versions: 2.3.3
>Reporter: George Papa
>Priority: Major
> Attachments: test_csvs.zip
>
>
> I used Spark 2.1.1 and I upgraded into new versions. After Spark version 
> 2.3.3,  I observed from Spark UI that the driver memory is{color:#ff} 
> increasing continuously.{color}
> In more detail, the driver memory and executors memory have the same used 
> memory storage and after each iteration the storage memory is increasing. You 
> can reproduce this behavior by running the following snippet code. The 
> following example, is very simple, without any dataframe persistence, but the 
> memory consumption is not stable as it was in former Spark versions 
> (Specifically until Spark 2.3.2).
> Also, I tested with Spark streaming and structured streaming API and I had 
> the same behavior. I tested with an existing application which reads from 
> Kafka source and do some aggregations, persist dataframes and then unpersist 
> them. The persist and unpersist it works correct, I see the dataframes in the 
> storage tab in Spark UI and after the unpersist, all dataframe have removed. 
> But, after the unpersist the executors memory is not zero, BUT has the same 
> value with the driver memory. This behavior also affects the application 
> performance because the memory of the executors is increasing as the driver 
> increasing and after a while the persisted dataframes are not fit in the 
> executors memory and  I have spill to disk.
> Another error which I had after a long running, was 
> {color:#ff}java.lang.OutOfMemoryError: GC overhead limit exceeded, but I 
> don't know if its relevant with the above behavior or not.{color}
>  
> *HOW TO REPRODUCE THIS BEHAVIOR:*
> Create a very simple application(streaming count_file.py) in order to 
> reproduce this behavior. This application reads CSV files from a directory, 
> count the rows and then remove the processed files.
> {code:java}
> import time
> import os
> from pyspark.sql import SparkSession
> from pyspark.sql import functions as F
> from pyspark.sql import types as T
> target_dir = "..."
> spark=SparkSession.builder.appName("DataframeCount").getOrCreate()
> while True:
> for f in os.listdir(target_dir):
> df = spark.read.load(target_dir + f, format="csv")
> print("Number of records: {0}".format(df.count()))
> time.sleep(15){code}
> Submit code:
> {code:java}
> spark-submit 
> --master spark://xxx.xxx.xx.xxx
> --deploy-mode client
> --executor-memory 4g
> --executor-cores 3
> streaming count_file.py
> {code}
>  
> *TESTED CASES WITH THE SAME BEHAVIOUR:*
>  * I tested with default settings (spark-defaults.conf)
>  * Add spark.cleaner.periodicGC.interval 1min (or less)
>  * {{Turn spark.cleaner.referenceTracking.blocking}}=false
>  * Run the application in cluster mode
>  * Increase/decrease the resources of the executors and driver
>  * I tested with extraJavaOptions in driver and executor -XX:+UseG1GC 
> -XX:InitiatingHeapOccupancyPercent=35 -XX:ConcGCThreads=12
>   
> *DEPENDENCIES*
>  * Operation system: Ubuntu 16.04.3 LTS
>  * Java: jdk1.8.0_131 (tested also with jdk1.8.0_221)
>  * Python: Python 2.7.12
>  
> *NOTE:* In Spark 2.1.1 the driver memory consumption (Storage Memory tab) was 
> extremely low and after the run of ContextCleaner and BlockManager the memory 
> was decreasing.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29265) Window+collect_list causing single-task operation

2019-09-27 Thread Florentino Sainz (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Florentino Sainz updated SPARK-29265:
-
Description: 
Hi,

 

I had this problem in "real" environments and also made a self-contained test ( 
[^Test.scala] attached).

Having this Window definition:
{code:scala}
val myWindow = Window.partitionBy($"word").orderBy("word") 
val filt2 = filtrador.withColumn("avg_Time", 
collect_list($"number").over(myWindow))
{code}
  

In the test I can see how all elements of my DF are being collected in a single 
task.


Unbounded+unordered Window + collect_list seems to be collecting ALL the 
dataframe in a single executor/task.

groupBy + collect_list seems to do it as expect (collect_list for each group 
independently).
 
At the end it was a data distribution problem caused by one table which was 
supossed to do 1-1 join and did 1-N join (!!)

  was:
Hi,

 

I had this problem in "real" environments and also made a self-contained test ( 
[^Test.scala] attached).

Having this Window definition:
{code:scala}
val myWindow = Window.partitionBy($"word").orderBy("word") 
val filt2 = filtrador.withColumn("avg_Time", 
collect_list($"number").over(myWindow))
{code}
  

In the test I can see how all elements of my DF are being collected in a single 
task.


Unbounded+unordered Window + collect_list seems to be collecting ALL the 
dataframe in a single executor/task.

groupBy + collect_list seems to do it as expect (collect_list for each group 
independently).
 

Full Code showing the error (see how the mapPartitions shows 99 rows in one 
partition) attached in Test.scala, sbt project (src and build.sbt) attached too 
in TestSpark.zip.


> Window+collect_list causing single-task operation
> -
>
> Key: SPARK-29265
> URL: https://issues.apache.org/jira/browse/SPARK-29265
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
> Environment: Any
>Reporter: Florentino Sainz
>Priority: Minor
>
> Hi,
>  
> I had this problem in "real" environments and also made a self-contained test 
> ( [^Test.scala] attached).
> Having this Window definition:
> {code:scala}
> val myWindow = Window.partitionBy($"word").orderBy("word") 
> val filt2 = filtrador.withColumn("avg_Time", 
> collect_list($"number").over(myWindow))
> {code}
>   
> In the test I can see how all elements of my DF are being collected in a 
> single task.
> Unbounded+unordered Window + collect_list seems to be collecting ALL the 
> dataframe in a single executor/task.
> groupBy + collect_list seems to do it as expect (collect_list for each group 
> independently).
>  
> At the end it was a data distribution problem caused by one table which was 
> supossed to do 1-1 join and did 1-N join (!!)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-29265) Window+collect_list causing single-task operation

2019-09-27 Thread Florentino Sainz (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Florentino Sainz closed SPARK-29265.


> Window+collect_list causing single-task operation
> -
>
> Key: SPARK-29265
> URL: https://issues.apache.org/jira/browse/SPARK-29265
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
> Environment: Any
>Reporter: Florentino Sainz
>Priority: Minor
>
> Hi,
>  
> I had this problem in "real" environments and also made a self-contained test 
> ( [^Test.scala] attached).
> Having this Window definition:
> {code:scala}
> val myWindow = Window.partitionBy($"word").orderBy("word") 
> val filt2 = filtrador.withColumn("avg_Time", 
> collect_list($"number").over(myWindow))
> {code}
>   
> In the test I can see how all elements of my DF are being collected in a 
> single task.
> Unbounded+unordered Window + collect_list seems to be collecting ALL the 
> dataframe in a single executor/task.
> groupBy + collect_list seems to do it as expect (collect_list for each group 
> independently).
>  
> At the end it was a data distribution problem caused by one table which was 
> supossed to do 1-1 join and did 1-N join (!!)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27254) Cleanup complete but becoming invalid output files in ManifestFileCommitProtocol if job is aborted

2019-09-27 Thread Shixiong Zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-27254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-27254.
--
Fix Version/s: 3.0.0
 Assignee: Jungtaek Lim
   Resolution: Fixed

> Cleanup complete but becoming invalid output files in 
> ManifestFileCommitProtocol if job is aborted
> --
>
> Key: SPARK-27254
> URL: https://issues.apache.org/jira/browse/SPARK-27254
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Jungtaek Lim
>Assignee: Jungtaek Lim
>Priority: Minor
> Fix For: 3.0.0
>
>
> ManifestFileCommitProtocol doesn't clean up complete (but will become 
> invalid) output files when job is aborted.
> ManifestFileCommitProtocol doesn't do anything for cleaning up when job is 
> aborted but just maintains the metadata which list of complete output files 
> are written. SPARK-27210 addressed for task level cleanup, but it still 
> doesn't clean up it as job level.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29232) RandomForestRegressionModel does not update the parameter maps of the DecisionTreeRegressionModels underneath

2019-09-27 Thread Jiaqi Guo (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16939712#comment-16939712
 ] 

Jiaqi Guo commented on SPARK-29232:
---

[~aman_omer], here is [an example from the Spark 
documentation|[https://spark.apache.org/docs/latest/ml-classification-regression.html#random-forest-regression]].
{code:java}
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.evaluation.RegressionEvaluator
import org.apache.spark.ml.feature.VectorIndexer
import org.apache.spark.ml.regression.{RandomForestRegressionModel, 
RandomForestRegressor}

// Load and parse the data file, converting it to a DataFrame.
val data = spark.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt")

// Set maxCategories so features with > 4 distinct values are treated as 
continuous.
val featureIndexer = new VectorIndexer()
  .setInputCol("features")
  .setOutputCol("indexedFeatures")
  .setMaxCategories(4)
  .fit(data)

// Split the data into training and test sets (30% held out for testing).
val Array(trainingData, testData) = data.randomSplit(Array(0.7, 0.3))

// Train a RandomForest model.
val rf = new RandomForestRegressor()
  .setNumTrees(5)
  .setMaxDepth(10)
  .setLabelCol("label")
  .setFeaturesCol("indexedFeatures")

// Chain indexer and forest in a Pipeline.
val pipeline = new Pipeline()
  .setStages(Array(featureIndexer, rf))

// Train model. This also runs the indexer.
val model = pipeline.fit(trainingData)

// Make predictions.
val predictions = model.transform(testData)

// Select example rows to display.
predictions.select("prediction", "label", "features").show(5)

// Select (prediction, true label) and compute test error.
val evaluator = new RegressionEvaluator()
  .setLabelCol("label")
  .setPredictionCol("prediction")
  .setMetricName("rmse")
val rmse = evaluator.evaluate(predictions)
println(s"Root Mean Squared Error (RMSE) on test data = $rmse")
val rfModel = model.stages(1).asInstanceOf[RandomForestRegressionModel]
println(s"Learned regression forest model:\n ${rfModel.toDebugString}")
{code}
This gives you a random forest model called rfModel. I modified the max depth 
to 10 for the trees. 
{code:java}
rfModel.extractParamMap()
// Printout
res23: org.apache.spark.ml.param.ParamMap = { rfr_8197914ca605-cacheNodeIds: 
false, rfr_8197914ca605-checkpointInterval: 10, 
rfr_8197914ca605-featureSubsetStrategy: auto, rfr_8197914ca605-featuresCol: 
indexedFeatures, rfr_8197914ca605-impurity: variance, 
rfr_8197914ca605-labelCol: label, rfr_8197914ca605-maxBins: 32, 
rfr_8197914ca605-maxDepth: 10, rfr_8197914ca605-maxMemoryInMB: 256, 
rfr_8197914ca605-minInfoGain: 0.0, rfr_8197914ca605-minInstancesPerNode: 1, 
rfr_8197914ca605-numTrees: 5, rfr_8197914ca605-predictionCol: prediction, 
rfr_8197914ca605-seed: 235498149, rfr_8197914ca605-subsamplingRate: 1.0 }
{code}
As you can see the maxDepth here is correct. However, if we were to check the 
parameter map of the trees.
{code:java}
rfModel.trees(0).extractParamMap()
// Printout
res22: org.apache.spark.ml.param.ParamMap = { dtr_bfcfc13f1334-cacheNodeIds: 
false, dtr_bfcfc13f1334-checkpointInterval: 10, dtr_bfcfc13f1334-featuresCol: 
features, dtr_bfcfc13f1334-impurity: variance, dtr_bfcfc13f1334-labelCol: 
label, dtr_bfcfc13f1334-maxBins: 32, dtr_bfcfc13f1334-maxDepth: 5, 
dtr_bfcfc13f1334-maxMemoryInMB: 256, dtr_bfcfc13f1334-minInfoGain: 0.0, 
dtr_bfcfc13f1334-minInstancesPerNode: 1, dtr_bfcfc13f1334-predictionCol: 
prediction, dtr_bfcfc13f1334-seed: 1366634793 }
{code}
The max depth stays at the default value 5. In fact, parameter maps of 
individual trees will only give the default decision tree values.

> RandomForestRegressionModel does not update the parameter maps of the 
> DecisionTreeRegressionModels underneath
> -
>
> Key: SPARK-29232
> URL: https://issues.apache.org/jira/browse/SPARK-29232
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.4.0
>Reporter: Jiaqi Guo
>Priority: Major
>
> We trained a RandomForestRegressionModel, and tried to access the trees. Even 
> though the DecisionTreeRegressionModel is correctly built with the proper 
> parameters from random forest, the parameter map is not updated, and still 
> contains only the default value. 
> For example, if a RandomForestRegressor was trained with maxDepth of 12, then 
> accessing the tree information, extractParamMap still returns the default 
> values, with maxDepth=5. Calling the depth itself of 
> DecisionTreeRegressionModel returns the correct value of 12 though.
> This creates issues when we want to access each individual tree and build the 
> trees back up for the random forest estimator.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (SPARK-26431) Update availableSlots by availableCpus for barrier taskset

2019-09-27 Thread Xingbo Jiang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-26431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16939667#comment-16939667
 ] 

Xingbo Jiang edited comment on SPARK-26431 at 9/27/19 6:39 PM:
---

Resolved by https://github.com/apache/spark/pull/25946


was (Author: jiangxb1987):
Resolved by https://github.com/apache/spark/pull/23375

> Update availableSlots by availableCpus for barrier taskset
> --
>
> Key: SPARK-26431
> URL: https://issues.apache.org/jira/browse/SPARK-26431
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: wuyi
>Assignee: Juliusz Sompolski
>Priority: Major
> Fix For: 3.0.0
>
>
> availableCpus decrease as  tasks allocated, so, we should update 
> availableSlots by availableCpus for barrier taskset to avoid unnecessary 
> resourceOffer process.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-29263) availableSlots in scheduler can change before being checked by barrier taskset

2019-09-27 Thread Xingbo Jiang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16939670#comment-16939670
 ] 

Xingbo Jiang edited comment on SPARK-29263 at 9/27/19 6:38 PM:
---

Resolved by https://github.com/apache/spark/pull/25946


was (Author: jiangxb1987):
Resolved by https://github.com/apache/spark/pull/23375

> availableSlots in scheduler can change before being checked by barrier taskset
> --
>
> Key: SPARK-29263
> URL: https://issues.apache.org/jira/browse/SPARK-29263
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.4.0
>Reporter: Juliusz Sompolski
>Assignee: Juliusz Sompolski
>Priority: Major
> Fix For: 3.0.0
>
>
> availableSlots are computed before the loop in resourceOffer, but they change 
> in every iteration



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29263) availableSlots in scheduler can change before being checked by barrier taskset

2019-09-27 Thread Xingbo Jiang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xingbo Jiang updated SPARK-29263:
-
Target Version/s:   (was: 2.4.0)

> availableSlots in scheduler can change before being checked by barrier taskset
> --
>
> Key: SPARK-29263
> URL: https://issues.apache.org/jira/browse/SPARK-29263
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 3.0.0
>Reporter: Juliusz Sompolski
>Assignee: Juliusz Sompolski
>Priority: Major
> Fix For: 3.0.0
>
>
> availableSlots are computed before the loop in resourceOffer, but they change 
> in every iteration



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29263) availableSlots in scheduler can change before being checked by barrier taskset

2019-09-27 Thread Xingbo Jiang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xingbo Jiang updated SPARK-29263:
-
Affects Version/s: (was: 3.0.0)
   2.4.0

> availableSlots in scheduler can change before being checked by barrier taskset
> --
>
> Key: SPARK-29263
> URL: https://issues.apache.org/jira/browse/SPARK-29263
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.4.0
>Reporter: Juliusz Sompolski
>Assignee: Juliusz Sompolski
>Priority: Major
> Fix For: 3.0.0
>
>
> availableSlots are computed before the loop in resourceOffer, but they change 
> in every iteration



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29263) availableSlots in scheduler can change before being checked by barrier taskset

2019-09-27 Thread Xingbo Jiang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xingbo Jiang resolved SPARK-29263.
--
   Fix Version/s: 3.0.0
Target Version/s: 2.4.0
Assignee: Juliusz Sompolski
  Resolution: Fixed

Resolved by https://github.com/apache/spark/pull/23375

> availableSlots in scheduler can change before being checked by barrier taskset
> --
>
> Key: SPARK-29263
> URL: https://issues.apache.org/jira/browse/SPARK-29263
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 3.0.0
>Reporter: Juliusz Sompolski
>Assignee: Juliusz Sompolski
>Priority: Major
> Fix For: 3.0.0
>
>
> availableSlots are computed before the loop in resourceOffer, but they change 
> in every iteration



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26431) Update availableSlots by availableCpus for barrier taskset

2019-09-27 Thread Xingbo Jiang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-26431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xingbo Jiang resolved SPARK-26431.
--
Fix Version/s: 3.0.0
 Assignee: Juliusz Sompolski
   Resolution: Fixed

Resolved by https://github.com/apache/spark/pull/23375

> Update availableSlots by availableCpus for barrier taskset
> --
>
> Key: SPARK-26431
> URL: https://issues.apache.org/jira/browse/SPARK-26431
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: wuyi
>Assignee: Juliusz Sompolski
>Priority: Major
> Fix For: 3.0.0
>
>
> availableCpus decrease as  tasks allocated, so, we should update 
> availableSlots by availableCpus for barrier taskset to avoid unnecessary 
> resourceOffer process.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29070) Make SparkLauncher log full spark-submit command line

2019-09-27 Thread Marcelo Masiero Vanzin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Masiero Vanzin resolved SPARK-29070.

Fix Version/s: 3.0.0
 Assignee: Jeff Evans
   Resolution: Fixed

> Make SparkLauncher log full spark-submit command line
> -
>
> Key: SPARK-29070
> URL: https://issues.apache.org/jira/browse/SPARK-29070
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Submit
>Affects Versions: 2.4.5
>Reporter: Jeff Evans
>Assignee: Jeff Evans
>Priority: Minor
> Fix For: 3.0.0
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> {{org.apache.spark.launcher.SparkLauncher}} wraps a {{ProcessBuilder}}, and 
> builds up a full command line to {{spark-submit}} using a builder pattern.  
> When {{startApplication}} is finally called, a full command line is 
> materialized out of all the options, then invoked via the {{ProcessBuilder}}.
> In scenarios where another application is submitting to Spark, it would be 
> extremely useful from a support and debugging standpoint to be able to see 
> the full {{spark-submit}} command that is actually used (so that the same 
> submission can be tested standalone, arguments tweaked, etc.).  Currently, 
> the only way this gets captured is to {{stderr}} if the 
> {{SPARK_PRINT_LAUNCH_COMMAND}} environment variable is set.  This is 
> cumbersome in the context of an application that is wrapping Spark and 
> already using the APIs.
> I propose simply making {{SparkSubmit}} log the full command line it is about 
> to launch, so that clients can see it directly in their log files, rather 
> than having to capture and search through {{stderr}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29240) PySpark 2.4 about sql function 'element_at' param 'extraction'

2019-09-27 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-29240.
---
Fix Version/s: 3.0.0
   2.4.5
 Assignee: Hyukjin Kwon
   Resolution: Fixed

This is resolved via https://github.com/apache/spark/pull/25950

> PySpark 2.4 about sql function 'element_at' param 'extraction'
> --
>
> Key: SPARK-29240
> URL: https://issues.apache.org/jira/browse/SPARK-29240
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: Simon Reon
>Assignee: Hyukjin Kwon
>Priority: Trivial
> Fix For: 2.4.5, 3.0.0
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> I was trying to translate {color:#FF}Scala{color} into 
> {color:#FF}python{color} with {color:#FF}PySpark 2.4.0{color} .Codes 
> below aims to extract col '{color:#FF}list{color}' value using col 
> '{color:#FF}num{color}' as index.
>  
> {code:java}
> x = 
> spark.createDataFrame([((1,2,3),1),((4,5,6),2),((7,8,9),3)],['list','num'])
> x.show(){code}
>  
> ||list||num||
> |[1,2,3]|1|
> |[4,5,6]|2|
> |[7,8,9]|3|
> I suppose to use new func '{color:#FF}element_at{color}' in 2.4.0 .But it 
> gives an error:
> {code:java}
> x.withColumn('aa',F.element_at('list',x.num.cast('int')))
> {code}
> _TypeError: Column is not iterable_
>  
> Finally ,I have to use {color:#FF}udf{color} to solve this problem.
> But in Scala ,it is ok when the second param 
> '{color:#FF}extraction{color}' in func '{color:#FF}element_at{color}' 
> is a col name with int type: 
> {code:java}
> //Scala
> val y = x.withColumn("aa",element_at('list,'num.cast("int")))
> y.show(){code}
> ||list||num|| aa||
> |[1,2,3]|1| 1|
> |[4,5,6] |2 |5 |
> |[7,8,9] |3 |9 |
>  I hope it could be fixed in latest version.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29240) PySpark 2.4 about sql function 'element_at' param 'extraction'

2019-09-27 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-29240:
--
Labels:   (was: beginner easyfix newbie starter)

> PySpark 2.4 about sql function 'element_at' param 'extraction'
> --
>
> Key: SPARK-29240
> URL: https://issues.apache.org/jira/browse/SPARK-29240
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: Simon Reon
>Priority: Trivial
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> I was trying to translate {color:#FF}Scala{color} into 
> {color:#FF}python{color} with {color:#FF}PySpark 2.4.0{color} .Codes 
> below aims to extract col '{color:#FF}list{color}' value using col 
> '{color:#FF}num{color}' as index.
>  
> {code:java}
> x = 
> spark.createDataFrame([((1,2,3),1),((4,5,6),2),((7,8,9),3)],['list','num'])
> x.show(){code}
>  
> ||list||num||
> |[1,2,3]|1|
> |[4,5,6]|2|
> |[7,8,9]|3|
> I suppose to use new func '{color:#FF}element_at{color}' in 2.4.0 .But it 
> gives an error:
> {code:java}
> x.withColumn('aa',F.element_at('list',x.num.cast('int')))
> {code}
> _TypeError: Column is not iterable_
>  
> Finally ,I have to use {color:#FF}udf{color} to solve this problem.
> But in Scala ,it is ok when the second param 
> '{color:#FF}extraction{color}' in func '{color:#FF}element_at{color}' 
> is a col name with int type: 
> {code:java}
> //Scala
> val y = x.withColumn("aa",element_at('list,'num.cast("int")))
> y.show(){code}
> ||list||num|| aa||
> |[1,2,3]|1| 1|
> |[4,5,6] |2 |5 |
> |[7,8,9] |3 |9 |
>  I hope it could be fixed in latest version.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29240) PySpark 2.4 about sql function 'element_at' param 'extraction'

2019-09-27 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-29240:
--
Issue Type: Bug  (was: Improvement)

> PySpark 2.4 about sql function 'element_at' param 'extraction'
> --
>
> Key: SPARK-29240
> URL: https://issues.apache.org/jira/browse/SPARK-29240
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: Simon Reon
>Priority: Trivial
>  Labels: beginner, easyfix, newbie, starter
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> I was trying to translate {color:#FF}Scala{color} into 
> {color:#FF}python{color} with {color:#FF}PySpark 2.4.0{color} .Codes 
> below aims to extract col '{color:#FF}list{color}' value using col 
> '{color:#FF}num{color}' as index.
>  
> {code:java}
> x = 
> spark.createDataFrame([((1,2,3),1),((4,5,6),2),((7,8,9),3)],['list','num'])
> x.show(){code}
>  
> ||list||num||
> |[1,2,3]|1|
> |[4,5,6]|2|
> |[7,8,9]|3|
> I suppose to use new func '{color:#FF}element_at{color}' in 2.4.0 .But it 
> gives an error:
> {code:java}
> x.withColumn('aa',F.element_at('list',x.num.cast('int')))
> {code}
> _TypeError: Column is not iterable_
>  
> Finally ,I have to use {color:#FF}udf{color} to solve this problem.
> But in Scala ,it is ok when the second param 
> '{color:#FF}extraction{color}' in func '{color:#FF}element_at{color}' 
> is a col name with int type: 
> {code:java}
> //Scala
> val y = x.withColumn("aa",element_at('list,'num.cast("int")))
> y.show(){code}
> ||list||num|| aa||
> |[1,2,3]|1| 1|
> |[4,5,6] |2 |5 |
> |[7,8,9] |3 |9 |
>  I hope it could be fixed in latest version.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29245) CCE during creating HiveMetaStoreClient

2019-09-27 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16939583#comment-16939583
 ] 

Dongjoon Hyun commented on SPARK-29245:
---

Like we missed this last time, there might be more code paths which is 
unchecked. Since this one is the fist step to connect to remote HMS (which is 
also the main use cases), it seems that no one tried to use this 
feature(SPARK-28684) until now.

> CCE during creating HiveMetaStoreClient 
> 
>
> Key: SPARK-29245
> URL: https://issues.apache.org/jira/browse/SPARK-29245
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
> Environment: CDH 6.3
>Reporter: Dongjoon Hyun
>Priority: Blocker
>
> From `master` branch build, when I try to connect to an external HMS, I hit 
> the following.
> {code}
> 19/09/25 10:58:46 ERROR hive.log: Got exception: java.lang.ClassCastException 
> class [Ljava.lang.Object; cannot be cast to class [Ljava.net.URI; 
> ([Ljava.lang.Object; and [Ljava.net.URI; are in module java.base of loader 
> 'bootstrap')
> java.lang.ClassCastException: class [Ljava.lang.Object; cannot be cast to 
> class [Ljava.net.URI; ([Ljava.lang.Object; and [Ljava.net.URI; are in module 
> java.base of loader 'bootstrap')
>   at 
> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.(HiveMetaStoreClient.java:200)
>   at 
> org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient.(SessionHiveMetaStoreClient.java:70)
> {code}
> With HIVE-21508, I can get the following.
> {code}
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 3.0.0-SNAPSHOT
>   /_/
> Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 11.0.4)
> Type in expressions to have them evaluated.
> Type :help for more information.
> scala> sql("show databases").show
> ++
> |databaseName|
> ++
> |  .  |
> ...
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29015) Can't use 'add jar' jar's class as create table serde class on JDK 11

2019-09-27 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-29015.
--
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 25775
[https://github.com/apache/spark/pull/25775]

> Can't use 'add jar' jar's class as create table serde class on JDK 11
> -
>
> Key: SPARK-29015
> URL: https://issues.apache.org/jira/browse/SPARK-29015
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Assignee: angerszhu
>Priority: Major
> Fix For: 3.0.0
>
>
> How to reproduce:
> {code:bash}
> export JAVA_HOME=/usr/lib/jdk-11.0.3
> export PATH=$JAVA_HOME/bin:$PATH
> build/sbt clean package -Phive -Phadoop-3.2 -Phive-thriftserver
> export SPARK_PREPEND_CLASSES=true
> sbin/start-thriftserver.sh
> bin/beeline -u jdbc:hive2://localhost:1
> {code}
> {noformat}
> 0: jdbc:hive2://localhost:1> add jar 
> /root/.m2/repository/org/apache/hive/hcatalog/hive-hcatalog-core/2.3.6/hive-hcatalog-core-2.3.6.jar;
> INFO  : Added 
> [/root/.m2/repository/org/apache/hive/hcatalog/hive-hcatalog-core/2.3.6/hive-hcatalog-core-2.3.6.jar]
>  to class path
> INFO  : Added resources: 
> [/root/.m2/repository/org/apache/hive/hcatalog/hive-hcatalog-core/2.3.6/hive-hcatalog-core-2.3.6.jar]
> +-+
> | result  |
> +-+
> +-+
> No rows selected (0.381 seconds)
> 0: jdbc:hive2://localhost:1> CREATE TABLE addJar(key string) ROW FORMAT 
> SERDE 'org.apache.hive.hcatalog.data.JsonSerDe';
> +-+
> | Result  |
> +-+
> +-+
> No rows selected (0.613 seconds)
> 0: jdbc:hive2://localhost:1> select * from addJar;
> Error: Error running query: java.lang.RuntimeException: 
> java.lang.ClassNotFoundException: org.apache.hive.hcatalog.data.JsonSerDe 
> (state=,code=0)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-29015) Can't use 'add jar' jar's class as create table serde class on JDK 11

2019-09-27 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen reassigned SPARK-29015:


Assignee: angerszhu

> Can't use 'add jar' jar's class as create table serde class on JDK 11
> -
>
> Key: SPARK-29015
> URL: https://issues.apache.org/jira/browse/SPARK-29015
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Assignee: angerszhu
>Priority: Major
>
> How to reproduce:
> {code:bash}
> export JAVA_HOME=/usr/lib/jdk-11.0.3
> export PATH=$JAVA_HOME/bin:$PATH
> build/sbt clean package -Phive -Phadoop-3.2 -Phive-thriftserver
> export SPARK_PREPEND_CLASSES=true
> sbin/start-thriftserver.sh
> bin/beeline -u jdbc:hive2://localhost:1
> {code}
> {noformat}
> 0: jdbc:hive2://localhost:1> add jar 
> /root/.m2/repository/org/apache/hive/hcatalog/hive-hcatalog-core/2.3.6/hive-hcatalog-core-2.3.6.jar;
> INFO  : Added 
> [/root/.m2/repository/org/apache/hive/hcatalog/hive-hcatalog-core/2.3.6/hive-hcatalog-core-2.3.6.jar]
>  to class path
> INFO  : Added resources: 
> [/root/.m2/repository/org/apache/hive/hcatalog/hive-hcatalog-core/2.3.6/hive-hcatalog-core-2.3.6.jar]
> +-+
> | result  |
> +-+
> +-+
> No rows selected (0.381 seconds)
> 0: jdbc:hive2://localhost:1> CREATE TABLE addJar(key string) ROW FORMAT 
> SERDE 'org.apache.hive.hcatalog.data.JsonSerDe';
> +-+
> | Result  |
> +-+
> +-+
> No rows selected (0.613 seconds)
> 0: jdbc:hive2://localhost:1> select * from addJar;
> Error: Error running query: java.lang.RuntimeException: 
> java.lang.ClassNotFoundException: org.apache.hive.hcatalog.data.JsonSerDe 
> (state=,code=0)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29276) Spark job fails because of timeout to Driver

2019-09-27 Thread Jochen Hebbrecht (Jira)
Jochen Hebbrecht created SPARK-29276:


 Summary: Spark job fails because of timeout to Driver
 Key: SPARK-29276
 URL: https://issues.apache.org/jira/browse/SPARK-29276
 Project: Spark
  Issue Type: Question
  Components: Spark Core
Affects Versions: 2.4.2
Reporter: Jochen Hebbrecht


Hi,

I'm using Spark 2.4.2 on AWS EMR 5.24.0. I'm trying to send a Spark job towards 
the cluster. Thhe job gets accepted, but the YARN application fails with:

{code}
19/09/27 14:33:35 ERROR ApplicationMaster: Uncaught exception: 
java.util.concurrent.TimeoutException: Futures timed out after [10 
milliseconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:223)
at 
scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:227)
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:220)
at 
org.apache.spark.deploy.yarn.ApplicationMaster.runDriver(ApplicationMaster.scala:468)
at 
org.apache.spark.deploy.yarn.ApplicationMaster.org$apache$spark$deploy$yarn$ApplicationMaster$$runImpl(ApplicationMaster.scala:305)
at 
org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply$mcV$sp(ApplicationMaster.scala:245)
at 
org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply(ApplicationMaster.scala:245)
at 
org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply(ApplicationMaster.scala:245)
at 
org.apache.spark.deploy.yarn.ApplicationMaster$$anon$3.run(ApplicationMaster.scala:779)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844)
at 
org.apache.spark.deploy.yarn.ApplicationMaster.doAsUser(ApplicationMaster.scala:778)
at 
org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:244)
at 
org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:803)
at 
org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala)
19/09/27 14:33:35 INFO ApplicationMaster: Final app status: FAILED, exitCode: 
13, (reason: Uncaught exception: java.util.concurrent.TimeoutException: Futures 
timed out after [10 milliseconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:223)
at 
scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:227)
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:220)
at 
org.apache.spark.deploy.yarn.ApplicationMaster.runDriver(ApplicationMaster.scala:468)
at 
org.apache.spark.deploy.yarn.ApplicationMaster.org$apache$spark$deploy$yarn$ApplicationMaster$$runImpl(ApplicationMaster.scala:305)
at 
org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply$mcV$sp(ApplicationMaster.scala:245)
at 
org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply(ApplicationMaster.scala:245)
at 
org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply(ApplicationMaster.scala:245)
at 
org.apache.spark.deploy.yarn.ApplicationMaster$$anon$3.run(ApplicationMaster.scala:779)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844)
at 
org.apache.spark.deploy.yarn.ApplicationMaster.doAsUser(ApplicationMaster.scala:778)
at 
org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:244)
at 
org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:803)
at 
org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala)
{code}

It actually goes wrong at this line: 
https://github.com/apache/spark/blob/v2.4.2/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L468

Now, I'm 100% sure Spark is OK and there's no bug, but there must be something 
wrong with my setup. I don't understand the code of the ApplicationMaster, so 
could somebody explain me what it is trying to reach? Where exactly does the 
connection timeout? So at least I can debug it further because I don't have a 
clue what it is doing :-)

Thanks for any help!
Jochen



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29273) Spark peakExecutionMemory metrics is zero

2019-09-27 Thread huangweiyi (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16939466#comment-16939466
 ] 

huangweiyi commented on SPARK-29273:


i do the same thing like shs for replaying the spark-event-log,  when parsing 
the SparkListenerTaskEnd, i print out some metrics value, here is the sniped 
code

case taskEnd: SparkListenerTaskEnd => {

   info(s"peakExecutionMemory: ${taskEnd.taskMetrics.peakExecutionMemory}")
   info(s"executorRunTime: ${taskEnd.taskMetrics.executorRunTime}")
   info(s"executorCpuTime: ${taskEnd.taskMetrics.executorCpuTime}")

   ...

}

 

here is the output is :

19/09/27 21:31:40 INFO SparkFSProcessor: peakExecutionMemory: 0
19/09/27 21:31:40 INFO SparkFSProcessor: executorRunTime: 1253
19/09/27 21:31:40 INFO SparkFSProcessor: executorCpuTime: 924518630

 

and i add a pr to this issue, please help review, many thans!

> Spark peakExecutionMemory metrics is zero
> -
>
> Key: SPARK-29273
> URL: https://issues.apache.org/jira/browse/SPARK-29273
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.3
> Environment: hadoop 2.7.3
> spark 2.4.3
> jdk 1.8.0_60
>Reporter: huangweiyi
>Priority: Major
>
> with spark 2.4.3 in our production environment, i want to get the 
> peakExecutionMemory which is exposed by the TaskMetrics, but alway get the 
> zero value



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29265) Window+collect_list causing single-task operation

2019-09-27 Thread Florentino Sainz (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16939423#comment-16939423
 ] 

Florentino Sainz commented on SPARK-29265:
--

Ok just realized what's happening, we did have one element with MANY rows 
inside the same group, when using collect_list we are multiplying the list for 
each row. (this was not expected tho...)

I changed it to groupBy and now we only have one row with all the values :).

> Window+collect_list causing single-task operation
> -
>
> Key: SPARK-29265
> URL: https://issues.apache.org/jira/browse/SPARK-29265
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
> Environment: Any
>Reporter: Florentino Sainz
>Priority: Minor
>
> Hi,
>  
> I had this problem in "real" environments and also made a self-contained test 
> ( [^Test.scala] attached).
> Having this Window definition:
> {code:scala}
> val myWindow = Window.partitionBy($"word").orderBy("word") 
> val filt2 = filtrador.withColumn("avg_Time", 
> collect_list($"number").over(myWindow))
> {code}
>   
> In the test I can see how all elements of my DF are being collected in a 
> single task.
> Unbounded+unordered Window + collect_list seems to be collecting ALL the 
> dataframe in a single executor/task.
> groupBy + collect_list seems to do it as expect (collect_list for each group 
> independently).
>  
> Full Code showing the error (see how the mapPartitions shows 99 rows in one 
> partition) attached in Test.scala, sbt project (src and build.sbt) attached 
> too in TestSpark.zip.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-29265) Window+collect_list causing single-task operation

2019-09-27 Thread Florentino Sainz (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Florentino Sainz updated SPARK-29265:
-
Comment: was deleted

(was: Ok, after quite a bit of research, I had to change my Window to groupBy + 
collect_list

Unbounded+unordered Window + collect_list seems to be collecting ALL the 
dataframe in a single executor/task.

groupBy + collect_list seems to do it as expect (collect_list for each group 
independently).

Is this behaviour expected or it's a bug?)

> Window+collect_list causing single-task operation
> -
>
> Key: SPARK-29265
> URL: https://issues.apache.org/jira/browse/SPARK-29265
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
> Environment: Any
>Reporter: Florentino Sainz
>Priority: Minor
>
> Hi,
>  
> I had this problem in "real" environments and also made a self-contained test 
> ( [^Test.scala] attached).
> Having this Window definition:
> {code:scala}
> val myWindow = Window.partitionBy($"word").orderBy("word") 
> val filt2 = filtrador.withColumn("avg_Time", 
> collect_list($"number").over(myWindow))
> {code}
>   
> In the test I can see how all elements of my DF are being collected in a 
> single task.
> Unbounded+unordered Window + collect_list seems to be collecting ALL the 
> dataframe in a single executor/task.
> groupBy + collect_list seems to do it as expect (collect_list for each group 
> independently).
>  
> Full Code showing the error (see how the mapPartitions shows 99 rows in one 
> partition) attached in Test.scala, sbt project (src and build.sbt) attached 
> too in TestSpark.zip.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29274) Can not coerce decimal type to double type when it's join key

2019-09-27 Thread Yuming Wang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16939414#comment-16939414
 ] 

Yuming Wang commented on SPARK-29274:
-

PostgreSQL need to add explicit type casts:
{code:sql}
postgres=# create table t1 (incdata_id decimal(21,0), v text);
CREATE TABLE
postgres=# create table t2 (incdata_id text, v text);
CREATE TABLE
postgres=# explain select * from t1 join t2 on (t1.incdata_id = t2.incdata_id);
ERROR:  operator does not exist: numeric = text
LINE 1: ...xplain select * from t1 join t2 on (t1.incdata_id = t2.incda...
 ^
HINT:  No operator matches the given name and argument types. You might need to 
add explicit type casts.
{code}

> Can not coerce decimal type to double type when it's join key
> -
>
> Key: SPARK-29274
> URL: https://issues.apache.org/jira/browse/SPARK-29274
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.4, 2.4.4, 3.0.0
>Reporter: Yuming Wang
>Assignee: Pengfei Chang
>Priority: Major
> Attachments: image-2019-09-27-20-20-24-238.png
>
>
> How to reproduce this issue:
> {code:sql}
> create table t1 (incdata_id decimal(21,0), v string) using parquet;
> create table t2 (incdata_id string, v string) using parquet;
> explain select * from t1 join t2 on (t1.incdata_id = t2.incdata_id);
> == Physical Plan ==
> *(5) SortMergeJoin 
> [knownfloatingpointnormalized(normalizenanandzero(cast(incdata_id#31 as 
> double)))], 
> [knownfloatingpointnormalized(normalizenanandzero(cast(incdata_id#33 as 
> double)))], Inner
> :- *(2) Sort 
> [knownfloatingpointnormalized(normalizenanandzero(cast(incdata_id#31 as 
> double))) ASC NULLS FIRST], false, 0
> :  +- Exchange 
> hashpartitioning(knownfloatingpointnormalized(normalizenanandzero(cast(incdata_id#31
>  as double))), 200), true, [id=#104]
> : +- *(1) Filter isnotnull(incdata_id#31)
> :+- Scan hive default.t1 [incdata_id#31, v#32], HiveTableRelation 
> `default`.`t1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, 
> [incdata_id#31, v#32], Statistics(sizeInBytes=8.0 EiB)
> +- *(4) Sort 
> [knownfloatingpointnormalized(normalizenanandzero(cast(incdata_id#33 as 
> double))) ASC NULLS FIRST], false, 0
>+- Exchange 
> hashpartitioning(knownfloatingpointnormalized(normalizenanandzero(cast(incdata_id#33
>  as double))), 200), true, [id=#112]
>   +- *(3) Filter isnotnull(incdata_id#33)
>  +- Scan hive default.t2 [incdata_id#33, v#34], HiveTableRelation 
> `default`.`t2`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, 
> [incdata_id#33, v#34], Statistics(sizeInBytes=8.0 EiB)
> {code}
> {code:sql}
> select cast(v1 as double) as v3, cast(v2 as double) as v4,
>   cast(v1 as double) = cast(v2 as double), v1 = v2 
> from (select cast('1001636981212' as decimal(21, 0)) as v1,
>   cast('1001636981213' as decimal(21, 0)) as v2) t;
> 1.00163697E20 1.00163697E20   truefalse
> {code}
>  
> It's a realy case in our production:
> !image-2019-09-27-20-20-24-238.png|width=100%!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29267) rdd.countApprox should stop when 'timeout'

2019-09-27 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16939393#comment-16939393
 ] 

Hyukjin Kwon commented on SPARK-29267:
--

Do you mean countApprox should fail by its timeout? {{timeout}} is the timeout 
for approximation.

> rdd.countApprox should stop when 'timeout'
> --
>
> Key: SPARK-29267
> URL: https://issues.apache.org/jira/browse/SPARK-29267
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.4
>Reporter: Kangtian
>Priority: Minor
>
> {{The way to Approximate counting: org.apache.spark.rdd.RDD#countApprox}}
> +countApprox(timeout: Long, confidence: Double = 0.95)+
>  
> But: 
> when timeout comes, the job will continue run until really finish.
>  
> We Want:
> *When timeout comes, the job will finish{color:#FF} immediately{color}*, 
> without FinalValue
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29275) Update the SQL migration guide regarding special date/timestamp values

2019-09-27 Thread Maxim Gekk (Jira)
Maxim Gekk created SPARK-29275:
--

 Summary: Update the SQL migration guide regarding special 
date/timestamp values
 Key: SPARK-29275
 URL: https://issues.apache.org/jira/browse/SPARK-29275
 Project: Spark
  Issue Type: Documentation
  Components: SQL
Affects Versions: 3.0.0
Reporter: Maxim Gekk


Need to describe special date and timestamp values added by 
[https://github.com/apache/spark/pull/25716] and 
[https://github.com/apache/spark/pull/25708]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21914) Running examples as tests in SQL builtin function documentation

2019-09-27 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-21914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-21914:


Assignee: Maxim Gekk

> Running examples as tests in SQL builtin function documentation
> ---
>
> Key: SPARK-21914
> URL: https://issues.apache.org/jira/browse/SPARK-21914
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>Assignee: Maxim Gekk
>Priority: Major
>
> It looks we have added many examples in {{ExpressionDescription}} for builtin 
> functions.
> Actually, if I have seen correctly, we have fixed many examples so far in 
> some minor PRs and sometimes require to add the examples as tests sql and 
> golden files.
> As we have formatted examples in {{ExpressionDescription.examples}} - 
> https://github.com/apache/spark/blob/ba327ee54c32b11107793604895bd38559804858/sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/ExpressionDescription.java#L44-L50,
>  and we have `SQLQueryTestSuite`, I think we could run the examples as tests 
> like Python's doctests.
> Rough way I am thinking:
> 1. Loads the example in {{ExpressionDescription}}.
> 2. identify queries by {{>}}.
> 3. identify the rest of them as the results.
> 4. run the examples by reusing {{SQLQueryTestSuite}} if possible.
> 5. compare the output by reusing {{SQLQueryTestSuite}} if possible.
> Advantages of doing this I could think for now:
> - Reduce the number of PRs to fix the examples
> - De-duplicate the test cases that should be added into sql and golden files.
> - Correct documentation with correct examples.
> - Reduce reviewing costs for documentation fix PRs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21914) Running examples as tests in SQL builtin function documentation

2019-09-27 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-21914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-21914.
--
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 25942
[https://github.com/apache/spark/pull/25942]

> Running examples as tests in SQL builtin function documentation
> ---
>
> Key: SPARK-21914
> URL: https://issues.apache.org/jira/browse/SPARK-21914
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 3.0.0
>
>
> It looks we have added many examples in {{ExpressionDescription}} for builtin 
> functions.
> Actually, if I have seen correctly, we have fixed many examples so far in 
> some minor PRs and sometimes require to add the examples as tests sql and 
> golden files.
> As we have formatted examples in {{ExpressionDescription.examples}} - 
> https://github.com/apache/spark/blob/ba327ee54c32b11107793604895bd38559804858/sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/ExpressionDescription.java#L44-L50,
>  and we have `SQLQueryTestSuite`, I think we could run the examples as tests 
> like Python's doctests.
> Rough way I am thinking:
> 1. Loads the example in {{ExpressionDescription}}.
> 2. identify queries by {{>}}.
> 3. identify the rest of them as the results.
> 4. run the examples by reusing {{SQLQueryTestSuite}} if possible.
> 5. compare the output by reusing {{SQLQueryTestSuite}} if possible.
> Advantages of doing this I could think for now:
> - Reduce the number of PRs to fix the examples
> - De-duplicate the test cases that should be added into sql and golden files.
> - Correct documentation with correct examples.
> - Reduce reviewing costs for documentation fix PRs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-29274) Can not coerce decimal type to double type when it's join key

2019-09-27 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang reassigned SPARK-29274:
---

Assignee: Pengfei Chang  (was: Yuming Wang)

> Can not coerce decimal type to double type when it's join key
> -
>
> Key: SPARK-29274
> URL: https://issues.apache.org/jira/browse/SPARK-29274
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.4, 2.4.4, 3.0.0
>Reporter: Yuming Wang
>Assignee: Pengfei Chang
>Priority: Major
> Attachments: image-2019-09-27-20-20-24-238.png
>
>
> How to reproduce this issue:
> {code:sql}
> create table t1 (incdata_id decimal(21,0), v string) using parquet;
> create table t2 (incdata_id string, v string) using parquet;
> explain select * from t1 join t2 on (t1.incdata_id = t2.incdata_id);
> == Physical Plan ==
> *(5) SortMergeJoin 
> [knownfloatingpointnormalized(normalizenanandzero(cast(incdata_id#31 as 
> double)))], 
> [knownfloatingpointnormalized(normalizenanandzero(cast(incdata_id#33 as 
> double)))], Inner
> :- *(2) Sort 
> [knownfloatingpointnormalized(normalizenanandzero(cast(incdata_id#31 as 
> double))) ASC NULLS FIRST], false, 0
> :  +- Exchange 
> hashpartitioning(knownfloatingpointnormalized(normalizenanandzero(cast(incdata_id#31
>  as double))), 200), true, [id=#104]
> : +- *(1) Filter isnotnull(incdata_id#31)
> :+- Scan hive default.t1 [incdata_id#31, v#32], HiveTableRelation 
> `default`.`t1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, 
> [incdata_id#31, v#32], Statistics(sizeInBytes=8.0 EiB)
> +- *(4) Sort 
> [knownfloatingpointnormalized(normalizenanandzero(cast(incdata_id#33 as 
> double))) ASC NULLS FIRST], false, 0
>+- Exchange 
> hashpartitioning(knownfloatingpointnormalized(normalizenanandzero(cast(incdata_id#33
>  as double))), 200), true, [id=#112]
>   +- *(3) Filter isnotnull(incdata_id#33)
>  +- Scan hive default.t2 [incdata_id#33, v#34], HiveTableRelation 
> `default`.`t2`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, 
> [incdata_id#33, v#34], Statistics(sizeInBytes=8.0 EiB)
> {code}
> {code:sql}
> select cast(v1 as double) as v3, cast(v2 as double) as v4,
>   cast(v1 as double) = cast(v2 as double), v1 = v2 
> from (select cast('1001636981212' as decimal(21, 0)) as v1,
>   cast('1001636981213' as decimal(21, 0)) as v2) t;
> 1.00163697E20 1.00163697E20   truefalse
> {code}
>  
> It's a realy case in our production:
> !image-2019-09-27-20-20-24-238.png|width=100%!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29274) Can not coerce decimal type to double type when it's join key

2019-09-27 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-29274:

Description: 
How to reproduce this issue:
{code:sql}
create table t1 (incdata_id decimal(21,0), v string) using parquet;
create table t2 (incdata_id string, v string) using parquet;

explain select * from t1 join t2 on (t1.incdata_id = t2.incdata_id);
== Physical Plan ==
*(5) SortMergeJoin 
[knownfloatingpointnormalized(normalizenanandzero(cast(incdata_id#31 as 
double)))], 
[knownfloatingpointnormalized(normalizenanandzero(cast(incdata_id#33 as 
double)))], Inner
:- *(2) Sort 
[knownfloatingpointnormalized(normalizenanandzero(cast(incdata_id#31 as 
double))) ASC NULLS FIRST], false, 0
:  +- Exchange 
hashpartitioning(knownfloatingpointnormalized(normalizenanandzero(cast(incdata_id#31
 as double))), 200), true, [id=#104]
: +- *(1) Filter isnotnull(incdata_id#31)
:+- Scan hive default.t1 [incdata_id#31, v#32], HiveTableRelation 
`default`.`t1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, 
[incdata_id#31, v#32], Statistics(sizeInBytes=8.0 EiB)
+- *(4) Sort 
[knownfloatingpointnormalized(normalizenanandzero(cast(incdata_id#33 as 
double))) ASC NULLS FIRST], false, 0
   +- Exchange 
hashpartitioning(knownfloatingpointnormalized(normalizenanandzero(cast(incdata_id#33
 as double))), 200), true, [id=#112]
  +- *(3) Filter isnotnull(incdata_id#33)
 +- Scan hive default.t2 [incdata_id#33, v#34], HiveTableRelation 
`default`.`t2`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, 
[incdata_id#33, v#34], Statistics(sizeInBytes=8.0 EiB)
{code}
{code:sql}
select cast(v1 as double) as v3, cast(v2 as double) as v4,
  cast(v1 as double) = cast(v2 as double), v1 = v2 
from (select cast('1001636981212' as decimal(21, 0)) as v1,
  cast('1001636981213' as decimal(21, 0)) as v2) t;

1.00163697E20   1.00163697E20   truefalse
{code}
 

It's a realy case in our production:

!image-2019-09-27-20-20-24-238.png|width=100%!

  was:
How to reproduce this issue:
{code:sql}
create table t1 (incdata_id decimal(21,0), v string) using parquet;
create table t2 (incdata_id string, v string) using parquet;

explain select * from t1 join t2 on (t1.incdata_id = t2.incdata_id);
== Physical Plan ==
*(5) SortMergeJoin 
[knownfloatingpointnormalized(normalizenanandzero(cast(incdata_id#31 as 
double)))], 
[knownfloatingpointnormalized(normalizenanandzero(cast(incdata_id#33 as 
double)))], Inner
:- *(2) Sort 
[knownfloatingpointnormalized(normalizenanandzero(cast(incdata_id#31 as 
double))) ASC NULLS FIRST], false, 0
:  +- Exchange 
hashpartitioning(knownfloatingpointnormalized(normalizenanandzero(cast(incdata_id#31
 as double))), 200), true, [id=#104]
: +- *(1) Filter isnotnull(incdata_id#31)
:+- Scan hive default.t1 [incdata_id#31, v#32], HiveTableRelation 
`default`.`t1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, 
[incdata_id#31, v#32], Statistics(sizeInBytes=8.0 EiB)
+- *(4) Sort 
[knownfloatingpointnormalized(normalizenanandzero(cast(incdata_id#33 as 
double))) ASC NULLS FIRST], false, 0
   +- Exchange 
hashpartitioning(knownfloatingpointnormalized(normalizenanandzero(cast(incdata_id#33
 as double))), 200), true, [id=#112]
  +- *(3) Filter isnotnull(incdata_id#33)
 +- Scan hive default.t2 [incdata_id#33, v#34], HiveTableRelation 
`default`.`t2`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, 
[incdata_id#33, v#34], Statistics(sizeInBytes=8.0 EiB)
{code}
{code:sql}
select cast(v1 as double) as v3, cast(v2 as double) as v4,
  cast(v1 as double) = cast(v2 as double), v1 = v2 
from (select cast('1001636981212' as decimal(21, 0)) as v1,
  cast('1001636981213' as decimal(21, 0)) as v2) t;

1.00163697E20   1.00163697E20   truefalse
{code}
 

It's a realy case in our production:

!image-2019-09-27-20-20-24-238.png!


> Can not coerce decimal type to double type when it's join key
> -
>
> Key: SPARK-29274
> URL: https://issues.apache.org/jira/browse/SPARK-29274
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.4, 2.4.4, 3.0.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
> Attachments: image-2019-09-27-20-20-24-238.png
>
>
> How to reproduce this issue:
> {code:sql}
> create table t1 (incdata_id decimal(21,0), v string) using parquet;
> create table t2 (incdata_id string, v string) using parquet;
> explain select * from t1 join t2 on (t1.incdata_id = t2.incdata_id);
> == Physical Plan ==
> *(5) SortMergeJoin 
> [knownfloatingpointnormalized(normalizenanandzero(cast(incdata_id#31 as 
> double)))], 
> [knownfloatingpointnormalized(normalizenanandzero(cast(incdata_id#33 as 
> double)))], Inner
> :- 

[jira] [Assigned] (SPARK-29274) Can not coerce decimal type to double type when it's join key

2019-09-27 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang reassigned SPARK-29274:
---

Assignee: Yuming Wang

> Can not coerce decimal type to double type when it's join key
> -
>
> Key: SPARK-29274
> URL: https://issues.apache.org/jira/browse/SPARK-29274
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.4, 2.4.4, 3.0.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
> Attachments: image-2019-09-27-20-20-24-238.png
>
>
> How to reproduce this issue:
> {code:sql}
> create table t1 (incdata_id decimal(21,0), v string) using parquet;
> create table t2 (incdata_id string, v string) using parquet;
> explain select * from t1 join t2 on (t1.incdata_id = t2.incdata_id);
> == Physical Plan ==
> *(5) SortMergeJoin 
> [knownfloatingpointnormalized(normalizenanandzero(cast(incdata_id#31 as 
> double)))], 
> [knownfloatingpointnormalized(normalizenanandzero(cast(incdata_id#33 as 
> double)))], Inner
> :- *(2) Sort 
> [knownfloatingpointnormalized(normalizenanandzero(cast(incdata_id#31 as 
> double))) ASC NULLS FIRST], false, 0
> :  +- Exchange 
> hashpartitioning(knownfloatingpointnormalized(normalizenanandzero(cast(incdata_id#31
>  as double))), 200), true, [id=#104]
> : +- *(1) Filter isnotnull(incdata_id#31)
> :+- Scan hive default.t1 [incdata_id#31, v#32], HiveTableRelation 
> `default`.`t1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, 
> [incdata_id#31, v#32], Statistics(sizeInBytes=8.0 EiB)
> +- *(4) Sort 
> [knownfloatingpointnormalized(normalizenanandzero(cast(incdata_id#33 as 
> double))) ASC NULLS FIRST], false, 0
>+- Exchange 
> hashpartitioning(knownfloatingpointnormalized(normalizenanandzero(cast(incdata_id#33
>  as double))), 200), true, [id=#112]
>   +- *(3) Filter isnotnull(incdata_id#33)
>  +- Scan hive default.t2 [incdata_id#33, v#34], HiveTableRelation 
> `default`.`t2`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, 
> [incdata_id#33, v#34], Statistics(sizeInBytes=8.0 EiB)
> {code}
> {code:sql}
> select cast(v1 as double) as v3, cast(v2 as double) as v4, cast(v1 as double) 
> = cast(v2 as double), v1 = v2 from (select cast('1001636981212' as 
> decimal(21, 0)) as v1, cast('1001636981213' as decimal(21, 0)) as v2) 
> t;
> 1.00163697E20 1.00163697E20   truefalse
> {code}
>  
> It's a realy case in our production:
> !image-2019-09-27-20-20-24-238.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29274) Can not coerce decimal type to double type when it's join key

2019-09-27 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-29274:

Description: 
How to reproduce this issue:
{code:sql}
create table t1 (incdata_id decimal(21,0), v string) using parquet;
create table t2 (incdata_id string, v string) using parquet;

explain select * from t1 join t2 on (t1.incdata_id = t2.incdata_id);
== Physical Plan ==
*(5) SortMergeJoin 
[knownfloatingpointnormalized(normalizenanandzero(cast(incdata_id#31 as 
double)))], 
[knownfloatingpointnormalized(normalizenanandzero(cast(incdata_id#33 as 
double)))], Inner
:- *(2) Sort 
[knownfloatingpointnormalized(normalizenanandzero(cast(incdata_id#31 as 
double))) ASC NULLS FIRST], false, 0
:  +- Exchange 
hashpartitioning(knownfloatingpointnormalized(normalizenanandzero(cast(incdata_id#31
 as double))), 200), true, [id=#104]
: +- *(1) Filter isnotnull(incdata_id#31)
:+- Scan hive default.t1 [incdata_id#31, v#32], HiveTableRelation 
`default`.`t1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, 
[incdata_id#31, v#32], Statistics(sizeInBytes=8.0 EiB)
+- *(4) Sort 
[knownfloatingpointnormalized(normalizenanandzero(cast(incdata_id#33 as 
double))) ASC NULLS FIRST], false, 0
   +- Exchange 
hashpartitioning(knownfloatingpointnormalized(normalizenanandzero(cast(incdata_id#33
 as double))), 200), true, [id=#112]
  +- *(3) Filter isnotnull(incdata_id#33)
 +- Scan hive default.t2 [incdata_id#33, v#34], HiveTableRelation 
`default`.`t2`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, 
[incdata_id#33, v#34], Statistics(sizeInBytes=8.0 EiB)
{code}
{code:sql}
select cast(v1 as double) as v3, cast(v2 as double) as v4,
  cast(v1 as double) = cast(v2 as double), v1 = v2 
from (select cast('1001636981212' as decimal(21, 0)) as v1,
  cast('1001636981213' as decimal(21, 0)) as v2) t;

1.00163697E20   1.00163697E20   truefalse
{code}
 

It's a realy case in our production:

!image-2019-09-27-20-20-24-238.png!

  was:
How to reproduce this issue:
{code:sql}
create table t1 (incdata_id decimal(21,0), v string) using parquet;
create table t2 (incdata_id string, v string) using parquet;

explain select * from t1 join t2 on (t1.incdata_id = t2.incdata_id);
== Physical Plan ==
*(5) SortMergeJoin 
[knownfloatingpointnormalized(normalizenanandzero(cast(incdata_id#31 as 
double)))], 
[knownfloatingpointnormalized(normalizenanandzero(cast(incdata_id#33 as 
double)))], Inner
:- *(2) Sort 
[knownfloatingpointnormalized(normalizenanandzero(cast(incdata_id#31 as 
double))) ASC NULLS FIRST], false, 0
:  +- Exchange 
hashpartitioning(knownfloatingpointnormalized(normalizenanandzero(cast(incdata_id#31
 as double))), 200), true, [id=#104]
: +- *(1) Filter isnotnull(incdata_id#31)
:+- Scan hive default.t1 [incdata_id#31, v#32], HiveTableRelation 
`default`.`t1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, 
[incdata_id#31, v#32], Statistics(sizeInBytes=8.0 EiB)
+- *(4) Sort 
[knownfloatingpointnormalized(normalizenanandzero(cast(incdata_id#33 as 
double))) ASC NULLS FIRST], false, 0
   +- Exchange 
hashpartitioning(knownfloatingpointnormalized(normalizenanandzero(cast(incdata_id#33
 as double))), 200), true, [id=#112]
  +- *(3) Filter isnotnull(incdata_id#33)
 +- Scan hive default.t2 [incdata_id#33, v#34], HiveTableRelation 
`default`.`t2`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, 
[incdata_id#33, v#34], Statistics(sizeInBytes=8.0 EiB)
{code}
{code:sql}
select cast(v1 as double) as v3, cast(v2 as double) as v4, cast(v1 as double) = 
cast(v2 as double), v1 = v2 from (select cast('1001636981212' as 
decimal(21, 0)) as v1, cast('1001636981213' as decimal(21, 0)) as v2) t;

1.00163697E20   1.00163697E20   truefalse
{code}
 

It's a realy case in our production:

!image-2019-09-27-20-20-24-238.png!


> Can not coerce decimal type to double type when it's join key
> -
>
> Key: SPARK-29274
> URL: https://issues.apache.org/jira/browse/SPARK-29274
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.4, 2.4.4, 3.0.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
> Attachments: image-2019-09-27-20-20-24-238.png
>
>
> How to reproduce this issue:
> {code:sql}
> create table t1 (incdata_id decimal(21,0), v string) using parquet;
> create table t2 (incdata_id string, v string) using parquet;
> explain select * from t1 join t2 on (t1.incdata_id = t2.incdata_id);
> == Physical Plan ==
> *(5) SortMergeJoin 
> [knownfloatingpointnormalized(normalizenanandzero(cast(incdata_id#31 as 
> double)))], 
> [knownfloatingpointnormalized(normalizenanandzero(cast(incdata_id#33 as 
> double)))], Inner
> :- *(2) Sort 
> 

[jira] [Commented] (SPARK-29274) Can not coerce decimal type to double type when it's join key

2019-09-27 Thread Yuming Wang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16939384#comment-16939384
 ] 

Yuming Wang commented on SPARK-29274:
-

I'll asgin this ticket to [~pfchang]  who found this issue.

> Can not coerce decimal type to double type when it's join key
> -
>
> Key: SPARK-29274
> URL: https://issues.apache.org/jira/browse/SPARK-29274
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.4, 2.4.4, 3.0.0
>Reporter: Yuming Wang
>Priority: Major
> Attachments: image-2019-09-27-20-20-24-238.png
>
>
> How to reproduce this issue:
> {code:sql}
> create table t1 (incdata_id decimal(21,0), v string) using parquet;
> create table t2 (incdata_id string, v string) using parquet;
> explain select * from t1 join t2 on (t1.incdata_id = t2.incdata_id);
> == Physical Plan ==
> *(5) SortMergeJoin 
> [knownfloatingpointnormalized(normalizenanandzero(cast(incdata_id#31 as 
> double)))], 
> [knownfloatingpointnormalized(normalizenanandzero(cast(incdata_id#33 as 
> double)))], Inner
> :- *(2) Sort 
> [knownfloatingpointnormalized(normalizenanandzero(cast(incdata_id#31 as 
> double))) ASC NULLS FIRST], false, 0
> :  +- Exchange 
> hashpartitioning(knownfloatingpointnormalized(normalizenanandzero(cast(incdata_id#31
>  as double))), 200), true, [id=#104]
> : +- *(1) Filter isnotnull(incdata_id#31)
> :+- Scan hive default.t1 [incdata_id#31, v#32], HiveTableRelation 
> `default`.`t1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, 
> [incdata_id#31, v#32], Statistics(sizeInBytes=8.0 EiB)
> +- *(4) Sort 
> [knownfloatingpointnormalized(normalizenanandzero(cast(incdata_id#33 as 
> double))) ASC NULLS FIRST], false, 0
>+- Exchange 
> hashpartitioning(knownfloatingpointnormalized(normalizenanandzero(cast(incdata_id#33
>  as double))), 200), true, [id=#112]
>   +- *(3) Filter isnotnull(incdata_id#33)
>  +- Scan hive default.t2 [incdata_id#33, v#34], HiveTableRelation 
> `default`.`t2`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, 
> [incdata_id#33, v#34], Statistics(sizeInBytes=8.0 EiB)
> {code}
> {code:sql}
> select cast(v1 as double) as v3, cast(v2 as double) as v4, cast(v1 as double) 
> = cast(v2 as double), v1 = v2 from (select cast('1001636981212' as 
> decimal(21, 0)) as v1, cast('1001636981213' as decimal(21, 0)) as v2) 
> t;
> 1.00163697E20 1.00163697E20   truefalse
> {code}
>  
> It's a realy case in our production:
> !image-2019-09-27-20-20-24-238.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29274) Can not coerce decimal type to double type when it's join key

2019-09-27 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-29274:

Description: 
How to reproduce this issue:
{code:sql}
create table t1 (incdata_id decimal(21,0), v string) using parquet;
create table t2 (incdata_id string, v string) using parquet;

explain select * from t1 join t2 on (t1.incdata_id = t2.incdata_id);
== Physical Plan ==
*(5) SortMergeJoin 
[knownfloatingpointnormalized(normalizenanandzero(cast(incdata_id#31 as 
double)))], 
[knownfloatingpointnormalized(normalizenanandzero(cast(incdata_id#33 as 
double)))], Inner
:- *(2) Sort 
[knownfloatingpointnormalized(normalizenanandzero(cast(incdata_id#31 as 
double))) ASC NULLS FIRST], false, 0
:  +- Exchange 
hashpartitioning(knownfloatingpointnormalized(normalizenanandzero(cast(incdata_id#31
 as double))), 200), true, [id=#104]
: +- *(1) Filter isnotnull(incdata_id#31)
:+- Scan hive default.t1 [incdata_id#31, v#32], HiveTableRelation 
`default`.`t1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, 
[incdata_id#31, v#32], Statistics(sizeInBytes=8.0 EiB)
+- *(4) Sort 
[knownfloatingpointnormalized(normalizenanandzero(cast(incdata_id#33 as 
double))) ASC NULLS FIRST], false, 0
   +- Exchange 
hashpartitioning(knownfloatingpointnormalized(normalizenanandzero(cast(incdata_id#33
 as double))), 200), true, [id=#112]
  +- *(3) Filter isnotnull(incdata_id#33)
 +- Scan hive default.t2 [incdata_id#33, v#34], HiveTableRelation 
`default`.`t2`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, 
[incdata_id#33, v#34], Statistics(sizeInBytes=8.0 EiB)
{code}
{code:sql}
select cast(v1 as double) as v3, cast(v2 as double) as v4, cast(v1 as double) = 
cast(v2 as double), v1 = v2 from (select cast('1001636981212' as 
decimal(21, 0)) as v1, cast('1001636981213' as decimal(21, 0)) as v2) t;

1.00163697E20   1.00163697E20   truefalse
{code}
 

It's a realy case in our production:

!image-2019-09-27-20-20-24-238.png!

  was:
How to reproduce this issue:
{code:sql}
create table t1 (incdata_id decimal(21,0), v string) using parquet;
create table t2 (incdata_id string, v string) using parquet;

explain select * from t1 join t2 on (t1.incdata_id = t2.incdata_id);
== Physical Plan ==
*(5) SortMergeJoin 
[knownfloatingpointnormalized(normalizenanandzero(cast(incdata_id#31 as 
double)))], 
[knownfloatingpointnormalized(normalizenanandzero(cast(incdata_id#33 as 
double)))], Inner
:- *(2) Sort 
[knownfloatingpointnormalized(normalizenanandzero(cast(incdata_id#31 as 
double))) ASC NULLS FIRST], false, 0
:  +- Exchange 
hashpartitioning(knownfloatingpointnormalized(normalizenanandzero(cast(incdata_id#31
 as double))), 200), true, [id=#104]
: +- *(1) Filter isnotnull(incdata_id#31)
:+- Scan hive default.t1 [incdata_id#31, v#32], HiveTableRelation 
`default`.`t1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, 
[incdata_id#31, v#32], Statistics(sizeInBytes=8.0 EiB)
+- *(4) Sort 
[knownfloatingpointnormalized(normalizenanandzero(cast(incdata_id#33 as 
double))) ASC NULLS FIRST], false, 0
   +- Exchange 
hashpartitioning(knownfloatingpointnormalized(normalizenanandzero(cast(incdata_id#33
 as double))), 200), true, [id=#112]
  +- *(3) Filter isnotnull(incdata_id#33)
 +- Scan hive default.t2 [incdata_id#33, v#34], HiveTableRelation 
`default`.`t2`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, 
[incdata_id#33, v#34], Statistics(sizeInBytes=8.0 EiB)
{code}


{code:sql}
select cast(v1 as double) as v3, cast(v2 as double) as v4, cast(v1 as double) = 
cast(v2 as double), v1 = v2 from (select cast('1001636981212' as 
decimal(21, 0)) as v1, cast('1001636981213' as decimal(21, 0)) as v2) t;
1.00163697E20   1.00163697E20   truefalse
{code}




> Can not coerce decimal type to double type when it's join key
> -
>
> Key: SPARK-29274
> URL: https://issues.apache.org/jira/browse/SPARK-29274
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.4, 2.4.4, 3.0.0
>Reporter: Yuming Wang
>Priority: Major
> Attachments: image-2019-09-27-20-20-24-238.png
>
>
> How to reproduce this issue:
> {code:sql}
> create table t1 (incdata_id decimal(21,0), v string) using parquet;
> create table t2 (incdata_id string, v string) using parquet;
> explain select * from t1 join t2 on (t1.incdata_id = t2.incdata_id);
> == Physical Plan ==
> *(5) SortMergeJoin 
> [knownfloatingpointnormalized(normalizenanandzero(cast(incdata_id#31 as 
> double)))], 
> [knownfloatingpointnormalized(normalizenanandzero(cast(incdata_id#33 as 
> double)))], Inner
> :- *(2) Sort 
> [knownfloatingpointnormalized(normalizenanandzero(cast(incdata_id#31 as 
> double))) ASC NULLS FIRST], false, 0
> :  +- 

[jira] [Updated] (SPARK-29274) Can not coerce decimal type to double type when it's join key

2019-09-27 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-29274:

Attachment: image-2019-09-27-20-20-24-238.png

> Can not coerce decimal type to double type when it's join key
> -
>
> Key: SPARK-29274
> URL: https://issues.apache.org/jira/browse/SPARK-29274
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.4, 2.4.4, 3.0.0
>Reporter: Yuming Wang
>Priority: Major
> Attachments: image-2019-09-27-20-20-24-238.png
>
>
> How to reproduce this issue:
> {code:sql}
> create table t1 (incdata_id decimal(21,0), v string) using parquet;
> create table t2 (incdata_id string, v string) using parquet;
> explain select * from t1 join t2 on (t1.incdata_id = t2.incdata_id);
> == Physical Plan ==
> *(5) SortMergeJoin 
> [knownfloatingpointnormalized(normalizenanandzero(cast(incdata_id#31 as 
> double)))], 
> [knownfloatingpointnormalized(normalizenanandzero(cast(incdata_id#33 as 
> double)))], Inner
> :- *(2) Sort 
> [knownfloatingpointnormalized(normalizenanandzero(cast(incdata_id#31 as 
> double))) ASC NULLS FIRST], false, 0
> :  +- Exchange 
> hashpartitioning(knownfloatingpointnormalized(normalizenanandzero(cast(incdata_id#31
>  as double))), 200), true, [id=#104]
> : +- *(1) Filter isnotnull(incdata_id#31)
> :+- Scan hive default.t1 [incdata_id#31, v#32], HiveTableRelation 
> `default`.`t1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, 
> [incdata_id#31, v#32], Statistics(sizeInBytes=8.0 EiB)
> +- *(4) Sort 
> [knownfloatingpointnormalized(normalizenanandzero(cast(incdata_id#33 as 
> double))) ASC NULLS FIRST], false, 0
>+- Exchange 
> hashpartitioning(knownfloatingpointnormalized(normalizenanandzero(cast(incdata_id#33
>  as double))), 200), true, [id=#112]
>   +- *(3) Filter isnotnull(incdata_id#33)
>  +- Scan hive default.t2 [incdata_id#33, v#34], HiveTableRelation 
> `default`.`t2`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, 
> [incdata_id#33, v#34], Statistics(sizeInBytes=8.0 EiB)
> {code}
> {code:sql}
> select cast(v1 as double) as v3, cast(v2 as double) as v4, cast(v1 as double) 
> = cast(v2 as double), v1 = v2 from (select cast('1001636981212' as 
> decimal(21, 0)) as v1, cast('1001636981213' as decimal(21, 0)) as v2) 
> t;
> 1.00163697E20 1.00163697E20   truefalse
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29265) Window+collect_list causing single-task operation

2019-09-27 Thread Florentino Sainz (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Florentino Sainz updated SPARK-29265:
-
Summary: Window+collect_list causing single-task operation  (was: Window 
orderBy causing full-DF orderBy )

> Window+collect_list causing single-task operation
> -
>
> Key: SPARK-29265
> URL: https://issues.apache.org/jira/browse/SPARK-29265
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
> Environment: Any
>Reporter: Florentino Sainz
>Priority: Minor
>
> Hi,
>  
> I had this problem in "real" environments and also made a self-contained test 
> ( [^Test.scala] attached).
> Having this Window definition:
> {code:scala}
> val myWindow = Window.partitionBy($"word").orderBy("word") 
> val filt2 = filtrador.withColumn("avg_Time", 
> collect_list($"number").over(myWindow))
> {code}
>  
> As a user, I would expect either:
> 1- Error/warning (because trying to sort on one of the columns of the window 
> partitionBy)
> 2- A mostly-useless operation which just orders the rows (which have the same 
> value on word) inside each Window but doesn't affect performance too much.
>  
> Currently what I see:
> *When I use "myWindow" in any DataFrame, somehow that Window.orderBy is 
> performing a global orderBy on the whole DataFrame. Similar to 
> dataframe.orderBy("word").*
> *In my real environment, my program just didn't finish in time/crashed thus 
> causing my program to be very slow or crash (because as it's a global 
> orderBy, it will just go to one executor).*
>  
> In the test I can see how all elements of my DF are in a single partition 
> (side-effect of the global orderBy) 
>  
> Full Code showing the error (see how the mapPartitions shows 99 rows in one 
> partition) attached in Test.scala, sbt project (src and build.sbt) attached 
> too in TestSpark.zip.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29265) Window+collect_list causing single-task operation

2019-09-27 Thread Florentino Sainz (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Florentino Sainz updated SPARK-29265:
-
Description: 
Hi,

 

I had this problem in "real" environments and also made a self-contained test ( 
[^Test.scala] attached).

Having this Window definition:
{code:scala}
val myWindow = Window.partitionBy($"word").orderBy("word") 
val filt2 = filtrador.withColumn("avg_Time", 
collect_list($"number").over(myWindow))
{code}
  


 

In the test I can see how all elements of my DF are in a single partition 
(side-effect of the global orderBy) 

 

Full Code showing the error (see how the mapPartitions shows 99 rows in one 
partition) attached in Test.scala, sbt project (src and build.sbt) attached too 
in TestSpark.zip.

  was:
Hi,

 

I had this problem in "real" environments and also made a self-contained test ( 
[^Test.scala] attached).

Having this Window definition:
{code:scala}
val myWindow = Window.partitionBy($"word").orderBy("word") 
val filt2 = filtrador.withColumn("avg_Time", 
collect_list($"number").over(myWindow))
{code}
 

As a user, I would expect either:

1- Error/warning (because trying to sort on one of the columns of the window 
partitionBy)

2- A mostly-useless operation which just orders the rows (which have the same 
value on word) inside each Window but doesn't affect performance too much.

 

Currently what I see:

*When I use "myWindow" in any DataFrame, somehow that Window.orderBy is 
performing a global orderBy on the whole DataFrame. Similar to 
dataframe.orderBy("word").*

*In my real environment, my program just didn't finish in time/crashed thus 
causing my program to be very slow or crash (because as it's a global orderBy, 
it will just go to one executor).*

 

In the test I can see how all elements of my DF are in a single partition 
(side-effect of the global orderBy) 

 

Full Code showing the error (see how the mapPartitions shows 99 rows in one 
partition) attached in Test.scala, sbt project (src and build.sbt) attached too 
in TestSpark.zip.


> Window+collect_list causing single-task operation
> -
>
> Key: SPARK-29265
> URL: https://issues.apache.org/jira/browse/SPARK-29265
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
> Environment: Any
>Reporter: Florentino Sainz
>Priority: Minor
>
> Hi,
>  
> I had this problem in "real" environments and also made a self-contained test 
> ( [^Test.scala] attached).
> Having this Window definition:
> {code:scala}
> val myWindow = Window.partitionBy($"word").orderBy("word") 
> val filt2 = filtrador.withColumn("avg_Time", 
> collect_list($"number").over(myWindow))
> {code}
>   
>  
> In the test I can see how all elements of my DF are in a single partition 
> (side-effect of the global orderBy) 
>  
> Full Code showing the error (see how the mapPartitions shows 99 rows in one 
> partition) attached in Test.scala, sbt project (src and build.sbt) attached 
> too in TestSpark.zip.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29265) Window+collect_list causing single-task operation

2019-09-27 Thread Florentino Sainz (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Florentino Sainz updated SPARK-29265:
-
Description: 
Hi,

 

I had this problem in "real" environments and also made a self-contained test ( 
[^Test.scala] attached).

Having this Window definition:
{code:scala}
val myWindow = Window.partitionBy($"word").orderBy("word") 
val filt2 = filtrador.withColumn("avg_Time", 
collect_list($"number").over(myWindow))
{code}
  

In the test I can see how all elements of my DF are being collected in a single 
task.


Unbounded+unordered Window + collect_list seems to be collecting ALL the 
dataframe in a single executor/task.

groupBy + collect_list seems to do it as expect (collect_list for each group 
independently).
 

Full Code showing the error (see how the mapPartitions shows 99 rows in one 
partition) attached in Test.scala, sbt project (src and build.sbt) attached too 
in TestSpark.zip.

  was:
Hi,

 

I had this problem in "real" environments and also made a self-contained test ( 
[^Test.scala] attached).

Having this Window definition:
{code:scala}
val myWindow = Window.partitionBy($"word").orderBy("word") 
val filt2 = filtrador.withColumn("avg_Time", 
collect_list($"number").over(myWindow))
{code}
  


 

In the test I can see how all elements of my DF are in a single partition 
(side-effect of the global orderBy) 

 

Full Code showing the error (see how the mapPartitions shows 99 rows in one 
partition) attached in Test.scala, sbt project (src and build.sbt) attached too 
in TestSpark.zip.


> Window+collect_list causing single-task operation
> -
>
> Key: SPARK-29265
> URL: https://issues.apache.org/jira/browse/SPARK-29265
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
> Environment: Any
>Reporter: Florentino Sainz
>Priority: Minor
>
> Hi,
>  
> I had this problem in "real" environments and also made a self-contained test 
> ( [^Test.scala] attached).
> Having this Window definition:
> {code:scala}
> val myWindow = Window.partitionBy($"word").orderBy("word") 
> val filt2 = filtrador.withColumn("avg_Time", 
> collect_list($"number").over(myWindow))
> {code}
>   
> In the test I can see how all elements of my DF are being collected in a 
> single task.
> Unbounded+unordered Window + collect_list seems to be collecting ALL the 
> dataframe in a single executor/task.
> groupBy + collect_list seems to do it as expect (collect_list for each group 
> independently).
>  
> Full Code showing the error (see how the mapPartitions shows 99 rows in one 
> partition) attached in Test.scala, sbt project (src and build.sbt) attached 
> too in TestSpark.zip.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29265) Window orderBy causing full-DF orderBy

2019-09-27 Thread Florentino Sainz (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16939381#comment-16939381
 ] 

Florentino Sainz commented on SPARK-29265:
--

Ok, after quite a bit of research, I had to change my Window to groupBy + 
collect_list

Unbounded+unordered Window + collect_list seems to be collecting ALL the 
dataframe in a single executor/task.

groupBy + collect_list seems to do it as expect (collect_list for each group 
independently).

Is this behaviour expected or it's a bug?

> Window orderBy causing full-DF orderBy 
> ---
>
> Key: SPARK-29265
> URL: https://issues.apache.org/jira/browse/SPARK-29265
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
> Environment: Any
>Reporter: Florentino Sainz
>Priority: Minor
>
> Hi,
>  
> I had this problem in "real" environments and also made a self-contained test 
> ( [^Test.scala] attached).
> Having this Window definition:
> {code:scala}
> val myWindow = Window.partitionBy($"word").orderBy("word") 
> val filt2 = filtrador.withColumn("avg_Time", 
> collect_list($"number").over(myWindow))
> {code}
>  
> As a user, I would expect either:
> 1- Error/warning (because trying to sort on one of the columns of the window 
> partitionBy)
> 2- A mostly-useless operation which just orders the rows (which have the same 
> value on word) inside each Window but doesn't affect performance too much.
>  
> Currently what I see:
> *When I use "myWindow" in any DataFrame, somehow that Window.orderBy is 
> performing a global orderBy on the whole DataFrame. Similar to 
> dataframe.orderBy("word").*
> *In my real environment, my program just didn't finish in time/crashed thus 
> causing my program to be very slow or crash (because as it's a global 
> orderBy, it will just go to one executor).*
>  
> In the test I can see how all elements of my DF are in a single partition 
> (side-effect of the global orderBy) 
>  
> Full Code showing the error (see how the mapPartitions shows 99 rows in one 
> partition) attached in Test.scala, sbt project (src and build.sbt) attached 
> too in TestSpark.zip.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29274) Can not coerce decimal type to double type when it's join key

2019-09-27 Thread Yuming Wang (Jira)
Yuming Wang created SPARK-29274:
---

 Summary: Can not coerce decimal type to double type when it's join 
key
 Key: SPARK-29274
 URL: https://issues.apache.org/jira/browse/SPARK-29274
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.4, 2.3.4, 3.0.0
Reporter: Yuming Wang


How to reproduce this issue:
{code:sql}
create table t1 (incdata_id decimal(21,0), v string) using parquet;
create table t2 (incdata_id string, v string) using parquet;

explain select * from t1 join t2 on (t1.incdata_id = t2.incdata_id);
== Physical Plan ==
*(5) SortMergeJoin 
[knownfloatingpointnormalized(normalizenanandzero(cast(incdata_id#31 as 
double)))], 
[knownfloatingpointnormalized(normalizenanandzero(cast(incdata_id#33 as 
double)))], Inner
:- *(2) Sort 
[knownfloatingpointnormalized(normalizenanandzero(cast(incdata_id#31 as 
double))) ASC NULLS FIRST], false, 0
:  +- Exchange 
hashpartitioning(knownfloatingpointnormalized(normalizenanandzero(cast(incdata_id#31
 as double))), 200), true, [id=#104]
: +- *(1) Filter isnotnull(incdata_id#31)
:+- Scan hive default.t1 [incdata_id#31, v#32], HiveTableRelation 
`default`.`t1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, 
[incdata_id#31, v#32], Statistics(sizeInBytes=8.0 EiB)
+- *(4) Sort 
[knownfloatingpointnormalized(normalizenanandzero(cast(incdata_id#33 as 
double))) ASC NULLS FIRST], false, 0
   +- Exchange 
hashpartitioning(knownfloatingpointnormalized(normalizenanandzero(cast(incdata_id#33
 as double))), 200), true, [id=#112]
  +- *(3) Filter isnotnull(incdata_id#33)
 +- Scan hive default.t2 [incdata_id#33, v#34], HiveTableRelation 
`default`.`t2`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, 
[incdata_id#33, v#34], Statistics(sizeInBytes=8.0 EiB)
{code}


{code:sql}
select cast(v1 as double) as v3, cast(v2 as double) as v4, cast(v1 as double) = 
cast(v2 as double), v1 = v2 from (select cast('1001636981212' as 
decimal(21, 0)) as v1, cast('1001636981213' as decimal(21, 0)) as v2) t;
1.00163697E20   1.00163697E20   truefalse
{code}





--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29232) RandomForestRegressionModel does not update the parameter maps of the DecisionTreeRegressionModels underneath

2019-09-27 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-29232:
-
Priority: Major  (was: Critical)

> RandomForestRegressionModel does not update the parameter maps of the 
> DecisionTreeRegressionModels underneath
> -
>
> Key: SPARK-29232
> URL: https://issues.apache.org/jira/browse/SPARK-29232
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.4.0
>Reporter: Jiaqi Guo
>Priority: Major
>
> We trained a RandomForestRegressionModel, and tried to access the trees. Even 
> though the DecisionTreeRegressionModel is correctly built with the proper 
> parameters from random forest, the parameter map is not updated, and still 
> contains only the default value. 
> For example, if a RandomForestRegressor was trained with maxDepth of 12, then 
> accessing the tree information, extractParamMap still returns the default 
> values, with maxDepth=5. Calling the depth itself of 
> DecisionTreeRegressionModel returns the correct value of 12 though.
> This creates issues when we want to access each individual tree and build the 
> trees back up for the random forest estimator.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29238) Add newColumn using withColumn to an empty Dataframe

2019-09-27 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-29238.
--
Resolution: Not A Problem

> Add newColumn using withColumn to an empty Dataframe
> 
>
> Key: SPARK-29238
> URL: https://issues.apache.org/jira/browse/SPARK-29238
> Project: Spark
>  Issue Type: Question
>  Components: SQL
>Affects Versions: 2.3.2
>Reporter: ARUN KINDRA
>Priority: Major
>
> Hi
> I'm trying to add a newColumn to an empty DF but I don't see the new column 
> is getting added.
> Dataset newDF = sparkSession.emptyDataFrame();
>  Dataset newDf_DateConverted = newDF.withColumn("year", lit("2019"));
>  newDf_DateConverted.show();
> *Output:*
> +---+
> |year|
> +---+
>  +---+
>  
> Basically, I am reading a HBase table and if there is no data into the table 
> i get empty JavaPairRDD. I convert that JavaPairRDD to JavaRDD and then 
> using schema I convert into a DF. Later I need to insert that DF values into 
> the Hive external partitioned table. But when there is no data in HBase i am 
> not seeing the partition getting created in HDFS.
> So, I tried the above 2 lines to code where I have empty DF and I try to add 
> partitioned column into it, that also doesn't work.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29238) Add newColumn using withColumn to an empty Dataframe

2019-09-27 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16939379#comment-16939379
 ] 

Hyukjin Kwon commented on SPARK-29238:
--

Since there's no record, there would be anything will be added as a value. 
There's 0 records so no record that holds the literal.

> Add newColumn using withColumn to an empty Dataframe
> 
>
> Key: SPARK-29238
> URL: https://issues.apache.org/jira/browse/SPARK-29238
> Project: Spark
>  Issue Type: Question
>  Components: SQL
>Affects Versions: 2.3.2
>Reporter: ARUN KINDRA
>Priority: Major
>
> Hi
> I'm trying to add a newColumn to an empty DF but I don't see the new column 
> is getting added.
> Dataset newDF = sparkSession.emptyDataFrame();
>  Dataset newDf_DateConverted = newDF.withColumn("year", lit("2019"));
>  newDf_DateConverted.show();
> *Output:*
> +---+
> |year|
> +---+
>  +---+
>  
> Basically, I am reading a HBase table and if there is no data into the table 
> i get empty JavaPairRDD. I convert that JavaPairRDD to JavaRDD and then 
> using schema I convert into a DF. Later I need to insert that DF values into 
> the Hive external partitioned table. But when there is no data in HBase i am 
> not seeing the partition getting created in HDFS.
> So, I tried the above 2 lines to code where I have empty DF and I try to add 
> partitioned column into it, that also doesn't work.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29262) DataFrameWriter insertIntoPartition function

2019-09-27 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16939371#comment-16939371
 ] 

Hyukjin Kwon commented on SPARK-29262:
--

[~hzfeiwang] please be clear about what this JIRA means. What's 
insertIntoPartition, and why do we need it?

> DataFrameWriter insertIntoPartition function
> 
>
> Key: SPARK-29262
> URL: https://issues.apache.org/jira/browse/SPARK-29262
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: feiwang
>Priority: Minor
>
> Do we have plan to support insertIntoPartition function for dataFrameWriter?
> [~cloud_fan]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29272) dataframe.write.format("libsvm").save() take too much time

2019-09-27 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-29272.
--
Resolution: Invalid

> dataframe.write.format("libsvm").save() take too much time
> --
>
> Key: SPARK-29272
> URL: https://issues.apache.org/jira/browse/SPARK-29272
> Project: Spark
>  Issue Type: Question
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: 张焕明
>Priority: Major
>
> I have a pyspark dataframe with about 10 thousand records,while using pyspark 
> api to dump the whole dataset. It take 10 seconds. While I use filter api to 
> select 10 records and dump the temp_df again. It take 8 seconds.why will it 
> take so much time? How can I improve it? Thank you!
> MLUtils.convertVectorColumnsToML(dataframe).write.format("libsvm").save('path'),
>  mode='overwrite'),
> temp_df = dataframe.filter(train_df['__index'].between(int(0,10))



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29272) dataframe.write.format("libsvm").save() take too much time

2019-09-27 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16939367#comment-16939367
 ] 

Hyukjin Kwon commented on SPARK-29272:
--

Questions should go to mailing list or stackoverflow. You would be able to get 
a better answer there. 

> dataframe.write.format("libsvm").save() take too much time
> --
>
> Key: SPARK-29272
> URL: https://issues.apache.org/jira/browse/SPARK-29272
> Project: Spark
>  Issue Type: Question
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: 张焕明
>Priority: Major
>
> I have a pyspark dataframe with about 10 thousand records,while using pyspark 
> api to dump the whole dataset. It take 10 seconds. While I use filter api to 
> select 10 records and dump the temp_df again. It take 8 seconds.why will it 
> take so much time? How can I improve it? Thank you!
> MLUtils.convertVectorColumnsToML(dataframe).write.format("libsvm").save('path'),
>  mode='overwrite'),
> temp_df = dataframe.filter(train_df['__index'].between(int(0,10))



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29273) Spark peakExecutionMemory metrics is zero

2019-09-27 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16939365#comment-16939365
 ] 

Hyukjin Kwon commented on SPARK-29273:
--

Can you show reproducer with the current / expected & input / output?

> Spark peakExecutionMemory metrics is zero
> -
>
> Key: SPARK-29273
> URL: https://issues.apache.org/jira/browse/SPARK-29273
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.3
> Environment: hadoop 2.7.3
> spark 2.4.3
> jdk 1.8.0_60
>Reporter: huangweiyi
>Priority: Major
>
> with spark 2.4.3 in our production environment, i want to get the 
> peakExecutionMemory which is exposed by the TaskMetrics, but alway get the 
> zero value



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29273) Spark peakExecutionMemory metrics is zero

2019-09-27 Thread huangweiyi (Jira)
huangweiyi created SPARK-29273:
--

 Summary: Spark peakExecutionMemory metrics is zero
 Key: SPARK-29273
 URL: https://issues.apache.org/jira/browse/SPARK-29273
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.4.3
 Environment: hadoop 2.7.3

spark 2.4.3

jdk 1.8.0_60
Reporter: huangweiyi


with spark 2.4.3 in our production environment, i want to get the 
peakExecutionMemory which is exposed by the TaskMetrics, but alway get the zero 
value



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29242) Check results of expression examples automatically

2019-09-27 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-29242.
--
Resolution: Duplicate

> Check results of expression examples automatically
> --
>
> Key: SPARK-29242
> URL: https://issues.apache.org/jira/browse/SPARK-29242
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Tests
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Minor
>
> Expression examples demonstrate how to use associated functions, and show 
> expected results. Need to write a test which executes the examples and 
> compare actual and expected results. For example: 
> https://github.com/apache/spark/blob/051e691029c456fc2db5f229485d3fb8f5d0e84c/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala#L2038-L2043



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29253) Add agg(Spark, Spark*) to SQL Dataset

2019-09-27 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-29253.
--
Resolution: Won't Fix

> Add agg(Spark, Spark*) to SQL Dataset
> -
>
> Key: SPARK-29253
> URL: https://issues.apache.org/jira/browse/SPARK-29253
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Tomoko Komiyama
>Priority: Minor
>
> agg() was able to use when we use String on arguments.
> Add agg(Spark, Spark*) and its test.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29055) Memory leak in Spark

2019-09-27 Thread George Papa (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

George Papa updated SPARK-29055:

Description: 
I used Spark 2.1.1 and I upgraded into new versions. After Spark version 2.3.3, 
 I observed from Spark UI that the driver memory is{color:#ff} increasing 
continuously.{color}

In more detail, the driver memory and executors memory have the same used 
memory storage and after each iteration the storage memory is increasing. You 
can reproduce this behavior by running the following snippet code. The 
following example, is very simple, without any dataframe persistence, but the 
memory consumption is not stable as it was in former Spark versions 
(Specifically until Spark 2.3.2).

Also, I tested with Spark streaming and structured streaming API and I had the 
same behavior. I tested with an existing application which reads from Kafka 
source and do some aggregations, persist dataframes and then unpersist them. 
The persist and unpersist it works correct, I see the dataframes in the storage 
tab in Spark UI and after the unpersist, all dataframe have removed. But, after 
the unpersist the executors memory is not zero, BUT has the same value with the 
driver memory. This behavior also affects the application performance because 
the memory of the executors is increasing as the driver increasing and after a 
while the persisted dataframes are not fit in the executors memory and  I have 
spill to disk.

Another error which I had after a long running, was 
{color:#ff}java.lang.OutOfMemoryError: GC overhead limit exceeded, but I 
don't know if its relevant with the above behavior or not.{color}

 

*HOW TO REPRODUCE THIS BEHAVIOR:*

Create a very simple application(streaming count_file.py) in order to reproduce 
this behavior. This application reads CSV files from a directory, count the 
rows and then remove the processed files.
{code:java}
import time
import os

from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql import types as T

target_dir = "..."

spark=SparkSession.builder.appName("DataframeCount").getOrCreate()

while True:
for f in os.listdir(target_dir):
df = spark.read.load(target_dir + f, format="csv")
print("Number of records: {0}".format(df.count()))
time.sleep(15){code}
Submit code:
{code:java}
spark-submit 
--master spark://xxx.xxx.xx.xxx
--deploy-mode client
--executor-memory 4g
--executor-cores 3
streaming count_file.py
{code}
 

*TESTED CASES WITH THE SAME BEHAVIOUR:*
 * I tested with default settings (spark-defaults.conf)
 * Add spark.cleaner.periodicGC.interval 1min (or less)
 * {{Turn spark.cleaner.referenceTracking.blocking}}=false
 * Run the application in cluster mode
 * Increase/decrease the resources of the executors and driver
 * I tested with extraJavaOptions in driver and executor -XX:+UseG1GC 
-XX:InitiatingHeapOccupancyPercent=35 -XX:ConcGCThreads=12
  

*DEPENDENCIES*
 * Operation system: Ubuntu 16.04.3 LTS
 * Java: jdk1.8.0_131 (tested also with jdk1.8.0_221)
 * Python: Python 2.7.12

 

*NOTE:* In Spark 2.1.1 the driver memory consumption (Storage Memory tab) was 
extremely low and after the run of ContextCleaner and BlockManager the memory 
was decreasing.

  was:
I used Spark 2.1.1 and I upgraded into the latest version 2.4.4. I observed 
from Spark UI that the driver memory is{color:#ff} increasing 
continuously{color} and after of long running I had the following error : 
{color:#ff}java.lang.OutOfMemoryError: GC overhead limit exceeded{color}

In Spark 2.1.1 the driver memory consumption (Storage Memory tab) was extremely 
low and after the run of ContextCleaner and BlockManager the memory was 
decreasing.

Also, I tested the Spark versions 2.3.3, 2.4.3 and I had the same behavior.

 

*HOW TO REPRODUCE THIS BEHAVIOR:*

Create a very simple application(streaming count_file.py) in order to reproduce 
this behavior. This application reads CSV files from a directory, count the 
rows and then remove the processed files.
{code:java}
import time
import os

from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql import types as T

target_dir = "..."

spark=SparkSession.builder.appName("DataframeCount").getOrCreate()

while True:
for f in os.listdir(target_dir):
df = spark.read.load(target_dir + f, format="csv")
print("Number of records: {0}".format(df.count()))
time.sleep(15){code}
Submit code:
{code:java}
spark-submit 
--master spark://xxx.xxx.xx.xxx
--deploy-mode client
--executor-memory 4g
--executor-cores 3
streaming count_file.py
{code}
 

*TESTED CASES WITH THE SAME BEHAVIOUR:*
 * I tested with default settings (spark-defaults.conf)
 * Add spark.cleaner.periodicGC.interval 1min (or less)
 * {{Turn spark.cleaner.referenceTracking.blocking}}=false
 * Run the application in cluster mode
 * Increase/decrease the resources 

[jira] [Commented] (SPARK-29232) RandomForestRegressionModel does not update the parameter maps of the DecisionTreeRegressionModels underneath

2019-09-27 Thread Aman Omer (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16939293#comment-16939293
 ] 

Aman Omer commented on SPARK-29232:
---

I used some examples for RF regression but can't use model.extractParamMap(). I 
am stuck here.

[~jiaqig] [~shahid] any lead to reproduce this bug will be helpful. 

Thanks

> RandomForestRegressionModel does not update the parameter maps of the 
> DecisionTreeRegressionModels underneath
> -
>
> Key: SPARK-29232
> URL: https://issues.apache.org/jira/browse/SPARK-29232
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.4.0
>Reporter: Jiaqi Guo
>Priority: Critical
>
> We trained a RandomForestRegressionModel, and tried to access the trees. Even 
> though the DecisionTreeRegressionModel is correctly built with the proper 
> parameters from random forest, the parameter map is not updated, and still 
> contains only the default value. 
> For example, if a RandomForestRegressor was trained with maxDepth of 12, then 
> accessing the tree information, extractParamMap still returns the default 
> values, with maxDepth=5. Calling the depth itself of 
> DecisionTreeRegressionModel returns the correct value of 12 though.
> This creates issues when we want to access each individual tree and build the 
> trees back up for the random forest estimator.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29152) Spark Executor Plugin API shutdown is not proper when dynamic allocation enabled[SPARK-24918]

2019-09-27 Thread Rakesh Raushan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rakesh Raushan updated SPARK-29152:
---
Description: 
*Issue Description*

Spark Executor Plugin API *shutdown handling is not proper*, when dynamic 
allocation enabled .Plugin's shutdown method is not processed when dynamic 
allocation is enabled and *executors become dead* after inactive time.

*Test Precondition*
1. Create a plugin and make a jar named SparkExecutorplugin.jar

import org.apache.spark.ExecutorPlugin;
public class ExecutoTest1 implements ExecutorPlugin{
public void init(){
System.out.println("Executor Plugin Initialised.");
}

public void shutdown(){
System.out.println("Executor plugin closed successfully.");
}
}

2. Create the  jars with the same and put it in folder /spark/examples/jars

*Test Steps*

1. launch bin/spark-sql with dynamic allocation enabled

./spark-sql --master yarn --conf spark.executor.plugins=ExecutoTest1  --jars 
/opt/HA/C10/install/spark/spark/examples/jars/SparkExecutorPlugin.jar --conf 
spark.dynamicAllocation.enabled=true --conf 
spark.dynamicAllocation.initialExecutors=2 --conf 
spark.dynamicAllocation.minExecutors=1

2 create a table , insert the data and select * from tablename
3.Check the spark UI Jobs tab/SQL tab
4. Check all Executors(executor tab will give all executors details) 
application log file for Executor plugin Initialization and Shutdown messages 
or operations.
Example 
/yarn/logdir/application_1567156749079_0025/container_e02_1567156749079_0025_01_05/
 stdout

5. Wait for the executor to be dead after the inactive time and check the same 
container log 
6. Kill the spark sql and check the container log  for executor plugin shutdown.

*Expect Output*

1. Job should be success. Create table ,insert and select query should be 
success.

2.While running query All Executors  log should contain the executor plugin 
Init messages or operations.
"Executor Plugin Initialised.

3.Once the executors are dead ,shutdown message should be there in log file.
“ Executor plugin closed successfully.

4.Once the sql application closed ,shutdown message should be there in log.
“ Executor plugin closed successfully". 


*Actual Output*

Shutdown message is not called when executor is dead after inactive time.

*Observation*
Without dynamic allocation Executor plugin is working fine. But after enabling 
dynamic allocation,Executor shutdown is not processed.




  was:
*Issue Description*

Spark Executor Plugin API *shutdown handling is not proper*, when dynamic 
allocation enabled .Plugin shutdown method is not processed-while dynamic 
allocation is enabled and *executors become dead* after inactive time.

*Test Precondition*
1.Prepared 4 spark applications with executor plugin interface.
First application-SparkExecutorplugin.jar

import org.apache.spark.ExecutorPlugin;
public class ExecutoTest1 implements ExecutorPlugin{
public void init(){
System.out.println("Executor Plugin Initialised.");
}

public void shutdown(){
System.out.println("Executor plugin closed successfully.");
}
}



2. Create the  jars with the same and put it in folder /spark/examples/jars

*Test Steps*

1. launch bin/spark-sql with dynamic allocation enabled

./spark-sql --master yarn --conf spark.executor.plugins=ExecutoTest1  
--conf="spark.executor.extraClassPath=/opt/HA/C10/install/spark/spark/examples/jars/*"
 --conf spark.dynamicAllocation.enabled=true --conf 
spark.dynamicAllocation.initialExecutors=2 --conf 
spark.dynamicAllocation.minExecutors=0

2 create a table , insert the data and select * from tablename
3.Check the spark UI Jobs tab/SQL tab
4. Check all Executors(executor tab will give all executors details) 
application log file for Executor plugin Initialization and Shutdown messages 
or operations.
Example 
/yarn/logdir/application_1567156749079_0025/container_e02_1567156749079_0025_01_05/
 stdout

5. Wait for the executor to be dead after the inactive time and check the same 
container log 
6. Kill the spark sql and check the container log  for executor plugin shutdown.

*Expect Output*

1. Job should be success. Create table ,insert and select query should be 
success.

2.While running query All Executors  log should contain the executor plugin 
Init and shutdown messages or operations.
"Executor Plugin Initialised.

3.Once the executors are dead ,executor  shutdown should call shutdown message 
should be there in log file.
“ Executor plugin closed successfully.

4.Once the sql application closed ,executor  shutdown should call shutdown 
message should be there in log.
“ Executor plugin closed successfully". 


*Actual Output*

Shutdown message is not called when executor is dead or after closing the 
application

*Observation*
Without dynamic allocation Executor plugin is working fine.But after enabling 
dynamic allocation,Executor shutdown 

[jira] [Updated] (SPARK-29055) Memory leak in Spark

2019-09-27 Thread George Papa (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

George Papa updated SPARK-29055:

Affects Version/s: (was: 2.4.4)
   (was: 2.4.3)

> Memory leak in Spark
> 
>
> Key: SPARK-29055
> URL: https://issues.apache.org/jira/browse/SPARK-29055
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager, Spark Core
>Affects Versions: 2.3.3
>Reporter: George Papa
>Priority: Major
> Attachments: test_csvs.zip
>
>
> I used Spark 2.1.1 and I upgraded into the latest version 2.4.4. I observed 
> from Spark UI that the driver memory is{color:#ff} increasing 
> continuously{color} and after of long running I had the following error : 
> {color:#ff}java.lang.OutOfMemoryError: GC overhead limit exceeded{color}
> In Spark 2.1.1 the driver memory consumption (Storage Memory tab) was 
> extremely low and after the run of ContextCleaner and BlockManager the memory 
> was decreasing.
> Also, I tested the Spark versions 2.3.3, 2.4.3 and I had the same behavior.
>  
> *HOW TO REPRODUCE THIS BEHAVIOR:*
> Create a very simple application(streaming count_file.py) in order to 
> reproduce this behavior. This application reads CSV files from a directory, 
> count the rows and then remove the processed files.
> {code:java}
> import time
> import os
> from pyspark.sql import SparkSession
> from pyspark.sql import functions as F
> from pyspark.sql import types as T
> target_dir = "..."
> spark=SparkSession.builder.appName("DataframeCount").getOrCreate()
> while True:
> for f in os.listdir(target_dir):
> df = spark.read.load(target_dir + f, format="csv")
> print("Number of records: {0}".format(df.count()))
> time.sleep(15){code}
> Submit code:
> {code:java}
> spark-submit 
> --master spark://xxx.xxx.xx.xxx
> --deploy-mode client
> --executor-memory 4g
> --executor-cores 3
> streaming count_file.py
> {code}
>  
> *TESTED CASES WITH THE SAME BEHAVIOUR:*
>  * I tested with default settings (spark-defaults.conf)
>  * Add spark.cleaner.periodicGC.interval 1min (or less)
>  * {{Turn spark.cleaner.referenceTracking.blocking}}=false
>  * Run the application in cluster mode
>  * Increase/decrease the resources of the executors and driver
>  * I tested with extraJavaOptions in driver and executor -XX:+UseG1GC 
> -XX:InitiatingHeapOccupancyPercent=35 -XX:ConcGCThreads=12
>   
> *DEPENDENCIES*
>  * Operation system: Ubuntu 16.04.3 LTS
>  * Java: jdk1.8.0_131 (tested also with jdk1.8.0_221)
>  * Python: Python 2.7.12



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29055) Memory leak in Spark

2019-09-27 Thread George Papa (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

George Papa updated SPARK-29055:

Summary: Memory leak in Spark  (was: Memory leak in Spark Driver)

> Memory leak in Spark
> 
>
> Key: SPARK-29055
> URL: https://issues.apache.org/jira/browse/SPARK-29055
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager, Spark Core
>Affects Versions: 2.3.3, 2.4.3, 2.4.4
>Reporter: George Papa
>Priority: Major
> Attachments: test_csvs.zip
>
>
> I used Spark 2.1.1 and I upgraded into the latest version 2.4.4. I observed 
> from Spark UI that the driver memory is{color:#ff} increasing 
> continuously{color} and after of long running I had the following error : 
> {color:#ff}java.lang.OutOfMemoryError: GC overhead limit exceeded{color}
> In Spark 2.1.1 the driver memory consumption (Storage Memory tab) was 
> extremely low and after the run of ContextCleaner and BlockManager the memory 
> was decreasing.
> Also, I tested the Spark versions 2.3.3, 2.4.3 and I had the same behavior.
>  
> *HOW TO REPRODUCE THIS BEHAVIOR:*
> Create a very simple application(streaming count_file.py) in order to 
> reproduce this behavior. This application reads CSV files from a directory, 
> count the rows and then remove the processed files.
> {code:java}
> import time
> import os
> from pyspark.sql import SparkSession
> from pyspark.sql import functions as F
> from pyspark.sql import types as T
> target_dir = "..."
> spark=SparkSession.builder.appName("DataframeCount").getOrCreate()
> while True:
> for f in os.listdir(target_dir):
> df = spark.read.load(target_dir + f, format="csv")
> print("Number of records: {0}".format(df.count()))
> time.sleep(15){code}
> Submit code:
> {code:java}
> spark-submit 
> --master spark://xxx.xxx.xx.xxx
> --deploy-mode client
> --executor-memory 4g
> --executor-cores 3
> streaming count_file.py
> {code}
>  
> *TESTED CASES WITH THE SAME BEHAVIOUR:*
>  * I tested with default settings (spark-defaults.conf)
>  * Add spark.cleaner.periodicGC.interval 1min (or less)
>  * {{Turn spark.cleaner.referenceTracking.blocking}}=false
>  * Run the application in cluster mode
>  * Increase/decrease the resources of the executors and driver
>  * I tested with extraJavaOptions in driver and executor -XX:+UseG1GC 
> -XX:InitiatingHeapOccupancyPercent=35 -XX:ConcGCThreads=12
>   
> *DEPENDENCIES*
>  * Operation system: Ubuntu 16.04.3 LTS
>  * Java: jdk1.8.0_131 (tested also with jdk1.8.0_221)
>  * Python: Python 2.7.12



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29272) dataframe.write.format("libsvm").save() take too much time

2019-09-27 Thread Jira


 [ 
https://issues.apache.org/jira/browse/SPARK-29272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

张焕明 updated SPARK-29272:

Description: 
I have a pyspark dataframe with about 10 thousand records,while using pyspark 
api to dump the whole dataset. It take 10 seconds. While I use filter api to 
select 10 records and dump the temp_df again. It take 8 seconds.why will it 
take so much time? How can I improve it? Thank you!

MLUtils.convertVectorColumnsToML(dataframe).write.format("libsvm").save('path'),
 mode='overwrite'),

temp_df = dataframe.filter(train_df['__index'].between(int(0,10))

  was:
我有一个pyspark的dataframe,小数据测试,只有3M,当调用

MLUtils.convertVectorColumnsToML(dataframe).write.format("libsvm").save('path'),
 mode='overwrite')时,花费10秒,而当我用filter函数只筛选出10条数据再保存

temp_df = 
dataframe.filter(train_df['__index'].between(int(0,10)),仍然需要8s的时间开销,为什么那么慢啊,我需要怎么改进吗


> dataframe.write.format("libsvm").save() take too much time
> --
>
> Key: SPARK-29272
> URL: https://issues.apache.org/jira/browse/SPARK-29272
> Project: Spark
>  Issue Type: Question
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: 张焕明
>Priority: Major
>
> I have a pyspark dataframe with about 10 thousand records,while using pyspark 
> api to dump the whole dataset. It take 10 seconds. While I use filter api to 
> select 10 records and dump the temp_df again. It take 8 seconds.why will it 
> take so much time? How can I improve it? Thank you!
> MLUtils.convertVectorColumnsToML(dataframe).write.format("libsvm").save('path'),
>  mode='overwrite'),
> temp_df = dataframe.filter(train_df['__index'].between(int(0,10))



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29272) dataframe.write.format("libsvm").save() take too much time

2019-09-27 Thread Jira


 [ 
https://issues.apache.org/jira/browse/SPARK-29272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

张焕明 updated SPARK-29272:

Summary: dataframe.write.format("libsvm").save() take too much time  (was: 
dataframe.write.format("libsvm").save() 保存时间太长)

> dataframe.write.format("libsvm").save() take too much time
> --
>
> Key: SPARK-29272
> URL: https://issues.apache.org/jira/browse/SPARK-29272
> Project: Spark
>  Issue Type: Question
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: 张焕明
>Priority: Major
>
> 我有一个pyspark的dataframe,小数据测试,只有3M,当调用
> MLUtils.convertVectorColumnsToML(dataframe).write.format("libsvm").save('path'),
>  mode='overwrite')时,花费10秒,而当我用filter函数只筛选出10条数据再保存
> temp_df = 
> dataframe.filter(train_df['__index'].between(int(0,10)),仍然需要8s的时间开销,为什么那么慢啊,我需要怎么改进吗



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29272) dataframe.write.format("libsvm").save() 保存时间太长

2019-09-27 Thread Jira
张焕明 created SPARK-29272:
---

 Summary: dataframe.write.format("libsvm").save() 保存时间太长
 Key: SPARK-29272
 URL: https://issues.apache.org/jira/browse/SPARK-29272
 Project: Spark
  Issue Type: Question
  Components: ML
Affects Versions: 2.2.0
Reporter: 张焕明


我有一个pyspark的dataframe,小数据测试,只有3M,当调用

MLUtils.convertVectorColumnsToML(dataframe).write.format("libsvm").save('path'),
 mode='overwrite')时,花费10秒,而当我用filter函数只筛选出10条数据再保存

temp_df = 
dataframe.filter(train_df['__index'].between(int(0,10)),仍然需要8s的时间开销,为什么那么慢啊,我需要怎么改进吗



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-29213) Make it consistent when get notnull output and generate null checks in FilterExec

2019-09-27 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-29213:
---

Assignee: Wang Shuo

> Make it consistent when get notnull output and generate null checks in 
> FilterExec
> -
>
> Key: SPARK-29213
> URL: https://issues.apache.org/jira/browse/SPARK-29213
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: Wang Shuo
>Assignee: Wang Shuo
>Priority: Major
> Fix For: 2.4.5, 3.0.0
>
>
> Currently the behavior of getting output and generating null checks in 
> FilterExec is different. Thus some nullable attribute could be treated as not 
> nullable by mistake.
> In FilterExec.ouput, an attribute is marked as nullable or not by finding its 
> `exprId` in notNullAttributes:
> {code:java}
> a.nullable && notNullAttributes.contains(a.exprId)
> {code}
> But in FilterExec.doConsume,  a `nullCheck` is generated or not for an 
> attribute is decided by whether there is semantic equal not null predicate:
> {code:java}
> val nullChecks = c.references.map { r => val idx = notNullPreds.indexWhere { 
> n => n.asInstanceOf[IsNotNull].child.semanticEquals(r)} if (idx != -1 && 
> !generatedIsNotNullChecks(idx)) { generatedIsNotNullChecks(idx) = true // Use 
> the child's output. The nullability is what the child produced. 
> genPredicate(notNullPreds(idx), input, child.output) } else { "" } 
> }.mkString("\n").trim
> {code}
>  
> NPE will happen when run the SQL below:
> {code:java}
> sql("create table table1(x string)")
> sql("create table table2(x bigint)")
> sql("create table table3(x string)")
> sql("insert into table2 select null as x")
> sql(
>   """
> |select t1.x
> |from (
> |select x from table1) t1
> |left join (
> |select x from (
> |select x from table2
> |union all
> |select substr(x,5) x from table3
> |) a
> |where length(x)>0
> |) t3
> |on t1.x=t3.x
>   """.stripMargin).collect()
> {code}
>  
> NPE Exception:
> {code:java}
> java.lang.NullPointerException
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(generated.java:40)
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:726)
> at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
> at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:135)
> at 
> org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:94)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
> at org.apache.spark.scheduler.Task.run(Task.scala:127)
> at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:449)
> at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:452)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> {code}
>  
> the generated code:
> {code:java}
> == Subtree 4 / 5 ==
> *(2) Project [cast(x#7L as string) AS x#9]
> +- *(2) Filter ((length(cast(x#7L as string)) > 0) AND isnotnull(cast(x#7L as 
> string)))
>    +- Scan hive default.table2 [x#7L], HiveTableRelation `default`.`table2`, 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [x#7L]
> Generated code:
> /* 001 */ public Object generate(Object[] references) {
> /* 002 */   return new GeneratedIteratorForCodegenStage2(references);
> /* 003 */ }
> /* 004 */
> /* 005 */ // codegenStageId=2
> /* 006 */ final class GeneratedIteratorForCodegenStage2 extends 
> org.apache.spark.sql.execution.BufferedRowIterator {
> /* 007 */   private Object[] references;
> /* 008 */   private scala.collection.Iterator[] inputs;
> /* 009 */   private scala.collection.Iterator inputadapter_input_0;
> /* 010 */   private 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter[] 
> filter_mutableStateArray_0 = new 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter[2];
> /* 011 */
> /* 012 */   public GeneratedIteratorForCodegenStage2(Object[] references) {
> /* 013 */ this.references = references;
> /* 014 */   }
> /* 015 */
> /* 016 */   public void init(int index, scala.collection.Iterator[] inputs) {
> /* 017 */ 

[jira] [Resolved] (SPARK-29213) Make it consistent when get notnull output and generate null checks in FilterExec

2019-09-27 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-29213.
-
Fix Version/s: 3.0.0
   2.4.5
   Resolution: Fixed

Issue resolved by pull request 25902
[https://github.com/apache/spark/pull/25902]

> Make it consistent when get notnull output and generate null checks in 
> FilterExec
> -
>
> Key: SPARK-29213
> URL: https://issues.apache.org/jira/browse/SPARK-29213
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: Wang Shuo
>Priority: Major
> Fix For: 2.4.5, 3.0.0
>
>
> Currently the behavior of getting output and generating null checks in 
> FilterExec is different. Thus some nullable attribute could be treated as not 
> nullable by mistake.
> In FilterExec.ouput, an attribute is marked as nullable or not by finding its 
> `exprId` in notNullAttributes:
> {code:java}
> a.nullable && notNullAttributes.contains(a.exprId)
> {code}
> But in FilterExec.doConsume,  a `nullCheck` is generated or not for an 
> attribute is decided by whether there is semantic equal not null predicate:
> {code:java}
> val nullChecks = c.references.map { r => val idx = notNullPreds.indexWhere { 
> n => n.asInstanceOf[IsNotNull].child.semanticEquals(r)} if (idx != -1 && 
> !generatedIsNotNullChecks(idx)) { generatedIsNotNullChecks(idx) = true // Use 
> the child's output. The nullability is what the child produced. 
> genPredicate(notNullPreds(idx), input, child.output) } else { "" } 
> }.mkString("\n").trim
> {code}
>  
> NPE will happen when run the SQL below:
> {code:java}
> sql("create table table1(x string)")
> sql("create table table2(x bigint)")
> sql("create table table3(x string)")
> sql("insert into table2 select null as x")
> sql(
>   """
> |select t1.x
> |from (
> |select x from table1) t1
> |left join (
> |select x from (
> |select x from table2
> |union all
> |select substr(x,5) x from table3
> |) a
> |where length(x)>0
> |) t3
> |on t1.x=t3.x
>   """.stripMargin).collect()
> {code}
>  
> NPE Exception:
> {code:java}
> java.lang.NullPointerException
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(generated.java:40)
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:726)
> at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
> at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:135)
> at 
> org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:94)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
> at org.apache.spark.scheduler.Task.run(Task.scala:127)
> at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:449)
> at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:452)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> {code}
>  
> the generated code:
> {code:java}
> == Subtree 4 / 5 ==
> *(2) Project [cast(x#7L as string) AS x#9]
> +- *(2) Filter ((length(cast(x#7L as string)) > 0) AND isnotnull(cast(x#7L as 
> string)))
>    +- Scan hive default.table2 [x#7L], HiveTableRelation `default`.`table2`, 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [x#7L]
> Generated code:
> /* 001 */ public Object generate(Object[] references) {
> /* 002 */   return new GeneratedIteratorForCodegenStage2(references);
> /* 003 */ }
> /* 004 */
> /* 005 */ // codegenStageId=2
> /* 006 */ final class GeneratedIteratorForCodegenStage2 extends 
> org.apache.spark.sql.execution.BufferedRowIterator {
> /* 007 */   private Object[] references;
> /* 008 */   private scala.collection.Iterator[] inputs;
> /* 009 */   private scala.collection.Iterator inputadapter_input_0;
> /* 010 */   private 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter[] 
> filter_mutableStateArray_0 = new 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter[2];
> /* 011 */
> /* 012 */   public GeneratedIteratorForCodegenStage2(Object[] references) {
> /* 013 */ this.references = references;
> /* 014 */   }
> /* 015 */

[jira] [Resolved] (SPARK-29270) Run HiveSparkSubmitSuite with hive jars use maven 3.1.2 won't stop then timeout

2019-09-27 Thread angerszhu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

angerszhu resolved SPARK-29270.
---
Resolution: Duplicate

> Run HiveSparkSubmitSuite with hive jars use maven 3.1.2 won't stop then 
> timeout
> ---
>
> Key: SPARK-29270
> URL: https://issues.apache.org/jira/browse/SPARK-29270
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: angerszhu
>Priority: Major
>
> Reproduce in master, run this UT
> !image-2019-09-27-14-26-31-226.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29257) All Task attempts scheduled to the same executor inevitably access the same bad disk

2019-09-27 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-29257.
-
Resolution: Not A Problem

> All Task attempts scheduled to the same executor inevitably access the same 
> bad disk
> 
>
> Key: SPARK-29257
> URL: https://issues.apache.org/jira/browse/SPARK-29257
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 2.3.4, 2.4.4
>Reporter: Kent Yao
>Priority: Major
> Attachments: image-2019-09-26-16-44-48-554.png
>
>
> We have an HDFS/YARN cluster with about 2k~3k nodes, each node has 8T * 12 
> local disks for storage and shuffle. Sometimes, one or more disks get into 
> bad status during computations. Sometimes it does cause job level failure, 
> sometimes does.
> The following picture shows one failure job caused by 4 task attempts were 
> all delivered to the same node and failed with almost the same exception for 
> writing the index temporary file to the same bad disk.
>  
> This is caused by two reasons:
>  # As we can see in the figure the data and the node have the best data 
> locality for reading from HDFS. As the default spark.locality.wait(3s) taking 
> effect, there is a high probability that those attempts will be scheduled to 
> this node.
>  # The index file or data file name for a particular shuffle map task is 
> fixed. It is formed by the shuffle id, the map id and the noop reduce id 
> which is always 0. The root local dir is picked by the fixed file name's 
> non-negative hash code % the disk number. Thus, this value is also fixed.  
> Even when we have 12 disks in total and only one of them is broken, if the 
> broken one is once picked, all the following attempts of this task will 
> inevitably pick the broken one.
>  
>  
> !image-2019-09-26-16-44-48-554.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29271) Run HiveSparkSubmitSuite with hive jars use maven 3.1.2 won't stop then timeout

2019-09-27 Thread angerszhu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

angerszhu updated SPARK-29271:
--
Description: 
Reproduce in master, run this UT

!WeChat33ff9dee6d78fe7be280d5f3b974e6ac.png|width=552,height=261!

  was:
Reproduce in master, run this UT

!WeChat33ff9dee6d78fe7be280d5f3b974e6ac.png!


> Run HiveSparkSubmitSuite with hive jars use maven 3.1.2 won't stop then 
> timeout
> ---
>
> Key: SPARK-29271
> URL: https://issues.apache.org/jira/browse/SPARK-29271
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: angerszhu
>Priority: Major
> Attachments: WeChat33ff9dee6d78fe7be280d5f3b974e6ac.png
>
>
> Reproduce in master, run this UT
> !WeChat33ff9dee6d78fe7be280d5f3b974e6ac.png|width=552,height=261!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29271) Run HiveSparkSubmitSuite with hive jars use maven 3.1.2 won't stop then timeout

2019-09-27 Thread angerszhu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

angerszhu updated SPARK-29271:
--
Description: 
Reproduce in master, run this UT

!WeChat33ff9dee6d78fe7be280d5f3b974e6ac.png!

  was:Reproduce in master, run this UT


> Run HiveSparkSubmitSuite with hive jars use maven 3.1.2 won't stop then 
> timeout
> ---
>
> Key: SPARK-29271
> URL: https://issues.apache.org/jira/browse/SPARK-29271
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: angerszhu
>Priority: Major
> Attachments: WeChat33ff9dee6d78fe7be280d5f3b974e6ac.png
>
>
> Reproduce in master, run this UT
> !WeChat33ff9dee6d78fe7be280d5f3b974e6ac.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29271) Run HiveSparkSubmitSuite with hive jars use maven 3.1.2 won't stop then timeout

2019-09-27 Thread angerszhu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

angerszhu updated SPARK-29271:
--
Description: Reproduce in master, run this UT  (was: Reproduce in master, 
run this UT
{code:java}
{code})

> Run HiveSparkSubmitSuite with hive jars use maven 3.1.2 won't stop then 
> timeout
> ---
>
> Key: SPARK-29271
> URL: https://issues.apache.org/jira/browse/SPARK-29271
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: angerszhu
>Priority: Major
> Attachments: WeChat33ff9dee6d78fe7be280d5f3b974e6ac.png
>
>
> Reproduce in master, run this UT



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29271) Run HiveSparkSubmitSuite with hive jars use maven 3.1.2 won't stop then timeout

2019-09-27 Thread angerszhu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

angerszhu updated SPARK-29271:
--
Attachment: WeChat33ff9dee6d78fe7be280d5f3b974e6ac.png

> Run HiveSparkSubmitSuite with hive jars use maven 3.1.2 won't stop then 
> timeout
> ---
>
> Key: SPARK-29271
> URL: https://issues.apache.org/jira/browse/SPARK-29271
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: angerszhu
>Priority: Major
> Attachments: WeChat33ff9dee6d78fe7be280d5f3b974e6ac.png
>
>
> Reproduce in master, run this UT



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29271) Run HiveSparkSubmitSuite with hive jars use maven 3.1.2 won't stop then timeout

2019-09-27 Thread angerszhu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

angerszhu updated SPARK-29271:
--
Description: 
Reproduce in master, run this UT
{code:java}
{code}

  was:
Reproduce in master, run this UT

!image-2019-09-27-14-26-31-226.png!


> Run HiveSparkSubmitSuite with hive jars use maven 3.1.2 won't stop then 
> timeout
> ---
>
> Key: SPARK-29271
> URL: https://issues.apache.org/jira/browse/SPARK-29271
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: angerszhu
>Priority: Major
>
> Reproduce in master, run this UT
> {code:java}
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29271) Run HiveSparkSubmitSuite with hive jars use maven 3.1.2 won't stop then timeout

2019-09-27 Thread angerszhu (Jira)
angerszhu created SPARK-29271:
-

 Summary: Run HiveSparkSubmitSuite with hive jars use maven 3.1.2 
won't stop then timeout
 Key: SPARK-29271
 URL: https://issues.apache.org/jira/browse/SPARK-29271
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: angerszhu


Reproduce in master, run this UT

!image-2019-09-27-14-26-31-226.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29270) Run HiveSparkSubmitSuite with hive jars use maven 3.1.2 won't stop then timeout

2019-09-27 Thread angerszhu (Jira)
angerszhu created SPARK-29270:
-

 Summary: Run HiveSparkSubmitSuite with hive jars use maven 3.1.2 
won't stop then timeout
 Key: SPARK-29270
 URL: https://issues.apache.org/jira/browse/SPARK-29270
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: angerszhu


Reproduce in master, run this UT

!image-2019-09-27-14-26-31-226.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29203) Reduce shuffle partitions in SQLQueryTestSuite

2019-09-27 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29203?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-29203:

Parent: SPARK-25604
Issue Type: Sub-task  (was: Improvement)

> Reduce shuffle partitions in SQLQueryTestSuite
> --
>
> Key: SPARK-29203
> URL: https://issues.apache.org/jira/browse/SPARK-29203
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Tests
>Affects Versions: 2.4.4, 3.0.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 2.4.5, 3.0.0
>
>
> spark.sql.shuffle.partitions=200(default):
> {noformat}
> [info] - subquery/in-subquery/in-joins.sql (6 minutes, 19 seconds)
> [info] - subquery/in-subquery/not-in-joins.sql (2 minutes, 17 seconds)
> [info] - subquery/scalar-subquery/scalar-subquery-predicate.sql (45 seconds, 
> 763 milliseconds)
> {noformat}
> spark.sql.shuffle.partitions=5:
> {noformat}
> [info] - subquery/in-subquery/in-joins.sql (1 minute, 12 seconds)
> [info] - subquery/in-subquery/not-in-joins.sql (27 seconds, 541 milliseconds)
> [info] - subquery/scalar-subquery/scalar-subquery-predicate.sql (17 seconds, 
> 360 milliseconds)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29195) Can't config orc.compress.size option for native ORC writer

2019-09-27 Thread Eric Sun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16939155#comment-16939155
 ] 

Eric Sun commented on SPARK-29195:
--

It is very likely in Spark - 
[https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFileFormat.scala]

only *MAPRED_OUTPUT_SCHEMA* and *COMPRESS* are set.

> Can't config orc.compress.size option for native ORC writer
> ---
>
> Key: SPARK-29195
> URL: https://issues.apache.org/jira/browse/SPARK-29195
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
> Environment: Linux
> Java 1.8.0
>Reporter: Eric Sun
>Priority: Minor
>  Labels: ORC
>
>  Only codec can be effectively configured via code, but "orc.compress.size" 
> or "orc.row.index.stride" can not.
>  
> {code:java}
> // try
>   val spark = SparkSession
> .builder()
> .appName(appName)
> .enableHiveSupport()
> .config("spark.sql.orc.impl", "native")
> .config("orc.compress.size", 512 * 1024)
> .config("spark.sql.orc.compress.size", 512 * 1024)
> .config("hive.exec.orc.default.buffer.size", 512 * 1024)
> .config("spark.hadoop.io.file.buffer.size", 512 * 1024)
> .getOrCreate()
> {code}
> orcfiledump still shows:
>  
> {code:java}
> File Version: 0.12 with FUTURE
> Compression: ZLIB
> Compression size: 65536
> {code}
>  
> Executor Log:
> {code}
> impl.WriterImpl: ORC writer created for path: 
> hdfs://name_node_host:9000/foo/bar/_temporary/0/_temporary/attempt_20190920222359_0001_m_000127_0/part-00127-2a9a9287-54bf-441c-b3cf-718b122d9c2f_00127.c000.zlib.orc
>  with stripeSize: 67108864 blockSize: 268435456 compression: ZLIB bufferSize: 
> 65536
> File Output Committer Algorithm version is 2
> {code}
> According to [SPARK-23342], the other ORC options should be configurable. Is 
> there anything missing here?
> Is there any other way to affect "orc.compress.size"?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org