[jira] [Created] (SPARK-27552) The configuration `hive.exec.stagingdir` is invalid on Windows OS

2019-04-23 Thread liuxian (JIRA)
liuxian created SPARK-27552:
---

 Summary: The configuration `hive.exec.stagingdir` is invalid on 
Windows OS
 Key: SPARK-27552
 URL: https://issues.apache.org/jira/browse/SPARK-27552
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: liuxian


If we set _{{hive.exec.stagingdir=.test-staging\tmp}}_,
 But the staging directory is still _{{.hive-staging}}_ on Windows OS.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27173) For hive parquet table,codes(lz4,brotli,zstd) are not available

2019-04-15 Thread liuxian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liuxian resolved SPARK-27173.
-
Resolution: Won't Fix

> For hive parquet table,codes(lz4,brotli,zstd) are not available
> ---
>
> Key: SPARK-27173
> URL: https://issues.apache.org/jira/browse/SPARK-27173
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: liuxian
>Priority: Minor
>
> From  
> _parquet.hadoop.metadata.CompressionCodecName_(parquet-hadoop-bundle-1.6.0.jar
>  ),  we can know,  for hive parquet table, it only supports *snappy*, *gzip* 
> and  *lzo*



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27256) If the configuration is used to set the number of bytes, we'd better use `bytesConf`'.

2019-03-24 Thread liuxian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liuxian updated SPARK-27256:

Description: 
Currently, if we want to configure `spark. sql. files. maxPartitionBytes` to 
256 megabytes, we must set  `spark. sql. files. maxPartitionBytes=268435456`, 
which is very unfriendly to users.

And if we set it like this:`spark. sql. files. maxPartitionBytes=256M`, we will 
 encounter this exception:

_Exception in thread "main" java.lang.IllegalArgumentException: 
spark.sql.files.maxPartitionBytes should be long, but was 256M_
     _at 
org.apache.spark.internal.config.ConfigHelpers$.toNumber(ConfigBuilder.scala:34)_

  was:
Currently, if we want to configure `spark. sql. files. maxPartitionBytes` to 
256 megabytes, we must set  `spark. sql. files. maxPartitionBytes=268435456`, 
which is very unfriendly to users.

And if we set it like this:`spark. sql. files. maxPartitionBytes=256M`, we will 
 encounter this exception:

_Exception in thread "main" java.lang.IllegalArgumentException: 
spark.sql.files.maxPartitionBytes should be long, but was 128M_
    _at 
org.apache.spark.internal.config.ConfigHelpers$.toNumber(ConfigBuilder.scala:34)_


> If the configuration is used to set the number of bytes, we'd better use 
> `bytesConf`'.
> --
>
> Key: SPARK-27256
> URL: https://issues.apache.org/jira/browse/SPARK-27256
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 3.0.0
>Reporter: liuxian
>Priority: Minor
>
> Currently, if we want to configure `spark. sql. files. maxPartitionBytes` to 
> 256 megabytes, we must set  `spark. sql. files. maxPartitionBytes=268435456`, 
> which is very unfriendly to users.
> And if we set it like this:`spark. sql. files. maxPartitionBytes=256M`, we 
> will  encounter this exception:
> _Exception in thread "main" java.lang.IllegalArgumentException: 
> spark.sql.files.maxPartitionBytes should be long, but was 256M_
>      _at 
> org.apache.spark.internal.config.ConfigHelpers$.toNumber(ConfigBuilder.scala:34)_



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27256) If the configuration is used to set the number of bytes, we'd better use `bytesConf`'.

2019-03-23 Thread liuxian (JIRA)
liuxian created SPARK-27256:
---

 Summary: If the configuration is used to set the number of bytes, 
we'd better use `bytesConf`'.
 Key: SPARK-27256
 URL: https://issues.apache.org/jira/browse/SPARK-27256
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, SQL
Affects Versions: 3.0.0
Reporter: liuxian


Currently, if we want to configure `spark. sql. files. maxPartitionBytes` to 
256 megabytes, we must set  `spark. sql. files. maxPartitionBytes=268435456`, 
which is very unfriendly to users.

And if we set it like this:`spark. sql. files. maxPartitionBytes=256M`, we will 
 encounter this exception:

_Exception in thread "main" java.lang.IllegalArgumentException: 
spark.sql.files.maxPartitionBytes should be long, but was 128M_
    _at 
org.apache.spark.internal.config.ConfigHelpers$.toNumber(ConfigBuilder.scala:34)_



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27238) In the same APP, maybe some hive Parquet(ORC) tables can't use the built-in Parquet(ORC) reader and writer

2019-03-22 Thread liuxian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liuxian updated SPARK-27238:

Description: 
In the same APP, TableA and TableB are both hive Parquet tables, but TableA 
can't use the built-in Parquet reader and writer.

{color:#6a8759}In {color}{color:#33}this situation,  
{color}_spark.sql.hive.convertMetastoreParquet_ {color:#33}can't control 
this well, so I think we can add a fine-grained configuration to handle this 
case{color}

  was:
In the same APP, TableA and TableB are both hive parquet tables, but TableA 
can't use the built-in Parquet reader and writer.

{color:#6a8759}In {color}{color:#33}this situation,  
{color}_spark.sql.hive.convertMetastoreParquet_ {color:#33}can't control 
this well, so I think we can add a fine-grained configuration to handle this 
case{color}


> In the same APP, maybe some hive Parquet(ORC) tables can't use the built-in 
> Parquet(ORC) reader and writer
> --
>
> Key: SPARK-27238
> URL: https://issues.apache.org/jira/browse/SPARK-27238
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: liuxian
>Priority: Minor
>
> In the same APP, TableA and TableB are both hive Parquet tables, but TableA 
> can't use the built-in Parquet reader and writer.
> {color:#6a8759}In {color}{color:#33}this situation,  
> {color}_spark.sql.hive.convertMetastoreParquet_ {color:#33}can't control 
> this well, so I think we can add a fine-grained configuration to handle this 
> case{color}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27238) In the same APP, maybe some hive Parquet(ORC) tables can't use the built-in Parquet(ORC) reader and writer

2019-03-22 Thread liuxian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liuxian updated SPARK-27238:

Summary: In the same APP, maybe some hive Parquet(ORC) tables can't use the 
built-in Parquet(ORC) reader and writer  (was: In the same APP, maybe some hive 
parquet tables can't use the built-in Parquet reader and writer)

> In the same APP, maybe some hive Parquet(ORC) tables can't use the built-in 
> Parquet(ORC) reader and writer
> --
>
> Key: SPARK-27238
> URL: https://issues.apache.org/jira/browse/SPARK-27238
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: liuxian
>Priority: Minor
>
> In the same APP, TableA and TableB are both hive parquet tables, but TableA 
> can't use the built-in Parquet reader and writer.
> {color:#6a8759}In {color}{color:#33}this situation,  
> {color}_spark.sql.hive.convertMetastoreParquet_ {color:#33}can't control 
> this well, so I think we can add a fine-grained configuration to handle this 
> case{color}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27238) In the same APP, maybe some hive parquet tables can't use the built-in Parquet reader and writer

2019-03-21 Thread liuxian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liuxian updated SPARK-27238:

Description: 
In the same APP, TableA and TableB are both hive parquet tables, but TableA 
can't use the built-in Parquet reader and writer.

{color:#6a8759}In {color}{color:#33}this situation,  
{color}_spark.sql.hive.convertMetastoreParquet_ {color:#33}can't control 
this well, so I think we can add a fine-grained configuration to handle this 
case{color}

  was:
In the same APP, TableA and TableB are both hive parquet tables, but TableA 
can't use the built-in Parquet reader and writer.

{color:#6a8759}{color:#33}In {color}{color:#33}this situation,  
{color}{color:#ff}{color}spark.sql.hive.convertMetastoreParquet 
{color:#33}can't control this well, so I think we can add a fine-grained 
configuration to handle this case{color}
{color}


> In the same APP, maybe some hive parquet tables can't use the built-in 
> Parquet reader and writer
> 
>
> Key: SPARK-27238
> URL: https://issues.apache.org/jira/browse/SPARK-27238
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: liuxian
>Priority: Minor
>
> In the same APP, TableA and TableB are both hive parquet tables, but TableA 
> can't use the built-in Parquet reader and writer.
> {color:#6a8759}In {color}{color:#33}this situation,  
> {color}_spark.sql.hive.convertMetastoreParquet_ {color:#33}can't control 
> this well, so I think we can add a fine-grained configuration to handle this 
> case{color}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27238) In the same APP, maybe some hive parquet tables can't use the built-in Parquet reader and writer

2019-03-21 Thread liuxian (JIRA)
liuxian created SPARK-27238:
---

 Summary: In the same APP, maybe some hive parquet tables can't use 
the built-in Parquet reader and writer
 Key: SPARK-27238
 URL: https://issues.apache.org/jira/browse/SPARK-27238
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: liuxian


In the same APP, TableA and TableB are both hive parquet tables, but TableA 
can't use the built-in Parquet reader and writer.

{color:#6a8759}{color:#33}In {color}{color:#33}this situation,  
{color}{color:#ff}{color}spark.sql.hive.convertMetastoreParquet 
{color:#33}can't control this well, so I think we can add a fine-grained 
configuration to handle this case{color}
{color}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27173) For hive parquet table,codes()

2019-03-15 Thread liuxian (JIRA)
liuxian created SPARK-27173:
---

 Summary: For hive parquet table,codes()
 Key: SPARK-27173
 URL: https://issues.apache.org/jira/browse/SPARK-27173
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: liuxian






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27173) For hive parquet table,codes(lz4,brotli,zstd) are not available

2019-03-15 Thread liuxian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liuxian updated SPARK-27173:

Description: From  
_parquet.hadoop.metadata.CompressionCodecName_(parquet-hadoop-bundle-1.6.0.jar 
),  we can know,  for hive parquet table, it only supports *snappy*, *gzip* and 
 *lzo*  (was: We can parquet-hadoop-bundle-1.6.0.jar
parquet.hadoop.metadata.CompressionCodecName)

> For hive parquet table,codes(lz4,brotli,zstd) are not available
> ---
>
> Key: SPARK-27173
> URL: https://issues.apache.org/jira/browse/SPARK-27173
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: liuxian
>Priority: Minor
>
> From  
> _parquet.hadoop.metadata.CompressionCodecName_(parquet-hadoop-bundle-1.6.0.jar
>  ),  we can know,  for hive parquet table, it only supports *snappy*, *gzip* 
> and  *lzo*



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27173) For hive parquet table,codes(lz4,brotli,zstd) are not available

2019-03-15 Thread liuxian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liuxian updated SPARK-27173:

Description: We can parquet-hadoop-bundle-1.6.0.jar
parquet.hadoop.metadata.CompressionCodecName

> For hive parquet table,codes(lz4,brotli,zstd) are not available
> ---
>
> Key: SPARK-27173
> URL: https://issues.apache.org/jira/browse/SPARK-27173
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: liuxian
>Priority: Minor
>
> We can parquet-hadoop-bundle-1.6.0.jar
> parquet.hadoop.metadata.CompressionCodecName



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27173) For hive parquet table,codes(lz4,brotli,zstd) are not available

2019-03-15 Thread liuxian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liuxian updated SPARK-27173:

Summary: For hive parquet table,codes(lz4,brotli,zstd) are not available  
(was: For hive parquet table,codes())

> For hive parquet table,codes(lz4,brotli,zstd) are not available
> ---
>
> Key: SPARK-27173
> URL: https://issues.apache.org/jira/browse/SPARK-27173
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: liuxian
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27083) Add a config to control subqueryReuse

2019-03-07 Thread liuxian (JIRA)
liuxian created SPARK-27083:
---

 Summary: Add a config to control subqueryReuse
 Key: SPARK-27083
 URL: https://issues.apache.org/jira/browse/SPARK-27083
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: liuxian


Subquery Reuse and Exchange Reuse are not the same feature, if we don't want to 
reuse subqueries,and we just want to reuse exchanges,only one configuration 
that cannot be done.
So I think we should add a new configuration  to control subqueryReuse.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27056) Remove `start-shuffle-service.sh`

2019-03-05 Thread liuxian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liuxian updated SPARK-27056:

Description: 
_start-shuffle-service.sh_ was only used by Mesos before 
_start-mesos-shuffle-service.sh_.
Obviously, _start-mesos-shuffle-service.sh_ solves some problems, it is better 
than _start-shuffle-service.sh_.
So now we should delete _start-shuffle-service.sh_ in case users use it.

  was:
_start-shuffle-service.sh_ was only used by Mesos before 
_start-mesos-shuffle-service.sh_.
Obviously, _start-mesos-shuffle-service.sh_ solves some problems, it is better 
than start-shuffle-service.sh.
So now we should delete _start-shuffle-service.sh_ in case users use it.


> Remove  `start-shuffle-service.sh`
> --
>
> Key: SPARK-27056
> URL: https://issues.apache.org/jira/browse/SPARK-27056
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Affects Versions: 3.0.0
>Reporter: liuxian
>Priority: Minor
>
> _start-shuffle-service.sh_ was only used by Mesos before 
> _start-mesos-shuffle-service.sh_.
> Obviously, _start-mesos-shuffle-service.sh_ solves some problems, it is 
> better than _start-shuffle-service.sh_.
> So now we should delete _start-shuffle-service.sh_ in case users use it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27056) Remove `start-shuffle-service.sh`

2019-03-05 Thread liuxian (JIRA)
liuxian created SPARK-27056:
---

 Summary: Remove  `start-shuffle-service.sh`
 Key: SPARK-27056
 URL: https://issues.apache.org/jira/browse/SPARK-27056
 Project: Spark
  Issue Type: Improvement
  Components: Mesos
Affects Versions: 3.0.0
Reporter: liuxian


_start-shuffle-service.sh_ was only used by Mesos before 
_start-mesos-shuffle-service.sh_.
Obviously, _start-mesos-shuffle-service.sh_ solves some problems, it is better 
than start-shuffle-service.sh.
So now we should delete _start-shuffle-service.sh_ in case users use it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25574) Add an option `keepQuotes` for parsing csv file

2019-02-22 Thread liuxian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liuxian resolved SPARK-25574.
-
Resolution: Invalid

> Add an option `keepQuotes` for parsing csv  file
> 
>
> Key: SPARK-25574
> URL: https://issues.apache.org/jira/browse/SPARK-25574
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: liuxian
>Priority: Minor
>
> In our project, when we read the CSV file, we hope to keep quotes.
> For example:
> We have such a record in the CSV file.:
> *ab,cc,,"c,ddd"*
> We hope it displays like this:
> |_c0|_c1|_c2|    _c3|
> |  ab|cc  |null|*"c,ddd"*|
>  
> Not like this:
> |_c0|_c1|_c2|  _c3|
> |  ab|cc  |null|c,ddd|



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26353) Add typed aggregate functions(max/min) to the example module

2019-02-15 Thread liuxian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liuxian updated SPARK-26353:

Summary: Add typed aggregate functions(max/min) to the example module  
(was: Add typed aggregate functions:max&)

> Add typed aggregate functions(max/min) to the example module
> 
>
> Key: SPARK-26353
> URL: https://issues.apache.org/jira/browse/SPARK-26353
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: liuxian
>Priority: Minor
>
> For Dataset API,  aggregate functions:max& are not implemented in a 
> type-safe way at the moment.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26353) Add typed aggregate functions(max/min) to the example module

2019-02-15 Thread liuxian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liuxian updated SPARK-26353:

Description: Add typed aggregate functions(max/min) to the example module.  
(was: For Dataset API,  aggregate functions:max& are not implemented in a 
type-safe way at the moment.)

> Add typed aggregate functions(max/min) to the example module
> 
>
> Key: SPARK-26353
> URL: https://issues.apache.org/jira/browse/SPARK-26353
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: liuxian
>Priority: Minor
>
> Add typed aggregate functions(max/min) to the example module.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26793) Remove spark.shuffle.manager

2019-01-31 Thread liuxian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liuxian resolved SPARK-26793.
-
Resolution: Invalid

> Remove spark.shuffle.manager
> 
>
> Key: SPARK-26793
> URL: https://issues.apache.org/jira/browse/SPARK-26793
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: liuxian
>Priority: Minor
>
> Currently, `ShuffleManager` always uses `SortShuffleManager`,  I think this 
> configuration can be removed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26793) Remove spark.shuffle.manager

2019-01-30 Thread liuxian (JIRA)
liuxian created SPARK-26793:
---

 Summary: Remove spark.shuffle.manager
 Key: SPARK-26793
 URL: https://issues.apache.org/jira/browse/SPARK-26793
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: liuxian


Currently, `ShuffleManager` always uses `SortShuffleManager`,  I think this 
configuration can be removed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26780) Improve shuffle read using ReadAheadInputStream

2019-01-29 Thread liuxian (JIRA)
liuxian created SPARK-26780:
---

 Summary: Improve  shuffle read using ReadAheadInputStream 
 Key: SPARK-26780
 URL: https://issues.apache.org/jira/browse/SPARK-26780
 Project: Spark
  Issue Type: Improvement
  Components: Shuffle
Affects Versions: 3.0.0
Reporter: liuxian


Using _ReadAheadInputStream_  to improve  shuffle read  performance.
 _ReadAheadInputStream_ can save cpu utilization and almost no performance 
regression



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23516) I think it is unnecessary to transfer unroll memory to storage memory

2019-01-28 Thread liuxian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liuxian updated SPARK-23516:

Description: 

Now  `StaticMemoryManager`  mode has been removed.
And for `UnifiedMemoryManager`, unroll memory is also storage memory,so I think 
it is unnecessary to release unroll memory really, and then to get storage 
memory again.

  was:In fact, unroll memory is also storage memory,so I think it is 
unnecessary to release unroll memory really, and then to get storage memory 
again.


> I think it is unnecessary to transfer unroll memory to storage memory 
> --
>
> Key: SPARK-23516
> URL: https://issues.apache.org/jira/browse/SPARK-23516
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: liuxian
>Priority: Minor
>
> Now  `StaticMemoryManager`  mode has been removed.
> And for `UnifiedMemoryManager`, unroll memory is also storage memory,so I 
> think it is unnecessary to release unroll memory really, and then to get 
> storage memory again.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-23516) I think it is unnecessary to transfer unroll memory to storage memory

2019-01-28 Thread liuxian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liuxian reopened SPARK-23516:
-

> I think it is unnecessary to transfer unroll memory to storage memory 
> --
>
> Key: SPARK-23516
> URL: https://issues.apache.org/jira/browse/SPARK-23516
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: liuxian
>Priority: Minor
>
> Now _StaticMemoryManager_ mode has been removed.
>  And for _UnifiedMemoryManager_,  unroll memory is also storage memory, so I 
> think it is unnecessary to release unroll memory really,  and then to get 
> storage memory again.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23516) I think it is unnecessary to transfer unroll memory to storage memory

2019-01-28 Thread liuxian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liuxian updated SPARK-23516:

Description: 
Now _StaticMemoryManager_ mode has been removed.
 And for _UnifiedMemoryManager_,  unroll memory is also storage memory, so I 
think it is unnecessary to release unroll memory really,  and then to get 
storage memory again.

  was:

Now  `StaticMemoryManager`  mode has been removed.
And for `UnifiedMemoryManager`, unroll memory is also storage memory,so I think 
it is unnecessary to release unroll memory really, and then to get storage 
memory again.


> I think it is unnecessary to transfer unroll memory to storage memory 
> --
>
> Key: SPARK-23516
> URL: https://issues.apache.org/jira/browse/SPARK-23516
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: liuxian
>Priority: Minor
>
> Now _StaticMemoryManager_ mode has been removed.
>  And for _UnifiedMemoryManager_,  unroll memory is also storage memory, so I 
> think it is unnecessary to release unroll memory really,  and then to get 
> storage memory again.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23516) I think it is unnecessary to transfer unroll memory to storage memory

2019-01-28 Thread liuxian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liuxian updated SPARK-23516:

Affects Version/s: (was: 2.3.0)
   3.0.0

> I think it is unnecessary to transfer unroll memory to storage memory 
> --
>
> Key: SPARK-23516
> URL: https://issues.apache.org/jira/browse/SPARK-23516
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: liuxian
>Priority: Minor
>
> In fact, unroll memory is also storage memory,so I think it is unnecessary to 
> release unroll memory really, and then to get storage memory again.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26621) Use ConfigEntry for hardcoded configs for shuffle categories.

2019-01-15 Thread liuxian (JIRA)
liuxian created SPARK-26621:
---

 Summary: Use ConfigEntry for hardcoded configs for shuffle 
categories.
 Key: SPARK-26621
 URL: https://issues.apache.org/jira/browse/SPARK-26621
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: liuxian


Make the following hardcoded configs to use ConfigEntry.

{{spark.shuffle}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26353) Add typed aggregate functions:max&

2018-12-12 Thread liuxian (JIRA)
liuxian created SPARK-26353:
---

 Summary: Add typed aggregate functions:max&
 Key: SPARK-26353
 URL: https://issues.apache.org/jira/browse/SPARK-26353
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: liuxian


For Dataset API,  aggregate functions:max& are not implemented in a 
type-safe way at the moment.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26300) The `checkForStreaming` mothod may be called twice in `createQuery`

2018-12-06 Thread liuxian (JIRA)
liuxian created SPARK-26300:
---

 Summary: The `checkForStreaming`  mothod  may be called twice in 
`createQuery`
 Key: SPARK-26300
 URL: https://issues.apache.org/jira/browse/SPARK-26300
 Project: Spark
  Issue Type: Improvement
  Components: Structured Streaming
Affects Versions: 2.4.0
Reporter: liuxian


If {{checkForContinuous}} is called ( {{checkForStreaming}} is called in 
{{checkForContinuous}} ), the {{checkForStreaming}} mothod will be called twice 
in {{createQuery}} , this is not necessary, and the {{checkForStreaming}} 
method has a lot of statements, so it's better to remove one of them.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-26264) It is better to add @transient to field 'locs' for class `ResultTask`.

2018-12-04 Thread liuxian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liuxian closed SPARK-26264.
---

> It is better to add @transient to field 'locs'  for class `ResultTask`.
> ---
>
> Key: SPARK-26264
> URL: https://issues.apache.org/jira/browse/SPARK-26264
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: liuxian
>Priority: Minor
>
> The field 'locs' is only used in driver side  for class `ResultTask`, so it 
> is not needed to serialize  when sending the `ResultTask`  to executor.
> Although it's not very big, it's very frequent, so we can add` transient` for 
> it  like `ShuffleMapTask`
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26264) It is better to add @transient to field 'locs' for class `ResultTask`.

2018-12-04 Thread liuxian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liuxian resolved SPARK-26264.
-
Resolution: Not A Problem

> It is better to add @transient to field 'locs'  for class `ResultTask`.
> ---
>
> Key: SPARK-26264
> URL: https://issues.apache.org/jira/browse/SPARK-26264
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: liuxian
>Priority: Minor
>
> The field 'locs' is only used in driver side  for class `ResultTask`, so it 
> is not needed to serialize  when sending the `ResultTask`  to executor.
> Although it's not very big, it's very frequent, so we can add` transient` for 
> it  like `ShuffleMapTask`
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26264) It is better to add @transient to field 'locs' for class `ResultTask`.

2018-12-04 Thread liuxian (JIRA)
liuxian created SPARK-26264:
---

 Summary: It is better to add @transient to field 'locs'  for class 
`ResultTask`.
 Key: SPARK-26264
 URL: https://issues.apache.org/jira/browse/SPARK-26264
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: liuxian


The field 'locs' is only used in driver side  for class `ResultTask`, so it is 
not needed to serialize  when sending the `ResultTask`  to executor.

Although it's not very big, it's very frequent, so we can add` transient` for 
it  like `ShuffleMapTask`

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25729) It is better to replace `minPartitions` with `defaultParallelism` , when `minPartitions` is less than `defaultParallelism`

2018-11-04 Thread liuxian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liuxian updated SPARK-25729:

Description: 
For ‘WholeTextFileInputFormat’, when `minPartitions` is less than 
`defaultParallelism`,

it is better to replace `minPartitions` with `defaultParallelism` , because 
this can make better use of resources and improve parallelism.

  was:
In ‘WholeTextFileRDD’,when `minPartitions` is less than `defaultParallelism`,

it is better to replace `minPartitions` with `defaultParallelism` , because 
this can make better use of resources and improve parallelism.


> It is better to replace `minPartitions` with `defaultParallelism` , when 
> `minPartitions` is less than `defaultParallelism`
> --
>
> Key: SPARK-25729
> URL: https://issues.apache.org/jira/browse/SPARK-25729
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: liuxian
>Priority: Minor
>
> For ‘WholeTextFileInputFormat’, when `minPartitions` is less than 
> `defaultParallelism`,
> it is better to replace `minPartitions` with `defaultParallelism` , because 
> this can make better use of resources and improve parallelism.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25806) The instanceof FileSplit is redundant for ParquetFileFormat and OrcFileFormat

2018-10-28 Thread liuxian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liuxian updated SPARK-25806:

Summary:  The instanceof FileSplit is redundant for ParquetFileFormat and 
OrcFileFormat  (was:  The instanceof FileSplit is redundant for 
ParquetFileFormat)

>  The instanceof FileSplit is redundant for ParquetFileFormat and OrcFileFormat
> --
>
> Key: SPARK-25806
> URL: https://issues.apache.org/jira/browse/SPARK-25806
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output
>Affects Versions: 3.0.0
>Reporter: liuxian
>Priority: Trivial
>
> The instance of FileSplit is redundant  {color:#33}in the 
> {color}{color:#f79232}ParquetFileFormat{color} and {color:#f79232} 
> {{hive\orc\OrcFileFormat}}{color} class.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25806) The instanceof FileSplit is redundant for ParquetFileFormat

2018-10-28 Thread liuxian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liuxian updated SPARK-25806:

Description: The instance of FileSplit is redundant  {color:#33}in the 
{color}{color:#f79232}ParquetFileFormat{color} and {color:#f79232} 
{{hive\orc\OrcFileFormat}}{color} class.  (was: The instance of FileSplit is 
redundant for {color:#ffc66d}buildReaderWithPartitionValues {color}

{color:#33}in the {color}{color:#f79232}ParquetFileFormat{color} class.)

>  The instanceof FileSplit is redundant for ParquetFileFormat
> 
>
> Key: SPARK-25806
> URL: https://issues.apache.org/jira/browse/SPARK-25806
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output
>Affects Versions: 3.0.0
>Reporter: liuxian
>Priority: Trivial
>
> The instance of FileSplit is redundant  {color:#33}in the 
> {color}{color:#f79232}ParquetFileFormat{color} and {color:#f79232} 
> {{hive\orc\OrcFileFormat}}{color} class.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25806) The instanceof FileSplit is redundant for ParquetFileFormat

2018-10-23 Thread liuxian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liuxian updated SPARK-25806:

Description: 
The instance of FileSplit is redundant for 
{color:#ffc66d}buildReaderWithPartitionValues {color}

{color:#33}in the {color}{color:#f79232}ParquetFileFormat{color} class.

  was:
The instanceof FileSplit is redundant for 
{color:#ffc66d}buildReaderWithPartitionValues {color}

{color:#33}in the {color}{color:#f79232}ParquetFileFormat{color} class.


>  The instanceof FileSplit is redundant for ParquetFileFormat
> 
>
> Key: SPARK-25806
> URL: https://issues.apache.org/jira/browse/SPARK-25806
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output
>Affects Versions: 3.0.0
>Reporter: liuxian
>Priority: Trivial
>
> The instance of FileSplit is redundant for 
> {color:#ffc66d}buildReaderWithPartitionValues {color}
> {color:#33}in the {color}{color:#f79232}ParquetFileFormat{color} class.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25806) The instanceof FileSplit is redundant for ParquetFileFormat

2018-10-23 Thread liuxian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liuxian updated SPARK-25806:

Description: 
The instanceof FileSplit is redundant for 
{color:#ffc66d}buildReaderWithPartitionValues {color}

{color:#33}in the {color}{color:#f79232}ParquetFileFormat{color} class.

  was:
The instanceof FileSplit is redundant for 
{color:#ffc66d}buildReaderWithPartitionValues {color}

{color:#33}{color:#ffc66d}in the {color}{color}ParquetFileFormat class.


>  The instanceof FileSplit is redundant for ParquetFileFormat
> 
>
> Key: SPARK-25806
> URL: https://issues.apache.org/jira/browse/SPARK-25806
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output
>Affects Versions: 3.0.0
>Reporter: liuxian
>Priority: Trivial
>
> The instanceof FileSplit is redundant for 
> {color:#ffc66d}buildReaderWithPartitionValues {color}
> {color:#33}in the {color}{color:#f79232}ParquetFileFormat{color} class.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25806) The instanceof FileSplit is redundant for ParquetFileFormat

2018-10-23 Thread liuxian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liuxian updated SPARK-25806:

Description: 
The instanceof FileSplit is redundant for 
{color:#ffc66d}buildReaderWithPartitionValues {color}

{color:#33}{color:#ffc66d}in the {color}{color}ParquetFileFormat class.

  was:The instanceof FileSplit is redundant for 
{color:#ffc66d}buildReaderWithPartitionValues {color:#33}in the 
{color}{color}{color:#f79232}ParquetFileFormat{color} class.


>  The instanceof FileSplit is redundant for ParquetFileFormat
> 
>
> Key: SPARK-25806
> URL: https://issues.apache.org/jira/browse/SPARK-25806
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output
>Affects Versions: 3.0.0
>Reporter: liuxian
>Priority: Trivial
>
> The instanceof FileSplit is redundant for 
> {color:#ffc66d}buildReaderWithPartitionValues {color}
> {color:#33}{color:#ffc66d}in the {color}{color}ParquetFileFormat class.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25806) The instanceof FileSplit is redundant for ParquetFileFormat

2018-10-23 Thread liuxian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liuxian updated SPARK-25806:

Description: The instanceof FileSplit is redundant for 
{color:#ffc66d}buildReaderWithPartitionValues {color:#33}in the 
{color}{color}{color:#f79232}ParquetFileFormat{color} class.  (was: The 
instanceof FileSplit is redundant for 
{color:#ffc66d}buildReaderWithPartitionValues {color:#33}in the 
{color}{color}{color:#f79232}ParquetFileFormat {color:#33}class.{color}
{color})

>  The instanceof FileSplit is redundant for ParquetFileFormat
> 
>
> Key: SPARK-25806
> URL: https://issues.apache.org/jira/browse/SPARK-25806
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output
>Affects Versions: 3.0.0
>Reporter: liuxian
>Priority: Trivial
>
> The instanceof FileSplit is redundant for 
> {color:#ffc66d}buildReaderWithPartitionValues {color:#33}in the 
> {color}{color}{color:#f79232}ParquetFileFormat{color} class.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25806) The instanceof FileSplit is redundant for ParquetFileFormat

2018-10-23 Thread liuxian (JIRA)
liuxian created SPARK-25806:
---

 Summary:  The instanceof FileSplit is redundant for 
ParquetFileFormat
 Key: SPARK-25806
 URL: https://issues.apache.org/jira/browse/SPARK-25806
 Project: Spark
  Issue Type: Improvement
  Components: Input/Output
Affects Versions: 3.0.0
Reporter: liuxian


The instanceof FileSplit is redundant for 
{color:#ffc66d}buildReaderWithPartitionValues {color:#33}in the 
{color}{color}{color:#f79232}ParquetFileFormat {color:#33}class.{color}
{color}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25786) If the ByteBuffer.hasArray is false , it will throw UnsupportedOperationException for Kryo

2018-10-19 Thread liuxian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liuxian updated SPARK-25786:

Environment: (was: `{color:#ffc66d}deserialize{color}` for kryo,  the 
type of input parameter is ByteBuffer, if it is not backed by an accessible 
byte array. it will throw UnsupportedOperationException

Exception Info:

java.lang.UnsupportedOperationException was thrown.
java.lang.UnsupportedOperationException
    at java.nio.ByteBuffer.array(ByteBuffer.java:994)
    at 
org.apache.spark.serializer.KryoSerializerInstance.deserialize(KryoSerializer.scala:362)

 )
Description: 
`{color:#ffc66d}deserialize{color}` for kryo,  the type of input parameter is 
ByteBuffer, if it is not backed by an accessible byte array. it will throw 
UnsupportedOperationException

Exception Info:

java.lang.UnsupportedOperationException was thrown.
 java.lang.UnsupportedOperationException
     at java.nio.ByteBuffer.array(ByteBuffer.java:994)
     at 
org.apache.spark.serializer.KryoSerializerInstance.deserialize(KryoSerializer.scala:362)

 

> If the ByteBuffer.hasArray is false , it will throw 
> UnsupportedOperationException for Kryo
> --
>
> Key: SPARK-25786
> URL: https://issues.apache.org/jira/browse/SPARK-25786
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: liuxian
>Priority: Major
>
> `{color:#ffc66d}deserialize{color}` for kryo,  the type of input parameter is 
> ByteBuffer, if it is not backed by an accessible byte array. it will throw 
> UnsupportedOperationException
> Exception Info:
> java.lang.UnsupportedOperationException was thrown.
>  java.lang.UnsupportedOperationException
>      at java.nio.ByteBuffer.array(ByteBuffer.java:994)
>      at 
> org.apache.spark.serializer.KryoSerializerInstance.deserialize(KryoSerializer.scala:362)
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25786) If the ByteBuffer.hasArray is false , it will throw UnsupportedOperationException for Kryo

2018-10-19 Thread liuxian (JIRA)
liuxian created SPARK-25786:
---

 Summary: If the ByteBuffer.hasArray is false , it will throw 
UnsupportedOperationException for Kryo
 Key: SPARK-25786
 URL: https://issues.apache.org/jira/browse/SPARK-25786
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.0.0
 Environment: `{color:#ffc66d}deserialize{color}` for kryo,  the type 
of input parameter is ByteBuffer, if it is not backed by an accessible byte 
array. it will throw UnsupportedOperationException

Exception Info:

java.lang.UnsupportedOperationException was thrown.
java.lang.UnsupportedOperationException
    at java.nio.ByteBuffer.array(ByteBuffer.java:994)
    at 
org.apache.spark.serializer.KryoSerializerInstance.deserialize(KryoSerializer.scala:362)

 
Reporter: liuxian






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25780) Scheduling the tasks which have no higher level locality first

2018-10-19 Thread liuxian (JIRA)
liuxian created SPARK-25780:
---

 Summary: Scheduling the tasks which have no higher level locality 
first
 Key: SPARK-25780
 URL: https://issues.apache.org/jira/browse/SPARK-25780
 Project: Spark
  Issue Type: Improvement
  Components: Scheduler
Affects Versions: 3.0.0
Reporter: liuxian


For example:
An application has two executors: (exe1, host1), (exe2,host2).
And 3 tasks with locality: \{task0, Seq(TaskLocation("host1", "exec1"))}, 
\{task1, Seq(TaskLocation("host1", "exec1"), TaskLocation("host2"))},  \{task2, 
Seq(TaskLocation("host2")}
If task0 is runing in exe1, when `allowedLocality` is NODE_LOCAL for exe2, it 
is better to schedule task2 fisrt, not task1, because task1 may be scheduled to 
exe1 later.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25776) The disk write buffer size must be greater than 12.

2018-10-18 Thread liuxian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liuxian updated SPARK-25776:

Description: 
In {color:#205081}{{UnsafeSorterSpillWriter.java}}{color}, when we write a 
record to a spill file wtih {{ {color:#205081}void write(Object baseObject, 
long baseOffset, int recordLength, long keyPrefix{color})}}, 
{color:#205081}{{recordLength}} {color}and {color:#205081}{{keyPrefix}} 
{color}will be written the disk write buffer first, and these will take 12 
bytes, so the disk write buffer size must be greater than 12.

If {{diskWriteBufferSize}} is 10, it will print this exception info:

_java.lang.ArrayIndexOutOfBoundsException: 10_
 _at 
org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillWriter.writeLongToBuffer
 (UnsafeSorterSpillWriter.java:91)_
 _at 
org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillWriter.write(UnsafeSorterSpillWriter.java:123)_
 _at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spillIterator(UnsafeExternalSorter.java:498)_
 _at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:222)_
 _at org.apache.spark.memory.MemoryConsumer.spill(MemoryConsumer.java:65)_

  was:
In {{UnsafeSorterSpillWriter.java}}, when we write a record to a spill file 
wtih {{ void write(Object baseObject, long baseOffset, int recordLength, long 
keyPrefix)}}, {{recordLength}} and {{keyPrefix}} will be written the disk write 
buffer first, and these will take 12 bytes, so the disk write buffer size must 
be greater than 12.

If {{diskWriteBufferSize}} is 10, it will print this exception info:

_java.lang.ArrayIndexOutOfBoundsException: 10_
 _at 
org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillWriter.writeLongToBuffer
 (UnsafeSorterSpillWriter.java:91)_
 _at 
org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillWriter.write(UnsafeSorterSpillWriter.java:123)_
 _at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spillIterator(UnsafeExternalSorter.java:498)_
 _at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:222)_
 _at org.apache.spark.memory.MemoryConsumer.spill(MemoryConsumer.java:65)_


> The disk write buffer size must be greater than 12.
> ---
>
> Key: SPARK-25776
> URL: https://issues.apache.org/jira/browse/SPARK-25776
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: liuxian
>Priority: Minor
>
> In {color:#205081}{{UnsafeSorterSpillWriter.java}}{color}, when we write a 
> record to a spill file wtih {{ {color:#205081}void write(Object baseObject, 
> long baseOffset, int recordLength, long keyPrefix{color})}}, 
> {color:#205081}{{recordLength}} {color}and {color:#205081}{{keyPrefix}} 
> {color}will be written the disk write buffer first, and these will take 12 
> bytes, so the disk write buffer size must be greater than 12.
> If {{diskWriteBufferSize}} is 10, it will print this exception info:
> _java.lang.ArrayIndexOutOfBoundsException: 10_
>  _at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillWriter.writeLongToBuffer
>  (UnsafeSorterSpillWriter.java:91)_
>  _at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillWriter.write(UnsafeSorterSpillWriter.java:123)_
>  _at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spillIterator(UnsafeExternalSorter.java:498)_
>  _at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:222)_
>  _at org.apache.spark.memory.MemoryConsumer.spill(MemoryConsumer.java:65)_



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25776) The disk write buffer size must be greater than 12.

2018-10-18 Thread liuxian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liuxian updated SPARK-25776:

Description: 
In {color:#205081}{{UnsafeSorterSpillWriter.java}}{color}, when we write a 
record to a spill file wtih {{ {color:#205081}void write(Object baseObject, 
long baseOffset, int recordLength, long keyPrefix{color})}}, 
{color:#205081}{{recordLength}} {color}and {color:#205081}{{keyPrefix}} 
{color}will be written the disk write buffer first, and these will take 12 
bytes, so the disk write buffer size must be greater than 12.

If {color:#205081}{{diskWriteBufferSize}} {color}is 10, it will print this 
exception info:

_java.lang.ArrayIndexOutOfBoundsException: 10_
 _at 
org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillWriter.writeLongToBuffer
 (UnsafeSorterSpillWriter.java:91)_
 _at 
org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillWriter.write(UnsafeSorterSpillWriter.java:123)_
 _at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spillIterator(UnsafeExternalSorter.java:498)_
 _at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:222)_
 _at org.apache.spark.memory.MemoryConsumer.spill(MemoryConsumer.java:65)_

  was:
In {color:#205081}{{UnsafeSorterSpillWriter.java}}{color}, when we write a 
record to a spill file wtih {{ {color:#205081}void write(Object baseObject, 
long baseOffset, int recordLength, long keyPrefix{color})}}, 
{color:#205081}{{recordLength}} {color}and {color:#205081}{{keyPrefix}} 
{color}will be written the disk write buffer first, and these will take 12 
bytes, so the disk write buffer size must be greater than 12.

If {{diskWriteBufferSize}} is 10, it will print this exception info:

_java.lang.ArrayIndexOutOfBoundsException: 10_
 _at 
org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillWriter.writeLongToBuffer
 (UnsafeSorterSpillWriter.java:91)_
 _at 
org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillWriter.write(UnsafeSorterSpillWriter.java:123)_
 _at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spillIterator(UnsafeExternalSorter.java:498)_
 _at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:222)_
 _at org.apache.spark.memory.MemoryConsumer.spill(MemoryConsumer.java:65)_


> The disk write buffer size must be greater than 12.
> ---
>
> Key: SPARK-25776
> URL: https://issues.apache.org/jira/browse/SPARK-25776
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: liuxian
>Priority: Minor
>
> In {color:#205081}{{UnsafeSorterSpillWriter.java}}{color}, when we write a 
> record to a spill file wtih {{ {color:#205081}void write(Object baseObject, 
> long baseOffset, int recordLength, long keyPrefix{color})}}, 
> {color:#205081}{{recordLength}} {color}and {color:#205081}{{keyPrefix}} 
> {color}will be written the disk write buffer first, and these will take 12 
> bytes, so the disk write buffer size must be greater than 12.
> If {color:#205081}{{diskWriteBufferSize}} {color}is 10, it will print this 
> exception info:
> _java.lang.ArrayIndexOutOfBoundsException: 10_
>  _at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillWriter.writeLongToBuffer
>  (UnsafeSorterSpillWriter.java:91)_
>  _at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillWriter.write(UnsafeSorterSpillWriter.java:123)_
>  _at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spillIterator(UnsafeExternalSorter.java:498)_
>  _at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:222)_
>  _at org.apache.spark.memory.MemoryConsumer.spill(MemoryConsumer.java:65)_



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25776) The disk write buffer size must be greater than 12.

2018-10-18 Thread liuxian (JIRA)
liuxian created SPARK-25776:
---

 Summary: The disk write buffer size must be greater than 12.
 Key: SPARK-25776
 URL: https://issues.apache.org/jira/browse/SPARK-25776
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: liuxian


In {{UnsafeSorterSpillWriter.java}}, when we write a record to a spill file 
wtih {{ void write(Object baseObject, long baseOffset, int recordLength, long 
keyPrefix)}}, {{recordLength}} and {{keyPrefix}} will be written the disk write 
buffer first, and these will take 12 bytes, so the disk write buffer size must 
be greater than 12.

If {{diskWriteBufferSize}} is 10, it will print this exception info:

_java.lang.ArrayIndexOutOfBoundsException: 10_
 _at 
org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillWriter.writeLongToBuffer
 (UnsafeSorterSpillWriter.java:91)_
 _at 
org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillWriter.write(UnsafeSorterSpillWriter.java:123)_
 _at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spillIterator(UnsafeExternalSorter.java:498)_
 _at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:222)_
 _at org.apache.spark.memory.MemoryConsumer.spill(MemoryConsumer.java:65)_



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25753) binaryFiles broken for small files

2018-10-16 Thread liuxian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liuxian updated SPARK-25753:

Description: 
_{{StreamFileInputFormat}}_ and 
{{_WholeTextFileInputFormat_(https://issues.apache.org/jira/browse/SPARK-24610)}}
 have the same problem: for small sized files, the computed maxSplitSize by 
`_{{StreamFileInputFormat}}_ `  is way smaller than the default or commonly 
used split size of 64/128M and spark throws an exception while trying to read 
them.

{{Exception info:}}

_{{Minimum split size pernode 5123456 cannot be larger than maximum split size 
4194304 java.io.IOException: Minimum split size pernode 5123456 cannot be 
larger than maximum split size 4194304 at 
org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat.getSplits(CombineFileInputFormat.java:
 201) at 
org.apache.spark.rdd.BinaryFileRDD.getPartitions(BinaryFileRDD.scala:52) at 
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:254) at 
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252) at 
scala.Option.getOrElse(Option.scala:121) at 
org.apache.spark.rdd.RDD.partitions(RDD.scala:252) at 
org.apache.spark.SparkContext.runJob(SparkContext.scala:2138)}}_

  was:
{{StreamFileInputFormat}} and 
{{WholeTextFileInputFormat(https://issues.apache.org/jira/browse/SPARK-24610)}} 
have the same problem: for small sized files, the computed maxSplitSize by 
`{{StreamFileInputFormat}} `  is way smaller than the default or commonly used 
split size of 64/128M and spark throws an exception while trying to read them.

{{Exception info:Minimum split size pernode 5123456 cannot be larger than 
maximum split size 4194304 java.io.IOException: Minimum split size pernode 
5123456 cannot be larger than maximum split size 4194304 at 
org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat.getSplits(CombineFileInputFormat.java:
 201) at 
org.apache.spark.rdd.BinaryFileRDD.getPartitions(BinaryFileRDD.scala:52) at 
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:254) at 
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252) at 
scala.Option.getOrElse(Option.scala:121) at 
org.apache.spark.rdd.RDD.partitions(RDD.scala:252) at 
org.apache.spark.SparkContext.runJob(SparkContext.scala:2138)}}


> binaryFiles broken for small files
> --
>
> Key: SPARK-25753
> URL: https://issues.apache.org/jira/browse/SPARK-25753
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 3.0.0
>Reporter: liuxian
>Priority: Minor
>
> _{{StreamFileInputFormat}}_ and 
> {{_WholeTextFileInputFormat_(https://issues.apache.org/jira/browse/SPARK-24610)}}
>  have the same problem: for small sized files, the computed maxSplitSize by 
> `_{{StreamFileInputFormat}}_ `  is way smaller than the default or commonly 
> used split size of 64/128M and spark throws an exception while trying to read 
> them.
> {{Exception info:}}
> _{{Minimum split size pernode 5123456 cannot be larger than maximum split 
> size 4194304 java.io.IOException: Minimum split size pernode 5123456 cannot 
> be larger than maximum split size 4194304 at 
> org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat.getSplits(CombineFileInputFormat.java:
>  201) at 
> org.apache.spark.rdd.BinaryFileRDD.getPartitions(BinaryFileRDD.scala:52) at 
> org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:254) at 
> org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252) at 
> scala.Option.getOrElse(Option.scala:121) at 
> org.apache.spark.rdd.RDD.partitions(RDD.scala:252) at 
> org.apache.spark.SparkContext.runJob(SparkContext.scala:2138)}}_



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25753) binaryFiles broken for small files

2018-10-16 Thread liuxian (JIRA)
liuxian created SPARK-25753:
---

 Summary: binaryFiles broken for small files
 Key: SPARK-25753
 URL: https://issues.apache.org/jira/browse/SPARK-25753
 Project: Spark
  Issue Type: Bug
  Components: Input/Output
Affects Versions: 3.0.0
Reporter: liuxian


{{StreamFileInputFormat}} and 
{{WholeTextFileInputFormat(https://issues.apache.org/jira/browse/SPARK-24610)}} 
have the same problem: for small sized files, the computed maxSplitSize by 
`{{StreamFileInputFormat}} `  is way smaller than the default or commonly used 
split size of 64/128M and spark throws an exception while trying to read them.

{{Exception info:Minimum split size pernode 5123456 cannot be larger than 
maximum split size 4194304 java.io.IOException: Minimum split size pernode 
5123456 cannot be larger than maximum split size 4194304 at 
org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat.getSplits(CombineFileInputFormat.java:
 201) at 
org.apache.spark.rdd.BinaryFileRDD.getPartitions(BinaryFileRDD.scala:52) at 
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:254) at 
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252) at 
scala.Option.getOrElse(Option.scala:121) at 
org.apache.spark.rdd.RDD.partitions(RDD.scala:252) at 
org.apache.spark.SparkContext.runJob(SparkContext.scala:2138)}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25729) It is better to replace `minPartitions` with `defaultParallelism` , when `minPartitions` is less than `defaultParallelism`

2018-10-16 Thread liuxian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liuxian updated SPARK-25729:

Description: 
In ‘WholeTextFileRDD’,when `minPartitions` is less than `defaultParallelism`,

it is better to replace `minPartitions` with `defaultParallelism` , because 
this can make better use of resources and improve parallelism.

  was:
In ‘WholeTextFileRDD’,when `minPartitions` is less than `defaultParallelism`,

it is better to replace `minPartitions` with `defaultParallelism` , because 
this can make better use of resources and improve concurrency.


> It is better to replace `minPartitions` with `defaultParallelism` , when 
> `minPartitions` is less than `defaultParallelism`
> --
>
> Key: SPARK-25729
> URL: https://issues.apache.org/jira/browse/SPARK-25729
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: liuxian
>Priority: Minor
>
> In ‘WholeTextFileRDD’,when `minPartitions` is less than `defaultParallelism`,
> it is better to replace `minPartitions` with `defaultParallelism` , because 
> this can make better use of resources and improve parallelism.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25729) It is better to replace `minPartitions` with `defaultParallelism` , when `minPartitions` is less than `defaultParallelism`

2018-10-15 Thread liuxian (JIRA)
liuxian created SPARK-25729:
---

 Summary: It is better to replace `minPartitions` with 
`defaultParallelism` , when `minPartitions` is less than `defaultParallelism`
 Key: SPARK-25729
 URL: https://issues.apache.org/jira/browse/SPARK-25729
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.4.0
Reporter: liuxian


In ‘WholeTextFileRDD’,when `minPartitions` is less than `defaultParallelism`,

it is better to replace `minPartitions` with `defaultParallelism` , because 
this can make better use of resources and improve concurrency.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25674) If the records are incremented by more than 1 at a time,the number of bytes might rarely ever get updated

2018-10-08 Thread liuxian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liuxian updated SPARK-25674:

Priority: Minor  (was: Trivial)

> If the records are incremented by more than 1 at a time,the number of bytes 
> might rarely ever get updated
> -
>
> Key: SPARK-25674
> URL: https://issues.apache.org/jira/browse/SPARK-25674
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: liuxian
>Priority: Minor
>
> If the records are incremented by more than 1 at a time,the number of bytes 
> might rarely ever get updated in `FileScanRDD.scala`,because it might skip 
> over the count that is an exact multiple of 
> UPDATE_INPUT_METRICS_INTERVAL_RECORDS.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25674) If the records are incremented by more than 1 at a time,the number of bytes might rarely ever get updated

2018-10-07 Thread liuxian (JIRA)
liuxian created SPARK-25674:
---

 Summary: If the records are incremented by more than 1 at a 
time,the number of bytes might rarely ever get updated
 Key: SPARK-25674
 URL: https://issues.apache.org/jira/browse/SPARK-25674
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.0
Reporter: liuxian


If the records are incremented by more than 1 at a time,the number of bytes 
might rarely ever get updated in `FileScanRDD.scala`,because it might skip over 
the count that is an exact multiple of UPDATE_INPUT_METRICS_INTERVAL_RECORDS.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25574) Add an option `keepQuotes` for parsing csv file

2018-09-29 Thread liuxian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liuxian updated SPARK-25574:

Description: 
In our project, when we read the CSV file, we hope to keep quotes.

For example:

We have such a record in the CSV file.:

*ab,cc,,"c,ddd"*

We hope it displays like this:
|_c0|_c1|_c2|    _c3|
|  ab|cc   |null|*"c,ddd"*|

 

Not like this:
|_c0|_c1|_c2|  _c3|
|  ab|cc   |null |c,ddd|

+-+--++-+

  was:
In our project, when we read the CSV file, we hope to keep quotes.

For example:

We have such a record in the CSV file.:

*ab,cc,,"c,ddd"*

We hope it displays like this:

++---++---+
| _c0|_c1| _c2|    _c3|
++---++---+
|  ab| cc|null|*"c,ddd"*|

 

not like this:

++---++-+
| _c0|_c1| _c2|  _c3|
++---++-+
|  ab| cc|null|c,ddd|
++---++-+


> Add an option `keepQuotes` for parsing csv  file
> 
>
> Key: SPARK-25574
> URL: https://issues.apache.org/jira/browse/SPARK-25574
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: liuxian
>Priority: Minor
>
> In our project, when we read the CSV file, we hope to keep quotes.
> For example:
> We have such a record in the CSV file.:
> *ab,cc,,"c,ddd"*
> We hope it displays like this:
> |_c0|_c1|_c2|    _c3|
> |  ab|cc   |null|*"c,ddd"*|
>  
> Not like this:
> |_c0|_c1|_c2|  _c3|
> |  ab|cc   |null |c,ddd|
> +-+--++-+



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25574) Add an option `keepQuotes` for parsing csv file

2018-09-29 Thread liuxian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liuxian updated SPARK-25574:

Description: 
In our project, when we read the CSV file, we hope to keep quotes.

For example:

We have such a record in the CSV file.:

*ab,cc,,"c,ddd"*

We hope it displays like this:
|_c0|_c1|_c2|    _c3|
|  ab|cc  |null|*"c,ddd"*|

 

Not like this:
|_c0|_c1|_c2|  _c3|
|  ab|cc  |null|c,ddd|

  was:
In our project, when we read the CSV file, we hope to keep quotes.

For example:

We have such a record in the CSV file.:

*ab,cc,,"c,ddd"*

We hope it displays like this:
|_c0|_c1|_c2|    _c3|
|  ab|cc   |null|*"c,ddd"*|

 

Not like this:
|_c0|_c1|_c2|  _c3|
|  ab|cc   |null |c,ddd|

+-+--++-+


> Add an option `keepQuotes` for parsing csv  file
> 
>
> Key: SPARK-25574
> URL: https://issues.apache.org/jira/browse/SPARK-25574
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: liuxian
>Priority: Minor
>
> In our project, when we read the CSV file, we hope to keep quotes.
> For example:
> We have such a record in the CSV file.:
> *ab,cc,,"c,ddd"*
> We hope it displays like this:
> |_c0|_c1|_c2|    _c3|
> |  ab|cc  |null|*"c,ddd"*|
>  
> Not like this:
> |_c0|_c1|_c2|  _c3|
> |  ab|cc  |null|c,ddd|



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25574) Add an option `keepQuotes` for parsing csv file

2018-09-29 Thread liuxian (JIRA)
liuxian created SPARK-25574:
---

 Summary: Add an option `keepQuotes` for parsing csv  file
 Key: SPARK-25574
 URL: https://issues.apache.org/jira/browse/SPARK-25574
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.0
Reporter: liuxian


In our project, when we read the CSV file, we hope to keep quotes.

For example:

We have such a record in the CSV file.:

*ab,cc,,"c,ddd"*

We hope it displays like this:

++---++---+
| _c0|_c1| _c2|    _c3|
++---++---+
|  ab| cc|null|*"c,ddd"*|

 

not like this:

++---++-+
| _c0|_c1| _c2|  _c3|
++---++-+
|  ab| cc|null|c,ddd|
++---++-+



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25366) Zstd and brotli CompressionCodec are not supported for parquet files

2018-09-09 Thread liuxian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liuxian updated SPARK-25366:

Summary: Zstd and brotli CompressionCodec are  not supported for parquet 
files  (was: Zstd and brotil CompressionCodec are  not supported for parquet 
files)

> Zstd and brotli CompressionCodec are  not supported for parquet files
> -
>
> Key: SPARK-25366
> URL: https://issues.apache.org/jira/browse/SPARK-25366
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: liuxian
>Priority: Minor
>
> Caused by: org.apache.parquet.hadoop.BadConfigurationException: Class 
> org.apache.hadoop.io.compress.*BrotliCodec* was not found
>     at 
> org.apache.parquet.hadoop.CodecFactory.getCodec(CodecFactory.java:235)
>     at 
> org.apache.parquet.hadoop.CodecFactory$HeapBytesCompressor.(CodecFactory.java:142)
>     at 
> org.apache.parquet.hadoop.CodecFactory.createCompressor(CodecFactory.java:206)
>     at 
> org.apache.parquet.hadoop.CodecFactory.getCompressor(CodecFactory.java:189)
>     at 
> org.apache.parquet.hadoop.ParquetRecordWriter.(ParquetRecordWriter.java:153)
>     at 
> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:411)
>     at 
> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:349)
>     at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.(ParquetOutputWriter.scala:37)
>     at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anon$1.newInstance(ParquetFileFormat.scala:161)
>     
>     
>     
>     
>     
> Caused by: org.apache.parquet.hadoop.BadConfigurationException: Class 
> org.apache.hadoop.io.compress.*{color:#33}ZStandardCodec{color}* was not 
> found
>     at 
> org.apache.parquet.hadoop.CodecFactory.getCodec(CodecFactory.java:235)
>     at 
> org.apache.parquet.hadoop.CodecFactory$HeapBytesCompressor.(CodecFactory.java:142)
>     at 
> org.apache.parquet.hadoop.CodecFactory.createCompressor(CodecFactory.java:206)
>     at 
> org.apache.parquet.hadoop.CodecFactory.getCompressor(CodecFactory.java:189)
>     at 
> org.apache.parquet.hadoop.ParquetRecordWriter.(ParquetRecordWriter.java:153)
>     at 
> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:411)
>     at 
> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:349)
>     at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.(ParquetOutputWriter.scala:37)
>     at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anon$1.newInstance(ParquetFileFormat.scala:161)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25366) Zstd and brotil CompressionCodec are not supported for parquet files

2018-09-07 Thread liuxian (JIRA)
liuxian created SPARK-25366:
---

 Summary: Zstd and brotil CompressionCodec are  not supported for 
parquet files
 Key: SPARK-25366
 URL: https://issues.apache.org/jira/browse/SPARK-25366
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.0
Reporter: liuxian


Caused by: org.apache.parquet.hadoop.BadConfigurationException: Class 
org.apache.hadoop.io.compress.*BrotliCodec* was not found
    at 
org.apache.parquet.hadoop.CodecFactory.getCodec(CodecFactory.java:235)
    at 
org.apache.parquet.hadoop.CodecFactory$HeapBytesCompressor.(CodecFactory.java:142)
    at 
org.apache.parquet.hadoop.CodecFactory.createCompressor(CodecFactory.java:206)
    at 
org.apache.parquet.hadoop.CodecFactory.getCompressor(CodecFactory.java:189)
    at 
org.apache.parquet.hadoop.ParquetRecordWriter.(ParquetRecordWriter.java:153)
    at 
org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:411)
    at 
org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:349)
    at 
org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.(ParquetOutputWriter.scala:37)
    at 
org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anon$1.newInstance(ParquetFileFormat.scala:161)
    
    
    
    
    
Caused by: org.apache.parquet.hadoop.BadConfigurationException: Class 
org.apache.hadoop.io.compress.*{color:#33}ZStandardCodec{color}* was not 
found
    at 
org.apache.parquet.hadoop.CodecFactory.getCodec(CodecFactory.java:235)
    at 
org.apache.parquet.hadoop.CodecFactory$HeapBytesCompressor.(CodecFactory.java:142)
    at 
org.apache.parquet.hadoop.CodecFactory.createCompressor(CodecFactory.java:206)
    at 
org.apache.parquet.hadoop.CodecFactory.getCompressor(CodecFactory.java:189)
    at 
org.apache.parquet.hadoop.ParquetRecordWriter.(ParquetRecordWriter.java:153)
    at 
org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:411)
    at 
org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:349)
    at 
org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.(ParquetOutputWriter.scala:37)
    at 
org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anon$1.newInstance(ParquetFileFormat.scala:161)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25356) Add Parquet block size (row group size) option to SparkSQL configuration

2018-09-06 Thread liuxian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liuxian resolved SPARK-25356.
-
Resolution: Invalid

>  Add  Parquet block size (row group size)  option to SparkSQL configuration
> ---
>
> Key: SPARK-25356
> URL: https://issues.apache.org/jira/browse/SPARK-25356
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: liuxian
>Priority: Minor
>
> I think we should configure the Parquet buffer size when using Parquet format.
> Because for HDFS, `dfs.block.size` is configurable, sometimes we hope the 
> block size of parquet to be consistent with it.
> And  whether this parameter `spark.sql.files.maxPartitionBytes` is best 
> consistent with the Parquet  block size when using Parquet format?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25356) Add Parquet block size (row group size) option to SparkSQL configuration

2018-09-06 Thread liuxian (JIRA)
liuxian created SPARK-25356:
---

 Summary:  Add  Parquet block size (row group size)  option to 
SparkSQL configuration
 Key: SPARK-25356
 URL: https://issues.apache.org/jira/browse/SPARK-25356
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.0
Reporter: liuxian


I think we should configure the Parquet buffer size when using Parquet format.

Because for HDFS, `dfs.block.size` is configurable, sometimes we hope the block 
size of parquet to be consistent with it.

And  whether this parameter `spark.sql.files.maxPartitionBytes` is best 
consistent with the Parquet  block size when using Parquet format?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25300) Unified the configuration parameter `spark.shuffle.service.enabled`

2018-08-31 Thread liuxian (JIRA)
liuxian created SPARK-25300:
---

 Summary: Unified the configuration parameter 
`spark.shuffle.service.enabled`
 Key: SPARK-25300
 URL: https://issues.apache.org/jira/browse/SPARK-25300
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: liuxian


The configuration parameter "spark.shuffle.service.enabled"  has defined in 
`package.scala`,  and it  is also used in many place, so we can replace it with 
`SHUFFLE_SERVICE_ENABLED`



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25249) Add a unit test for OpenHashMap

2018-08-27 Thread liuxian (JIRA)
liuxian created SPARK-25249:
---

 Summary: Add a unit test for OpenHashMap
 Key: SPARK-25249
 URL: https://issues.apache.org/jira/browse/SPARK-25249
 Project: Spark
  Issue Type: Test
  Components: Tests
Affects Versions: 2.4.0
Reporter: liuxian


Adding a unit test for OpenHashMap , this can help developers  to distinguish 
between the 0/0.0/0L and non-exist value

{color:#629755} {color}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25166) Reduce the number of write operations for shuffle write.

2018-08-20 Thread liuxian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liuxian updated SPARK-25166:

Description: Currently, only one record is written to a buffer each time, 
which increases the number of copies.  (was: Currently, each record will be 
write to a buffer , which increases the number of copies.)

> Reduce the number of write operations for shuffle write.
> 
>
> Key: SPARK-25166
> URL: https://issues.apache.org/jira/browse/SPARK-25166
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 2.4.0
>Reporter: liuxian
>Priority: Minor
>
> Currently, only one record is written to a buffer each time, which increases 
> the number of copies.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25166) Reduce the number of write operations for shuffle write.

2018-08-20 Thread liuxian (JIRA)
liuxian created SPARK-25166:
---

 Summary: Reduce the number of write operations for shuffle write.
 Key: SPARK-25166
 URL: https://issues.apache.org/jira/browse/SPARK-25166
 Project: Spark
  Issue Type: Improvement
  Components: Shuffle
Affects Versions: 2.4.0
Reporter: liuxian


Currently, each record will be write to a buffer , which increases the number 
of copies.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24994) When the data type of the field is converted to other types, it can also support pushdown to parquet

2018-08-01 Thread liuxian (JIRA)
liuxian created SPARK-24994:
---

 Summary: When the data type of the field is converted to other 
types, it can also support pushdown to parquet
 Key: SPARK-24994
 URL: https://issues.apache.org/jira/browse/SPARK-24994
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.0
Reporter: liuxian


For this statement: select * from table1 where a = 100;
the data type of `a` is `smallint` , because the defaut data type of 100 is 
`int` ,so the data type of  'a' is converted to `int`.
In this case, it does not support push down to parquet.

In our business, for our SQL statements, and we generally do not convert 100 to 
`smallint`, We hope that it can support push down to parquet for this situation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-23989) When using `SortShuffleWriter`, the data will be overwritten

2018-04-18 Thread liuxian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16442176#comment-16442176
 ] 

liuxian edited comment on SPARK-23989 at 4/18/18 9:21 AM:
--

test({color:#6a8759}"groupBy"{color}) {
 {color:#808080} spark.conf.set("spark.sql.shuffle.partitions", 16777217){color}

{color:#cc7832}val {color}df1 = 
{color:#9876aa}Seq{color}(({color:#6a8759}"a"{color}{color:#cc7832}, 
{color}{color:#6897bb}1{color}{color:#cc7832}, 
{color}{color:#6897bb}0{color}{color:#cc7832}, 
{color}{color:#6a8759}"b"{color}){color:#cc7832}, 
{color}({color:#6a8759}"b"{color}{color:#cc7832}, 
{color}{color:#6897bb}2{color}{color:#cc7832}, 
{color}{color:#6897bb}4{color}{color:#cc7832}, 
{color}{color:#6a8759}"c"{color}){color:#cc7832}, 
{color}({color:#6a8759}"a"{color}{color:#cc7832}, 
{color}{color:#6897bb}2{color}{color:#cc7832}, 
{color}{color:#6897bb}3{color}{color:#cc7832}, 
{color}{color:#6a8759}"d"{color}))
 .toDF({color:#6a8759}"key"{color}{color:#cc7832}, 
{color}{color:#6a8759}"value1"{color}{color:#cc7832}, 
{color}{color:#6a8759}"value2"{color}{color:#cc7832}, 
{color}{color:#6a8759}"rest"{color})

checkAnswer(
 
df1.groupBy({color:#6a8759}"key"{color}).min({color:#6a8759}"value2"{color}){color:#cc7832},{color}
 {color:#9876aa}Seq{color}(Row({color:#6a8759}"a"{color}{color:#cc7832}, 
{color}{color:#6897bb}0{color}){color:#cc7832}, 
{color}Row({color:#6a8759}"b"{color}{color:#cc7832}, 
{color}{color:#6897bb}4{color}))
 )
 }

Because the number of partitions is too large, it will run for a long time.

The number of partitions is so large that the purpose is to go 
`SortShuffleWriter`

 


was (Author: 10110346):
test({color:#6a8759}"groupBy"{color}) {
{color:#808080} spark.conf.set("spark.sql.shuffle.partitions", 
16777217){color}{color:#808080}
{color} {color:#cc7832}val {color}df1 = 
{color:#9876aa}Seq{color}(({color:#6a8759}"a"{color}{color:#cc7832}, 
{color}{color:#6897bb}1{color}{color:#cc7832}, 
{color}{color:#6897bb}0{color}{color:#cc7832}, 
{color}{color:#6a8759}"b"{color}){color:#cc7832}, 
{color}({color:#6a8759}"b"{color}{color:#cc7832}, 
{color}{color:#6897bb}2{color}{color:#cc7832}, 
{color}{color:#6897bb}4{color}{color:#cc7832}, 
{color}{color:#6a8759}"c"{color}){color:#cc7832}, 
{color}({color:#6a8759}"a"{color}{color:#cc7832}, 
{color}{color:#6897bb}2{color}{color:#cc7832}, 
{color}{color:#6897bb}3{color}{color:#cc7832}, 
{color}{color:#6a8759}"d"{color}))
 .toDF({color:#6a8759}"key"{color}{color:#cc7832}, 
{color}{color:#6a8759}"value1"{color}{color:#cc7832}, 
{color}{color:#6a8759}"value2"{color}{color:#cc7832}, 
{color}{color:#6a8759}"rest"{color})

 checkAnswer(
 
df1.groupBy({color:#6a8759}"key"{color}).min({color:#6a8759}"value2"{color}){color:#cc7832},
{color} {color:#9876aa}Seq{color}(Row({color:#6a8759}"a"{color}{color:#cc7832}, 
{color}{color:#6897bb}0{color}){color:#cc7832}, 
{color}Row({color:#6a8759}"b"{color}{color:#cc7832}, 
{color}{color:#6897bb}4{color}))
 )
 }

Because the number of partitions is too large, it will run for a long time.

The number of partitions is so large that the purpose is to go 
`SortShuffleWriter`

 

> When using `SortShuffleWriter`, the data will be overwritten
> 
>
> Key: SPARK-23989
> URL: https://issues.apache.org/jira/browse/SPARK-23989
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: liuxian
>Priority: Critical
>
> {color:#33}When using `SortShuffleWriter`, we only insert  
> '{color}{color:#cc7832}AnyRef{color}{color:#33}' into 
> '{color}PartitionedAppendOnlyMap{color:#33}' or 
> '{color}PartitionedPairBuffer{color:#33}'.{color}
> {color:#33}For this function:{color}
> {color:#cc7832}override def {color}{color:#ffc66d}write{color}(records: 
> {color:#4e807d}Iterator{color}[Product2[{color:#4e807d}K{color}{color:#cc7832},
>  {color}{color:#4e807d}V{color}]])
> the value of 'records' is `UnsafeRow`, so  the value will be overwritten
> {color:#33} {color}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23989) When using `SortShuffleWriter`, the data will be overwritten

2018-04-18 Thread liuxian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16442176#comment-16442176
 ] 

liuxian commented on SPARK-23989:
-

test({color:#6a8759}"groupBy"{color}) {
{color:#808080} spark.conf.set("spark.sql.shuffle.partitions", 
16777217){color}{color:#808080}
{color} {color:#cc7832}val {color}df1 = 
{color:#9876aa}Seq{color}(({color:#6a8759}"a"{color}{color:#cc7832}, 
{color}{color:#6897bb}1{color}{color:#cc7832}, 
{color}{color:#6897bb}0{color}{color:#cc7832}, 
{color}{color:#6a8759}"b"{color}){color:#cc7832}, 
{color}({color:#6a8759}"b"{color}{color:#cc7832}, 
{color}{color:#6897bb}2{color}{color:#cc7832}, 
{color}{color:#6897bb}4{color}{color:#cc7832}, 
{color}{color:#6a8759}"c"{color}){color:#cc7832}, 
{color}({color:#6a8759}"a"{color}{color:#cc7832}, 
{color}{color:#6897bb}2{color}{color:#cc7832}, 
{color}{color:#6897bb}3{color}{color:#cc7832}, 
{color}{color:#6a8759}"d"{color}))
 .toDF({color:#6a8759}"key"{color}{color:#cc7832}, 
{color}{color:#6a8759}"value1"{color}{color:#cc7832}, 
{color}{color:#6a8759}"value2"{color}{color:#cc7832}, 
{color}{color:#6a8759}"rest"{color})

 checkAnswer(
 
df1.groupBy({color:#6a8759}"key"{color}).min({color:#6a8759}"value2"{color}){color:#cc7832},
{color} {color:#9876aa}Seq{color}(Row({color:#6a8759}"a"{color}{color:#cc7832}, 
{color}{color:#6897bb}0{color}){color:#cc7832}, 
{color}Row({color:#6a8759}"b"{color}{color:#cc7832}, 
{color}{color:#6897bb}4{color}))
 )
 }

Because the number of partitions is too large, it will run for a long time.

The number of partitions is so large that the purpose is to go 
`SortShuffleWriter`

 

> When using `SortShuffleWriter`, the data will be overwritten
> 
>
> Key: SPARK-23989
> URL: https://issues.apache.org/jira/browse/SPARK-23989
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: liuxian
>Priority: Critical
>
> {color:#33}When using `SortShuffleWriter`, we only insert  
> '{color}{color:#cc7832}AnyRef{color}{color:#33}' into 
> '{color}PartitionedAppendOnlyMap{color:#33}' or 
> '{color}PartitionedPairBuffer{color:#33}'.{color}
> {color:#33}For this function:{color}
> {color:#cc7832}override def {color}{color:#ffc66d}write{color}(records: 
> {color:#4e807d}Iterator{color}[Product2[{color:#4e807d}K{color}{color:#cc7832},
>  {color}{color:#4e807d}V{color}]])
> the value of 'records' is `UnsafeRow`, so  the value will be overwritten
> {color:#33} {color}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23989) When using `SortShuffleWriter`, the data will be overwritten

2018-04-18 Thread liuxian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liuxian updated SPARK-23989:

Attachment: (was: 无标题2.png)

> When using `SortShuffleWriter`, the data will be overwritten
> 
>
> Key: SPARK-23989
> URL: https://issues.apache.org/jira/browse/SPARK-23989
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: liuxian
>Priority: Critical
>
> {color:#33}When using `SortShuffleWriter`, we only insert  
> '{color}{color:#cc7832}AnyRef{color}{color:#33}' into 
> '{color}PartitionedAppendOnlyMap{color:#33}' or 
> '{color}PartitionedPairBuffer{color:#33}'.{color}
> {color:#33}For this function:{color}
> {color:#cc7832}override def {color}{color:#ffc66d}write{color}(records: 
> {color:#4e807d}Iterator{color}[Product2[{color:#4e807d}K{color}{color:#cc7832},
>  {color}{color:#4e807d}V{color}]])
> the value of 'records' is `UnsafeRow`, so  the value will be overwritten
> {color:#33} {color}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23989) When using `SortShuffleWriter`, the data will be overwritten

2018-04-18 Thread liuxian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16441982#comment-16441982
 ] 

liuxian commented on SPARK-23989:
-

We assume that: numPartitions > 
{color:#9876aa}MAX_SHUFFLE_OUTPUT_PARTITIONS_FOR_SERIALIZED_MODE{color}

> When using `SortShuffleWriter`, the data will be overwritten
> 
>
> Key: SPARK-23989
> URL: https://issues.apache.org/jira/browse/SPARK-23989
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: liuxian
>Priority: Critical
> Attachments: 无标题2.png
>
>
> {color:#33}When using `SortShuffleWriter`, we only insert  
> '{color}{color:#cc7832}AnyRef{color}{color:#33}' into 
> '{color}PartitionedAppendOnlyMap{color:#33}' or 
> '{color}PartitionedPairBuffer{color:#33}'.{color}
> {color:#33}For this function:{color}
> {color:#cc7832}override def {color}{color:#ffc66d}write{color}(records: 
> {color:#4e807d}Iterator{color}[Product2[{color:#4e807d}K{color}{color:#cc7832},
>  {color}{color:#4e807d}V{color}]])
> the value of 'records' is `UnsafeRow`, so  the value will be overwritten
> {color:#33} {color}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23989) When using `SortShuffleWriter`, the data will be overwritten

2018-04-18 Thread liuxian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16441980#comment-16441980
 ] 

liuxian commented on SPARK-23989:
-

{color:#9876aa}I think '{color:#33}SortShuffleWriter{color}' should adapt 
to any  shufflewrite scene{color}

> When using `SortShuffleWriter`, the data will be overwritten
> 
>
> Key: SPARK-23989
> URL: https://issues.apache.org/jira/browse/SPARK-23989
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: liuxian
>Priority: Critical
> Attachments: 无标题2.png
>
>
> {color:#33}When using `SortShuffleWriter`, we only insert  
> '{color}{color:#cc7832}AnyRef{color}{color:#33}' into 
> '{color}PartitionedAppendOnlyMap{color:#33}' or 
> '{color}PartitionedPairBuffer{color:#33}'.{color}
> {color:#33}For this function:{color}
> {color:#cc7832}override def {color}{color:#ffc66d}write{color}(records: 
> {color:#4e807d}Iterator{color}[Product2[{color:#4e807d}K{color}{color:#cc7832},
>  {color}{color:#4e807d}V{color}]])
> the value of 'records' is `UnsafeRow`, so  the value will be overwritten
> {color:#33} {color}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-23989) When using `SortShuffleWriter`, the data will be overwritten

2018-04-18 Thread liuxian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16441952#comment-16441952
 ] 

liuxian edited comment on SPARK-23989 at 4/18/18 6:21 AM:
--

1.  Make 'BypassMergeSortShuffleHandle' and 'SerializedShuffleHandle' disable

{color:#cc7832}override def 
{color}{color:#ffc66d}registerShuffle{color}[{color:#4e807d}K{color}{color:#cc7832},
 {color}{color:#4e807d}V{color}{color:#cc7832}, {color}{color:#4e807d}C{color}](
 shuffleId: {color:#cc7832}Int,{color} numMaps: {color:#cc7832}Int,{color} 
dependency: ShuffleDependency[{color:#4e807d}K{color}{color:#cc7832}, 
{color}{color:#4e807d}V{color}{color:#cc7832}, 
{color}{color:#4e807d}C{color}]): ShuffleHandle = {
 {color:#cc7832}if 
{color}(SortShuffleWriter.shouldBypassMergeSort(conf{color:#cc7832}, 
{color}dependency){color:#14892c} && false {color}) {
 {color:#808080}// If there are fewer than 
spark.shuffle.sort.bypassMergeThreshold partitions and we 
don't{color}{color:#808080} // need map-side aggregation, then write 
numPartitions files directly and just concatenate{color}{color:#808080} // them 
at the end. This avoids doing serialization and deserialization twice to 
merge{color}{color:#808080} // together the spilled files, which would happen 
with the normal code path. The downside is{color}{color:#808080} // having 
multiple files open at a time and thus more memory allocated to buffers.{color} 
{color:#cc7832}new 
{color}BypassMergeSortShuffleHandle[{color:#4e807d}K{color}{color:#cc7832}, 
{color}{color:#4e807d}V{color}](
 shuffleId{color:#cc7832}, {color}numMaps{color:#cc7832}, 
{color}dependency.asInstanceOf[ShuffleDependency[{color:#4e807d}K{color}{color:#cc7832},
 {color}{color:#4e807d}V{color}{color:#cc7832}, 
{color}{color:#4e807d}V{color}]])
 } {color:#cc7832}else if 
{color}(SortShuffleManager.canUseSerializedShuffle(dependency) 
{color:#14892c}&& false{color}) {
 {color:#808080}// Otherwise, try to buffer map outputs in a serialized form, 
since this is more efficient:{color} {color:#cc7832}new 
{color}SerializedShuffleHandle[{color:#4e807d}K{color}{color:#cc7832}, 
{color}{color:#4e807d}V{color}](
 shuffleId{color:#cc7832}, {color}numMaps{color:#cc7832}, 
{color}dependency.asInstanceOf[ShuffleDependency[{color:#4e807d}K{color}{color:#cc7832},
 {color}{color:#4e807d}V{color}{color:#cc7832}, 
{color}{color:#4e807d}V{color}]])
 } {color:#cc7832}else {color}{
 {color:#808080}// Otherwise, buffer map outputs in a deserialized form:{color} 
{color:#cc7832}new {color}BaseShuffleHandle(shuffleId{color:#cc7832}, 
{color}numMaps{color:#cc7832}, {color}dependency)
 }
 }

 

2. Run this unit test in 'DataFrameAggregateSuite.scala'

test({color:#6a8759}"SPARK-21580 ints in aggregation expressions are taken as 
group-by ordinal."{color})

3.  I have been debugging in IDEA, grab this information:

{{ _buffer = \{PartitionedPairBuffer@9817}_ }}
 {{ _capacity = 64_}}
 {{ _curSize = 2_}}
 {{ _data = {Object[128]@9832}_ }}
 {{  _0 = \{Tuple2@9834} "(3,3)"_}}
 {{  {color:#14892c}_1 = \{UnsafeRow@9835} "[0,2,2]"_{color}}}
 {{  _2 = \{Tuple2@9841} "(4,4)"_}}
 {{  _{color:#14892c}3 = \{UnsafeRow@9835} "[0,2,2]"{color}_}}

 

 


was (Author: 10110346):
1.  Make 'BypassMergeSortShuffleHandle' and 'SerializedShuffleHandle' disable

{color:#cc7832}override def 
{color}{color:#ffc66d}registerShuffle{color}[{color:#4e807d}K{color}{color:#cc7832},
 {color}{color:#4e807d}V{color}{color:#cc7832}, {color}{color:#4e807d}C{color}](
 shuffleId: {color:#cc7832}Int,
{color} numMaps: {color:#cc7832}Int,
{color} dependency: ShuffleDependency[{color:#4e807d}K{color}{color:#cc7832}, 
{color}{color:#4e807d}V{color}{color:#cc7832}, 
{color}{color:#4e807d}C{color}]): ShuffleHandle = {
 {color:#cc7832}if 
{color}(SortShuffleWriter.shouldBypassMergeSort(conf{color:#cc7832}, 
{color}dependency){color:#14892c} && false {color}) {
 {color:#808080}// If there are fewer than 
spark.shuffle.sort.bypassMergeThreshold partitions and we don't
{color}{color:#808080} // need map-side aggregation, then write numPartitions 
files directly and just concatenate
{color}{color:#808080} // them at the end. This avoids doing serialization and 
deserialization twice to merge
{color}{color:#808080} // together the spilled files, which would happen with 
the normal code path. The downside is
{color}{color:#808080} // having multiple files open at a time and thus more 
memory allocated to buffers.
{color} {color:#cc7832}new 
{color}BypassMergeSortShuffleHandle[{color:#4e807d}K{color}{color:#cc7832}, 
{color}{color:#4e807d}V{color}](
 shuffleId{color:#cc7832}, {color}numMaps{color:#cc7832}, 
{color}dependency.asInstanceOf[ShuffleDependency[{color:#4e807d}K{color}{color:#cc7832},
 {color}{color:#4e807d}V{color}{color:#cc7832}, 
{color}{color:#4e807d}V{color}]])
 } {color:#cc7832}else if 
{color}(SortShuffleManager.canUseSerializedShuffle(dependency) 
{color:#14892c}&& 

[jira] [Commented] (SPARK-23989) When using `SortShuffleWriter`, the data will be overwritten

2018-04-18 Thread liuxian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16441952#comment-16441952
 ] 

liuxian commented on SPARK-23989:
-

1.  Make 'BypassMergeSortShuffleHandle' and 'SerializedShuffleHandle' disable

{color:#cc7832}override def 
{color}{color:#ffc66d}registerShuffle{color}[{color:#4e807d}K{color}{color:#cc7832},
 {color}{color:#4e807d}V{color}{color:#cc7832}, {color}{color:#4e807d}C{color}](
 shuffleId: {color:#cc7832}Int,
{color} numMaps: {color:#cc7832}Int,
{color} dependency: ShuffleDependency[{color:#4e807d}K{color}{color:#cc7832}, 
{color}{color:#4e807d}V{color}{color:#cc7832}, 
{color}{color:#4e807d}C{color}]): ShuffleHandle = {
 {color:#cc7832}if 
{color}(SortShuffleWriter.shouldBypassMergeSort(conf{color:#cc7832}, 
{color}dependency){color:#14892c} && false {color}) {
 {color:#808080}// If there are fewer than 
spark.shuffle.sort.bypassMergeThreshold partitions and we don't
{color}{color:#808080} // need map-side aggregation, then write numPartitions 
files directly and just concatenate
{color}{color:#808080} // them at the end. This avoids doing serialization and 
deserialization twice to merge
{color}{color:#808080} // together the spilled files, which would happen with 
the normal code path. The downside is
{color}{color:#808080} // having multiple files open at a time and thus more 
memory allocated to buffers.
{color} {color:#cc7832}new 
{color}BypassMergeSortShuffleHandle[{color:#4e807d}K{color}{color:#cc7832}, 
{color}{color:#4e807d}V{color}](
 shuffleId{color:#cc7832}, {color}numMaps{color:#cc7832}, 
{color}dependency.asInstanceOf[ShuffleDependency[{color:#4e807d}K{color}{color:#cc7832},
 {color}{color:#4e807d}V{color}{color:#cc7832}, 
{color}{color:#4e807d}V{color}]])
 } {color:#cc7832}else if 
{color}(SortShuffleManager.canUseSerializedShuffle(dependency) 
{color:#14892c}&& false{color}) {
 {color:#808080}// Otherwise, try to buffer map outputs in a serialized form, 
since this is more efficient:
{color} {color:#cc7832}new 
{color}SerializedShuffleHandle[{color:#4e807d}K{color}{color:#cc7832}, 
{color}{color:#4e807d}V{color}](
 shuffleId{color:#cc7832}, {color}numMaps{color:#cc7832}, 
{color}dependency.asInstanceOf[ShuffleDependency[{color:#4e807d}K{color}{color:#cc7832},
 {color}{color:#4e807d}V{color}{color:#cc7832}, 
{color}{color:#4e807d}V{color}]])
 } {color:#cc7832}else {color}{
 {color:#808080}// Otherwise, buffer map outputs in a deserialized form:
{color} {color:#cc7832}new {color}BaseShuffleHandle(shuffleId{color:#cc7832}, 
{color}numMaps{color:#cc7832}, {color}dependency)
 }
}

 

2. Run this unit test in 'DataFrameAggregateSuite.scala'

test({color:#6a8759}"SPARK-21580 ints in aggregation expressions are taken as 
group-by ordinal."{color})

3.  I have been debugging in IDEA, grab this information:

{{ _buffer = \{PartitionedPairBuffer@9817}_ }}
{{ _capacity = 64_}}
{{ _curSize = 2_}}
{{ _data = \{Object[128]@9832}_ }}
{{  _0 = \{Tuple2@9834} "(3,3)"_}}
{{  _1 = \{UnsafeRow@9835} "[0,2,2]"_}}
{{  _2 = \{Tuple2@9841} "(4,4)"_}}
{{  _{color:#14892c}3 = \{UnsafeRow@9835} "[0,2,2]"{color}_}}

 

 

> When using `SortShuffleWriter`, the data will be overwritten
> 
>
> Key: SPARK-23989
> URL: https://issues.apache.org/jira/browse/SPARK-23989
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: liuxian
>Priority: Critical
> Attachments: 无标题2.png
>
>
> {color:#33}When using `SortShuffleWriter`, we only insert  
> '{color}{color:#cc7832}AnyRef{color}{color:#33}' into 
> '{color}PartitionedAppendOnlyMap{color:#33}' or 
> '{color}PartitionedPairBuffer{color:#33}'.{color}
> {color:#33}For this function:{color}
> {color:#cc7832}override def {color}{color:#ffc66d}write{color}(records: 
> {color:#4e807d}Iterator{color}[Product2[{color:#4e807d}K{color}{color:#cc7832},
>  {color}{color:#4e807d}V{color}]])
> the value of 'records' is `UnsafeRow`, so  the value will be overwritten
> {color:#33} {color}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23989) When using `SortShuffleWriter`, the data will be overwritten

2018-04-18 Thread liuxian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liuxian updated SPARK-23989:

Attachment: 无标题2.png

> When using `SortShuffleWriter`, the data will be overwritten
> 
>
> Key: SPARK-23989
> URL: https://issues.apache.org/jira/browse/SPARK-23989
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: liuxian
>Priority: Critical
> Attachments: 无标题2.png
>
>
> {color:#33}When using `SortShuffleWriter`, we only insert  
> '{color}{color:#cc7832}AnyRef{color}{color:#33}' into 
> '{color}PartitionedAppendOnlyMap{color:#33}' or 
> '{color}PartitionedPairBuffer{color:#33}'.{color}
> {color:#33}For this function:{color}
> {color:#cc7832}override def {color}{color:#ffc66d}write{color}(records: 
> {color:#4e807d}Iterator{color}[Product2[{color:#4e807d}K{color}{color:#cc7832},
>  {color}{color:#4e807d}V{color}]])
> the value of 'records' is `UnsafeRow`, so  the value will be overwritten
> {color:#33} {color}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23989) When using `SortShuffleWriter`, the data will be overwritten

2018-04-16 Thread liuxian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16440250#comment-16440250
 ] 

liuxian commented on SPARK-23989:
-

If we make 'BypassMergeSortShuffleHandle' and 'SerializedShuffleHandle' 
disable, a lot of unit tests in 'DataFrameAggregateSuite.scala' will fail

> When using `SortShuffleWriter`, the data will be overwritten
> 
>
> Key: SPARK-23989
> URL: https://issues.apache.org/jira/browse/SPARK-23989
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: liuxian
>Priority: Critical
>
> {color:#33}When using `SortShuffleWriter`, we only insert  
> '{color}{color:#cc7832}AnyRef{color}{color:#33}' into 
> '{color}PartitionedAppendOnlyMap{color:#33}' or 
> '{color}PartitionedPairBuffer{color:#33}'.{color}
> {color:#33}For this function:{color}
> {color:#cc7832}override def {color}{color:#ffc66d}write{color}(records: 
> {color:#4e807d}Iterator{color}[Product2[{color:#4e807d}K{color}{color:#cc7832},
>  {color}{color:#4e807d}V{color}]])
> the value of 'records' is `UnsafeRow`, so  the value will be overwritten
> {color:#33} {color}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23992) ShuffleDependency does not need to be deserialized every time

2018-04-16 Thread liuxian (JIRA)
liuxian created SPARK-23992:
---

 Summary: ShuffleDependency does not need to be deserialized every 
time
 Key: SPARK-23992
 URL: https://issues.apache.org/jira/browse/SPARK-23992
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.3.0
Reporter: liuxian


In the same stage, 'ShuffleDependency' is not necessary to be deserialized each 
time.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-23989) When using `SortShuffleWriter`, the data will be overwritten

2018-04-16 Thread liuxian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16439231#comment-16439231
 ] 

liuxian edited comment on SPARK-23989 at 4/16/18 10:18 AM:
---

For {color:#33}`SortShuffleWriter`{color},  `records: 
{color:#4e807d}Iterator{color}[Product2[{color:#4e807d}K{color}{color:#cc7832}, 
{color}{color:#4e807d}V{color}]]` is key-value pair, but the value is 
'UnsafeRow' type.

For example ,we insert the first record  {color:#33}into 
`PartitionedPairBuffer`, we only save the  'AnyRef{color}',   but the 
{color:#33} 'AnyRef{color}'  of  next  record(only value, not key)  is same 
as the first record  , so the first record  is overwritten.


was (Author: 10110346):
For {color:#33}`SortShuffleWriter`{color},  `records: 
{color:#4e807d}Iterator{color}[Product2[{color:#4e807d}K{color}{color:#cc7832}, 
{color}{color:#4e807d}V{color}]]` is key-value pair, but the value is 
'UnsafeRow' type.

For example ,we insert the first record  {color:#33}into 
`PartitionedPairBuffer`, we only save the  '{color:#cc7832}AnyRef{color}',   
but the {color:#33} '{color:#cc7832}AnyRef{color}'{color}  of  next  
{color}record(only value, not key)  is same as the first record  , so the first 
record  is overwritten.
h1. overwritten

> When using `SortShuffleWriter`, the data will be overwritten
> 
>
> Key: SPARK-23989
> URL: https://issues.apache.org/jira/browse/SPARK-23989
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: liuxian
>Priority: Critical
>
> {color:#33}When using `SortShuffleWriter`, we only insert  
> '{color}{color:#cc7832}AnyRef{color}{color:#33}' into 
> '{color}PartitionedAppendOnlyMap{color:#33}' or 
> '{color}PartitionedPairBuffer{color:#33}'.{color}
> {color:#33}For this function:{color}
> {color:#cc7832}override def {color}{color:#ffc66d}write{color}(records: 
> {color:#4e807d}Iterator{color}[Product2[{color:#4e807d}K{color}{color:#cc7832},
>  {color}{color:#4e807d}V{color}]])
> the value of 'records' is `UnsafeRow`, so  the value will be overwritten
> {color:#33} {color}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23989) When using `SortShuffleWriter`, the data will be overwritten

2018-04-16 Thread liuxian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16439231#comment-16439231
 ] 

liuxian commented on SPARK-23989:
-

For {color:#33}`SortShuffleWriter`{color},  `records: 
{color:#4e807d}Iterator{color}[Product2[{color:#4e807d}K{color}{color:#cc7832}, 
{color}{color:#4e807d}V{color}]]` is key-value pair, but the value is 
'UnsafeRow' type.

For example ,we insert the first record  {color:#33}into 
`PartitionedPairBuffer`, we only save the  '{color:#cc7832}AnyRef{color}',   
but the {color:#33} '{color:#cc7832}AnyRef{color}'{color}  of  next  
{color}record(only value, not key)  is same as the first record  , so the first 
record  is overwritten.
h1. overwritten

> When using `SortShuffleWriter`, the data will be overwritten
> 
>
> Key: SPARK-23989
> URL: https://issues.apache.org/jira/browse/SPARK-23989
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: liuxian
>Priority: Critical
>
> {color:#33}When using `SortShuffleWriter`, we only insert  
> '{color}{color:#cc7832}AnyRef{color}{color:#33}' into 
> '{color}PartitionedAppendOnlyMap{color:#33}' or 
> '{color}PartitionedPairBuffer{color:#33}'.{color}
> {color:#33}For this function:{color}
> {color:#cc7832}override def {color}{color:#ffc66d}write{color}(records: 
> {color:#4e807d}Iterator{color}[Product2[{color:#4e807d}K{color}{color:#cc7832},
>  {color}{color:#4e807d}V{color}]])
> the value of 'records' is `UnsafeRow`, so  the value will be overwritten
> {color:#33} {color}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-23989) When using `SortShuffleWriter`, the data will be overwritten

2018-04-16 Thread liuxian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16439148#comment-16439148
 ] 

liuxian edited comment on SPARK-23989 at 4/16/18 9:00 AM:
--

[~joshrosen] [~cloud_fan]


was (Author: 10110346):
[~joshrosen]

> When using `SortShuffleWriter`, the data will be overwritten
> 
>
> Key: SPARK-23989
> URL: https://issues.apache.org/jira/browse/SPARK-23989
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: liuxian
>Priority: Critical
>
> {color:#33}When using `SortShuffleWriter`, we only insert  
> '{color}{color:#cc7832}AnyRef{color}{color:#33}' into 
> '{color}PartitionedAppendOnlyMap{color:#33}' or 
> '{color}PartitionedPairBuffer{color:#33}'.{color}
> {color:#33}For this function:{color}
> {color:#cc7832}override def {color}{color:#ffc66d}write{color}(records: 
> {color:#4e807d}Iterator{color}[Product2[{color:#4e807d}K{color}{color:#cc7832},
>  {color}{color:#4e807d}V{color}]])
> the value of 'records' is `UnsafeRow`, so  the value will be overwritten
> {color:#33} {color}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23989) When using `SortShuffleWriter`, the data will be overwritten

2018-04-16 Thread liuxian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16439148#comment-16439148
 ] 

liuxian commented on SPARK-23989:
-

[~joshrosen]

> When using `SortShuffleWriter`, the data will be overwritten
> 
>
> Key: SPARK-23989
> URL: https://issues.apache.org/jira/browse/SPARK-23989
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: liuxian
>Priority: Critical
>
> {color:#33}When using `SortShuffleWriter`, we only insert  
> '{color}{color:#cc7832}AnyRef{color}{color:#33}' into 
> '{color}PartitionedAppendOnlyMap{color:#33}' or 
> '{color}PartitionedPairBuffer{color:#33}'.{color}
> {color:#33}For this function:{color}
> {color:#cc7832}override def {color}{color:#ffc66d}write{color}(records: 
> {color:#4e807d}Iterator{color}[Product2[{color:#4e807d}K{color}{color:#cc7832},
>  {color}{color:#4e807d}V{color}]])
> the value of 'records' is `UnsafeRow`, so  the value will be overwritten
> {color:#33} {color}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23989) When using `SortShuffleWriter`, the data will be overwritten

2018-04-16 Thread liuxian (JIRA)
liuxian created SPARK-23989:
---

 Summary: When using `SortShuffleWriter`, the data will be 
overwritten
 Key: SPARK-23989
 URL: https://issues.apache.org/jira/browse/SPARK-23989
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.3.0
Reporter: liuxian


{color:#33}When using `SortShuffleWriter`, we only insert  
'{color}{color:#cc7832}AnyRef{color}{color:#33}' into '{color}

PartitionedAppendOnlyMap{color:#33}' or 
'{color}PartitionedPairBuffer{color:#33}'.{color}

{color:#33}For this function:{color}

{color:#cc7832}override def {color}{color:#ffc66d}write{color}(records: 
{color:#4e807d}Iterator{color}[Product2[{color:#4e807d}K{color}{color:#cc7832}, 
{color}{color:#4e807d}V{color}]])

the value of 'records' is `UnsafeRow`, so  the value will be overwritten

{color:#33} {color}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23989) When using `SortShuffleWriter`, the data will be overwritten

2018-04-16 Thread liuxian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liuxian updated SPARK-23989:

Description: 
{color:#33}When using `SortShuffleWriter`, we only insert  
'{color}{color:#cc7832}AnyRef{color}{color:#33}' into 
'{color}PartitionedAppendOnlyMap{color:#33}' or 
'{color}PartitionedPairBuffer{color:#33}'.{color}

{color:#33}For this function:{color}

{color:#cc7832}override def {color}{color:#ffc66d}write{color}(records: 
{color:#4e807d}Iterator{color}[Product2[{color:#4e807d}K{color}{color:#cc7832}, 
{color}{color:#4e807d}V{color}]])

the value of 'records' is `UnsafeRow`, so  the value will be overwritten

{color:#33} {color}

  was:
{color:#33}When using `SortShuffleWriter`, we only insert  
'{color}{color:#cc7832}AnyRef{color}{color:#33}' into '{color}

PartitionedAppendOnlyMap{color:#33}' or 
'{color}PartitionedPairBuffer{color:#33}'.{color}

{color:#33}For this function:{color}

{color:#cc7832}override def {color}{color:#ffc66d}write{color}(records: 
{color:#4e807d}Iterator{color}[Product2[{color:#4e807d}K{color}{color:#cc7832}, 
{color}{color:#4e807d}V{color}]])

the value of 'records' is `UnsafeRow`, so  the value will be overwritten

{color:#33} {color}


> When using `SortShuffleWriter`, the data will be overwritten
> 
>
> Key: SPARK-23989
> URL: https://issues.apache.org/jira/browse/SPARK-23989
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: liuxian
>Priority: Critical
>
> {color:#33}When using `SortShuffleWriter`, we only insert  
> '{color}{color:#cc7832}AnyRef{color}{color:#33}' into 
> '{color}PartitionedAppendOnlyMap{color:#33}' or 
> '{color}PartitionedPairBuffer{color:#33}'.{color}
> {color:#33}For this function:{color}
> {color:#cc7832}override def {color}{color:#ffc66d}write{color}(records: 
> {color:#4e807d}Iterator{color}[Product2[{color:#4e807d}K{color}{color:#cc7832},
>  {color}{color:#4e807d}V{color}]])
> the value of 'records' is `UnsafeRow`, so  the value will be overwritten
> {color:#33} {color}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23744) Memory leak in ReadableChannelFileRegion

2018-03-19 Thread liuxian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liuxian updated SPARK-23744:

Description: In the class _ReadableChannelFileRegion_,  the _buffer_ is 
direct memory, we should  modify  _deallocate_  to free it  (was: In the class 
_ReadableChannelFileRegion_,  the _buffer_ is direct memory, we should  modify  
deallocate  to free it)

> Memory leak in ReadableChannelFileRegion
> 
>
> Key: SPARK-23744
> URL: https://issues.apache.org/jira/browse/SPARK-23744
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: liuxian
>Priority: Major
>
> In the class _ReadableChannelFileRegion_,  the _buffer_ is direct memory, we 
> should  modify  _deallocate_  to free it



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23744) Memory leak in ReadableChannelFileRegion

2018-03-19 Thread liuxian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liuxian updated SPARK-23744:

Description: In the class _ReadableChannelFileRegion_,  the _buffer_ is 
direct memory, we should  modify  deallocate  to free it  (was: In the class 
`_ReadableChannelFileRegion_`,  the `buffer` is direct memory, we should  
modify `_deallocate_` to free it)

> Memory leak in ReadableChannelFileRegion
> 
>
> Key: SPARK-23744
> URL: https://issues.apache.org/jira/browse/SPARK-23744
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: liuxian
>Priority: Major
>
> In the class _ReadableChannelFileRegion_,  the _buffer_ is direct memory, we 
> should  modify  deallocate  to free it



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23744) Memory leak in ReadableChannelFileRegion

2018-03-19 Thread liuxian (JIRA)
liuxian created SPARK-23744:
---

 Summary: Memory leak in ReadableChannelFileRegion
 Key: SPARK-23744
 URL: https://issues.apache.org/jira/browse/SPARK-23744
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.3.0
Reporter: liuxian


In the class `_ReadableChannelFileRegion_`,  the `buffer` is direct memory, we 
should  modify `_deallocate_` to free it



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23651) Add a check for host name

2018-03-15 Thread liuxian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liuxian resolved SPARK-23651.
-
Resolution: Fixed

> Add a  check for host name
> --
>
> Key: SPARK-23651
> URL: https://issues.apache.org/jira/browse/SPARK-23651
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: liuxian
>Priority: Minor
>
> I encountered a error like this:
> _org.apache.spark.SparkException: Invalid Spark URL: 
> spark://HeartbeatReceiver@ci_164:42849_
>     _at 
> org.apache.spark.rpc.RpcEndpointAddress$.apply(RpcEndpointAddress.scala:66)_
>     _at 
> org.apache.spark.rpc.netty.NettyRpcEnv.asyncSetupEndpointRefByURI(NettyRpcEnv.scala:134)_
>     _at 
> org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:101)_
>     _at org.apache.spark.rpc.RpcEnv.setupEndpointRef(RpcEnv.scala:109)_
>     _at org.apache.spark.util.RpcUtils$.makeDriverRef(RpcUtils.scala:32)_
>     _at org.apache.spark.executor.Executor.(Executor.scala:155)_
>     _at 
> org.apache.spark.scheduler.local.LocalEndpoint.(LocalSchedulerBackend.scala:59)_
>     _at 
> org.apache.spark.scheduler.local.LocalSchedulerBackend.start(LocalSchedulerBackend.scala:126)_
>     _at 
> org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:164)_
>  
> I didn't  know why this _URL_(spark://HeartbeatReceiver@ci_164:42849) is 
> invalid, so i think we should give a clearer reminder for this error.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23651) Add a check for host name

2018-03-12 Thread liuxian (JIRA)
liuxian created SPARK-23651:
---

 Summary: Add a  check for host name
 Key: SPARK-23651
 URL: https://issues.apache.org/jira/browse/SPARK-23651
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.4.0
Reporter: liuxian


I encountered a error like this:

_org.apache.spark.SparkException: Invalid Spark URL: 
spark://HeartbeatReceiver@ci_164:42849_
    _at 
org.apache.spark.rpc.RpcEndpointAddress$.apply(RpcEndpointAddress.scala:66)_
    _at 
org.apache.spark.rpc.netty.NettyRpcEnv.asyncSetupEndpointRefByURI(NettyRpcEnv.scala:134)_
    _at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:101)_
    _at org.apache.spark.rpc.RpcEnv.setupEndpointRef(RpcEnv.scala:109)_
    _at org.apache.spark.util.RpcUtils$.makeDriverRef(RpcUtils.scala:32)_
    _at org.apache.spark.executor.Executor.(Executor.scala:155)_
    _at 
org.apache.spark.scheduler.local.LocalEndpoint.(LocalSchedulerBackend.scala:59)_
    _at 
org.apache.spark.scheduler.local.LocalSchedulerBackend.start(LocalSchedulerBackend.scala:126)_
    _at 
org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:164)_

 

I didn't  know why this _URL_(spark://HeartbeatReceiver@ci_164:42849) is 
invalid, so i think we should give a clearer reminder for this error.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23516) I think it is unnecessary to transfer unroll memory to storage memory

2018-03-05 Thread liuxian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liuxian resolved SPARK-23516.
-
Resolution: Invalid

> I think it is unnecessary to transfer unroll memory to storage memory 
> --
>
> Key: SPARK-23516
> URL: https://issues.apache.org/jira/browse/SPARK-23516
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: liuxian
>Priority: Minor
>
> In fact, unroll memory is also storage memory,so I think it is unnecessary to 
> release unroll memory really, and then to get storage memory again.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23532) [STANDALONE] Improve data locality when launching new executors for dynamic allocation

2018-02-27 Thread liuxian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liuxian updated SPARK-23532:

Description: 
Currently Spark on Yarn supports better data locality by considering the 
preferred locations of the pending tasks when dynamic allocation is enabled, 
refer to https://issues.apache.org/jira/browse/SPARK-4352.

Mesos alse supports data locality, refer to 
https://issues.apache.org/jira/browse/SPARK-16944

It would be better that Standalone can also support this feature.

  was:
Currently Spark on Yarn supports better data locality by considering the 
preferred locations of the pending tasks when dynamic allocation is enabled, 
Refer to https://issues.apache.org/jira/browse/SPARK-4352.

Mesos alse supports data locality, Refer to 
https://issues.apache.org/jira/browse/SPARK-16944

It would be better that Standalone can also support this feature.


> [STANDALONE] Improve data locality when launching new executors for dynamic 
> allocation
> --
>
> Key: SPARK-23532
> URL: https://issues.apache.org/jira/browse/SPARK-23532
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: liuxian
>Priority: Major
>
> Currently Spark on Yarn supports better data locality by considering the 
> preferred locations of the pending tasks when dynamic allocation is enabled, 
> refer to https://issues.apache.org/jira/browse/SPARK-4352.
> Mesos alse supports data locality, refer to 
> https://issues.apache.org/jira/browse/SPARK-16944
> It would be better that Standalone can also support this feature.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23532) [STANDALONE] Improve data locality when launching new executors for dynamic allocation

2018-02-27 Thread liuxian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liuxian updated SPARK-23532:

Description: 
Currently Spark on Yarn supports better data locality by considering the 
preferred locations of the pending tasks when dynamic allocation is enabled, 
Refer to https://issues.apache.org/jira/browse/SPARK-4352.

Mesos alse supports data locality, Refer to 
https://issues.apache.org/jira/browse/SPARK-16944

It would be better that Standalone can also support this feature.

  was:
Currently Spark on Yarn supports better data locality by considering the 
preferred locations of the pending tasks when dynamic allocation is enabled, 
Refer to https://issues.apache.org/jira/browse/SPARK-4352.

Mesos alse supports data locality, Refer to 
https://issues.apache.org/jira/browse/SPARK-16944+

It would be better that Standalone can also support this feature.


> [STANDALONE] Improve data locality when launching new executors for dynamic 
> allocation
> --
>
> Key: SPARK-23532
> URL: https://issues.apache.org/jira/browse/SPARK-23532
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: liuxian
>Priority: Major
>
> Currently Spark on Yarn supports better data locality by considering the 
> preferred locations of the pending tasks when dynamic allocation is enabled, 
> Refer to https://issues.apache.org/jira/browse/SPARK-4352.
> Mesos alse supports data locality, Refer to 
> https://issues.apache.org/jira/browse/SPARK-16944
> It would be better that Standalone can also support this feature.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23532) [STANDALONE] Improve data locality when launching new executors for dynamic allocation

2018-02-27 Thread liuxian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liuxian updated SPARK-23532:

Description: 
Currently Spark on Yarn supports better data locality by considering the 
preferred locations of the pending tasks when dynamic allocation is enabled, 
Refer to https://issues.apache.org/jira/browse/SPARK-4352.

Mesos alse supports data locality, Refer to 
https://issues.apache.org/jira/browse/SPARK-16944+

It would be better that Standalone can also support this feature.

  was:
Currently Spark on Yarn supports better data locality by considering the 
preferred locations of the pending tasks when dynamic allocation is enabled, 
Refer to https://issues.apache.org/jira/browse/SPARK-4352.

Mesos alse supports data locality, Refer to 
+https://issues.apache.org/jira/browse/SPARK-16944+

It would be better that Standalone can also support this feature.


> [STANDALONE] Improve data locality when launching new executors for dynamic 
> allocation
> --
>
> Key: SPARK-23532
> URL: https://issues.apache.org/jira/browse/SPARK-23532
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: liuxian
>Priority: Major
>
> Currently Spark on Yarn supports better data locality by considering the 
> preferred locations of the pending tasks when dynamic allocation is enabled, 
> Refer to https://issues.apache.org/jira/browse/SPARK-4352.
> Mesos alse supports data locality, Refer to 
> https://issues.apache.org/jira/browse/SPARK-16944+
> It would be better that Standalone can also support this feature.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23532) [STANDALONE] Improve data locality when launching new executors for dynamic allocation

2018-02-27 Thread liuxian (JIRA)
liuxian created SPARK-23532:
---

 Summary: [STANDALONE] Improve data locality when launching new 
executors for dynamic allocation
 Key: SPARK-23532
 URL: https://issues.apache.org/jira/browse/SPARK-23532
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 2.3.0
Reporter: liuxian


Currently Spark on Yarn supports better data locality by considering the 
preferred locations of the pending tasks when dynamic allocation is enabled, 
Refer to https://issues.apache.org/jira/browse/SPARK-4352.

Mesos alse supports data locality, Refer to 
+https://issues.apache.org/jira/browse/SPARK-16944+

It would be better that Standalone can also support this feature.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23516) I think it is unnecessary to transfer unroll memory to storage memory

2018-02-26 Thread liuxian (JIRA)
liuxian created SPARK-23516:
---

 Summary: I think it is unnecessary to transfer unroll memory to 
storage memory 
 Key: SPARK-23516
 URL: https://issues.apache.org/jira/browse/SPARK-23516
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.3.0
Reporter: liuxian


In fact, unroll memory is also storage memory,so I think it is unnecessary to 
release unroll memory really, and then to get storage memory again.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23404) When the underlying buffers are already direct, we should copy them to the heap memory

2018-02-21 Thread liuxian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liuxian resolved SPARK-23404.
-
Resolution: Invalid

> When the underlying buffers are already direct, we should copy them to the 
> heap memory
> --
>
> Key: SPARK-23404
> URL: https://issues.apache.org/jira/browse/SPARK-23404
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: liuxian
>Priority: Minor
>
> If the memory mode is _ON_HEAP_,when the underlying buffers are direct, we 
> should copy them to the heap memory.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23404) When the underlying buffers are already direct, we should copy them to the heap memory

2018-02-12 Thread liuxian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liuxian updated SPARK-23404:

Description: If the memory mode is _ON_HEAP_,when the underlying buffers 
are direct, we should copy them to the heap memory.  (was: If the memory mode 
is _ON_HEAP_,when the underlying buffers are direct, we should copy it to the 
heap memory.)

> When the underlying buffers are already direct, we should copy them to the 
> heap memory
> --
>
> Key: SPARK-23404
> URL: https://issues.apache.org/jira/browse/SPARK-23404
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: liuxian
>Priority: Minor
>
> If the memory mode is _ON_HEAP_,when the underlying buffers are direct, we 
> should copy them to the heap memory.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23404) When the underlying buffers are already direct, we should copy them to the heap memory

2018-02-12 Thread liuxian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liuxian updated SPARK-23404:

Summary: When the underlying buffers are already direct, we should copy 
them to the heap memory  (was: When the underlying buffers are already direct, 
we should copy it to the heap memory)

> When the underlying buffers are already direct, we should copy them to the 
> heap memory
> --
>
> Key: SPARK-23404
> URL: https://issues.apache.org/jira/browse/SPARK-23404
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: liuxian
>Priority: Minor
>
> If the memory mode is _ON_HEAP_,when the underlying buffers are direct, we 
> should copy it to the heap memory.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23404) When the underlying buffers are already direct, we should copy it to the heap memory

2018-02-12 Thread liuxian (JIRA)
liuxian created SPARK-23404:
---

 Summary: When the underlying buffers are already direct, we should 
copy it to the heap memory
 Key: SPARK-23404
 URL: https://issues.apache.org/jira/browse/SPARK-23404
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.3.0
Reporter: liuxian


If the memory mode is _ON_HEAP_,when the underlying buffers are direct, we 
should copy it to the heap memory.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23391) It may lead to overflow for some integer multiplication

2018-02-11 Thread liuxian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liuxian updated SPARK-23391:

Priority: Minor  (was: Major)

> It may lead to overflow for some integer multiplication 
> 
>
> Key: SPARK-23391
> URL: https://issues.apache.org/jira/browse/SPARK-23391
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: liuxian
>Priority: Minor
>
> In the {{getBlockData}},{{blockId.reduceId}} is the {{Int}} type, when it is 
> greater than 2^28, {{blockId.reduceId*8}} will overflow.
> In the _decompress0, len_ and  _unitSize are  {{Int}}_ type, so _len * 
> unitSize_ may lead to  overflow
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23391) It may lead to overflow for some integer multiplication

2018-02-11 Thread liuxian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liuxian updated SPARK-23391:

Description: 
In the {{getBlockData}},{{blockId.reduceId}} is the {{Int}} type, when it is 
greater than 2^28, {{blockId.reduceId*8}} will overflow.

In the _decompress0, len_ and  _unitSize are  {{Int}}_ type_,_ so _len * 
unitSize_ may lead to  overflow

 

 

  was:
In the {{getBlockData}},{{blockId.reduceId}} is the {{Int}} type, when it is 
greater than 2^28, {{blockId.reduceId*8}} will overflow.

In the _decompress0, len_ and  _unitSize are  {{Int}} type,_so _len * unitSize_ 
may lead to  overflow

 

 


> It may lead to overflow for some integer multiplication 
> 
>
> Key: SPARK-23391
> URL: https://issues.apache.org/jira/browse/SPARK-23391
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: liuxian
>Priority: Major
>
> In the {{getBlockData}},{{blockId.reduceId}} is the {{Int}} type, when it is 
> greater than 2^28, {{blockId.reduceId*8}} will overflow.
> In the _decompress0, len_ and  _unitSize are  {{Int}}_ type_,_ so _len * 
> unitSize_ may lead to  overflow
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23391) It may lead to overflow for some integer multiplication

2018-02-11 Thread liuxian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liuxian updated SPARK-23391:

Description: 
In the {{getBlockData}},{{blockId.reduceId}} is the {{Int}} type, when it is 
greater than 2^28, {{blockId.reduceId*8}} will overflow.

In the _decompress0, len_ and  _unitSize are  {{Int}}_ type, so _len * 
unitSize_ may lead to  overflow

 

 

  was:
In the {{getBlockData}},{{blockId.reduceId}} is the {{Int}} type, when it is 
greater than 2^28, {{blockId.reduceId*8}} will overflow.

In the _decompress0, len_ and  _unitSize are  {{Int}}_ type_,_ so _len * 
unitSize_ may lead to  overflow

 

 


> It may lead to overflow for some integer multiplication 
> 
>
> Key: SPARK-23391
> URL: https://issues.apache.org/jira/browse/SPARK-23391
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: liuxian
>Priority: Major
>
> In the {{getBlockData}},{{blockId.reduceId}} is the {{Int}} type, when it is 
> greater than 2^28, {{blockId.reduceId*8}} will overflow.
> In the _decompress0, len_ and  _unitSize are  {{Int}}_ type, so _len * 
> unitSize_ may lead to  overflow
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23391) It may lead to overflow for some integer multiplication

2018-02-11 Thread liuxian (JIRA)
liuxian created SPARK-23391:
---

 Summary: It may lead to overflow for some integer multiplication 
 Key: SPARK-23391
 URL: https://issues.apache.org/jira/browse/SPARK-23391
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.3.0
Reporter: liuxian


In the {{getBlockData}},{{blockId.reduceId}} is the {{Int}} type, when it is 
greater than 2^28, {{blockId.reduceId*8}} will overflow.

In the _decompress0, len_ and  _unitSize are  {{Int}} type,_so _len * unitSize_ 
may lead to  overflow

 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23389) When the shuffle dependency specifies aggregation ,and `dependency.mapSideCombine=false`, we should be able to use serialized sorting.

2018-02-11 Thread liuxian (JIRA)
liuxian created SPARK-23389:
---

 Summary: When the shuffle dependency specifies aggregation ,and 
`dependency.mapSideCombine=false`,  we should be able to use serialized sorting.
 Key: SPARK-23389
 URL: https://issues.apache.org/jira/browse/SPARK-23389
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.3.0
Reporter: liuxian


When the shuffle dependency specifies aggregation ,and 
`dependency.mapSideCombine=false`, in the map side,there is no need for 
aggregation and sorting, so we should be able to use serialized sorting.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23358) When the number of partitions is greater than 2^28, it will result in an error result

2018-02-08 Thread liuxian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23358?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liuxian updated SPARK-23358:

Description: 
In the `checkIndexAndDataFile`,the _blocks_ is the  _Int_ type,  when it is 
greater than 2^28, `blocks*8` will overflow, and this will result in an error 
result.
 In fact, `blocks` is actually the number of partitions.

  was:
In the `checkIndexAndDataFile`,the `blocks` is the  ` Int` type,  when it is 
greater than 2^28, `blocks*8` will overflow, and this will result in an error 
result.
In fact, `blocks` is actually the number of partitions.


> When the number of partitions is greater than 2^28, it will result in an 
> error result
> -
>
> Key: SPARK-23358
> URL: https://issues.apache.org/jira/browse/SPARK-23358
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: liuxian
>Priority: Major
>
> In the `checkIndexAndDataFile`,the _blocks_ is the  _Int_ type,  when it is 
> greater than 2^28, `blocks*8` will overflow, and this will result in an error 
> result.
>  In fact, `blocks` is actually the number of partitions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >