[jira] [Created] (SPARK-27552) The configuration `hive.exec.stagingdir` is invalid on Windows OS
liuxian created SPARK-27552: --- Summary: The configuration `hive.exec.stagingdir` is invalid on Windows OS Key: SPARK-27552 URL: https://issues.apache.org/jira/browse/SPARK-27552 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0 Reporter: liuxian If we set _{{hive.exec.stagingdir=.test-staging\tmp}}_, But the staging directory is still _{{.hive-staging}}_ on Windows OS. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27173) For hive parquet table,codes(lz4,brotli,zstd) are not available
[ https://issues.apache.org/jira/browse/SPARK-27173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liuxian resolved SPARK-27173. - Resolution: Won't Fix > For hive parquet table,codes(lz4,brotli,zstd) are not available > --- > > Key: SPARK-27173 > URL: https://issues.apache.org/jira/browse/SPARK-27173 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: liuxian >Priority: Minor > > From > _parquet.hadoop.metadata.CompressionCodecName_(parquet-hadoop-bundle-1.6.0.jar > ), we can know, for hive parquet table, it only supports *snappy*, *gzip* > and *lzo* -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27256) If the configuration is used to set the number of bytes, we'd better use `bytesConf`'.
[ https://issues.apache.org/jira/browse/SPARK-27256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liuxian updated SPARK-27256: Description: Currently, if we want to configure `spark. sql. files. maxPartitionBytes` to 256 megabytes, we must set `spark. sql. files. maxPartitionBytes=268435456`, which is very unfriendly to users. And if we set it like this:`spark. sql. files. maxPartitionBytes=256M`, we will encounter this exception: _Exception in thread "main" java.lang.IllegalArgumentException: spark.sql.files.maxPartitionBytes should be long, but was 256M_ _at org.apache.spark.internal.config.ConfigHelpers$.toNumber(ConfigBuilder.scala:34)_ was: Currently, if we want to configure `spark. sql. files. maxPartitionBytes` to 256 megabytes, we must set `spark. sql. files. maxPartitionBytes=268435456`, which is very unfriendly to users. And if we set it like this:`spark. sql. files. maxPartitionBytes=256M`, we will encounter this exception: _Exception in thread "main" java.lang.IllegalArgumentException: spark.sql.files.maxPartitionBytes should be long, but was 128M_ _at org.apache.spark.internal.config.ConfigHelpers$.toNumber(ConfigBuilder.scala:34)_ > If the configuration is used to set the number of bytes, we'd better use > `bytesConf`'. > -- > > Key: SPARK-27256 > URL: https://issues.apache.org/jira/browse/SPARK-27256 > Project: Spark > Issue Type: Improvement > Components: Spark Core, SQL >Affects Versions: 3.0.0 >Reporter: liuxian >Priority: Minor > > Currently, if we want to configure `spark. sql. files. maxPartitionBytes` to > 256 megabytes, we must set `spark. sql. files. maxPartitionBytes=268435456`, > which is very unfriendly to users. > And if we set it like this:`spark. sql. files. maxPartitionBytes=256M`, we > will encounter this exception: > _Exception in thread "main" java.lang.IllegalArgumentException: > spark.sql.files.maxPartitionBytes should be long, but was 256M_ > _at > org.apache.spark.internal.config.ConfigHelpers$.toNumber(ConfigBuilder.scala:34)_ -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27256) If the configuration is used to set the number of bytes, we'd better use `bytesConf`'.
liuxian created SPARK-27256: --- Summary: If the configuration is used to set the number of bytes, we'd better use `bytesConf`'. Key: SPARK-27256 URL: https://issues.apache.org/jira/browse/SPARK-27256 Project: Spark Issue Type: Improvement Components: Spark Core, SQL Affects Versions: 3.0.0 Reporter: liuxian Currently, if we want to configure `spark. sql. files. maxPartitionBytes` to 256 megabytes, we must set `spark. sql. files. maxPartitionBytes=268435456`, which is very unfriendly to users. And if we set it like this:`spark. sql. files. maxPartitionBytes=256M`, we will encounter this exception: _Exception in thread "main" java.lang.IllegalArgumentException: spark.sql.files.maxPartitionBytes should be long, but was 128M_ _at org.apache.spark.internal.config.ConfigHelpers$.toNumber(ConfigBuilder.scala:34)_ -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27238) In the same APP, maybe some hive Parquet(ORC) tables can't use the built-in Parquet(ORC) reader and writer
[ https://issues.apache.org/jira/browse/SPARK-27238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liuxian updated SPARK-27238: Description: In the same APP, TableA and TableB are both hive Parquet tables, but TableA can't use the built-in Parquet reader and writer. {color:#6a8759}In {color}{color:#33}this situation, {color}_spark.sql.hive.convertMetastoreParquet_ {color:#33}can't control this well, so I think we can add a fine-grained configuration to handle this case{color} was: In the same APP, TableA and TableB are both hive parquet tables, but TableA can't use the built-in Parquet reader and writer. {color:#6a8759}In {color}{color:#33}this situation, {color}_spark.sql.hive.convertMetastoreParquet_ {color:#33}can't control this well, so I think we can add a fine-grained configuration to handle this case{color} > In the same APP, maybe some hive Parquet(ORC) tables can't use the built-in > Parquet(ORC) reader and writer > -- > > Key: SPARK-27238 > URL: https://issues.apache.org/jira/browse/SPARK-27238 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: liuxian >Priority: Minor > > In the same APP, TableA and TableB are both hive Parquet tables, but TableA > can't use the built-in Parquet reader and writer. > {color:#6a8759}In {color}{color:#33}this situation, > {color}_spark.sql.hive.convertMetastoreParquet_ {color:#33}can't control > this well, so I think we can add a fine-grained configuration to handle this > case{color} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27238) In the same APP, maybe some hive Parquet(ORC) tables can't use the built-in Parquet(ORC) reader and writer
[ https://issues.apache.org/jira/browse/SPARK-27238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liuxian updated SPARK-27238: Summary: In the same APP, maybe some hive Parquet(ORC) tables can't use the built-in Parquet(ORC) reader and writer (was: In the same APP, maybe some hive parquet tables can't use the built-in Parquet reader and writer) > In the same APP, maybe some hive Parquet(ORC) tables can't use the built-in > Parquet(ORC) reader and writer > -- > > Key: SPARK-27238 > URL: https://issues.apache.org/jira/browse/SPARK-27238 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: liuxian >Priority: Minor > > In the same APP, TableA and TableB are both hive parquet tables, but TableA > can't use the built-in Parquet reader and writer. > {color:#6a8759}In {color}{color:#33}this situation, > {color}_spark.sql.hive.convertMetastoreParquet_ {color:#33}can't control > this well, so I think we can add a fine-grained configuration to handle this > case{color} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27238) In the same APP, maybe some hive parquet tables can't use the built-in Parquet reader and writer
[ https://issues.apache.org/jira/browse/SPARK-27238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liuxian updated SPARK-27238: Description: In the same APP, TableA and TableB are both hive parquet tables, but TableA can't use the built-in Parquet reader and writer. {color:#6a8759}In {color}{color:#33}this situation, {color}_spark.sql.hive.convertMetastoreParquet_ {color:#33}can't control this well, so I think we can add a fine-grained configuration to handle this case{color} was: In the same APP, TableA and TableB are both hive parquet tables, but TableA can't use the built-in Parquet reader and writer. {color:#6a8759}{color:#33}In {color}{color:#33}this situation, {color}{color:#ff}{color}spark.sql.hive.convertMetastoreParquet {color:#33}can't control this well, so I think we can add a fine-grained configuration to handle this case{color} {color} > In the same APP, maybe some hive parquet tables can't use the built-in > Parquet reader and writer > > > Key: SPARK-27238 > URL: https://issues.apache.org/jira/browse/SPARK-27238 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: liuxian >Priority: Minor > > In the same APP, TableA and TableB are both hive parquet tables, but TableA > can't use the built-in Parquet reader and writer. > {color:#6a8759}In {color}{color:#33}this situation, > {color}_spark.sql.hive.convertMetastoreParquet_ {color:#33}can't control > this well, so I think we can add a fine-grained configuration to handle this > case{color} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27238) In the same APP, maybe some hive parquet tables can't use the built-in Parquet reader and writer
liuxian created SPARK-27238: --- Summary: In the same APP, maybe some hive parquet tables can't use the built-in Parquet reader and writer Key: SPARK-27238 URL: https://issues.apache.org/jira/browse/SPARK-27238 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: liuxian In the same APP, TableA and TableB are both hive parquet tables, but TableA can't use the built-in Parquet reader and writer. {color:#6a8759}{color:#33}In {color}{color:#33}this situation, {color}{color:#ff}{color}spark.sql.hive.convertMetastoreParquet {color:#33}can't control this well, so I think we can add a fine-grained configuration to handle this case{color} {color} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27173) For hive parquet table,codes()
liuxian created SPARK-27173: --- Summary: For hive parquet table,codes() Key: SPARK-27173 URL: https://issues.apache.org/jira/browse/SPARK-27173 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: liuxian -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27173) For hive parquet table,codes(lz4,brotli,zstd) are not available
[ https://issues.apache.org/jira/browse/SPARK-27173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liuxian updated SPARK-27173: Description: From _parquet.hadoop.metadata.CompressionCodecName_(parquet-hadoop-bundle-1.6.0.jar ), we can know, for hive parquet table, it only supports *snappy*, *gzip* and *lzo* (was: We can parquet-hadoop-bundle-1.6.0.jar parquet.hadoop.metadata.CompressionCodecName) > For hive parquet table,codes(lz4,brotli,zstd) are not available > --- > > Key: SPARK-27173 > URL: https://issues.apache.org/jira/browse/SPARK-27173 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: liuxian >Priority: Minor > > From > _parquet.hadoop.metadata.CompressionCodecName_(parquet-hadoop-bundle-1.6.0.jar > ), we can know, for hive parquet table, it only supports *snappy*, *gzip* > and *lzo* -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27173) For hive parquet table,codes(lz4,brotli,zstd) are not available
[ https://issues.apache.org/jira/browse/SPARK-27173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liuxian updated SPARK-27173: Description: We can parquet-hadoop-bundle-1.6.0.jar parquet.hadoop.metadata.CompressionCodecName > For hive parquet table,codes(lz4,brotli,zstd) are not available > --- > > Key: SPARK-27173 > URL: https://issues.apache.org/jira/browse/SPARK-27173 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: liuxian >Priority: Minor > > We can parquet-hadoop-bundle-1.6.0.jar > parquet.hadoop.metadata.CompressionCodecName -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27173) For hive parquet table,codes(lz4,brotli,zstd) are not available
[ https://issues.apache.org/jira/browse/SPARK-27173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liuxian updated SPARK-27173: Summary: For hive parquet table,codes(lz4,brotli,zstd) are not available (was: For hive parquet table,codes()) > For hive parquet table,codes(lz4,brotli,zstd) are not available > --- > > Key: SPARK-27173 > URL: https://issues.apache.org/jira/browse/SPARK-27173 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: liuxian >Priority: Minor > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27083) Add a config to control subqueryReuse
liuxian created SPARK-27083: --- Summary: Add a config to control subqueryReuse Key: SPARK-27083 URL: https://issues.apache.org/jira/browse/SPARK-27083 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: liuxian Subquery Reuse and Exchange Reuse are not the same feature, if we don't want to reuse subqueries,and we just want to reuse exchanges,only one configuration that cannot be done. So I think we should add a new configuration to control subqueryReuse. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27056) Remove `start-shuffle-service.sh`
[ https://issues.apache.org/jira/browse/SPARK-27056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liuxian updated SPARK-27056: Description: _start-shuffle-service.sh_ was only used by Mesos before _start-mesos-shuffle-service.sh_. Obviously, _start-mesos-shuffle-service.sh_ solves some problems, it is better than _start-shuffle-service.sh_. So now we should delete _start-shuffle-service.sh_ in case users use it. was: _start-shuffle-service.sh_ was only used by Mesos before _start-mesos-shuffle-service.sh_. Obviously, _start-mesos-shuffle-service.sh_ solves some problems, it is better than start-shuffle-service.sh. So now we should delete _start-shuffle-service.sh_ in case users use it. > Remove `start-shuffle-service.sh` > -- > > Key: SPARK-27056 > URL: https://issues.apache.org/jira/browse/SPARK-27056 > Project: Spark > Issue Type: Improvement > Components: Mesos >Affects Versions: 3.0.0 >Reporter: liuxian >Priority: Minor > > _start-shuffle-service.sh_ was only used by Mesos before > _start-mesos-shuffle-service.sh_. > Obviously, _start-mesos-shuffle-service.sh_ solves some problems, it is > better than _start-shuffle-service.sh_. > So now we should delete _start-shuffle-service.sh_ in case users use it. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27056) Remove `start-shuffle-service.sh`
liuxian created SPARK-27056: --- Summary: Remove `start-shuffle-service.sh` Key: SPARK-27056 URL: https://issues.apache.org/jira/browse/SPARK-27056 Project: Spark Issue Type: Improvement Components: Mesos Affects Versions: 3.0.0 Reporter: liuxian _start-shuffle-service.sh_ was only used by Mesos before _start-mesos-shuffle-service.sh_. Obviously, _start-mesos-shuffle-service.sh_ solves some problems, it is better than start-shuffle-service.sh. So now we should delete _start-shuffle-service.sh_ in case users use it. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25574) Add an option `keepQuotes` for parsing csv file
[ https://issues.apache.org/jira/browse/SPARK-25574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liuxian resolved SPARK-25574. - Resolution: Invalid > Add an option `keepQuotes` for parsing csv file > > > Key: SPARK-25574 > URL: https://issues.apache.org/jira/browse/SPARK-25574 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: liuxian >Priority: Minor > > In our project, when we read the CSV file, we hope to keep quotes. > For example: > We have such a record in the CSV file.: > *ab,cc,,"c,ddd"* > We hope it displays like this: > |_c0|_c1|_c2| _c3| > | ab|cc |null|*"c,ddd"*| > > Not like this: > |_c0|_c1|_c2| _c3| > | ab|cc |null|c,ddd| -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26353) Add typed aggregate functions(max/min) to the example module
[ https://issues.apache.org/jira/browse/SPARK-26353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liuxian updated SPARK-26353: Summary: Add typed aggregate functions(max/min) to the example module (was: Add typed aggregate functions:max&) > Add typed aggregate functions(max/min) to the example module > > > Key: SPARK-26353 > URL: https://issues.apache.org/jira/browse/SPARK-26353 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: liuxian >Priority: Minor > > For Dataset API, aggregate functions:max& are not implemented in a > type-safe way at the moment. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26353) Add typed aggregate functions(max/min) to the example module
[ https://issues.apache.org/jira/browse/SPARK-26353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liuxian updated SPARK-26353: Description: Add typed aggregate functions(max/min) to the example module. (was: For Dataset API, aggregate functions:max& are not implemented in a type-safe way at the moment.) > Add typed aggregate functions(max/min) to the example module > > > Key: SPARK-26353 > URL: https://issues.apache.org/jira/browse/SPARK-26353 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: liuxian >Priority: Minor > > Add typed aggregate functions(max/min) to the example module. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-26793) Remove spark.shuffle.manager
[ https://issues.apache.org/jira/browse/SPARK-26793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liuxian resolved SPARK-26793. - Resolution: Invalid > Remove spark.shuffle.manager > > > Key: SPARK-26793 > URL: https://issues.apache.org/jira/browse/SPARK-26793 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: liuxian >Priority: Minor > > Currently, `ShuffleManager` always uses `SortShuffleManager`, I think this > configuration can be removed. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26793) Remove spark.shuffle.manager
liuxian created SPARK-26793: --- Summary: Remove spark.shuffle.manager Key: SPARK-26793 URL: https://issues.apache.org/jira/browse/SPARK-26793 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.0.0 Reporter: liuxian Currently, `ShuffleManager` always uses `SortShuffleManager`, I think this configuration can be removed. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26780) Improve shuffle read using ReadAheadInputStream
liuxian created SPARK-26780: --- Summary: Improve shuffle read using ReadAheadInputStream Key: SPARK-26780 URL: https://issues.apache.org/jira/browse/SPARK-26780 Project: Spark Issue Type: Improvement Components: Shuffle Affects Versions: 3.0.0 Reporter: liuxian Using _ReadAheadInputStream_ to improve shuffle read performance. _ReadAheadInputStream_ can save cpu utilization and almost no performance regression -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23516) I think it is unnecessary to transfer unroll memory to storage memory
[ https://issues.apache.org/jira/browse/SPARK-23516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liuxian updated SPARK-23516: Description: Now `StaticMemoryManager` mode has been removed. And for `UnifiedMemoryManager`, unroll memory is also storage memory,so I think it is unnecessary to release unroll memory really, and then to get storage memory again. was:In fact, unroll memory is also storage memory,so I think it is unnecessary to release unroll memory really, and then to get storage memory again. > I think it is unnecessary to transfer unroll memory to storage memory > -- > > Key: SPARK-23516 > URL: https://issues.apache.org/jira/browse/SPARK-23516 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: liuxian >Priority: Minor > > Now `StaticMemoryManager` mode has been removed. > And for `UnifiedMemoryManager`, unroll memory is also storage memory,so I > think it is unnecessary to release unroll memory really, and then to get > storage memory again. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-23516) I think it is unnecessary to transfer unroll memory to storage memory
[ https://issues.apache.org/jira/browse/SPARK-23516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liuxian reopened SPARK-23516: - > I think it is unnecessary to transfer unroll memory to storage memory > -- > > Key: SPARK-23516 > URL: https://issues.apache.org/jira/browse/SPARK-23516 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: liuxian >Priority: Minor > > Now _StaticMemoryManager_ mode has been removed. > And for _UnifiedMemoryManager_, unroll memory is also storage memory, so I > think it is unnecessary to release unroll memory really, and then to get > storage memory again. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23516) I think it is unnecessary to transfer unroll memory to storage memory
[ https://issues.apache.org/jira/browse/SPARK-23516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liuxian updated SPARK-23516: Description: Now _StaticMemoryManager_ mode has been removed. And for _UnifiedMemoryManager_, unroll memory is also storage memory, so I think it is unnecessary to release unroll memory really, and then to get storage memory again. was: Now `StaticMemoryManager` mode has been removed. And for `UnifiedMemoryManager`, unroll memory is also storage memory,so I think it is unnecessary to release unroll memory really, and then to get storage memory again. > I think it is unnecessary to transfer unroll memory to storage memory > -- > > Key: SPARK-23516 > URL: https://issues.apache.org/jira/browse/SPARK-23516 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: liuxian >Priority: Minor > > Now _StaticMemoryManager_ mode has been removed. > And for _UnifiedMemoryManager_, unroll memory is also storage memory, so I > think it is unnecessary to release unroll memory really, and then to get > storage memory again. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23516) I think it is unnecessary to transfer unroll memory to storage memory
[ https://issues.apache.org/jira/browse/SPARK-23516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liuxian updated SPARK-23516: Affects Version/s: (was: 2.3.0) 3.0.0 > I think it is unnecessary to transfer unroll memory to storage memory > -- > > Key: SPARK-23516 > URL: https://issues.apache.org/jira/browse/SPARK-23516 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: liuxian >Priority: Minor > > In fact, unroll memory is also storage memory,so I think it is unnecessary to > release unroll memory really, and then to get storage memory again. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26621) Use ConfigEntry for hardcoded configs for shuffle categories.
liuxian created SPARK-26621: --- Summary: Use ConfigEntry for hardcoded configs for shuffle categories. Key: SPARK-26621 URL: https://issues.apache.org/jira/browse/SPARK-26621 Project: Spark Issue Type: Sub-task Components: Spark Core Affects Versions: 3.0.0 Reporter: liuxian Make the following hardcoded configs to use ConfigEntry. {{spark.shuffle}} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26353) Add typed aggregate functions:max&
liuxian created SPARK-26353: --- Summary: Add typed aggregate functions:max& Key: SPARK-26353 URL: https://issues.apache.org/jira/browse/SPARK-26353 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: liuxian For Dataset API, aggregate functions:max& are not implemented in a type-safe way at the moment. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26300) The `checkForStreaming` mothod may be called twice in `createQuery`
liuxian created SPARK-26300: --- Summary: The `checkForStreaming` mothod may be called twice in `createQuery` Key: SPARK-26300 URL: https://issues.apache.org/jira/browse/SPARK-26300 Project: Spark Issue Type: Improvement Components: Structured Streaming Affects Versions: 2.4.0 Reporter: liuxian If {{checkForContinuous}} is called ( {{checkForStreaming}} is called in {{checkForContinuous}} ), the {{checkForStreaming}} mothod will be called twice in {{createQuery}} , this is not necessary, and the {{checkForStreaming}} method has a lot of statements, so it's better to remove one of them. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-26264) It is better to add @transient to field 'locs' for class `ResultTask`.
[ https://issues.apache.org/jira/browse/SPARK-26264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liuxian closed SPARK-26264. --- > It is better to add @transient to field 'locs' for class `ResultTask`. > --- > > Key: SPARK-26264 > URL: https://issues.apache.org/jira/browse/SPARK-26264 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: liuxian >Priority: Minor > > The field 'locs' is only used in driver side for class `ResultTask`, so it > is not needed to serialize when sending the `ResultTask` to executor. > Although it's not very big, it's very frequent, so we can add` transient` for > it like `ShuffleMapTask` > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-26264) It is better to add @transient to field 'locs' for class `ResultTask`.
[ https://issues.apache.org/jira/browse/SPARK-26264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liuxian resolved SPARK-26264. - Resolution: Not A Problem > It is better to add @transient to field 'locs' for class `ResultTask`. > --- > > Key: SPARK-26264 > URL: https://issues.apache.org/jira/browse/SPARK-26264 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: liuxian >Priority: Minor > > The field 'locs' is only used in driver side for class `ResultTask`, so it > is not needed to serialize when sending the `ResultTask` to executor. > Although it's not very big, it's very frequent, so we can add` transient` for > it like `ShuffleMapTask` > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26264) It is better to add @transient to field 'locs' for class `ResultTask`.
liuxian created SPARK-26264: --- Summary: It is better to add @transient to field 'locs' for class `ResultTask`. Key: SPARK-26264 URL: https://issues.apache.org/jira/browse/SPARK-26264 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.0.0 Reporter: liuxian The field 'locs' is only used in driver side for class `ResultTask`, so it is not needed to serialize when sending the `ResultTask` to executor. Although it's not very big, it's very frequent, so we can add` transient` for it like `ShuffleMapTask` -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25729) It is better to replace `minPartitions` with `defaultParallelism` , when `minPartitions` is less than `defaultParallelism`
[ https://issues.apache.org/jira/browse/SPARK-25729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liuxian updated SPARK-25729: Description: For ‘WholeTextFileInputFormat’, when `minPartitions` is less than `defaultParallelism`, it is better to replace `minPartitions` with `defaultParallelism` , because this can make better use of resources and improve parallelism. was: In ‘WholeTextFileRDD’,when `minPartitions` is less than `defaultParallelism`, it is better to replace `minPartitions` with `defaultParallelism` , because this can make better use of resources and improve parallelism. > It is better to replace `minPartitions` with `defaultParallelism` , when > `minPartitions` is less than `defaultParallelism` > -- > > Key: SPARK-25729 > URL: https://issues.apache.org/jira/browse/SPARK-25729 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: liuxian >Priority: Minor > > For ‘WholeTextFileInputFormat’, when `minPartitions` is less than > `defaultParallelism`, > it is better to replace `minPartitions` with `defaultParallelism` , because > this can make better use of resources and improve parallelism. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25806) The instanceof FileSplit is redundant for ParquetFileFormat and OrcFileFormat
[ https://issues.apache.org/jira/browse/SPARK-25806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liuxian updated SPARK-25806: Summary: The instanceof FileSplit is redundant for ParquetFileFormat and OrcFileFormat (was: The instanceof FileSplit is redundant for ParquetFileFormat) > The instanceof FileSplit is redundant for ParquetFileFormat and OrcFileFormat > -- > > Key: SPARK-25806 > URL: https://issues.apache.org/jira/browse/SPARK-25806 > Project: Spark > Issue Type: Improvement > Components: Input/Output >Affects Versions: 3.0.0 >Reporter: liuxian >Priority: Trivial > > The instance of FileSplit is redundant {color:#33}in the > {color}{color:#f79232}ParquetFileFormat{color} and {color:#f79232} > {{hive\orc\OrcFileFormat}}{color} class. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25806) The instanceof FileSplit is redundant for ParquetFileFormat
[ https://issues.apache.org/jira/browse/SPARK-25806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liuxian updated SPARK-25806: Description: The instance of FileSplit is redundant {color:#33}in the {color}{color:#f79232}ParquetFileFormat{color} and {color:#f79232} {{hive\orc\OrcFileFormat}}{color} class. (was: The instance of FileSplit is redundant for {color:#ffc66d}buildReaderWithPartitionValues {color} {color:#33}in the {color}{color:#f79232}ParquetFileFormat{color} class.) > The instanceof FileSplit is redundant for ParquetFileFormat > > > Key: SPARK-25806 > URL: https://issues.apache.org/jira/browse/SPARK-25806 > Project: Spark > Issue Type: Improvement > Components: Input/Output >Affects Versions: 3.0.0 >Reporter: liuxian >Priority: Trivial > > The instance of FileSplit is redundant {color:#33}in the > {color}{color:#f79232}ParquetFileFormat{color} and {color:#f79232} > {{hive\orc\OrcFileFormat}}{color} class. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25806) The instanceof FileSplit is redundant for ParquetFileFormat
[ https://issues.apache.org/jira/browse/SPARK-25806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liuxian updated SPARK-25806: Description: The instance of FileSplit is redundant for {color:#ffc66d}buildReaderWithPartitionValues {color} {color:#33}in the {color}{color:#f79232}ParquetFileFormat{color} class. was: The instanceof FileSplit is redundant for {color:#ffc66d}buildReaderWithPartitionValues {color} {color:#33}in the {color}{color:#f79232}ParquetFileFormat{color} class. > The instanceof FileSplit is redundant for ParquetFileFormat > > > Key: SPARK-25806 > URL: https://issues.apache.org/jira/browse/SPARK-25806 > Project: Spark > Issue Type: Improvement > Components: Input/Output >Affects Versions: 3.0.0 >Reporter: liuxian >Priority: Trivial > > The instance of FileSplit is redundant for > {color:#ffc66d}buildReaderWithPartitionValues {color} > {color:#33}in the {color}{color:#f79232}ParquetFileFormat{color} class. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25806) The instanceof FileSplit is redundant for ParquetFileFormat
[ https://issues.apache.org/jira/browse/SPARK-25806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liuxian updated SPARK-25806: Description: The instanceof FileSplit is redundant for {color:#ffc66d}buildReaderWithPartitionValues {color} {color:#33}in the {color}{color:#f79232}ParquetFileFormat{color} class. was: The instanceof FileSplit is redundant for {color:#ffc66d}buildReaderWithPartitionValues {color} {color:#33}{color:#ffc66d}in the {color}{color}ParquetFileFormat class. > The instanceof FileSplit is redundant for ParquetFileFormat > > > Key: SPARK-25806 > URL: https://issues.apache.org/jira/browse/SPARK-25806 > Project: Spark > Issue Type: Improvement > Components: Input/Output >Affects Versions: 3.0.0 >Reporter: liuxian >Priority: Trivial > > The instanceof FileSplit is redundant for > {color:#ffc66d}buildReaderWithPartitionValues {color} > {color:#33}in the {color}{color:#f79232}ParquetFileFormat{color} class. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25806) The instanceof FileSplit is redundant for ParquetFileFormat
[ https://issues.apache.org/jira/browse/SPARK-25806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liuxian updated SPARK-25806: Description: The instanceof FileSplit is redundant for {color:#ffc66d}buildReaderWithPartitionValues {color} {color:#33}{color:#ffc66d}in the {color}{color}ParquetFileFormat class. was:The instanceof FileSplit is redundant for {color:#ffc66d}buildReaderWithPartitionValues {color:#33}in the {color}{color}{color:#f79232}ParquetFileFormat{color} class. > The instanceof FileSplit is redundant for ParquetFileFormat > > > Key: SPARK-25806 > URL: https://issues.apache.org/jira/browse/SPARK-25806 > Project: Spark > Issue Type: Improvement > Components: Input/Output >Affects Versions: 3.0.0 >Reporter: liuxian >Priority: Trivial > > The instanceof FileSplit is redundant for > {color:#ffc66d}buildReaderWithPartitionValues {color} > {color:#33}{color:#ffc66d}in the {color}{color}ParquetFileFormat class. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25806) The instanceof FileSplit is redundant for ParquetFileFormat
[ https://issues.apache.org/jira/browse/SPARK-25806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liuxian updated SPARK-25806: Description: The instanceof FileSplit is redundant for {color:#ffc66d}buildReaderWithPartitionValues {color:#33}in the {color}{color}{color:#f79232}ParquetFileFormat{color} class. (was: The instanceof FileSplit is redundant for {color:#ffc66d}buildReaderWithPartitionValues {color:#33}in the {color}{color}{color:#f79232}ParquetFileFormat {color:#33}class.{color} {color}) > The instanceof FileSplit is redundant for ParquetFileFormat > > > Key: SPARK-25806 > URL: https://issues.apache.org/jira/browse/SPARK-25806 > Project: Spark > Issue Type: Improvement > Components: Input/Output >Affects Versions: 3.0.0 >Reporter: liuxian >Priority: Trivial > > The instanceof FileSplit is redundant for > {color:#ffc66d}buildReaderWithPartitionValues {color:#33}in the > {color}{color}{color:#f79232}ParquetFileFormat{color} class. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25806) The instanceof FileSplit is redundant for ParquetFileFormat
liuxian created SPARK-25806: --- Summary: The instanceof FileSplit is redundant for ParquetFileFormat Key: SPARK-25806 URL: https://issues.apache.org/jira/browse/SPARK-25806 Project: Spark Issue Type: Improvement Components: Input/Output Affects Versions: 3.0.0 Reporter: liuxian The instanceof FileSplit is redundant for {color:#ffc66d}buildReaderWithPartitionValues {color:#33}in the {color}{color}{color:#f79232}ParquetFileFormat {color:#33}class.{color} {color} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25786) If the ByteBuffer.hasArray is false , it will throw UnsupportedOperationException for Kryo
[ https://issues.apache.org/jira/browse/SPARK-25786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liuxian updated SPARK-25786: Environment: (was: `{color:#ffc66d}deserialize{color}` for kryo, the type of input parameter is ByteBuffer, if it is not backed by an accessible byte array. it will throw UnsupportedOperationException Exception Info: java.lang.UnsupportedOperationException was thrown. java.lang.UnsupportedOperationException at java.nio.ByteBuffer.array(ByteBuffer.java:994) at org.apache.spark.serializer.KryoSerializerInstance.deserialize(KryoSerializer.scala:362) ) Description: `{color:#ffc66d}deserialize{color}` for kryo, the type of input parameter is ByteBuffer, if it is not backed by an accessible byte array. it will throw UnsupportedOperationException Exception Info: java.lang.UnsupportedOperationException was thrown. java.lang.UnsupportedOperationException at java.nio.ByteBuffer.array(ByteBuffer.java:994) at org.apache.spark.serializer.KryoSerializerInstance.deserialize(KryoSerializer.scala:362) > If the ByteBuffer.hasArray is false , it will throw > UnsupportedOperationException for Kryo > -- > > Key: SPARK-25786 > URL: https://issues.apache.org/jira/browse/SPARK-25786 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: liuxian >Priority: Major > > `{color:#ffc66d}deserialize{color}` for kryo, the type of input parameter is > ByteBuffer, if it is not backed by an accessible byte array. it will throw > UnsupportedOperationException > Exception Info: > java.lang.UnsupportedOperationException was thrown. > java.lang.UnsupportedOperationException > at java.nio.ByteBuffer.array(ByteBuffer.java:994) > at > org.apache.spark.serializer.KryoSerializerInstance.deserialize(KryoSerializer.scala:362) > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25786) If the ByteBuffer.hasArray is false , it will throw UnsupportedOperationException for Kryo
liuxian created SPARK-25786: --- Summary: If the ByteBuffer.hasArray is false , it will throw UnsupportedOperationException for Kryo Key: SPARK-25786 URL: https://issues.apache.org/jira/browse/SPARK-25786 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.0.0 Environment: `{color:#ffc66d}deserialize{color}` for kryo, the type of input parameter is ByteBuffer, if it is not backed by an accessible byte array. it will throw UnsupportedOperationException Exception Info: java.lang.UnsupportedOperationException was thrown. java.lang.UnsupportedOperationException at java.nio.ByteBuffer.array(ByteBuffer.java:994) at org.apache.spark.serializer.KryoSerializerInstance.deserialize(KryoSerializer.scala:362) Reporter: liuxian -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25780) Scheduling the tasks which have no higher level locality first
liuxian created SPARK-25780: --- Summary: Scheduling the tasks which have no higher level locality first Key: SPARK-25780 URL: https://issues.apache.org/jira/browse/SPARK-25780 Project: Spark Issue Type: Improvement Components: Scheduler Affects Versions: 3.0.0 Reporter: liuxian For example: An application has two executors: (exe1, host1), (exe2,host2). And 3 tasks with locality: \{task0, Seq(TaskLocation("host1", "exec1"))}, \{task1, Seq(TaskLocation("host1", "exec1"), TaskLocation("host2"))}, \{task2, Seq(TaskLocation("host2")} If task0 is runing in exe1, when `allowedLocality` is NODE_LOCAL for exe2, it is better to schedule task2 fisrt, not task1, because task1 may be scheduled to exe1 later. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25776) The disk write buffer size must be greater than 12.
[ https://issues.apache.org/jira/browse/SPARK-25776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liuxian updated SPARK-25776: Description: In {color:#205081}{{UnsafeSorterSpillWriter.java}}{color}, when we write a record to a spill file wtih {{ {color:#205081}void write(Object baseObject, long baseOffset, int recordLength, long keyPrefix{color})}}, {color:#205081}{{recordLength}} {color}and {color:#205081}{{keyPrefix}} {color}will be written the disk write buffer first, and these will take 12 bytes, so the disk write buffer size must be greater than 12. If {{diskWriteBufferSize}} is 10, it will print this exception info: _java.lang.ArrayIndexOutOfBoundsException: 10_ _at org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillWriter.writeLongToBuffer (UnsafeSorterSpillWriter.java:91)_ _at org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillWriter.write(UnsafeSorterSpillWriter.java:123)_ _at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spillIterator(UnsafeExternalSorter.java:498)_ _at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:222)_ _at org.apache.spark.memory.MemoryConsumer.spill(MemoryConsumer.java:65)_ was: In {{UnsafeSorterSpillWriter.java}}, when we write a record to a spill file wtih {{ void write(Object baseObject, long baseOffset, int recordLength, long keyPrefix)}}, {{recordLength}} and {{keyPrefix}} will be written the disk write buffer first, and these will take 12 bytes, so the disk write buffer size must be greater than 12. If {{diskWriteBufferSize}} is 10, it will print this exception info: _java.lang.ArrayIndexOutOfBoundsException: 10_ _at org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillWriter.writeLongToBuffer (UnsafeSorterSpillWriter.java:91)_ _at org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillWriter.write(UnsafeSorterSpillWriter.java:123)_ _at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spillIterator(UnsafeExternalSorter.java:498)_ _at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:222)_ _at org.apache.spark.memory.MemoryConsumer.spill(MemoryConsumer.java:65)_ > The disk write buffer size must be greater than 12. > --- > > Key: SPARK-25776 > URL: https://issues.apache.org/jira/browse/SPARK-25776 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: liuxian >Priority: Minor > > In {color:#205081}{{UnsafeSorterSpillWriter.java}}{color}, when we write a > record to a spill file wtih {{ {color:#205081}void write(Object baseObject, > long baseOffset, int recordLength, long keyPrefix{color})}}, > {color:#205081}{{recordLength}} {color}and {color:#205081}{{keyPrefix}} > {color}will be written the disk write buffer first, and these will take 12 > bytes, so the disk write buffer size must be greater than 12. > If {{diskWriteBufferSize}} is 10, it will print this exception info: > _java.lang.ArrayIndexOutOfBoundsException: 10_ > _at > org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillWriter.writeLongToBuffer > (UnsafeSorterSpillWriter.java:91)_ > _at > org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillWriter.write(UnsafeSorterSpillWriter.java:123)_ > _at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spillIterator(UnsafeExternalSorter.java:498)_ > _at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:222)_ > _at org.apache.spark.memory.MemoryConsumer.spill(MemoryConsumer.java:65)_ -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25776) The disk write buffer size must be greater than 12.
[ https://issues.apache.org/jira/browse/SPARK-25776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liuxian updated SPARK-25776: Description: In {color:#205081}{{UnsafeSorterSpillWriter.java}}{color}, when we write a record to a spill file wtih {{ {color:#205081}void write(Object baseObject, long baseOffset, int recordLength, long keyPrefix{color})}}, {color:#205081}{{recordLength}} {color}and {color:#205081}{{keyPrefix}} {color}will be written the disk write buffer first, and these will take 12 bytes, so the disk write buffer size must be greater than 12. If {color:#205081}{{diskWriteBufferSize}} {color}is 10, it will print this exception info: _java.lang.ArrayIndexOutOfBoundsException: 10_ _at org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillWriter.writeLongToBuffer (UnsafeSorterSpillWriter.java:91)_ _at org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillWriter.write(UnsafeSorterSpillWriter.java:123)_ _at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spillIterator(UnsafeExternalSorter.java:498)_ _at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:222)_ _at org.apache.spark.memory.MemoryConsumer.spill(MemoryConsumer.java:65)_ was: In {color:#205081}{{UnsafeSorterSpillWriter.java}}{color}, when we write a record to a spill file wtih {{ {color:#205081}void write(Object baseObject, long baseOffset, int recordLength, long keyPrefix{color})}}, {color:#205081}{{recordLength}} {color}and {color:#205081}{{keyPrefix}} {color}will be written the disk write buffer first, and these will take 12 bytes, so the disk write buffer size must be greater than 12. If {{diskWriteBufferSize}} is 10, it will print this exception info: _java.lang.ArrayIndexOutOfBoundsException: 10_ _at org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillWriter.writeLongToBuffer (UnsafeSorterSpillWriter.java:91)_ _at org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillWriter.write(UnsafeSorterSpillWriter.java:123)_ _at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spillIterator(UnsafeExternalSorter.java:498)_ _at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:222)_ _at org.apache.spark.memory.MemoryConsumer.spill(MemoryConsumer.java:65)_ > The disk write buffer size must be greater than 12. > --- > > Key: SPARK-25776 > URL: https://issues.apache.org/jira/browse/SPARK-25776 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: liuxian >Priority: Minor > > In {color:#205081}{{UnsafeSorterSpillWriter.java}}{color}, when we write a > record to a spill file wtih {{ {color:#205081}void write(Object baseObject, > long baseOffset, int recordLength, long keyPrefix{color})}}, > {color:#205081}{{recordLength}} {color}and {color:#205081}{{keyPrefix}} > {color}will be written the disk write buffer first, and these will take 12 > bytes, so the disk write buffer size must be greater than 12. > If {color:#205081}{{diskWriteBufferSize}} {color}is 10, it will print this > exception info: > _java.lang.ArrayIndexOutOfBoundsException: 10_ > _at > org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillWriter.writeLongToBuffer > (UnsafeSorterSpillWriter.java:91)_ > _at > org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillWriter.write(UnsafeSorterSpillWriter.java:123)_ > _at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spillIterator(UnsafeExternalSorter.java:498)_ > _at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:222)_ > _at org.apache.spark.memory.MemoryConsumer.spill(MemoryConsumer.java:65)_ -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25776) The disk write buffer size must be greater than 12.
liuxian created SPARK-25776: --- Summary: The disk write buffer size must be greater than 12. Key: SPARK-25776 URL: https://issues.apache.org/jira/browse/SPARK-25776 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.0.0 Reporter: liuxian In {{UnsafeSorterSpillWriter.java}}, when we write a record to a spill file wtih {{ void write(Object baseObject, long baseOffset, int recordLength, long keyPrefix)}}, {{recordLength}} and {{keyPrefix}} will be written the disk write buffer first, and these will take 12 bytes, so the disk write buffer size must be greater than 12. If {{diskWriteBufferSize}} is 10, it will print this exception info: _java.lang.ArrayIndexOutOfBoundsException: 10_ _at org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillWriter.writeLongToBuffer (UnsafeSorterSpillWriter.java:91)_ _at org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillWriter.write(UnsafeSorterSpillWriter.java:123)_ _at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spillIterator(UnsafeExternalSorter.java:498)_ _at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:222)_ _at org.apache.spark.memory.MemoryConsumer.spill(MemoryConsumer.java:65)_ -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25753) binaryFiles broken for small files
[ https://issues.apache.org/jira/browse/SPARK-25753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liuxian updated SPARK-25753: Description: _{{StreamFileInputFormat}}_ and {{_WholeTextFileInputFormat_(https://issues.apache.org/jira/browse/SPARK-24610)}} have the same problem: for small sized files, the computed maxSplitSize by `_{{StreamFileInputFormat}}_ ` is way smaller than the default or commonly used split size of 64/128M and spark throws an exception while trying to read them. {{Exception info:}} _{{Minimum split size pernode 5123456 cannot be larger than maximum split size 4194304 java.io.IOException: Minimum split size pernode 5123456 cannot be larger than maximum split size 4194304 at org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat.getSplits(CombineFileInputFormat.java: 201) at org.apache.spark.rdd.BinaryFileRDD.getPartitions(BinaryFileRDD.scala:52) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:254) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.rdd.RDD.partitions(RDD.scala:252) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2138)}}_ was: {{StreamFileInputFormat}} and {{WholeTextFileInputFormat(https://issues.apache.org/jira/browse/SPARK-24610)}} have the same problem: for small sized files, the computed maxSplitSize by `{{StreamFileInputFormat}} ` is way smaller than the default or commonly used split size of 64/128M and spark throws an exception while trying to read them. {{Exception info:Minimum split size pernode 5123456 cannot be larger than maximum split size 4194304 java.io.IOException: Minimum split size pernode 5123456 cannot be larger than maximum split size 4194304 at org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat.getSplits(CombineFileInputFormat.java: 201) at org.apache.spark.rdd.BinaryFileRDD.getPartitions(BinaryFileRDD.scala:52) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:254) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.rdd.RDD.partitions(RDD.scala:252) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2138)}} > binaryFiles broken for small files > -- > > Key: SPARK-25753 > URL: https://issues.apache.org/jira/browse/SPARK-25753 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 3.0.0 >Reporter: liuxian >Priority: Minor > > _{{StreamFileInputFormat}}_ and > {{_WholeTextFileInputFormat_(https://issues.apache.org/jira/browse/SPARK-24610)}} > have the same problem: for small sized files, the computed maxSplitSize by > `_{{StreamFileInputFormat}}_ ` is way smaller than the default or commonly > used split size of 64/128M and spark throws an exception while trying to read > them. > {{Exception info:}} > _{{Minimum split size pernode 5123456 cannot be larger than maximum split > size 4194304 java.io.IOException: Minimum split size pernode 5123456 cannot > be larger than maximum split size 4194304 at > org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat.getSplits(CombineFileInputFormat.java: > 201) at > org.apache.spark.rdd.BinaryFileRDD.getPartitions(BinaryFileRDD.scala:52) at > org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:254) at > org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252) at > scala.Option.getOrElse(Option.scala:121) at > org.apache.spark.rdd.RDD.partitions(RDD.scala:252) at > org.apache.spark.SparkContext.runJob(SparkContext.scala:2138)}}_ -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25753) binaryFiles broken for small files
liuxian created SPARK-25753: --- Summary: binaryFiles broken for small files Key: SPARK-25753 URL: https://issues.apache.org/jira/browse/SPARK-25753 Project: Spark Issue Type: Bug Components: Input/Output Affects Versions: 3.0.0 Reporter: liuxian {{StreamFileInputFormat}} and {{WholeTextFileInputFormat(https://issues.apache.org/jira/browse/SPARK-24610)}} have the same problem: for small sized files, the computed maxSplitSize by `{{StreamFileInputFormat}} ` is way smaller than the default or commonly used split size of 64/128M and spark throws an exception while trying to read them. {{Exception info:Minimum split size pernode 5123456 cannot be larger than maximum split size 4194304 java.io.IOException: Minimum split size pernode 5123456 cannot be larger than maximum split size 4194304 at org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat.getSplits(CombineFileInputFormat.java: 201) at org.apache.spark.rdd.BinaryFileRDD.getPartitions(BinaryFileRDD.scala:52) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:254) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.rdd.RDD.partitions(RDD.scala:252) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2138)}} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25729) It is better to replace `minPartitions` with `defaultParallelism` , when `minPartitions` is less than `defaultParallelism`
[ https://issues.apache.org/jira/browse/SPARK-25729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liuxian updated SPARK-25729: Description: In ‘WholeTextFileRDD’,when `minPartitions` is less than `defaultParallelism`, it is better to replace `minPartitions` with `defaultParallelism` , because this can make better use of resources and improve parallelism. was: In ‘WholeTextFileRDD’,when `minPartitions` is less than `defaultParallelism`, it is better to replace `minPartitions` with `defaultParallelism` , because this can make better use of resources and improve concurrency. > It is better to replace `minPartitions` with `defaultParallelism` , when > `minPartitions` is less than `defaultParallelism` > -- > > Key: SPARK-25729 > URL: https://issues.apache.org/jira/browse/SPARK-25729 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: liuxian >Priority: Minor > > In ‘WholeTextFileRDD’,when `minPartitions` is less than `defaultParallelism`, > it is better to replace `minPartitions` with `defaultParallelism` , because > this can make better use of resources and improve parallelism. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25729) It is better to replace `minPartitions` with `defaultParallelism` , when `minPartitions` is less than `defaultParallelism`
liuxian created SPARK-25729: --- Summary: It is better to replace `minPartitions` with `defaultParallelism` , when `minPartitions` is less than `defaultParallelism` Key: SPARK-25729 URL: https://issues.apache.org/jira/browse/SPARK-25729 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 2.4.0 Reporter: liuxian In ‘WholeTextFileRDD’,when `minPartitions` is less than `defaultParallelism`, it is better to replace `minPartitions` with `defaultParallelism` , because this can make better use of resources and improve concurrency. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25674) If the records are incremented by more than 1 at a time,the number of bytes might rarely ever get updated
[ https://issues.apache.org/jira/browse/SPARK-25674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liuxian updated SPARK-25674: Priority: Minor (was: Trivial) > If the records are incremented by more than 1 at a time,the number of bytes > might rarely ever get updated > - > > Key: SPARK-25674 > URL: https://issues.apache.org/jira/browse/SPARK-25674 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: liuxian >Priority: Minor > > If the records are incremented by more than 1 at a time,the number of bytes > might rarely ever get updated in `FileScanRDD.scala`,because it might skip > over the count that is an exact multiple of > UPDATE_INPUT_METRICS_INTERVAL_RECORDS. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25674) If the records are incremented by more than 1 at a time,the number of bytes might rarely ever get updated
liuxian created SPARK-25674: --- Summary: If the records are incremented by more than 1 at a time,the number of bytes might rarely ever get updated Key: SPARK-25674 URL: https://issues.apache.org/jira/browse/SPARK-25674 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.0 Reporter: liuxian If the records are incremented by more than 1 at a time,the number of bytes might rarely ever get updated in `FileScanRDD.scala`,because it might skip over the count that is an exact multiple of UPDATE_INPUT_METRICS_INTERVAL_RECORDS. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25574) Add an option `keepQuotes` for parsing csv file
[ https://issues.apache.org/jira/browse/SPARK-25574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liuxian updated SPARK-25574: Description: In our project, when we read the CSV file, we hope to keep quotes. For example: We have such a record in the CSV file.: *ab,cc,,"c,ddd"* We hope it displays like this: |_c0|_c1|_c2| _c3| | ab|cc |null|*"c,ddd"*| Not like this: |_c0|_c1|_c2| _c3| | ab|cc |null |c,ddd| +-+--++-+ was: In our project, when we read the CSV file, we hope to keep quotes. For example: We have such a record in the CSV file.: *ab,cc,,"c,ddd"* We hope it displays like this: ++---++---+ | _c0|_c1| _c2| _c3| ++---++---+ | ab| cc|null|*"c,ddd"*| not like this: ++---++-+ | _c0|_c1| _c2| _c3| ++---++-+ | ab| cc|null|c,ddd| ++---++-+ > Add an option `keepQuotes` for parsing csv file > > > Key: SPARK-25574 > URL: https://issues.apache.org/jira/browse/SPARK-25574 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: liuxian >Priority: Minor > > In our project, when we read the CSV file, we hope to keep quotes. > For example: > We have such a record in the CSV file.: > *ab,cc,,"c,ddd"* > We hope it displays like this: > |_c0|_c1|_c2| _c3| > | ab|cc |null|*"c,ddd"*| > > Not like this: > |_c0|_c1|_c2| _c3| > | ab|cc |null |c,ddd| > +-+--++-+ -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25574) Add an option `keepQuotes` for parsing csv file
[ https://issues.apache.org/jira/browse/SPARK-25574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liuxian updated SPARK-25574: Description: In our project, when we read the CSV file, we hope to keep quotes. For example: We have such a record in the CSV file.: *ab,cc,,"c,ddd"* We hope it displays like this: |_c0|_c1|_c2| _c3| | ab|cc |null|*"c,ddd"*| Not like this: |_c0|_c1|_c2| _c3| | ab|cc |null|c,ddd| was: In our project, when we read the CSV file, we hope to keep quotes. For example: We have such a record in the CSV file.: *ab,cc,,"c,ddd"* We hope it displays like this: |_c0|_c1|_c2| _c3| | ab|cc |null|*"c,ddd"*| Not like this: |_c0|_c1|_c2| _c3| | ab|cc |null |c,ddd| +-+--++-+ > Add an option `keepQuotes` for parsing csv file > > > Key: SPARK-25574 > URL: https://issues.apache.org/jira/browse/SPARK-25574 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: liuxian >Priority: Minor > > In our project, when we read the CSV file, we hope to keep quotes. > For example: > We have such a record in the CSV file.: > *ab,cc,,"c,ddd"* > We hope it displays like this: > |_c0|_c1|_c2| _c3| > | ab|cc |null|*"c,ddd"*| > > Not like this: > |_c0|_c1|_c2| _c3| > | ab|cc |null|c,ddd| -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25574) Add an option `keepQuotes` for parsing csv file
liuxian created SPARK-25574: --- Summary: Add an option `keepQuotes` for parsing csv file Key: SPARK-25574 URL: https://issues.apache.org/jira/browse/SPARK-25574 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.4.0 Reporter: liuxian In our project, when we read the CSV file, we hope to keep quotes. For example: We have such a record in the CSV file.: *ab,cc,,"c,ddd"* We hope it displays like this: ++---++---+ | _c0|_c1| _c2| _c3| ++---++---+ | ab| cc|null|*"c,ddd"*| not like this: ++---++-+ | _c0|_c1| _c2| _c3| ++---++-+ | ab| cc|null|c,ddd| ++---++-+ -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25366) Zstd and brotli CompressionCodec are not supported for parquet files
[ https://issues.apache.org/jira/browse/SPARK-25366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liuxian updated SPARK-25366: Summary: Zstd and brotli CompressionCodec are not supported for parquet files (was: Zstd and brotil CompressionCodec are not supported for parquet files) > Zstd and brotli CompressionCodec are not supported for parquet files > - > > Key: SPARK-25366 > URL: https://issues.apache.org/jira/browse/SPARK-25366 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: liuxian >Priority: Minor > > Caused by: org.apache.parquet.hadoop.BadConfigurationException: Class > org.apache.hadoop.io.compress.*BrotliCodec* was not found > at > org.apache.parquet.hadoop.CodecFactory.getCodec(CodecFactory.java:235) > at > org.apache.parquet.hadoop.CodecFactory$HeapBytesCompressor.(CodecFactory.java:142) > at > org.apache.parquet.hadoop.CodecFactory.createCompressor(CodecFactory.java:206) > at > org.apache.parquet.hadoop.CodecFactory.getCompressor(CodecFactory.java:189) > at > org.apache.parquet.hadoop.ParquetRecordWriter.(ParquetRecordWriter.java:153) > at > org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:411) > at > org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:349) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.(ParquetOutputWriter.scala:37) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anon$1.newInstance(ParquetFileFormat.scala:161) > > > > > > Caused by: org.apache.parquet.hadoop.BadConfigurationException: Class > org.apache.hadoop.io.compress.*{color:#33}ZStandardCodec{color}* was not > found > at > org.apache.parquet.hadoop.CodecFactory.getCodec(CodecFactory.java:235) > at > org.apache.parquet.hadoop.CodecFactory$HeapBytesCompressor.(CodecFactory.java:142) > at > org.apache.parquet.hadoop.CodecFactory.createCompressor(CodecFactory.java:206) > at > org.apache.parquet.hadoop.CodecFactory.getCompressor(CodecFactory.java:189) > at > org.apache.parquet.hadoop.ParquetRecordWriter.(ParquetRecordWriter.java:153) > at > org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:411) > at > org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:349) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.(ParquetOutputWriter.scala:37) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anon$1.newInstance(ParquetFileFormat.scala:161) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25366) Zstd and brotil CompressionCodec are not supported for parquet files
liuxian created SPARK-25366: --- Summary: Zstd and brotil CompressionCodec are not supported for parquet files Key: SPARK-25366 URL: https://issues.apache.org/jira/browse/SPARK-25366 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.0 Reporter: liuxian Caused by: org.apache.parquet.hadoop.BadConfigurationException: Class org.apache.hadoop.io.compress.*BrotliCodec* was not found at org.apache.parquet.hadoop.CodecFactory.getCodec(CodecFactory.java:235) at org.apache.parquet.hadoop.CodecFactory$HeapBytesCompressor.(CodecFactory.java:142) at org.apache.parquet.hadoop.CodecFactory.createCompressor(CodecFactory.java:206) at org.apache.parquet.hadoop.CodecFactory.getCompressor(CodecFactory.java:189) at org.apache.parquet.hadoop.ParquetRecordWriter.(ParquetRecordWriter.java:153) at org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:411) at org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:349) at org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.(ParquetOutputWriter.scala:37) at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anon$1.newInstance(ParquetFileFormat.scala:161) Caused by: org.apache.parquet.hadoop.BadConfigurationException: Class org.apache.hadoop.io.compress.*{color:#33}ZStandardCodec{color}* was not found at org.apache.parquet.hadoop.CodecFactory.getCodec(CodecFactory.java:235) at org.apache.parquet.hadoop.CodecFactory$HeapBytesCompressor.(CodecFactory.java:142) at org.apache.parquet.hadoop.CodecFactory.createCompressor(CodecFactory.java:206) at org.apache.parquet.hadoop.CodecFactory.getCompressor(CodecFactory.java:189) at org.apache.parquet.hadoop.ParquetRecordWriter.(ParquetRecordWriter.java:153) at org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:411) at org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:349) at org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.(ParquetOutputWriter.scala:37) at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anon$1.newInstance(ParquetFileFormat.scala:161) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25356) Add Parquet block size (row group size) option to SparkSQL configuration
[ https://issues.apache.org/jira/browse/SPARK-25356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liuxian resolved SPARK-25356. - Resolution: Invalid > Add Parquet block size (row group size) option to SparkSQL configuration > --- > > Key: SPARK-25356 > URL: https://issues.apache.org/jira/browse/SPARK-25356 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: liuxian >Priority: Minor > > I think we should configure the Parquet buffer size when using Parquet format. > Because for HDFS, `dfs.block.size` is configurable, sometimes we hope the > block size of parquet to be consistent with it. > And whether this parameter `spark.sql.files.maxPartitionBytes` is best > consistent with the Parquet block size when using Parquet format? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25356) Add Parquet block size (row group size) option to SparkSQL configuration
liuxian created SPARK-25356: --- Summary: Add Parquet block size (row group size) option to SparkSQL configuration Key: SPARK-25356 URL: https://issues.apache.org/jira/browse/SPARK-25356 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.4.0 Reporter: liuxian I think we should configure the Parquet buffer size when using Parquet format. Because for HDFS, `dfs.block.size` is configurable, sometimes we hope the block size of parquet to be consistent with it. And whether this parameter `spark.sql.files.maxPartitionBytes` is best consistent with the Parquet block size when using Parquet format? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25300) Unified the configuration parameter `spark.shuffle.service.enabled`
liuxian created SPARK-25300: --- Summary: Unified the configuration parameter `spark.shuffle.service.enabled` Key: SPARK-25300 URL: https://issues.apache.org/jira/browse/SPARK-25300 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.0.0 Reporter: liuxian The configuration parameter "spark.shuffle.service.enabled" has defined in `package.scala`, and it is also used in many place, so we can replace it with `SHUFFLE_SERVICE_ENABLED` -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25249) Add a unit test for OpenHashMap
liuxian created SPARK-25249: --- Summary: Add a unit test for OpenHashMap Key: SPARK-25249 URL: https://issues.apache.org/jira/browse/SPARK-25249 Project: Spark Issue Type: Test Components: Tests Affects Versions: 2.4.0 Reporter: liuxian Adding a unit test for OpenHashMap , this can help developers to distinguish between the 0/0.0/0L and non-exist value {color:#629755} {color} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25166) Reduce the number of write operations for shuffle write.
[ https://issues.apache.org/jira/browse/SPARK-25166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liuxian updated SPARK-25166: Description: Currently, only one record is written to a buffer each time, which increases the number of copies. (was: Currently, each record will be write to a buffer , which increases the number of copies.) > Reduce the number of write operations for shuffle write. > > > Key: SPARK-25166 > URL: https://issues.apache.org/jira/browse/SPARK-25166 > Project: Spark > Issue Type: Improvement > Components: Shuffle >Affects Versions: 2.4.0 >Reporter: liuxian >Priority: Minor > > Currently, only one record is written to a buffer each time, which increases > the number of copies. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25166) Reduce the number of write operations for shuffle write.
liuxian created SPARK-25166: --- Summary: Reduce the number of write operations for shuffle write. Key: SPARK-25166 URL: https://issues.apache.org/jira/browse/SPARK-25166 Project: Spark Issue Type: Improvement Components: Shuffle Affects Versions: 2.4.0 Reporter: liuxian Currently, each record will be write to a buffer , which increases the number of copies. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24994) When the data type of the field is converted to other types, it can also support pushdown to parquet
liuxian created SPARK-24994: --- Summary: When the data type of the field is converted to other types, it can also support pushdown to parquet Key: SPARK-24994 URL: https://issues.apache.org/jira/browse/SPARK-24994 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.4.0 Reporter: liuxian For this statement: select * from table1 where a = 100; the data type of `a` is `smallint` , because the defaut data type of 100 is `int` ,so the data type of 'a' is converted to `int`. In this case, it does not support push down to parquet. In our business, for our SQL statements, and we generally do not convert 100 to `smallint`, We hope that it can support push down to parquet for this situation. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-23989) When using `SortShuffleWriter`, the data will be overwritten
[ https://issues.apache.org/jira/browse/SPARK-23989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16442176#comment-16442176 ] liuxian edited comment on SPARK-23989 at 4/18/18 9:21 AM: -- test({color:#6a8759}"groupBy"{color}) { {color:#808080} spark.conf.set("spark.sql.shuffle.partitions", 16777217){color} {color:#cc7832}val {color}df1 = {color:#9876aa}Seq{color}(({color:#6a8759}"a"{color}{color:#cc7832}, {color}{color:#6897bb}1{color}{color:#cc7832}, {color}{color:#6897bb}0{color}{color:#cc7832}, {color}{color:#6a8759}"b"{color}){color:#cc7832}, {color}({color:#6a8759}"b"{color}{color:#cc7832}, {color}{color:#6897bb}2{color}{color:#cc7832}, {color}{color:#6897bb}4{color}{color:#cc7832}, {color}{color:#6a8759}"c"{color}){color:#cc7832}, {color}({color:#6a8759}"a"{color}{color:#cc7832}, {color}{color:#6897bb}2{color}{color:#cc7832}, {color}{color:#6897bb}3{color}{color:#cc7832}, {color}{color:#6a8759}"d"{color})) .toDF({color:#6a8759}"key"{color}{color:#cc7832}, {color}{color:#6a8759}"value1"{color}{color:#cc7832}, {color}{color:#6a8759}"value2"{color}{color:#cc7832}, {color}{color:#6a8759}"rest"{color}) checkAnswer( df1.groupBy({color:#6a8759}"key"{color}).min({color:#6a8759}"value2"{color}){color:#cc7832},{color} {color:#9876aa}Seq{color}(Row({color:#6a8759}"a"{color}{color:#cc7832}, {color}{color:#6897bb}0{color}){color:#cc7832}, {color}Row({color:#6a8759}"b"{color}{color:#cc7832}, {color}{color:#6897bb}4{color})) ) } Because the number of partitions is too large, it will run for a long time. The number of partitions is so large that the purpose is to go `SortShuffleWriter` was (Author: 10110346): test({color:#6a8759}"groupBy"{color}) { {color:#808080} spark.conf.set("spark.sql.shuffle.partitions", 16777217){color}{color:#808080} {color} {color:#cc7832}val {color}df1 = {color:#9876aa}Seq{color}(({color:#6a8759}"a"{color}{color:#cc7832}, {color}{color:#6897bb}1{color}{color:#cc7832}, {color}{color:#6897bb}0{color}{color:#cc7832}, {color}{color:#6a8759}"b"{color}){color:#cc7832}, {color}({color:#6a8759}"b"{color}{color:#cc7832}, {color}{color:#6897bb}2{color}{color:#cc7832}, {color}{color:#6897bb}4{color}{color:#cc7832}, {color}{color:#6a8759}"c"{color}){color:#cc7832}, {color}({color:#6a8759}"a"{color}{color:#cc7832}, {color}{color:#6897bb}2{color}{color:#cc7832}, {color}{color:#6897bb}3{color}{color:#cc7832}, {color}{color:#6a8759}"d"{color})) .toDF({color:#6a8759}"key"{color}{color:#cc7832}, {color}{color:#6a8759}"value1"{color}{color:#cc7832}, {color}{color:#6a8759}"value2"{color}{color:#cc7832}, {color}{color:#6a8759}"rest"{color}) checkAnswer( df1.groupBy({color:#6a8759}"key"{color}).min({color:#6a8759}"value2"{color}){color:#cc7832}, {color} {color:#9876aa}Seq{color}(Row({color:#6a8759}"a"{color}{color:#cc7832}, {color}{color:#6897bb}0{color}){color:#cc7832}, {color}Row({color:#6a8759}"b"{color}{color:#cc7832}, {color}{color:#6897bb}4{color})) ) } Because the number of partitions is too large, it will run for a long time. The number of partitions is so large that the purpose is to go `SortShuffleWriter` > When using `SortShuffleWriter`, the data will be overwritten > > > Key: SPARK-23989 > URL: https://issues.apache.org/jira/browse/SPARK-23989 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: liuxian >Priority: Critical > > {color:#33}When using `SortShuffleWriter`, we only insert > '{color}{color:#cc7832}AnyRef{color}{color:#33}' into > '{color}PartitionedAppendOnlyMap{color:#33}' or > '{color}PartitionedPairBuffer{color:#33}'.{color} > {color:#33}For this function:{color} > {color:#cc7832}override def {color}{color:#ffc66d}write{color}(records: > {color:#4e807d}Iterator{color}[Product2[{color:#4e807d}K{color}{color:#cc7832}, > {color}{color:#4e807d}V{color}]]) > the value of 'records' is `UnsafeRow`, so the value will be overwritten > {color:#33} {color} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23989) When using `SortShuffleWriter`, the data will be overwritten
[ https://issues.apache.org/jira/browse/SPARK-23989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16442176#comment-16442176 ] liuxian commented on SPARK-23989: - test({color:#6a8759}"groupBy"{color}) { {color:#808080} spark.conf.set("spark.sql.shuffle.partitions", 16777217){color}{color:#808080} {color} {color:#cc7832}val {color}df1 = {color:#9876aa}Seq{color}(({color:#6a8759}"a"{color}{color:#cc7832}, {color}{color:#6897bb}1{color}{color:#cc7832}, {color}{color:#6897bb}0{color}{color:#cc7832}, {color}{color:#6a8759}"b"{color}){color:#cc7832}, {color}({color:#6a8759}"b"{color}{color:#cc7832}, {color}{color:#6897bb}2{color}{color:#cc7832}, {color}{color:#6897bb}4{color}{color:#cc7832}, {color}{color:#6a8759}"c"{color}){color:#cc7832}, {color}({color:#6a8759}"a"{color}{color:#cc7832}, {color}{color:#6897bb}2{color}{color:#cc7832}, {color}{color:#6897bb}3{color}{color:#cc7832}, {color}{color:#6a8759}"d"{color})) .toDF({color:#6a8759}"key"{color}{color:#cc7832}, {color}{color:#6a8759}"value1"{color}{color:#cc7832}, {color}{color:#6a8759}"value2"{color}{color:#cc7832}, {color}{color:#6a8759}"rest"{color}) checkAnswer( df1.groupBy({color:#6a8759}"key"{color}).min({color:#6a8759}"value2"{color}){color:#cc7832}, {color} {color:#9876aa}Seq{color}(Row({color:#6a8759}"a"{color}{color:#cc7832}, {color}{color:#6897bb}0{color}){color:#cc7832}, {color}Row({color:#6a8759}"b"{color}{color:#cc7832}, {color}{color:#6897bb}4{color})) ) } Because the number of partitions is too large, it will run for a long time. The number of partitions is so large that the purpose is to go `SortShuffleWriter` > When using `SortShuffleWriter`, the data will be overwritten > > > Key: SPARK-23989 > URL: https://issues.apache.org/jira/browse/SPARK-23989 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: liuxian >Priority: Critical > > {color:#33}When using `SortShuffleWriter`, we only insert > '{color}{color:#cc7832}AnyRef{color}{color:#33}' into > '{color}PartitionedAppendOnlyMap{color:#33}' or > '{color}PartitionedPairBuffer{color:#33}'.{color} > {color:#33}For this function:{color} > {color:#cc7832}override def {color}{color:#ffc66d}write{color}(records: > {color:#4e807d}Iterator{color}[Product2[{color:#4e807d}K{color}{color:#cc7832}, > {color}{color:#4e807d}V{color}]]) > the value of 'records' is `UnsafeRow`, so the value will be overwritten > {color:#33} {color} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23989) When using `SortShuffleWriter`, the data will be overwritten
[ https://issues.apache.org/jira/browse/SPARK-23989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liuxian updated SPARK-23989: Attachment: (was: 无标题2.png) > When using `SortShuffleWriter`, the data will be overwritten > > > Key: SPARK-23989 > URL: https://issues.apache.org/jira/browse/SPARK-23989 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: liuxian >Priority: Critical > > {color:#33}When using `SortShuffleWriter`, we only insert > '{color}{color:#cc7832}AnyRef{color}{color:#33}' into > '{color}PartitionedAppendOnlyMap{color:#33}' or > '{color}PartitionedPairBuffer{color:#33}'.{color} > {color:#33}For this function:{color} > {color:#cc7832}override def {color}{color:#ffc66d}write{color}(records: > {color:#4e807d}Iterator{color}[Product2[{color:#4e807d}K{color}{color:#cc7832}, > {color}{color:#4e807d}V{color}]]) > the value of 'records' is `UnsafeRow`, so the value will be overwritten > {color:#33} {color} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23989) When using `SortShuffleWriter`, the data will be overwritten
[ https://issues.apache.org/jira/browse/SPARK-23989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16441982#comment-16441982 ] liuxian commented on SPARK-23989: - We assume that: numPartitions > {color:#9876aa}MAX_SHUFFLE_OUTPUT_PARTITIONS_FOR_SERIALIZED_MODE{color} > When using `SortShuffleWriter`, the data will be overwritten > > > Key: SPARK-23989 > URL: https://issues.apache.org/jira/browse/SPARK-23989 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: liuxian >Priority: Critical > Attachments: 无标题2.png > > > {color:#33}When using `SortShuffleWriter`, we only insert > '{color}{color:#cc7832}AnyRef{color}{color:#33}' into > '{color}PartitionedAppendOnlyMap{color:#33}' or > '{color}PartitionedPairBuffer{color:#33}'.{color} > {color:#33}For this function:{color} > {color:#cc7832}override def {color}{color:#ffc66d}write{color}(records: > {color:#4e807d}Iterator{color}[Product2[{color:#4e807d}K{color}{color:#cc7832}, > {color}{color:#4e807d}V{color}]]) > the value of 'records' is `UnsafeRow`, so the value will be overwritten > {color:#33} {color} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23989) When using `SortShuffleWriter`, the data will be overwritten
[ https://issues.apache.org/jira/browse/SPARK-23989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16441980#comment-16441980 ] liuxian commented on SPARK-23989: - {color:#9876aa}I think '{color:#33}SortShuffleWriter{color}' should adapt to any shufflewrite scene{color} > When using `SortShuffleWriter`, the data will be overwritten > > > Key: SPARK-23989 > URL: https://issues.apache.org/jira/browse/SPARK-23989 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: liuxian >Priority: Critical > Attachments: 无标题2.png > > > {color:#33}When using `SortShuffleWriter`, we only insert > '{color}{color:#cc7832}AnyRef{color}{color:#33}' into > '{color}PartitionedAppendOnlyMap{color:#33}' or > '{color}PartitionedPairBuffer{color:#33}'.{color} > {color:#33}For this function:{color} > {color:#cc7832}override def {color}{color:#ffc66d}write{color}(records: > {color:#4e807d}Iterator{color}[Product2[{color:#4e807d}K{color}{color:#cc7832}, > {color}{color:#4e807d}V{color}]]) > the value of 'records' is `UnsafeRow`, so the value will be overwritten > {color:#33} {color} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-23989) When using `SortShuffleWriter`, the data will be overwritten
[ https://issues.apache.org/jira/browse/SPARK-23989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16441952#comment-16441952 ] liuxian edited comment on SPARK-23989 at 4/18/18 6:21 AM: -- 1. Make 'BypassMergeSortShuffleHandle' and 'SerializedShuffleHandle' disable {color:#cc7832}override def {color}{color:#ffc66d}registerShuffle{color}[{color:#4e807d}K{color}{color:#cc7832}, {color}{color:#4e807d}V{color}{color:#cc7832}, {color}{color:#4e807d}C{color}]( shuffleId: {color:#cc7832}Int,{color} numMaps: {color:#cc7832}Int,{color} dependency: ShuffleDependency[{color:#4e807d}K{color}{color:#cc7832}, {color}{color:#4e807d}V{color}{color:#cc7832}, {color}{color:#4e807d}C{color}]): ShuffleHandle = { {color:#cc7832}if {color}(SortShuffleWriter.shouldBypassMergeSort(conf{color:#cc7832}, {color}dependency){color:#14892c} && false {color}) { {color:#808080}// If there are fewer than spark.shuffle.sort.bypassMergeThreshold partitions and we don't{color}{color:#808080} // need map-side aggregation, then write numPartitions files directly and just concatenate{color}{color:#808080} // them at the end. This avoids doing serialization and deserialization twice to merge{color}{color:#808080} // together the spilled files, which would happen with the normal code path. The downside is{color}{color:#808080} // having multiple files open at a time and thus more memory allocated to buffers.{color} {color:#cc7832}new {color}BypassMergeSortShuffleHandle[{color:#4e807d}K{color}{color:#cc7832}, {color}{color:#4e807d}V{color}]( shuffleId{color:#cc7832}, {color}numMaps{color:#cc7832}, {color}dependency.asInstanceOf[ShuffleDependency[{color:#4e807d}K{color}{color:#cc7832}, {color}{color:#4e807d}V{color}{color:#cc7832}, {color}{color:#4e807d}V{color}]]) } {color:#cc7832}else if {color}(SortShuffleManager.canUseSerializedShuffle(dependency) {color:#14892c}&& false{color}) { {color:#808080}// Otherwise, try to buffer map outputs in a serialized form, since this is more efficient:{color} {color:#cc7832}new {color}SerializedShuffleHandle[{color:#4e807d}K{color}{color:#cc7832}, {color}{color:#4e807d}V{color}]( shuffleId{color:#cc7832}, {color}numMaps{color:#cc7832}, {color}dependency.asInstanceOf[ShuffleDependency[{color:#4e807d}K{color}{color:#cc7832}, {color}{color:#4e807d}V{color}{color:#cc7832}, {color}{color:#4e807d}V{color}]]) } {color:#cc7832}else {color}{ {color:#808080}// Otherwise, buffer map outputs in a deserialized form:{color} {color:#cc7832}new {color}BaseShuffleHandle(shuffleId{color:#cc7832}, {color}numMaps{color:#cc7832}, {color}dependency) } } 2. Run this unit test in 'DataFrameAggregateSuite.scala' test({color:#6a8759}"SPARK-21580 ints in aggregation expressions are taken as group-by ordinal."{color}) 3. I have been debugging in IDEA, grab this information: {{ _buffer = \{PartitionedPairBuffer@9817}_ }} {{ _capacity = 64_}} {{ _curSize = 2_}} {{ _data = {Object[128]@9832}_ }} {{ _0 = \{Tuple2@9834} "(3,3)"_}} {{ {color:#14892c}_1 = \{UnsafeRow@9835} "[0,2,2]"_{color}}} {{ _2 = \{Tuple2@9841} "(4,4)"_}} {{ _{color:#14892c}3 = \{UnsafeRow@9835} "[0,2,2]"{color}_}} was (Author: 10110346): 1. Make 'BypassMergeSortShuffleHandle' and 'SerializedShuffleHandle' disable {color:#cc7832}override def {color}{color:#ffc66d}registerShuffle{color}[{color:#4e807d}K{color}{color:#cc7832}, {color}{color:#4e807d}V{color}{color:#cc7832}, {color}{color:#4e807d}C{color}]( shuffleId: {color:#cc7832}Int, {color} numMaps: {color:#cc7832}Int, {color} dependency: ShuffleDependency[{color:#4e807d}K{color}{color:#cc7832}, {color}{color:#4e807d}V{color}{color:#cc7832}, {color}{color:#4e807d}C{color}]): ShuffleHandle = { {color:#cc7832}if {color}(SortShuffleWriter.shouldBypassMergeSort(conf{color:#cc7832}, {color}dependency){color:#14892c} && false {color}) { {color:#808080}// If there are fewer than spark.shuffle.sort.bypassMergeThreshold partitions and we don't {color}{color:#808080} // need map-side aggregation, then write numPartitions files directly and just concatenate {color}{color:#808080} // them at the end. This avoids doing serialization and deserialization twice to merge {color}{color:#808080} // together the spilled files, which would happen with the normal code path. The downside is {color}{color:#808080} // having multiple files open at a time and thus more memory allocated to buffers. {color} {color:#cc7832}new {color}BypassMergeSortShuffleHandle[{color:#4e807d}K{color}{color:#cc7832}, {color}{color:#4e807d}V{color}]( shuffleId{color:#cc7832}, {color}numMaps{color:#cc7832}, {color}dependency.asInstanceOf[ShuffleDependency[{color:#4e807d}K{color}{color:#cc7832}, {color}{color:#4e807d}V{color}{color:#cc7832}, {color}{color:#4e807d}V{color}]]) } {color:#cc7832}else if {color}(SortShuffleManager.canUseSerializedShuffle(dependency) {color:#14892c}&&
[jira] [Commented] (SPARK-23989) When using `SortShuffleWriter`, the data will be overwritten
[ https://issues.apache.org/jira/browse/SPARK-23989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16441952#comment-16441952 ] liuxian commented on SPARK-23989: - 1. Make 'BypassMergeSortShuffleHandle' and 'SerializedShuffleHandle' disable {color:#cc7832}override def {color}{color:#ffc66d}registerShuffle{color}[{color:#4e807d}K{color}{color:#cc7832}, {color}{color:#4e807d}V{color}{color:#cc7832}, {color}{color:#4e807d}C{color}]( shuffleId: {color:#cc7832}Int, {color} numMaps: {color:#cc7832}Int, {color} dependency: ShuffleDependency[{color:#4e807d}K{color}{color:#cc7832}, {color}{color:#4e807d}V{color}{color:#cc7832}, {color}{color:#4e807d}C{color}]): ShuffleHandle = { {color:#cc7832}if {color}(SortShuffleWriter.shouldBypassMergeSort(conf{color:#cc7832}, {color}dependency){color:#14892c} && false {color}) { {color:#808080}// If there are fewer than spark.shuffle.sort.bypassMergeThreshold partitions and we don't {color}{color:#808080} // need map-side aggregation, then write numPartitions files directly and just concatenate {color}{color:#808080} // them at the end. This avoids doing serialization and deserialization twice to merge {color}{color:#808080} // together the spilled files, which would happen with the normal code path. The downside is {color}{color:#808080} // having multiple files open at a time and thus more memory allocated to buffers. {color} {color:#cc7832}new {color}BypassMergeSortShuffleHandle[{color:#4e807d}K{color}{color:#cc7832}, {color}{color:#4e807d}V{color}]( shuffleId{color:#cc7832}, {color}numMaps{color:#cc7832}, {color}dependency.asInstanceOf[ShuffleDependency[{color:#4e807d}K{color}{color:#cc7832}, {color}{color:#4e807d}V{color}{color:#cc7832}, {color}{color:#4e807d}V{color}]]) } {color:#cc7832}else if {color}(SortShuffleManager.canUseSerializedShuffle(dependency) {color:#14892c}&& false{color}) { {color:#808080}// Otherwise, try to buffer map outputs in a serialized form, since this is more efficient: {color} {color:#cc7832}new {color}SerializedShuffleHandle[{color:#4e807d}K{color}{color:#cc7832}, {color}{color:#4e807d}V{color}]( shuffleId{color:#cc7832}, {color}numMaps{color:#cc7832}, {color}dependency.asInstanceOf[ShuffleDependency[{color:#4e807d}K{color}{color:#cc7832}, {color}{color:#4e807d}V{color}{color:#cc7832}, {color}{color:#4e807d}V{color}]]) } {color:#cc7832}else {color}{ {color:#808080}// Otherwise, buffer map outputs in a deserialized form: {color} {color:#cc7832}new {color}BaseShuffleHandle(shuffleId{color:#cc7832}, {color}numMaps{color:#cc7832}, {color}dependency) } } 2. Run this unit test in 'DataFrameAggregateSuite.scala' test({color:#6a8759}"SPARK-21580 ints in aggregation expressions are taken as group-by ordinal."{color}) 3. I have been debugging in IDEA, grab this information: {{ _buffer = \{PartitionedPairBuffer@9817}_ }} {{ _capacity = 64_}} {{ _curSize = 2_}} {{ _data = \{Object[128]@9832}_ }} {{ _0 = \{Tuple2@9834} "(3,3)"_}} {{ _1 = \{UnsafeRow@9835} "[0,2,2]"_}} {{ _2 = \{Tuple2@9841} "(4,4)"_}} {{ _{color:#14892c}3 = \{UnsafeRow@9835} "[0,2,2]"{color}_}} > When using `SortShuffleWriter`, the data will be overwritten > > > Key: SPARK-23989 > URL: https://issues.apache.org/jira/browse/SPARK-23989 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: liuxian >Priority: Critical > Attachments: 无标题2.png > > > {color:#33}When using `SortShuffleWriter`, we only insert > '{color}{color:#cc7832}AnyRef{color}{color:#33}' into > '{color}PartitionedAppendOnlyMap{color:#33}' or > '{color}PartitionedPairBuffer{color:#33}'.{color} > {color:#33}For this function:{color} > {color:#cc7832}override def {color}{color:#ffc66d}write{color}(records: > {color:#4e807d}Iterator{color}[Product2[{color:#4e807d}K{color}{color:#cc7832}, > {color}{color:#4e807d}V{color}]]) > the value of 'records' is `UnsafeRow`, so the value will be overwritten > {color:#33} {color} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23989) When using `SortShuffleWriter`, the data will be overwritten
[ https://issues.apache.org/jira/browse/SPARK-23989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liuxian updated SPARK-23989: Attachment: 无标题2.png > When using `SortShuffleWriter`, the data will be overwritten > > > Key: SPARK-23989 > URL: https://issues.apache.org/jira/browse/SPARK-23989 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: liuxian >Priority: Critical > Attachments: 无标题2.png > > > {color:#33}When using `SortShuffleWriter`, we only insert > '{color}{color:#cc7832}AnyRef{color}{color:#33}' into > '{color}PartitionedAppendOnlyMap{color:#33}' or > '{color}PartitionedPairBuffer{color:#33}'.{color} > {color:#33}For this function:{color} > {color:#cc7832}override def {color}{color:#ffc66d}write{color}(records: > {color:#4e807d}Iterator{color}[Product2[{color:#4e807d}K{color}{color:#cc7832}, > {color}{color:#4e807d}V{color}]]) > the value of 'records' is `UnsafeRow`, so the value will be overwritten > {color:#33} {color} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23989) When using `SortShuffleWriter`, the data will be overwritten
[ https://issues.apache.org/jira/browse/SPARK-23989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16440250#comment-16440250 ] liuxian commented on SPARK-23989: - If we make 'BypassMergeSortShuffleHandle' and 'SerializedShuffleHandle' disable, a lot of unit tests in 'DataFrameAggregateSuite.scala' will fail > When using `SortShuffleWriter`, the data will be overwritten > > > Key: SPARK-23989 > URL: https://issues.apache.org/jira/browse/SPARK-23989 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: liuxian >Priority: Critical > > {color:#33}When using `SortShuffleWriter`, we only insert > '{color}{color:#cc7832}AnyRef{color}{color:#33}' into > '{color}PartitionedAppendOnlyMap{color:#33}' or > '{color}PartitionedPairBuffer{color:#33}'.{color} > {color:#33}For this function:{color} > {color:#cc7832}override def {color}{color:#ffc66d}write{color}(records: > {color:#4e807d}Iterator{color}[Product2[{color:#4e807d}K{color}{color:#cc7832}, > {color}{color:#4e807d}V{color}]]) > the value of 'records' is `UnsafeRow`, so the value will be overwritten > {color:#33} {color} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23992) ShuffleDependency does not need to be deserialized every time
liuxian created SPARK-23992: --- Summary: ShuffleDependency does not need to be deserialized every time Key: SPARK-23992 URL: https://issues.apache.org/jira/browse/SPARK-23992 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 2.3.0 Reporter: liuxian In the same stage, 'ShuffleDependency' is not necessary to be deserialized each time. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-23989) When using `SortShuffleWriter`, the data will be overwritten
[ https://issues.apache.org/jira/browse/SPARK-23989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16439231#comment-16439231 ] liuxian edited comment on SPARK-23989 at 4/16/18 10:18 AM: --- For {color:#33}`SortShuffleWriter`{color}, `records: {color:#4e807d}Iterator{color}[Product2[{color:#4e807d}K{color}{color:#cc7832}, {color}{color:#4e807d}V{color}]]` is key-value pair, but the value is 'UnsafeRow' type. For example ,we insert the first record {color:#33}into `PartitionedPairBuffer`, we only save the 'AnyRef{color}', but the {color:#33} 'AnyRef{color}' of next record(only value, not key) is same as the first record , so the first record is overwritten. was (Author: 10110346): For {color:#33}`SortShuffleWriter`{color}, `records: {color:#4e807d}Iterator{color}[Product2[{color:#4e807d}K{color}{color:#cc7832}, {color}{color:#4e807d}V{color}]]` is key-value pair, but the value is 'UnsafeRow' type. For example ,we insert the first record {color:#33}into `PartitionedPairBuffer`, we only save the '{color:#cc7832}AnyRef{color}', but the {color:#33} '{color:#cc7832}AnyRef{color}'{color} of next {color}record(only value, not key) is same as the first record , so the first record is overwritten. h1. overwritten > When using `SortShuffleWriter`, the data will be overwritten > > > Key: SPARK-23989 > URL: https://issues.apache.org/jira/browse/SPARK-23989 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: liuxian >Priority: Critical > > {color:#33}When using `SortShuffleWriter`, we only insert > '{color}{color:#cc7832}AnyRef{color}{color:#33}' into > '{color}PartitionedAppendOnlyMap{color:#33}' or > '{color}PartitionedPairBuffer{color:#33}'.{color} > {color:#33}For this function:{color} > {color:#cc7832}override def {color}{color:#ffc66d}write{color}(records: > {color:#4e807d}Iterator{color}[Product2[{color:#4e807d}K{color}{color:#cc7832}, > {color}{color:#4e807d}V{color}]]) > the value of 'records' is `UnsafeRow`, so the value will be overwritten > {color:#33} {color} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23989) When using `SortShuffleWriter`, the data will be overwritten
[ https://issues.apache.org/jira/browse/SPARK-23989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16439231#comment-16439231 ] liuxian commented on SPARK-23989: - For {color:#33}`SortShuffleWriter`{color}, `records: {color:#4e807d}Iterator{color}[Product2[{color:#4e807d}K{color}{color:#cc7832}, {color}{color:#4e807d}V{color}]]` is key-value pair, but the value is 'UnsafeRow' type. For example ,we insert the first record {color:#33}into `PartitionedPairBuffer`, we only save the '{color:#cc7832}AnyRef{color}', but the {color:#33} '{color:#cc7832}AnyRef{color}'{color} of next {color}record(only value, not key) is same as the first record , so the first record is overwritten. h1. overwritten > When using `SortShuffleWriter`, the data will be overwritten > > > Key: SPARK-23989 > URL: https://issues.apache.org/jira/browse/SPARK-23989 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: liuxian >Priority: Critical > > {color:#33}When using `SortShuffleWriter`, we only insert > '{color}{color:#cc7832}AnyRef{color}{color:#33}' into > '{color}PartitionedAppendOnlyMap{color:#33}' or > '{color}PartitionedPairBuffer{color:#33}'.{color} > {color:#33}For this function:{color} > {color:#cc7832}override def {color}{color:#ffc66d}write{color}(records: > {color:#4e807d}Iterator{color}[Product2[{color:#4e807d}K{color}{color:#cc7832}, > {color}{color:#4e807d}V{color}]]) > the value of 'records' is `UnsafeRow`, so the value will be overwritten > {color:#33} {color} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-23989) When using `SortShuffleWriter`, the data will be overwritten
[ https://issues.apache.org/jira/browse/SPARK-23989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16439148#comment-16439148 ] liuxian edited comment on SPARK-23989 at 4/16/18 9:00 AM: -- [~joshrosen] [~cloud_fan] was (Author: 10110346): [~joshrosen] > When using `SortShuffleWriter`, the data will be overwritten > > > Key: SPARK-23989 > URL: https://issues.apache.org/jira/browse/SPARK-23989 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: liuxian >Priority: Critical > > {color:#33}When using `SortShuffleWriter`, we only insert > '{color}{color:#cc7832}AnyRef{color}{color:#33}' into > '{color}PartitionedAppendOnlyMap{color:#33}' or > '{color}PartitionedPairBuffer{color:#33}'.{color} > {color:#33}For this function:{color} > {color:#cc7832}override def {color}{color:#ffc66d}write{color}(records: > {color:#4e807d}Iterator{color}[Product2[{color:#4e807d}K{color}{color:#cc7832}, > {color}{color:#4e807d}V{color}]]) > the value of 'records' is `UnsafeRow`, so the value will be overwritten > {color:#33} {color} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23989) When using `SortShuffleWriter`, the data will be overwritten
[ https://issues.apache.org/jira/browse/SPARK-23989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16439148#comment-16439148 ] liuxian commented on SPARK-23989: - [~joshrosen] > When using `SortShuffleWriter`, the data will be overwritten > > > Key: SPARK-23989 > URL: https://issues.apache.org/jira/browse/SPARK-23989 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: liuxian >Priority: Critical > > {color:#33}When using `SortShuffleWriter`, we only insert > '{color}{color:#cc7832}AnyRef{color}{color:#33}' into > '{color}PartitionedAppendOnlyMap{color:#33}' or > '{color}PartitionedPairBuffer{color:#33}'.{color} > {color:#33}For this function:{color} > {color:#cc7832}override def {color}{color:#ffc66d}write{color}(records: > {color:#4e807d}Iterator{color}[Product2[{color:#4e807d}K{color}{color:#cc7832}, > {color}{color:#4e807d}V{color}]]) > the value of 'records' is `UnsafeRow`, so the value will be overwritten > {color:#33} {color} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23989) When using `SortShuffleWriter`, the data will be overwritten
liuxian created SPARK-23989: --- Summary: When using `SortShuffleWriter`, the data will be overwritten Key: SPARK-23989 URL: https://issues.apache.org/jira/browse/SPARK-23989 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.3.0 Reporter: liuxian {color:#33}When using `SortShuffleWriter`, we only insert '{color}{color:#cc7832}AnyRef{color}{color:#33}' into '{color} PartitionedAppendOnlyMap{color:#33}' or '{color}PartitionedPairBuffer{color:#33}'.{color} {color:#33}For this function:{color} {color:#cc7832}override def {color}{color:#ffc66d}write{color}(records: {color:#4e807d}Iterator{color}[Product2[{color:#4e807d}K{color}{color:#cc7832}, {color}{color:#4e807d}V{color}]]) the value of 'records' is `UnsafeRow`, so the value will be overwritten {color:#33} {color} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23989) When using `SortShuffleWriter`, the data will be overwritten
[ https://issues.apache.org/jira/browse/SPARK-23989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liuxian updated SPARK-23989: Description: {color:#33}When using `SortShuffleWriter`, we only insert '{color}{color:#cc7832}AnyRef{color}{color:#33}' into '{color}PartitionedAppendOnlyMap{color:#33}' or '{color}PartitionedPairBuffer{color:#33}'.{color} {color:#33}For this function:{color} {color:#cc7832}override def {color}{color:#ffc66d}write{color}(records: {color:#4e807d}Iterator{color}[Product2[{color:#4e807d}K{color}{color:#cc7832}, {color}{color:#4e807d}V{color}]]) the value of 'records' is `UnsafeRow`, so the value will be overwritten {color:#33} {color} was: {color:#33}When using `SortShuffleWriter`, we only insert '{color}{color:#cc7832}AnyRef{color}{color:#33}' into '{color} PartitionedAppendOnlyMap{color:#33}' or '{color}PartitionedPairBuffer{color:#33}'.{color} {color:#33}For this function:{color} {color:#cc7832}override def {color}{color:#ffc66d}write{color}(records: {color:#4e807d}Iterator{color}[Product2[{color:#4e807d}K{color}{color:#cc7832}, {color}{color:#4e807d}V{color}]]) the value of 'records' is `UnsafeRow`, so the value will be overwritten {color:#33} {color} > When using `SortShuffleWriter`, the data will be overwritten > > > Key: SPARK-23989 > URL: https://issues.apache.org/jira/browse/SPARK-23989 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: liuxian >Priority: Critical > > {color:#33}When using `SortShuffleWriter`, we only insert > '{color}{color:#cc7832}AnyRef{color}{color:#33}' into > '{color}PartitionedAppendOnlyMap{color:#33}' or > '{color}PartitionedPairBuffer{color:#33}'.{color} > {color:#33}For this function:{color} > {color:#cc7832}override def {color}{color:#ffc66d}write{color}(records: > {color:#4e807d}Iterator{color}[Product2[{color:#4e807d}K{color}{color:#cc7832}, > {color}{color:#4e807d}V{color}]]) > the value of 'records' is `UnsafeRow`, so the value will be overwritten > {color:#33} {color} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23744) Memory leak in ReadableChannelFileRegion
[ https://issues.apache.org/jira/browse/SPARK-23744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liuxian updated SPARK-23744: Description: In the class _ReadableChannelFileRegion_, the _buffer_ is direct memory, we should modify _deallocate_ to free it (was: In the class _ReadableChannelFileRegion_, the _buffer_ is direct memory, we should modify deallocate to free it) > Memory leak in ReadableChannelFileRegion > > > Key: SPARK-23744 > URL: https://issues.apache.org/jira/browse/SPARK-23744 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: liuxian >Priority: Major > > In the class _ReadableChannelFileRegion_, the _buffer_ is direct memory, we > should modify _deallocate_ to free it -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23744) Memory leak in ReadableChannelFileRegion
[ https://issues.apache.org/jira/browse/SPARK-23744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liuxian updated SPARK-23744: Description: In the class _ReadableChannelFileRegion_, the _buffer_ is direct memory, we should modify deallocate to free it (was: In the class `_ReadableChannelFileRegion_`, the `buffer` is direct memory, we should modify `_deallocate_` to free it) > Memory leak in ReadableChannelFileRegion > > > Key: SPARK-23744 > URL: https://issues.apache.org/jira/browse/SPARK-23744 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: liuxian >Priority: Major > > In the class _ReadableChannelFileRegion_, the _buffer_ is direct memory, we > should modify deallocate to free it -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23744) Memory leak in ReadableChannelFileRegion
liuxian created SPARK-23744: --- Summary: Memory leak in ReadableChannelFileRegion Key: SPARK-23744 URL: https://issues.apache.org/jira/browse/SPARK-23744 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.3.0 Reporter: liuxian In the class `_ReadableChannelFileRegion_`, the `buffer` is direct memory, we should modify `_deallocate_` to free it -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23651) Add a check for host name
[ https://issues.apache.org/jira/browse/SPARK-23651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liuxian resolved SPARK-23651. - Resolution: Fixed > Add a check for host name > -- > > Key: SPARK-23651 > URL: https://issues.apache.org/jira/browse/SPARK-23651 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: liuxian >Priority: Minor > > I encountered a error like this: > _org.apache.spark.SparkException: Invalid Spark URL: > spark://HeartbeatReceiver@ci_164:42849_ > _at > org.apache.spark.rpc.RpcEndpointAddress$.apply(RpcEndpointAddress.scala:66)_ > _at > org.apache.spark.rpc.netty.NettyRpcEnv.asyncSetupEndpointRefByURI(NettyRpcEnv.scala:134)_ > _at > org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:101)_ > _at org.apache.spark.rpc.RpcEnv.setupEndpointRef(RpcEnv.scala:109)_ > _at org.apache.spark.util.RpcUtils$.makeDriverRef(RpcUtils.scala:32)_ > _at org.apache.spark.executor.Executor.(Executor.scala:155)_ > _at > org.apache.spark.scheduler.local.LocalEndpoint.(LocalSchedulerBackend.scala:59)_ > _at > org.apache.spark.scheduler.local.LocalSchedulerBackend.start(LocalSchedulerBackend.scala:126)_ > _at > org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:164)_ > > I didn't know why this _URL_(spark://HeartbeatReceiver@ci_164:42849) is > invalid, so i think we should give a clearer reminder for this error. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23651) Add a check for host name
liuxian created SPARK-23651: --- Summary: Add a check for host name Key: SPARK-23651 URL: https://issues.apache.org/jira/browse/SPARK-23651 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 2.4.0 Reporter: liuxian I encountered a error like this: _org.apache.spark.SparkException: Invalid Spark URL: spark://HeartbeatReceiver@ci_164:42849_ _at org.apache.spark.rpc.RpcEndpointAddress$.apply(RpcEndpointAddress.scala:66)_ _at org.apache.spark.rpc.netty.NettyRpcEnv.asyncSetupEndpointRefByURI(NettyRpcEnv.scala:134)_ _at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:101)_ _at org.apache.spark.rpc.RpcEnv.setupEndpointRef(RpcEnv.scala:109)_ _at org.apache.spark.util.RpcUtils$.makeDriverRef(RpcUtils.scala:32)_ _at org.apache.spark.executor.Executor.(Executor.scala:155)_ _at org.apache.spark.scheduler.local.LocalEndpoint.(LocalSchedulerBackend.scala:59)_ _at org.apache.spark.scheduler.local.LocalSchedulerBackend.start(LocalSchedulerBackend.scala:126)_ _at org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:164)_ I didn't know why this _URL_(spark://HeartbeatReceiver@ci_164:42849) is invalid, so i think we should give a clearer reminder for this error. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23516) I think it is unnecessary to transfer unroll memory to storage memory
[ https://issues.apache.org/jira/browse/SPARK-23516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liuxian resolved SPARK-23516. - Resolution: Invalid > I think it is unnecessary to transfer unroll memory to storage memory > -- > > Key: SPARK-23516 > URL: https://issues.apache.org/jira/browse/SPARK-23516 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: liuxian >Priority: Minor > > In fact, unroll memory is also storage memory,so I think it is unnecessary to > release unroll memory really, and then to get storage memory again. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23532) [STANDALONE] Improve data locality when launching new executors for dynamic allocation
[ https://issues.apache.org/jira/browse/SPARK-23532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liuxian updated SPARK-23532: Description: Currently Spark on Yarn supports better data locality by considering the preferred locations of the pending tasks when dynamic allocation is enabled, refer to https://issues.apache.org/jira/browse/SPARK-4352. Mesos alse supports data locality, refer to https://issues.apache.org/jira/browse/SPARK-16944 It would be better that Standalone can also support this feature. was: Currently Spark on Yarn supports better data locality by considering the preferred locations of the pending tasks when dynamic allocation is enabled, Refer to https://issues.apache.org/jira/browse/SPARK-4352. Mesos alse supports data locality, Refer to https://issues.apache.org/jira/browse/SPARK-16944 It would be better that Standalone can also support this feature. > [STANDALONE] Improve data locality when launching new executors for dynamic > allocation > -- > > Key: SPARK-23532 > URL: https://issues.apache.org/jira/browse/SPARK-23532 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: liuxian >Priority: Major > > Currently Spark on Yarn supports better data locality by considering the > preferred locations of the pending tasks when dynamic allocation is enabled, > refer to https://issues.apache.org/jira/browse/SPARK-4352. > Mesos alse supports data locality, refer to > https://issues.apache.org/jira/browse/SPARK-16944 > It would be better that Standalone can also support this feature. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23532) [STANDALONE] Improve data locality when launching new executors for dynamic allocation
[ https://issues.apache.org/jira/browse/SPARK-23532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liuxian updated SPARK-23532: Description: Currently Spark on Yarn supports better data locality by considering the preferred locations of the pending tasks when dynamic allocation is enabled, Refer to https://issues.apache.org/jira/browse/SPARK-4352. Mesos alse supports data locality, Refer to https://issues.apache.org/jira/browse/SPARK-16944 It would be better that Standalone can also support this feature. was: Currently Spark on Yarn supports better data locality by considering the preferred locations of the pending tasks when dynamic allocation is enabled, Refer to https://issues.apache.org/jira/browse/SPARK-4352. Mesos alse supports data locality, Refer to https://issues.apache.org/jira/browse/SPARK-16944+ It would be better that Standalone can also support this feature. > [STANDALONE] Improve data locality when launching new executors for dynamic > allocation > -- > > Key: SPARK-23532 > URL: https://issues.apache.org/jira/browse/SPARK-23532 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: liuxian >Priority: Major > > Currently Spark on Yarn supports better data locality by considering the > preferred locations of the pending tasks when dynamic allocation is enabled, > Refer to https://issues.apache.org/jira/browse/SPARK-4352. > Mesos alse supports data locality, Refer to > https://issues.apache.org/jira/browse/SPARK-16944 > It would be better that Standalone can also support this feature. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23532) [STANDALONE] Improve data locality when launching new executors for dynamic allocation
[ https://issues.apache.org/jira/browse/SPARK-23532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liuxian updated SPARK-23532: Description: Currently Spark on Yarn supports better data locality by considering the preferred locations of the pending tasks when dynamic allocation is enabled, Refer to https://issues.apache.org/jira/browse/SPARK-4352. Mesos alse supports data locality, Refer to https://issues.apache.org/jira/browse/SPARK-16944+ It would be better that Standalone can also support this feature. was: Currently Spark on Yarn supports better data locality by considering the preferred locations of the pending tasks when dynamic allocation is enabled, Refer to https://issues.apache.org/jira/browse/SPARK-4352. Mesos alse supports data locality, Refer to +https://issues.apache.org/jira/browse/SPARK-16944+ It would be better that Standalone can also support this feature. > [STANDALONE] Improve data locality when launching new executors for dynamic > allocation > -- > > Key: SPARK-23532 > URL: https://issues.apache.org/jira/browse/SPARK-23532 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: liuxian >Priority: Major > > Currently Spark on Yarn supports better data locality by considering the > preferred locations of the pending tasks when dynamic allocation is enabled, > Refer to https://issues.apache.org/jira/browse/SPARK-4352. > Mesos alse supports data locality, Refer to > https://issues.apache.org/jira/browse/SPARK-16944+ > It would be better that Standalone can also support this feature. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23532) [STANDALONE] Improve data locality when launching new executors for dynamic allocation
liuxian created SPARK-23532: --- Summary: [STANDALONE] Improve data locality when launching new executors for dynamic allocation Key: SPARK-23532 URL: https://issues.apache.org/jira/browse/SPARK-23532 Project: Spark Issue Type: New Feature Components: Spark Core Affects Versions: 2.3.0 Reporter: liuxian Currently Spark on Yarn supports better data locality by considering the preferred locations of the pending tasks when dynamic allocation is enabled, Refer to https://issues.apache.org/jira/browse/SPARK-4352. Mesos alse supports data locality, Refer to +https://issues.apache.org/jira/browse/SPARK-16944+ It would be better that Standalone can also support this feature. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23516) I think it is unnecessary to transfer unroll memory to storage memory
liuxian created SPARK-23516: --- Summary: I think it is unnecessary to transfer unroll memory to storage memory Key: SPARK-23516 URL: https://issues.apache.org/jira/browse/SPARK-23516 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 2.3.0 Reporter: liuxian In fact, unroll memory is also storage memory,so I think it is unnecessary to release unroll memory really, and then to get storage memory again. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23404) When the underlying buffers are already direct, we should copy them to the heap memory
[ https://issues.apache.org/jira/browse/SPARK-23404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liuxian resolved SPARK-23404. - Resolution: Invalid > When the underlying buffers are already direct, we should copy them to the > heap memory > -- > > Key: SPARK-23404 > URL: https://issues.apache.org/jira/browse/SPARK-23404 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: liuxian >Priority: Minor > > If the memory mode is _ON_HEAP_,when the underlying buffers are direct, we > should copy them to the heap memory. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23404) When the underlying buffers are already direct, we should copy them to the heap memory
[ https://issues.apache.org/jira/browse/SPARK-23404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liuxian updated SPARK-23404: Description: If the memory mode is _ON_HEAP_,when the underlying buffers are direct, we should copy them to the heap memory. (was: If the memory mode is _ON_HEAP_,when the underlying buffers are direct, we should copy it to the heap memory.) > When the underlying buffers are already direct, we should copy them to the > heap memory > -- > > Key: SPARK-23404 > URL: https://issues.apache.org/jira/browse/SPARK-23404 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: liuxian >Priority: Minor > > If the memory mode is _ON_HEAP_,when the underlying buffers are direct, we > should copy them to the heap memory. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23404) When the underlying buffers are already direct, we should copy them to the heap memory
[ https://issues.apache.org/jira/browse/SPARK-23404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liuxian updated SPARK-23404: Summary: When the underlying buffers are already direct, we should copy them to the heap memory (was: When the underlying buffers are already direct, we should copy it to the heap memory) > When the underlying buffers are already direct, we should copy them to the > heap memory > -- > > Key: SPARK-23404 > URL: https://issues.apache.org/jira/browse/SPARK-23404 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: liuxian >Priority: Minor > > If the memory mode is _ON_HEAP_,when the underlying buffers are direct, we > should copy it to the heap memory. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23404) When the underlying buffers are already direct, we should copy it to the heap memory
liuxian created SPARK-23404: --- Summary: When the underlying buffers are already direct, we should copy it to the heap memory Key: SPARK-23404 URL: https://issues.apache.org/jira/browse/SPARK-23404 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 2.3.0 Reporter: liuxian If the memory mode is _ON_HEAP_,when the underlying buffers are direct, we should copy it to the heap memory. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23391) It may lead to overflow for some integer multiplication
[ https://issues.apache.org/jira/browse/SPARK-23391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liuxian updated SPARK-23391: Priority: Minor (was: Major) > It may lead to overflow for some integer multiplication > > > Key: SPARK-23391 > URL: https://issues.apache.org/jira/browse/SPARK-23391 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: liuxian >Priority: Minor > > In the {{getBlockData}},{{blockId.reduceId}} is the {{Int}} type, when it is > greater than 2^28, {{blockId.reduceId*8}} will overflow. > In the _decompress0, len_ and _unitSize are {{Int}}_ type, so _len * > unitSize_ may lead to overflow > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23391) It may lead to overflow for some integer multiplication
[ https://issues.apache.org/jira/browse/SPARK-23391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liuxian updated SPARK-23391: Description: In the {{getBlockData}},{{blockId.reduceId}} is the {{Int}} type, when it is greater than 2^28, {{blockId.reduceId*8}} will overflow. In the _decompress0, len_ and _unitSize are {{Int}}_ type_,_ so _len * unitSize_ may lead to overflow was: In the {{getBlockData}},{{blockId.reduceId}} is the {{Int}} type, when it is greater than 2^28, {{blockId.reduceId*8}} will overflow. In the _decompress0, len_ and _unitSize are {{Int}} type,_so _len * unitSize_ may lead to overflow > It may lead to overflow for some integer multiplication > > > Key: SPARK-23391 > URL: https://issues.apache.org/jira/browse/SPARK-23391 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: liuxian >Priority: Major > > In the {{getBlockData}},{{blockId.reduceId}} is the {{Int}} type, when it is > greater than 2^28, {{blockId.reduceId*8}} will overflow. > In the _decompress0, len_ and _unitSize are {{Int}}_ type_,_ so _len * > unitSize_ may lead to overflow > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23391) It may lead to overflow for some integer multiplication
[ https://issues.apache.org/jira/browse/SPARK-23391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liuxian updated SPARK-23391: Description: In the {{getBlockData}},{{blockId.reduceId}} is the {{Int}} type, when it is greater than 2^28, {{blockId.reduceId*8}} will overflow. In the _decompress0, len_ and _unitSize are {{Int}}_ type, so _len * unitSize_ may lead to overflow was: In the {{getBlockData}},{{blockId.reduceId}} is the {{Int}} type, when it is greater than 2^28, {{blockId.reduceId*8}} will overflow. In the _decompress0, len_ and _unitSize are {{Int}}_ type_,_ so _len * unitSize_ may lead to overflow > It may lead to overflow for some integer multiplication > > > Key: SPARK-23391 > URL: https://issues.apache.org/jira/browse/SPARK-23391 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: liuxian >Priority: Major > > In the {{getBlockData}},{{blockId.reduceId}} is the {{Int}} type, when it is > greater than 2^28, {{blockId.reduceId*8}} will overflow. > In the _decompress0, len_ and _unitSize are {{Int}}_ type, so _len * > unitSize_ may lead to overflow > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23391) It may lead to overflow for some integer multiplication
liuxian created SPARK-23391: --- Summary: It may lead to overflow for some integer multiplication Key: SPARK-23391 URL: https://issues.apache.org/jira/browse/SPARK-23391 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.3.0 Reporter: liuxian In the {{getBlockData}},{{blockId.reduceId}} is the {{Int}} type, when it is greater than 2^28, {{blockId.reduceId*8}} will overflow. In the _decompress0, len_ and _unitSize are {{Int}} type,_so _len * unitSize_ may lead to overflow -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23389) When the shuffle dependency specifies aggregation ,and `dependency.mapSideCombine=false`, we should be able to use serialized sorting.
liuxian created SPARK-23389: --- Summary: When the shuffle dependency specifies aggregation ,and `dependency.mapSideCombine=false`, we should be able to use serialized sorting. Key: SPARK-23389 URL: https://issues.apache.org/jira/browse/SPARK-23389 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 2.3.0 Reporter: liuxian When the shuffle dependency specifies aggregation ,and `dependency.mapSideCombine=false`, in the map side,there is no need for aggregation and sorting, so we should be able to use serialized sorting. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23358) When the number of partitions is greater than 2^28, it will result in an error result
[ https://issues.apache.org/jira/browse/SPARK-23358?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liuxian updated SPARK-23358: Description: In the `checkIndexAndDataFile`,the _blocks_ is the _Int_ type, when it is greater than 2^28, `blocks*8` will overflow, and this will result in an error result. In fact, `blocks` is actually the number of partitions. was: In the `checkIndexAndDataFile`,the `blocks` is the ` Int` type, when it is greater than 2^28, `blocks*8` will overflow, and this will result in an error result. In fact, `blocks` is actually the number of partitions. > When the number of partitions is greater than 2^28, it will result in an > error result > - > > Key: SPARK-23358 > URL: https://issues.apache.org/jira/browse/SPARK-23358 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: liuxian >Priority: Major > > In the `checkIndexAndDataFile`,the _blocks_ is the _Int_ type, when it is > greater than 2^28, `blocks*8` will overflow, and this will result in an error > result. > In fact, `blocks` is actually the number of partitions. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org