[jira] [Comment Edited] (SPARK-18727) Support schema evolution as new files are inserted into table
[ https://issues.apache.org/jira/browse/SPARK-18727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16174660#comment-16174660 ] Serge Smertin edited comment on SPARK-18727 at 9/21/17 12:31 PM: - i have some similar use-cases that were mentioned in [#comment-15987668] by [~simeons] - adding fields to nested _struct_ fields. application is built the way that parquet files are created/partitioned outside of Spark and only new columns might be added. Again, mostly within couple of nested structs. I don't know all potential implications of the idea, but can we just use the last element of selected files instead of the first one, as long as the FileStatus [list is already sorted by path lexicographically|https://github.com/apache/spark/blob/32fa0b81411f781173e185f4b19b9fd6d118f9fe/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala#L251]? It easier to guarantee that only new columns would be added over the time. And the following code change doesn't seem to be huge deviation from current behavior, thus tremendously saving time compared to {{spark.sql.parquet.mergeSchema=true}}: {code:java} // ParquetFileFormat.scala (lines 232..240) filesByType.commonMetadata.lastOption .orElse(filesByType.metadata.lastOption) .orElse(filesByType.data.lastOption) {code} /cc [~r...@databricks.com] [~xwu0226] was (Author: nfx): in one of the use-cases for project in [#comment-15987668] by [~simeons] - adding fields to nested _struct_ fields. application is built the way that parquet files are created/partitioned outside of Spark and only new columns might be added. Again, mostly within couple of nested structs. I don't know all potential implications of the idea, but can we just use the last element of selected files instead of the first one, as long as the FileStatus [list is already sorted by path lexicographically|https://github.com/apache/spark/blob/32fa0b81411f781173e185f4b19b9fd6d118f9fe/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala#L251]? It easier to guarantee that only new columns would be added over the time. And the following code change doesn't seem to be huge deviation from current behavior, thus tremendously saving time compared to {{spark.sql.parquet.mergeSchema=true}}: {code:java} // ParquetFileFormat.scala (lines 232..240) filesByType.commonMetadata.lastOption .orElse(filesByType.metadata.lastOption) .orElse(filesByType.data.lastOption) {code} /cc [~r...@databricks.com] [~xwu0226] > Support schema evolution as new files are inserted into table > - > > Key: SPARK-18727 > URL: https://issues.apache.org/jira/browse/SPARK-18727 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Eric Liang >Priority: Critical > > Now that we have pushed partition management of all tables to the catalog, > one issue for scalable partition handling remains: handling schema updates. > Currently, a schema update requires dropping and recreating the entire table, > which does not scale well with the size of the table. > We should support updating the schema of the table, either via ALTER TABLE, > or automatically as new files with compatible schemas are appended into the > table. > cc [~rxin] -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18727) Support schema evolution as new files are inserted into table
[ https://issues.apache.org/jira/browse/SPARK-18727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16174660#comment-16174660 ] Serge Smertin commented on SPARK-18727: --- in one of the use-cases for project in [#comment-15987668] by [~simeons] - adding fields to nested _struct_ fields. application is built the way that parquet files are created/partitioned outside of Spark and only new columns might be added. Again, mostly within couple of nested structs. I don't know all potential implications of the idea, but can we just use the last element of selected files instead of the first one, as long as the FileStatus [list is already sorted by path lexicographically|https://github.com/apache/spark/blob/32fa0b81411f781173e185f4b19b9fd6d118f9fe/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala#L251]? It easier to guarantee that only new columns would be added over the time. And the following code change doesn't seem to be huge deviation from current behavior, thus tremendously saving time compared to {{spark.sql.parquet.mergeSchema=true}}: {code:java} // ParquetFileFormat.scala (lines 232..240) filesByType.commonMetadata.lastOption .orElse(filesByType.metadata.lastOption) .orElse(filesByType.data.lastOption) {code} /cc [~r...@databricks.com] [~xwu0226] > Support schema evolution as new files are inserted into table > - > > Key: SPARK-18727 > URL: https://issues.apache.org/jira/browse/SPARK-18727 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Eric Liang >Priority: Critical > > Now that we have pushed partition management of all tables to the catalog, > one issue for scalable partition handling remains: handling schema updates. > Currently, a schema update requires dropping and recreating the entire table, > which does not scale well with the size of the table. > We should support updating the schema of the table, either via ALTER TABLE, > or automatically as new files with compatible schemas are appended into the > table. > cc [~rxin] -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4368) Ceph integration?
[ https://issues.apache.org/jira/browse/SPARK-4368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14964972#comment-14964972 ] Serge Smertin commented on SPARK-4368: -- if it's decided to be hosted outside of project - is there any documented way to add new storage abstraction then? > Ceph integration? > - > > Key: SPARK-4368 > URL: https://issues.apache.org/jira/browse/SPARK-4368 > Project: Spark > Issue Type: Improvement > Components: Input/Output >Reporter: Serge Smertin > > There is a use-case of storing big number of relatively small BLOB objects > (2-20Mb), which has to have some ugly workarounds in HDFS environments. There > is a need to process those BLOBs close to data themselves, so that's why > MapReduce paradigm is good, as it guarantees data locality. > Ceph seems to be one of the systems that maintains both of the properties > (small files and data locality) - > http://lists.ceph.com/pipermail/ceph-users-ceph.com/2013-July/032119.html. I > know already that Spark supports GlusterFS - > http://mail-archives.apache.org/mod_mbox/spark-user/201404.mbox/%3ccf657f2b.5b3a1%25ven...@yarcdata.com%3E > So i wonder, could there be an integration with this storage solution and > what could be the effort of doing that? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4368) Ceph integration?
[ https://issues.apache.org/jira/browse/SPARK-4368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14964970#comment-14964970 ] Serge Smertin commented on SPARK-4368: -- here you go - Hadoop Filesystem implementation on top of Ceph. It even supports data locality. This project also has Vagrant https://github.com/ceph/cephfs-hadoop/blob/master/src/main/java/org/apache/hadoop/fs/ceph/CephFileSystem.java#L538 > Ceph integration? > - > > Key: SPARK-4368 > URL: https://issues.apache.org/jira/browse/SPARK-4368 > Project: Spark > Issue Type: Improvement > Components: Input/Output >Reporter: Serge Smertin > > There is a use-case of storing big number of relatively small BLOB objects > (2-20Mb), which has to have some ugly workarounds in HDFS environments. There > is a need to process those BLOBs close to data themselves, so that's why > MapReduce paradigm is good, as it guarantees data locality. > Ceph seems to be one of the systems that maintains both of the properties > (small files and data locality) - > http://lists.ceph.com/pipermail/ceph-users-ceph.com/2013-July/032119.html. I > know already that Spark supports GlusterFS - > http://mail-archives.apache.org/mod_mbox/spark-user/201404.mbox/%3ccf657f2b.5b3a1%25ven...@yarcdata.com%3E > So i wonder, could there be an integration with this storage solution and > what could be the effort of doing that? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4368) Ceph integration?
[ https://issues.apache.org/jira/browse/SPARK-4368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14965026#comment-14965026 ] Serge Smertin commented on SPARK-4368: -- Thank you Steve for bunch of details > Ceph integration? > - > > Key: SPARK-4368 > URL: https://issues.apache.org/jira/browse/SPARK-4368 > Project: Spark > Issue Type: Improvement > Components: Input/Output >Reporter: Serge Smertin > > There is a use-case of storing big number of relatively small BLOB objects > (2-20Mb), which has to have some ugly workarounds in HDFS environments. There > is a need to process those BLOBs close to data themselves, so that's why > MapReduce paradigm is good, as it guarantees data locality. > Ceph seems to be one of the systems that maintains both of the properties > (small files and data locality) - > http://lists.ceph.com/pipermail/ceph-users-ceph.com/2013-July/032119.html. I > know already that Spark supports GlusterFS - > http://mail-archives.apache.org/mod_mbox/spark-user/201404.mbox/%3ccf657f2b.5b3a1%25ven...@yarcdata.com%3E > So i wonder, could there be an integration with this storage solution and > what could be the effort of doing that? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4368) Ceph integration?
Serge Smertin created SPARK-4368: Summary: Ceph integration? Key: SPARK-4368 URL: https://issues.apache.org/jira/browse/SPARK-4368 Project: Spark Issue Type: Bug Components: Input/Output Reporter: Serge Smertin There is a use-case of storing big number of relatively small BLOB objects (2-20Mb), which has to have some ugly workarounds in HDFS environments. There is a need to process those BLOBs close to data themselves, so that's why MapReduce paradigm is good, as it guarantees data locality. Ceph seems to be one of the systems that maintains both of the properties (small files and data locality) - http://lists.ceph.com/pipermail/ceph-users-ceph.com/2013-July/032119.html. I know already that Spark supports GlusterFS - http://mail-archives.apache.org/mod_mbox/spark-user/201404.mbox/%3ccf657f2b.5b3a1%25ven...@yarcdata.com%3E So i wonder, could there be an integration with this storage solution and what could be the effort of doing that? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org