[jira] [Comment Edited] (SPARK-18727) Support schema evolution as new files are inserted into table

2017-09-21 Thread Serge Smertin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16174660#comment-16174660
 ] 

Serge Smertin edited comment on SPARK-18727 at 9/21/17 12:31 PM:
-

i have some similar use-cases that were mentioned in [#comment-15987668] by 
[~simeons] - adding fields to nested _struct_ fields. application is built the 
way that parquet files are created/partitioned outside of Spark and only new 
columns might be added. Again, mostly within couple of nested structs. 

I don't know all potential implications of the idea, but can we just use the 
last element of selected files instead of the first one, as long as the 
FileStatus [list is already sorted by path 
lexicographically|https://github.com/apache/spark/blob/32fa0b81411f781173e185f4b19b9fd6d118f9fe/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala#L251]?
 It easier to guarantee that only new columns would be added over the time. And 
the following code change doesn't seem to be huge deviation from current 
behavior, thus tremendously saving time compared to 
{{spark.sql.parquet.mergeSchema=true}}:

{code:java} // ParquetFileFormat.scala (lines 232..240)
filesByType.commonMetadata.lastOption
.orElse(filesByType.metadata.lastOption)
.orElse(filesByType.data.lastOption)
{code}

/cc [~r...@databricks.com] [~xwu0226] 


was (Author: nfx):
in one of the use-cases for project in [#comment-15987668] by [~simeons] - 
adding fields to nested _struct_ fields. application is built the way that 
parquet files are created/partitioned outside of Spark and only new columns 
might be added. Again, mostly within couple of nested structs. 

I don't know all potential implications of the idea, but can we just use the 
last element of selected files instead of the first one, as long as the 
FileStatus [list is already sorted by path 
lexicographically|https://github.com/apache/spark/blob/32fa0b81411f781173e185f4b19b9fd6d118f9fe/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala#L251]?
 It easier to guarantee that only new columns would be added over the time. And 
the following code change doesn't seem to be huge deviation from current 
behavior, thus tremendously saving time compared to 
{{spark.sql.parquet.mergeSchema=true}}:

{code:java} // ParquetFileFormat.scala (lines 232..240)
filesByType.commonMetadata.lastOption
.orElse(filesByType.metadata.lastOption)
.orElse(filesByType.data.lastOption)
{code}

/cc [~r...@databricks.com] [~xwu0226] 

> Support schema evolution as new files are inserted into table
> -
>
> Key: SPARK-18727
> URL: https://issues.apache.org/jira/browse/SPARK-18727
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Eric Liang
>Priority: Critical
>
> Now that we have pushed partition management of all tables to the catalog, 
> one issue for scalable partition handling remains: handling schema updates.
> Currently, a schema update requires dropping and recreating the entire table, 
> which does not scale well with the size of the table.
> We should support updating the schema of the table, either via ALTER TABLE, 
> or automatically as new files with compatible schemas are appended into the 
> table.
> cc [~rxin]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18727) Support schema evolution as new files are inserted into table

2017-09-21 Thread Serge Smertin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16174660#comment-16174660
 ] 

Serge Smertin commented on SPARK-18727:
---

in one of the use-cases for project in [#comment-15987668] by [~simeons] - 
adding fields to nested _struct_ fields. application is built the way that 
parquet files are created/partitioned outside of Spark and only new columns 
might be added. Again, mostly within couple of nested structs. 

I don't know all potential implications of the idea, but can we just use the 
last element of selected files instead of the first one, as long as the 
FileStatus [list is already sorted by path 
lexicographically|https://github.com/apache/spark/blob/32fa0b81411f781173e185f4b19b9fd6d118f9fe/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala#L251]?
 It easier to guarantee that only new columns would be added over the time. And 
the following code change doesn't seem to be huge deviation from current 
behavior, thus tremendously saving time compared to 
{{spark.sql.parquet.mergeSchema=true}}:

{code:java} // ParquetFileFormat.scala (lines 232..240)
filesByType.commonMetadata.lastOption
.orElse(filesByType.metadata.lastOption)
.orElse(filesByType.data.lastOption)
{code}

/cc [~r...@databricks.com] [~xwu0226] 

> Support schema evolution as new files are inserted into table
> -
>
> Key: SPARK-18727
> URL: https://issues.apache.org/jira/browse/SPARK-18727
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Eric Liang
>Priority: Critical
>
> Now that we have pushed partition management of all tables to the catalog, 
> one issue for scalable partition handling remains: handling schema updates.
> Currently, a schema update requires dropping and recreating the entire table, 
> which does not scale well with the size of the table.
> We should support updating the schema of the table, either via ALTER TABLE, 
> or automatically as new files with compatible schemas are appended into the 
> table.
> cc [~rxin]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4368) Ceph integration?

2015-10-20 Thread Serge Smertin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14964972#comment-14964972
 ] 

Serge Smertin commented on SPARK-4368:
--

if it's decided to be hosted outside of project - is there any documented way 
to add new storage abstraction then?

> Ceph integration?
> -
>
> Key: SPARK-4368
> URL: https://issues.apache.org/jira/browse/SPARK-4368
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output
>Reporter: Serge Smertin
>
> There is a use-case of storing big number of relatively small BLOB objects 
> (2-20Mb), which has to have some ugly workarounds in HDFS environments. There 
> is a need to process those BLOBs close to data themselves, so that's why 
> MapReduce paradigm is good, as it guarantees data locality.
> Ceph seems to be one of the systems that maintains both of the properties 
> (small files and data locality) -  
> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2013-July/032119.html. I 
> know already that Spark supports GlusterFS - 
> http://mail-archives.apache.org/mod_mbox/spark-user/201404.mbox/%3ccf657f2b.5b3a1%25ven...@yarcdata.com%3E
> So i wonder, could there be an integration with this storage solution and 
> what could be the effort of doing that? 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4368) Ceph integration?

2015-10-20 Thread Serge Smertin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14964970#comment-14964970
 ] 

Serge Smertin commented on SPARK-4368:
--

here you go - Hadoop Filesystem implementation on top of Ceph. It even supports 
data locality. This project also has Vagrant 
https://github.com/ceph/cephfs-hadoop/blob/master/src/main/java/org/apache/hadoop/fs/ceph/CephFileSystem.java#L538

> Ceph integration?
> -
>
> Key: SPARK-4368
> URL: https://issues.apache.org/jira/browse/SPARK-4368
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output
>Reporter: Serge Smertin
>
> There is a use-case of storing big number of relatively small BLOB objects 
> (2-20Mb), which has to have some ugly workarounds in HDFS environments. There 
> is a need to process those BLOBs close to data themselves, so that's why 
> MapReduce paradigm is good, as it guarantees data locality.
> Ceph seems to be one of the systems that maintains both of the properties 
> (small files and data locality) -  
> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2013-July/032119.html. I 
> know already that Spark supports GlusterFS - 
> http://mail-archives.apache.org/mod_mbox/spark-user/201404.mbox/%3ccf657f2b.5b3a1%25ven...@yarcdata.com%3E
> So i wonder, could there be an integration with this storage solution and 
> what could be the effort of doing that? 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4368) Ceph integration?

2015-10-20 Thread Serge Smertin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14965026#comment-14965026
 ] 

Serge Smertin commented on SPARK-4368:
--

Thank you Steve for bunch of details

> Ceph integration?
> -
>
> Key: SPARK-4368
> URL: https://issues.apache.org/jira/browse/SPARK-4368
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output
>Reporter: Serge Smertin
>
> There is a use-case of storing big number of relatively small BLOB objects 
> (2-20Mb), which has to have some ugly workarounds in HDFS environments. There 
> is a need to process those BLOBs close to data themselves, so that's why 
> MapReduce paradigm is good, as it guarantees data locality.
> Ceph seems to be one of the systems that maintains both of the properties 
> (small files and data locality) -  
> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2013-July/032119.html. I 
> know already that Spark supports GlusterFS - 
> http://mail-archives.apache.org/mod_mbox/spark-user/201404.mbox/%3ccf657f2b.5b3a1%25ven...@yarcdata.com%3E
> So i wonder, could there be an integration with this storage solution and 
> what could be the effort of doing that? 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4368) Ceph integration?

2014-11-12 Thread Serge Smertin (JIRA)
Serge Smertin created SPARK-4368:


 Summary: Ceph integration?
 Key: SPARK-4368
 URL: https://issues.apache.org/jira/browse/SPARK-4368
 Project: Spark
  Issue Type: Bug
  Components: Input/Output
Reporter: Serge Smertin


There is a use-case of storing big number of relatively small BLOB objects 
(2-20Mb), which has to have some ugly workarounds in HDFS environments. There 
is a need to process those BLOBs close to data themselves, so that's why 
MapReduce paradigm is good, as it guarantees data locality.

Ceph seems to be one of the systems that maintains both of the properties 
(small files and data locality) -  
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2013-July/032119.html. I 
know already that Spark supports GlusterFS - 
http://mail-archives.apache.org/mod_mbox/spark-user/201404.mbox/%3ccf657f2b.5b3a1%25ven...@yarcdata.com%3E

So i wonder, could there be an integration with this storage solution and what 
could be the effort of doing that? 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org