[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #529: Add hadoop table catalog (WIP)
rdblue commented on a change in pull request #529: Add hadoop table catalog (WIP) URL: https://github.com/apache/incubator-iceberg/pull/529#discussion_r336281218 ## File path: core/src/main/java/org/apache/iceberg/hadoop/HadoopCatalog.java ## @@ -0,0 +1,144 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.iceberg.hadoop; + +import com.google.common.base.Preconditions; +import java.io.Closeable; +import java.io.IOException; +import java.util.Map; +import org.apache.hadoop.conf.Configuration; +import org.apache.hadoop.fs.FileSystem; +import org.apache.hadoop.fs.Path; +import org.apache.iceberg.BaseMetastoreCatalog; +import org.apache.iceberg.PartitionSpec; +import org.apache.iceberg.Schema; +import org.apache.iceberg.TableMetadata; +import org.apache.iceberg.TableOperations; +import org.apache.iceberg.catalog.TableIdentifier; +import org.apache.iceberg.exceptions.AlreadyExistsException; +import org.apache.iceberg.exceptions.RuntimeIOException; + + +public class HadoopCatalog extends BaseMetastoreCatalog implements Closeable { + private static final String ICEBERG_HADOOP_WAREHOUSE_BASE = "iceberg/warehouse"; + private final Configuration conf; + private String warehouseUri; + + public HadoopCatalog(Configuration conf, String warehouseUri) { +this.conf = conf; + +if (warehouseUri != null) { + this.warehouseUri = warehouseUri; +} else { + String fsRoot = conf.get("fs.defaultFS"); + Path warehousePath = new Path(fsRoot, ICEBERG_HADOOP_WAREHOUSE_BASE); + try { +FileSystem fs = Util.getFs(warehousePath, conf); +if (!fs.isDirectory(warehousePath)) { + if (!fs.mkdirs(warehousePath)) { +throw new IOException("failed to create warehouse for hadoop catalog"); + } +} +this.warehouseUri = fsRoot + "/" + ICEBERG_HADOOP_WAREHOUSE_BASE; + } catch (IOException e) { +throw new RuntimeIOException("failed to create directory for warehouse", e); + } +} + } + + public HadoopCatalog(Configuration conf) { +this(conf, null); + } + + @Override + public org.apache.iceberg.Table createTable( + TableIdentifier identifier, Schema schema, PartitionSpec spec, Map properties) { +Preconditions.checkArgument(identifier.namespace().levels().length == 1, +"Missing database in table identifier: %s", identifier); +Path tablePath = new Path(defaultWarehouseLocation(identifier)); +try { + FileSystem fs = Util.getFs(tablePath, conf); + if (!fs.isDirectory(tablePath)) { +fs.mkdirs(tablePath); + } else { +throw new AlreadyExistsException("the table already exists: " + identifier); + } +} catch (IOException e) { + throw new RuntimeIOException("failed to create directory", e); +} +return super.createTable(identifier, schema, spec, null, properties); + } + + public org.apache.iceberg.Table createTable( + TableIdentifier identifier, Schema schema, PartitionSpec spec) { +Preconditions.checkArgument(identifier.namespace().levels().length == 1, +"Missing database in table identifier: %s", identifier); +return createTable(identifier, schema, spec, null, null); + } + + @Override + protected TableOperations newTableOps(TableIdentifier identifier) { +Preconditions.checkArgument(identifier.namespace().levels().length == 1, +"Missing database in table identifier: %s", identifier); +return new HadoopTableOperations(new Path(defaultWarehouseLocation(identifier)), conf); + } + + @Override + protected String defaultWarehouseLocation(TableIdentifier tableIdentifier) { +String dbName = tableIdentifier.namespace().level(0); +String tableName = tableIdentifier.name(); +return this.warehouseUri + "/" + dbName + ".db" + "/" + tableName; + } + + @Override + public boolean dropTable(TableIdentifier identifier, boolean purge) { +Preconditions.checkArgument(identifier.namespace().levels().length == 1, +"Missing database in table identifier: %s", identifier); + +Path tablePath = new Path(defaultWarehouseLocation(identifier)); +TableOperations ops =
[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #529: Add hadoop table catalog (WIP)
rdblue commented on a change in pull request #529: Add hadoop table catalog (WIP) URL: https://github.com/apache/incubator-iceberg/pull/529#discussion_r336280986 ## File path: core/src/main/java/org/apache/iceberg/hadoop/HadoopCatalog.java ## @@ -0,0 +1,144 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.iceberg.hadoop; + +import com.google.common.base.Preconditions; +import java.io.Closeable; +import java.io.IOException; +import java.util.Map; +import org.apache.hadoop.conf.Configuration; +import org.apache.hadoop.fs.FileSystem; +import org.apache.hadoop.fs.Path; +import org.apache.iceberg.BaseMetastoreCatalog; +import org.apache.iceberg.PartitionSpec; +import org.apache.iceberg.Schema; +import org.apache.iceberg.TableMetadata; +import org.apache.iceberg.TableOperations; +import org.apache.iceberg.catalog.TableIdentifier; +import org.apache.iceberg.exceptions.AlreadyExistsException; +import org.apache.iceberg.exceptions.RuntimeIOException; + + +public class HadoopCatalog extends BaseMetastoreCatalog implements Closeable { + private static final String ICEBERG_HADOOP_WAREHOUSE_BASE = "iceberg/warehouse"; + private final Configuration conf; + private String warehouseUri; + + public HadoopCatalog(Configuration conf, String warehouseUri) { +this.conf = conf; + +if (warehouseUri != null) { + this.warehouseUri = warehouseUri; +} else { + String fsRoot = conf.get("fs.defaultFS"); + Path warehousePath = new Path(fsRoot, ICEBERG_HADOOP_WAREHOUSE_BASE); + try { +FileSystem fs = Util.getFs(warehousePath, conf); +if (!fs.isDirectory(warehousePath)) { + if (!fs.mkdirs(warehousePath)) { +throw new IOException("failed to create warehouse for hadoop catalog"); + } +} +this.warehouseUri = fsRoot + "/" + ICEBERG_HADOOP_WAREHOUSE_BASE; + } catch (IOException e) { +throw new RuntimeIOException("failed to create directory for warehouse", e); + } +} + } + + public HadoopCatalog(Configuration conf) { +this(conf, null); + } + + @Override + public org.apache.iceberg.Table createTable( + TableIdentifier identifier, Schema schema, PartitionSpec spec, Map properties) { +Preconditions.checkArgument(identifier.namespace().levels().length == 1, +"Missing database in table identifier: %s", identifier); +Path tablePath = new Path(defaultWarehouseLocation(identifier)); +try { + FileSystem fs = Util.getFs(tablePath, conf); + if (!fs.isDirectory(tablePath)) { +fs.mkdirs(tablePath); + } else { +throw new AlreadyExistsException("the table already exists: " + identifier); + } +} catch (IOException e) { + throw new RuntimeIOException("failed to create directory", e); +} +return super.createTable(identifier, schema, spec, null, properties); + } + + public org.apache.iceberg.Table createTable( + TableIdentifier identifier, Schema schema, PartitionSpec spec) { +Preconditions.checkArgument(identifier.namespace().levels().length == 1, +"Missing database in table identifier: %s", identifier); +return createTable(identifier, schema, spec, null, null); + } + + @Override + protected TableOperations newTableOps(TableIdentifier identifier) { +Preconditions.checkArgument(identifier.namespace().levels().length == 1, +"Missing database in table identifier: %s", identifier); +return new HadoopTableOperations(new Path(defaultWarehouseLocation(identifier)), conf); + } + + @Override + protected String defaultWarehouseLocation(TableIdentifier tableIdentifier) { +String dbName = tableIdentifier.namespace().level(0); +String tableName = tableIdentifier.name(); +return this.warehouseUri + "/" + dbName + ".db" + "/" + tableName; + } + + @Override + public boolean dropTable(TableIdentifier identifier, boolean purge) { +Preconditions.checkArgument(identifier.namespace().levels().length == 1, +"Missing database in table identifier: %s", identifier); + +Path tablePath = new Path(defaultWarehouseLocation(identifier)); +TableOperations ops =
[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #529: Add hadoop table catalog (WIP)
rdblue commented on a change in pull request #529: Add hadoop table catalog (WIP) URL: https://github.com/apache/incubator-iceberg/pull/529#discussion_r336280691 ## File path: core/src/main/java/org/apache/iceberg/hadoop/HadoopCatalog.java ## @@ -0,0 +1,144 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.iceberg.hadoop; + +import com.google.common.base.Preconditions; +import java.io.Closeable; +import java.io.IOException; +import java.util.Map; +import org.apache.hadoop.conf.Configuration; +import org.apache.hadoop.fs.FileSystem; +import org.apache.hadoop.fs.Path; +import org.apache.iceberg.BaseMetastoreCatalog; +import org.apache.iceberg.PartitionSpec; +import org.apache.iceberg.Schema; +import org.apache.iceberg.TableMetadata; +import org.apache.iceberg.TableOperations; +import org.apache.iceberg.catalog.TableIdentifier; +import org.apache.iceberg.exceptions.AlreadyExistsException; +import org.apache.iceberg.exceptions.RuntimeIOException; + + +public class HadoopCatalog extends BaseMetastoreCatalog implements Closeable { + private static final String ICEBERG_HADOOP_WAREHOUSE_BASE = "iceberg/warehouse"; + private final Configuration conf; + private String warehouseUri; + + public HadoopCatalog(Configuration conf, String warehouseUri) { +this.conf = conf; + +if (warehouseUri != null) { + this.warehouseUri = warehouseUri; +} else { + String fsRoot = conf.get("fs.defaultFS"); + Path warehousePath = new Path(fsRoot, ICEBERG_HADOOP_WAREHOUSE_BASE); + try { +FileSystem fs = Util.getFs(warehousePath, conf); +if (!fs.isDirectory(warehousePath)) { + if (!fs.mkdirs(warehousePath)) { +throw new IOException("failed to create warehouse for hadoop catalog"); + } +} +this.warehouseUri = fsRoot + "/" + ICEBERG_HADOOP_WAREHOUSE_BASE; + } catch (IOException e) { +throw new RuntimeIOException("failed to create directory for warehouse", e); + } +} + } + + public HadoopCatalog(Configuration conf) { +this(conf, null); + } + + @Override + public org.apache.iceberg.Table createTable( + TableIdentifier identifier, Schema schema, PartitionSpec spec, Map properties) { +Preconditions.checkArgument(identifier.namespace().levels().length == 1, +"Missing database in table identifier: %s", identifier); +Path tablePath = new Path(defaultWarehouseLocation(identifier)); +try { + FileSystem fs = Util.getFs(tablePath, conf); + if (!fs.isDirectory(tablePath)) { +fs.mkdirs(tablePath); + } else { +throw new AlreadyExistsException("the table already exists: " + identifier); + } +} catch (IOException e) { + throw new RuntimeIOException("failed to create directory", e); +} +return super.createTable(identifier, schema, spec, null, properties); + } + + public org.apache.iceberg.Table createTable( + TableIdentifier identifier, Schema schema, PartitionSpec spec) { +Preconditions.checkArgument(identifier.namespace().levels().length == 1, +"Missing database in table identifier: %s", identifier); +return createTable(identifier, schema, spec, null, null); + } + + @Override + protected TableOperations newTableOps(TableIdentifier identifier) { +Preconditions.checkArgument(identifier.namespace().levels().length == 1, +"Missing database in table identifier: %s", identifier); Review comment: Why restrict namespaces to 1 level? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #529: Add hadoop table catalog (WIP)
rdblue commented on a change in pull request #529: Add hadoop table catalog (WIP) URL: https://github.com/apache/incubator-iceberg/pull/529#discussion_r336280440 ## File path: core/src/main/java/org/apache/iceberg/hadoop/HadoopCatalog.java ## @@ -0,0 +1,144 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.iceberg.hadoop; + +import com.google.common.base.Preconditions; +import java.io.Closeable; +import java.io.IOException; +import java.util.Map; +import org.apache.hadoop.conf.Configuration; +import org.apache.hadoop.fs.FileSystem; +import org.apache.hadoop.fs.Path; +import org.apache.iceberg.BaseMetastoreCatalog; +import org.apache.iceberg.PartitionSpec; +import org.apache.iceberg.Schema; +import org.apache.iceberg.TableMetadata; +import org.apache.iceberg.TableOperations; +import org.apache.iceberg.catalog.TableIdentifier; +import org.apache.iceberg.exceptions.AlreadyExistsException; +import org.apache.iceberg.exceptions.RuntimeIOException; + + +public class HadoopCatalog extends BaseMetastoreCatalog implements Closeable { + private static final String ICEBERG_HADOOP_WAREHOUSE_BASE = "iceberg/warehouse"; + private final Configuration conf; + private String warehouseUri; + + public HadoopCatalog(Configuration conf, String warehouseUri) { +this.conf = conf; + +if (warehouseUri != null) { + this.warehouseUri = warehouseUri; +} else { + String fsRoot = conf.get("fs.defaultFS"); + Path warehousePath = new Path(fsRoot, ICEBERG_HADOOP_WAREHOUSE_BASE); + try { +FileSystem fs = Util.getFs(warehousePath, conf); +if (!fs.isDirectory(warehousePath)) { + if (!fs.mkdirs(warehousePath)) { +throw new IOException("failed to create warehouse for hadoop catalog"); + } +} +this.warehouseUri = fsRoot + "/" + ICEBERG_HADOOP_WAREHOUSE_BASE; + } catch (IOException e) { +throw new RuntimeIOException("failed to create directory for warehouse", e); + } +} + } + + public HadoopCatalog(Configuration conf) { +this(conf, null); + } + + @Override + public org.apache.iceberg.Table createTable( + TableIdentifier identifier, Schema schema, PartitionSpec spec, Map properties) { +Preconditions.checkArgument(identifier.namespace().levels().length == 1, +"Missing database in table identifier: %s", identifier); +Path tablePath = new Path(defaultWarehouseLocation(identifier)); +try { + FileSystem fs = Util.getFs(tablePath, conf); + if (!fs.isDirectory(tablePath)) { +fs.mkdirs(tablePath); + } else { +throw new AlreadyExistsException("the table already exists: " + identifier); Review comment: I don't think this is correct. The table exists if its metadata exists, not if the directory is present. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #529: Add hadoop table catalog (WIP)
rdblue commented on a change in pull request #529: Add hadoop table catalog (WIP) URL: https://github.com/apache/incubator-iceberg/pull/529#discussion_r336280536 ## File path: core/src/main/java/org/apache/iceberg/hadoop/HadoopCatalog.java ## @@ -0,0 +1,144 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.iceberg.hadoop; + +import com.google.common.base.Preconditions; +import java.io.Closeable; +import java.io.IOException; +import java.util.Map; +import org.apache.hadoop.conf.Configuration; +import org.apache.hadoop.fs.FileSystem; +import org.apache.hadoop.fs.Path; +import org.apache.iceberg.BaseMetastoreCatalog; +import org.apache.iceberg.PartitionSpec; +import org.apache.iceberg.Schema; +import org.apache.iceberg.TableMetadata; +import org.apache.iceberg.TableOperations; +import org.apache.iceberg.catalog.TableIdentifier; +import org.apache.iceberg.exceptions.AlreadyExistsException; +import org.apache.iceberg.exceptions.RuntimeIOException; + + +public class HadoopCatalog extends BaseMetastoreCatalog implements Closeable { + private static final String ICEBERG_HADOOP_WAREHOUSE_BASE = "iceberg/warehouse"; + private final Configuration conf; + private String warehouseUri; + + public HadoopCatalog(Configuration conf, String warehouseUri) { +this.conf = conf; + +if (warehouseUri != null) { + this.warehouseUri = warehouseUri; +} else { + String fsRoot = conf.get("fs.defaultFS"); + Path warehousePath = new Path(fsRoot, ICEBERG_HADOOP_WAREHOUSE_BASE); + try { +FileSystem fs = Util.getFs(warehousePath, conf); +if (!fs.isDirectory(warehousePath)) { + if (!fs.mkdirs(warehousePath)) { +throw new IOException("failed to create warehouse for hadoop catalog"); + } +} +this.warehouseUri = fsRoot + "/" + ICEBERG_HADOOP_WAREHOUSE_BASE; + } catch (IOException e) { +throw new RuntimeIOException("failed to create directory for warehouse", e); + } +} + } + + public HadoopCatalog(Configuration conf) { +this(conf, null); + } + + @Override + public org.apache.iceberg.Table createTable( + TableIdentifier identifier, Schema schema, PartitionSpec spec, Map properties) { +Preconditions.checkArgument(identifier.namespace().levels().length == 1, +"Missing database in table identifier: %s", identifier); +Path tablePath = new Path(defaultWarehouseLocation(identifier)); +try { + FileSystem fs = Util.getFs(tablePath, conf); + if (!fs.isDirectory(tablePath)) { +fs.mkdirs(tablePath); + } else { +throw new AlreadyExistsException("the table already exists: " + identifier); + } +} catch (IOException e) { + throw new RuntimeIOException("failed to create directory", e); +} +return super.createTable(identifier, schema, spec, null, properties); Review comment: `defaultWarehouseLocation` is overridden below. This only depends on `warehouseUri`. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #529: Add hadoop table catalog (WIP)
rdblue commented on a change in pull request #529: Add hadoop table catalog (WIP) URL: https://github.com/apache/incubator-iceberg/pull/529#discussion_r336279982 ## File path: core/src/main/java/org/apache/iceberg/hadoop/HadoopCatalog.java ## @@ -0,0 +1,144 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.iceberg.hadoop; + +import com.google.common.base.Preconditions; +import java.io.Closeable; +import java.io.IOException; +import java.util.Map; +import org.apache.hadoop.conf.Configuration; +import org.apache.hadoop.fs.FileSystem; +import org.apache.hadoop.fs.Path; +import org.apache.iceberg.BaseMetastoreCatalog; +import org.apache.iceberg.PartitionSpec; +import org.apache.iceberg.Schema; +import org.apache.iceberg.TableMetadata; +import org.apache.iceberg.TableOperations; +import org.apache.iceberg.catalog.TableIdentifier; +import org.apache.iceberg.exceptions.AlreadyExistsException; +import org.apache.iceberg.exceptions.RuntimeIOException; + + +public class HadoopCatalog extends BaseMetastoreCatalog implements Closeable { Review comment: Can you add documentation for how this catalog works? I believe that it creates Hadoop tables that require a file system with atomic rename. That should be stated in docs. I would also like to see a description of how this class is configured, where the tables are created, and what is implemented (no renameTable, is dropTable supported?). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] feng-tao commented on issue #551: [python] First add to docs, addresses #323 and #363
feng-tao commented on issue #551: [python] First add to docs, addresses #323 and #363 URL: https://github.com/apache/incubator-iceberg/pull/551#issuecomment-543422736 @TGooch44 do you know if we will have a pypi package to try it out? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] rdblue commented on issue #537: Docs: Fix typos
rdblue commented on issue #537: Docs: Fix typos URL: https://github.com/apache/incubator-iceberg/pull/537#issuecomment-543422604 I'm closing this since I think the typo was actually correct and I haven't heard back. Feel free to reopen if you think it still need to be fixed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] rdblue closed pull request #537: Docs: Fix typos
rdblue closed pull request #537: Docs: Fix typos URL: https://github.com/apache/incubator-iceberg/pull/537 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #551: [python] First add to docs, addresses #323 and #363
rdblue commented on a change in pull request #551: [python] First add to docs, addresses #323 and #363 URL: https://github.com/apache/incubator-iceberg/pull/551#discussion_r336278823 ## File path: site/docs/python-quickstart.md ## @@ -0,0 +1,40 @@ + + +# Examples + +## Inspect Table Metadata Review comment: The new wording sounds good. Thanks! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] rdblue merged pull request #551: [python] First add to docs, addresses #323 and #363
rdblue merged pull request #551: [python] First add to docs, addresses #323 and #363 URL: https://github.com/apache/incubator-iceberg/pull/551 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] rdblue commented on issue #556: Fix Kryo serialization in ParquetUtil.getSplitOffsets
rdblue commented on issue #556: Fix Kryo serialization in ParquetUtil.getSplitOffsets URL: https://github.com/apache/incubator-iceberg/pull/556#issuecomment-543421851 Looks like the failure is checkstyle: ``` [ant:checkstyle] [ERROR] /home/travis/build/apache/incubator-iceberg/spark/src/test/java/org/apache/iceberg/TestKryoSerialization.java:27:8: Unused import - org.apache.avro.generic.GenericData. [UnusedImports] [ant:checkstyle] [ERROR] /home/travis/build/apache/incubator-iceberg/spark/src/test/java/org/apache/iceberg/TestKryoSerialization.java:41: Extra separation in import group before 'java.io.File' [ImportOrder] [ant:checkstyle] [ERROR] /home/travis/build/apache/incubator-iceberg/spark/src/test/java/org/apache/iceberg/TestKryoSerialization.java:41: Wrong order for 'java.io.File' import. [ImportOrder] [ant:checkstyle] [ERROR] /home/travis/build/apache/incubator-iceberg/spark/src/test/java/org/apache/iceberg/TestKryoSerialization.java:47:8: Unused import - java.util.List. [UnusedImports] ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] rdblue commented on issue #553: Spark ReadTask is expensive to serialize
rdblue commented on issue #553: Spark ReadTask is expensive to serialize URL: https://github.com/apache/incubator-iceberg/issues/553#issuecomment-543421665 Using a broadcast sounds good to me for now. Can you open a PR for this? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] rdblue closed issue #555: Iceberg tables should allow for automatic table creation when writing if table not exists already
rdblue closed issue #555: Iceberg tables should allow for automatic table creation when writing if table not exists already URL: https://github.com/apache/incubator-iceberg/issues/555 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] rdblue commented on issue #555: Iceberg tables should allow for automatic table creation when writing if table not exists already
rdblue commented on issue #555: Iceberg tables should allow for automatic table creation when writing if table not exists already URL: https://github.com/apache/incubator-iceberg/issues/555#issuecomment-543421500 We are planning on adding support for the new logical plans in Spark 3.0. That will include support fro common SQL statements, like `CREATE TABLE ... AS SELECT ...` as well as `REPLACE TABLE ... AS SELECT ...`. It will also include support for the new [DataFrameWriterV2](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriterV2.scala) API that can be used for the same operations. The new API looks like this: ```scala df.writeTo("db.table").append() df.writeTo("db.table").partitionBy(hours($"ts")).create() df.writeTo("db.table").partitionBy(hours($"ts")).createOrReplace() ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] TGooch44 commented on a change in pull request #530: [python] adding Hive package to wrap BaseMetastoreTables/TableOperations
TGooch44 commented on a change in pull request #530: [python] adding Hive package to wrap BaseMetastoreTables/TableOperations URL: https://github.com/apache/incubator-iceberg/pull/530#discussion_r336266729 ## File path: python/iceberg/hive/hive_table_operations.py ## @@ -0,0 +1,59 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. +# + + +from iceberg.core import BaseMetastoreTableOperations + + +class HiveTableOperations(BaseMetastoreTableOperations): + +def __init__(self, conf, client, database, table): +super(HiveTableOperations, self).__init__(conf) +self._client = client +self.database = database +self.table = table +self.refresh() + +def refresh(self): +with self._client as open_client: +tbl_info = open_client.get_table(self.database, self.table) + +table_type = tbl_info.parameters.get(BaseMetastoreTableOperations.TABLE_TYPE_PROP) + +if table_type is None or table_type.lower() != BaseMetastoreTableOperations.ICEBERG_TABLE_TYPE_VALUE: +raise RuntimeError("Invalid table, not Iceberg: %s.%s.%s" % (self.database, Review comment: Trying to be too fast...let me add some tests here to catch some of this kind of stuff This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] jzhuge opened a new pull request #556: Fix Kryo serialization in ParquetUtil.getSplitOffsets
jzhuge opened a new pull request #556: Fix Kryo serialization in ParquetUtil.getSplitOffsets URL: https://github.com/apache/incubator-iceberg/pull/556 Found it during integration with downstream Spark 2.3 branch. Added a unit test. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] manishmalhotrawork commented on a change in pull request #529: Add hadoop table catalog (WIP)
manishmalhotrawork commented on a change in pull request #529: Add hadoop table catalog (WIP) URL: https://github.com/apache/incubator-iceberg/pull/529#discussion_r336260283 ## File path: core/src/main/java/org/apache/iceberg/hadoop/HadoopCatalog.java ## @@ -0,0 +1,142 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.iceberg.hadoop; + +import com.google.common.base.Preconditions; +import java.io.Closeable; +import java.io.IOException; +import java.util.Map; +import org.apache.hadoop.conf.Configuration; +import org.apache.hadoop.fs.FileSystem; +import org.apache.hadoop.fs.Path; +import org.apache.iceberg.BaseMetastoreCatalog; +import org.apache.iceberg.PartitionSpec; +import org.apache.iceberg.Schema; +import org.apache.iceberg.TableMetadata; +import org.apache.iceberg.TableOperations; +import org.apache.iceberg.catalog.TableIdentifier; +import org.apache.iceberg.exceptions.AlreadyExistsException; +import org.apache.iceberg.exceptions.RuntimeIOException; + + +public class HadoopCatalog extends BaseMetastoreCatalog implements Closeable { + private static final String ICEBERG_HADOOP_WAREHOUSE_BASE = "iceberg/warehouse"; + private final Configuration conf; + private String hdfsRoot; + + public HadoopCatalog(Configuration conf) { +this.conf = conf; +hdfsRoot = conf.get("fs.defaultFS"); +Path warehousePath = new Path(hdfsRoot + ICEBERG_HADOOP_WAREHOUSE_BASE); +try { + FileSystem fs = Util.getFs(warehousePath, conf); + if (!fs.isDirectory(warehousePath)) { +if (!fs.mkdirs(warehousePath)) { + throw new IOException("failed to create warehouse for hadoop catalog"); +} + } + this.hdfsRoot = hdfsRoot + "/" + ICEBERG_HADOOP_WAREHOUSE_BASE; +} catch (IOException e) { + throw new RuntimeIOException("failed to create directory for warehouse", e); +} + } + + @Override + public org.apache.iceberg.Table createTable( + TableIdentifier identifier, Schema schema, PartitionSpec spec, String location, Map properties) { +Preconditions.checkArgument(identifier.namespace().levels().length == 1, +"Missing database in table identifier: %s", identifier); +Path tablePath = new Path(defaultWarehouseLocation(identifier)); +try { + FileSystem fs = Util.getFs(tablePath, conf); + if (!fs.isDirectory(tablePath)) { +fs.mkdirs(tablePath); + } else { +throw new AlreadyExistsException("the table already exists: " + identifier); + } +} catch (IOException e) { + throw new RuntimeIOException("failed to create directory", e); +} +return super.createTable(identifier, schema, spec, null, properties); + } + + public org.apache.iceberg.Table createTable( + TableIdentifier identifier, Schema schema, PartitionSpec spec, Map properties) { +Preconditions.checkArgument(identifier.namespace().levels().length == 1, +"Missing database in table identifier: %s", identifier); +return createTable(identifier, schema, spec, null, properties); Review comment: Im sorry for this. When I think more about it, it should be ok. As parent class signature would not change based on the child class behavior. Also parent method is expecting `location` and `properties` to be null. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] manishmalhotrawork commented on a change in pull request #529: Add hadoop table catalog (WIP)
manishmalhotrawork commented on a change in pull request #529: Add hadoop table catalog (WIP) URL: https://github.com/apache/incubator-iceberg/pull/529#discussion_r336257210 ## File path: core/src/main/java/org/apache/iceberg/hadoop/HadoopCatalog.java ## @@ -0,0 +1,144 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.iceberg.hadoop; + +import com.google.common.base.Preconditions; +import java.io.Closeable; +import java.io.IOException; +import java.util.Map; +import org.apache.hadoop.conf.Configuration; +import org.apache.hadoop.fs.FileSystem; +import org.apache.hadoop.fs.Path; +import org.apache.iceberg.BaseMetastoreCatalog; +import org.apache.iceberg.PartitionSpec; +import org.apache.iceberg.Schema; +import org.apache.iceberg.TableMetadata; +import org.apache.iceberg.TableOperations; +import org.apache.iceberg.catalog.TableIdentifier; +import org.apache.iceberg.exceptions.AlreadyExistsException; +import org.apache.iceberg.exceptions.RuntimeIOException; + + +public class HadoopCatalog extends BaseMetastoreCatalog implements Closeable { + private static final String ICEBERG_HADOOP_WAREHOUSE_BASE = "iceberg/warehouse"; + private final Configuration conf; + private String warehouseUri; + + public HadoopCatalog(Configuration conf, String warehouseUri) { +this.conf = conf; + +if (warehouseUri != null) { + this.warehouseUri = warehouseUri; +} else { + String fsRoot = conf.get("fs.defaultFS"); + Path warehousePath = new Path(fsRoot, ICEBERG_HADOOP_WAREHOUSE_BASE); + try { +FileSystem fs = Util.getFs(warehousePath, conf); +if (!fs.isDirectory(warehousePath)) { + if (!fs.mkdirs(warehousePath)) { +throw new IOException("failed to create warehouse for hadoop catalog"); + } +} +this.warehouseUri = fsRoot + "/" + ICEBERG_HADOOP_WAREHOUSE_BASE; + } catch (IOException e) { +throw new RuntimeIOException("failed to create directory for warehouse", e); + } +} + } + + public HadoopCatalog(Configuration conf) { +this(conf, null); + } + + @Override + public org.apache.iceberg.Table createTable( + TableIdentifier identifier, Schema schema, PartitionSpec spec, Map properties) { +Preconditions.checkArgument(identifier.namespace().levels().length == 1, +"Missing database in table identifier: %s", identifier); +Path tablePath = new Path(defaultWarehouseLocation(identifier)); +try { + FileSystem fs = Util.getFs(tablePath, conf); + if (!fs.isDirectory(tablePath)) { +fs.mkdirs(tablePath); + } else { +throw new AlreadyExistsException("the table already exists: " + identifier); + } +} catch (IOException e) { + throw new RuntimeIOException("failed to create directory", e); +} +return super.createTable(identifier, schema, spec, null, properties); Review comment: @chenjunjiedada thanks for taking care of this [comment](https://github.com/apache/incubator-iceberg/pull/529/files/11e4993b0d60d676b09124bea65bf4adc2fe3c21#r334631040) May be my understanding is not perfect, so please correct me. but looks like as per this flow, we are expecting: HadoopTable will always be under hive warehouse directory. as this call to `BaseMetastoreCatalog` will use the `defaultWarehouseLocation` which will use `hive.metastore.warehouse.dir` to form the final location. Also for HadoopTables, do we need to set the `hive.metastore.warehouse.dir` ? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] manishmalhotrawork commented on a change in pull request #529: Add hadoop table catalog (WIP)
manishmalhotrawork commented on a change in pull request #529: Add hadoop table catalog (WIP) URL: https://github.com/apache/incubator-iceberg/pull/529#discussion_r336255270 ## File path: core/src/main/java/org/apache/iceberg/hadoop/HadoopCatalog.java ## @@ -0,0 +1,144 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.iceberg.hadoop; + +import com.google.common.base.Preconditions; +import java.io.Closeable; +import java.io.IOException; +import java.util.Map; +import org.apache.hadoop.conf.Configuration; +import org.apache.hadoop.fs.FileSystem; +import org.apache.hadoop.fs.Path; +import org.apache.iceberg.BaseMetastoreCatalog; +import org.apache.iceberg.PartitionSpec; +import org.apache.iceberg.Schema; +import org.apache.iceberg.TableMetadata; +import org.apache.iceberg.TableOperations; +import org.apache.iceberg.catalog.TableIdentifier; +import org.apache.iceberg.exceptions.AlreadyExistsException; +import org.apache.iceberg.exceptions.RuntimeIOException; + + +public class HadoopCatalog extends BaseMetastoreCatalog implements Closeable { + private static final String ICEBERG_HADOOP_WAREHOUSE_BASE = "iceberg/warehouse"; + private final Configuration conf; + private String warehouseUri; + + public HadoopCatalog(Configuration conf, String warehouseUri) { +this.conf = conf; + +if (warehouseUri != null) { + this.warehouseUri = warehouseUri; +} else { + String fsRoot = conf.get("fs.defaultFS"); + Path warehousePath = new Path(fsRoot, ICEBERG_HADOOP_WAREHOUSE_BASE); + try { +FileSystem fs = Util.getFs(warehousePath, conf); +if (!fs.isDirectory(warehousePath)) { + if (!fs.mkdirs(warehousePath)) { +throw new IOException("failed to create warehouse for hadoop catalog"); + } +} +this.warehouseUri = fsRoot + "/" + ICEBERG_HADOOP_WAREHOUSE_BASE; + } catch (IOException e) { +throw new RuntimeIOException("failed to create directory for warehouse", e); + } +} + } + + public HadoopCatalog(Configuration conf) { +this(conf, null); + } + + @Override + public org.apache.iceberg.Table createTable( Review comment: @chenjunjiedada thanks for taking care. I see the `public org.apache.iceberg.Table createTable( TableIdentifier identifier, Schema schema, PartitionSpec spec, String location, Map properties)` is removed. wondering, if the parent class method will still be callable using `HadoopCatalog` object. So, what would be the behavior of in that case. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #530: [python] adding Hive package to wrap BaseMetastoreTables/TableOperations
rdblue commented on a change in pull request #530: [python] adding Hive package to wrap BaseMetastoreTables/TableOperations URL: https://github.com/apache/incubator-iceberg/pull/530#discussion_r336240323 ## File path: python/iceberg/hive/hive_table_operations.py ## @@ -0,0 +1,59 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. +# + + +from iceberg.core import BaseMetastoreTableOperations + + +class HiveTableOperations(BaseMetastoreTableOperations): + +def __init__(self, conf, client, database, table): +super(HiveTableOperations, self).__init__(conf) +self._client = client +self.database = database +self.table = table +self.refresh() + +def refresh(self): +with self._client as open_client: +tbl_info = open_client.get_table(self.database, self.table) + +table_type = tbl_info.parameters.get(BaseMetastoreTableOperations.TABLE_TYPE_PROP) + +if table_type is None or table_type.lower() != BaseMetastoreTableOperations.ICEBERG_TABLE_TYPE_VALUE: +raise RuntimeError("Invalid table, not Iceberg: %s.%s.%s" % (self.database, Review comment: Looks like the format string wasn't updated. It still has 3 `%s`. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #551: [python] First add to docs, addresses #323 and #363
rdblue commented on a change in pull request #551: [python] First add to docs, addresses #323 and #363 URL: https://github.com/apache/incubator-iceberg/pull/551#discussion_r336238243 ## File path: python/README.md ## @@ -15,6 +15,26 @@ - limitations under the License. --> -# Iceberg -A python implementation of the Iceberg table format. -See the project level README for more details: https://github.com/apache/incubator-iceberg +# Iceberg Python + +Iceberg is a python library for programatic access to iceberg table metadata as well as data access. The intention is to provide a functional subset of the java library. + +## Getting Started + +We are not currently publishing to PyPi so the best way to install the library is to clone the git repo and do a pip install -e + +``` +git clone https://github.com/apache/incubator-iceberg.git +cd incubator-iceberg/python +pip install -e . Review comment: Did the other changes to this file make it? Looks like the empty line is still there and I don't see the test instructions. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #551: [python] First add to docs, addresses #323 and #363
rdblue commented on a change in pull request #551: [python] First add to docs, addresses #323 and #363 URL: https://github.com/apache/incubator-iceberg/pull/551#discussion_r336238079 ## File path: site/docs/python-quickstart.md ## @@ -0,0 +1,40 @@ + + +# Examples + +## Inspect Table Metadata Review comment: Sounds a little scary to me. We just want to make it clear that this isn't how to use an official release. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] goldentriangle opened a new issue #555: Iceberg tables should allow for automatic table creation when writing if table not exists already
goldentriangle opened a new issue #555: Iceberg tables should allow for automatic table creation when writing if table not exists already URL: https://github.com/apache/incubator-iceberg/issues/555 I think this is a special case for https://github.com/apache/incubator-iceberg/issues/540. When writing a spark dataframe into iceberg table, if the table doesn't exist, iceberg should create the table/schema automatically/implicitly This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] TGooch44 commented on a change in pull request #551: [python] First add to docs, addresses #323 and #363
TGooch44 commented on a change in pull request #551: [python] First add to docs, addresses #323 and #363 URL: https://github.com/apache/incubator-iceberg/pull/551#discussion_r336134669 ## File path: site/docs/python-quickstart.md ## @@ -0,0 +1,40 @@ + + +# Examples + +## Inspect Table Metadata Review comment: added the following text: > Iceberg python is currently in development, and as such, should __only__ be used for development and testing purposes until an official release has been made. > > As such, we are not currently publishing to PyPi so the best way to install the library is to perform the following steps: Let me know if that sounds ok This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] TGooch44 commented on a change in pull request #551: [python] First add to docs, addresses #323 and #363
TGooch44 commented on a change in pull request #551: [python] First add to docs, addresses #323 and #363 URL: https://github.com/apache/incubator-iceberg/pull/551#discussion_r336131362 ## File path: python/README.md ## @@ -15,6 +15,26 @@ - limitations under the License. --> -# Iceberg -A python implementation of the Iceberg table format. -See the project level README for more details: https://github.com/apache/incubator-iceberg +# Iceberg Python + +Iceberg is a python library for programatic access to iceberg table metadata as well as data access. The intention is to provide a functional subset of the java library. + +## Getting Started + +We are not currently publishing to PyPi so the best way to install the library is to clone the git repo and do a pip install -e + +``` +git clone https://github.com/apache/incubator-iceberg.git +cd incubator-iceberg/python +pip install -e . Review comment: added tox instructions, let me know if that looks ok This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] TGooch44 commented on a change in pull request #551: [python] First add to docs, addresses #323 and #363
TGooch44 commented on a change in pull request #551: [python] First add to docs, addresses #323 and #363 URL: https://github.com/apache/incubator-iceberg/pull/551#discussion_r336131555 ## File path: site/docs/python-api-intro.md ## @@ -0,0 +1,143 @@ + + +# Iceberg Python API + +Much of the python api conforms to the java api. You can get more info about the java api [here](https://iceberg.apache.org/api/). + + +## Tables + +The Table interface provides access to table metadata + ++ schema returns the current table schema ++ spec returns the current table partition spec ++ properties returns a map of key-value properties ++ currentSnapshot returns the current table snapshot ++ snapshots returns all valid snapshots for the table ++ snapshot(id) returns a specific snapshot by ID ++ location returns the table’s base location + +Tables also provide refresh to update the table to the latest version. + +### Scanning +Iceberg table scans start by creating a TableScan object with newScan. + +``` python +scan = table.new_scan(); +``` + +To configure a scan, call filter and select on the TableScan to get a new TableScan with those changes. + +``` python +filtered_scan = scan.filter(Expressions.equal("id", 5)) +``` + +String expressions can also be passed to the filter method. + +``` python +filtered_scan = scan.filter("id=5") +``` + +Schema projections can be applied against a TableScan by passing a list of column names. + +``` python +filtered_scan = scan.select(["col_1", "col_2", "col_3"]) +``` + +Because some data types cannot be read using the python library, a convenience method for excluding columns from projection is provided. + +``` python +filtered_scan = scan.select_except(["unsupported_col_1", "unsupported_col_2"]) +``` + + +Calls to configuration methods create a new TableScan so that each TableScan is immutable. + +When a scan is configured, planFiles, planTasks, and schema are used to return files, tasks, and the read projection. + +``` python +scan = table.new_scan() \ +.filter("id=5") \ +.select(["id", "data"]) + +projection = scan.schema +for task in scan.plan_tasks(): +print(task) +``` + +## Types + +Iceberg data types are located in iceberg.api.types.types + +### Primitives + +Primitive type instances are available from static methods in each type class. Types without parameters use get, and types like __decimal__ use factory methods: + +```python +IntegerType.get()# int +DoubleType.get() # double +DecimalType.of(9, 2) # decimal(9, 2) +``` + +### Nested types +Structs, maps, and lists are created using factory methods in type classes. + +Like struct fields, map keys or values and list elements are tracked as nested fields. Nested fields track [field IDs](https://iceberg.apache.org/evolution/#correctness) and nullability. + +Struct fields are created using __NestedField.optional__ or __NestedField.required__. Map value and list element nullability is set in the map and list factory methods. Review comment: tried to match this up. let me know if it looks better. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] TGooch44 commented on a change in pull request #551: [python] First add to docs, addresses #323 and #363
TGooch44 commented on a change in pull request #551: [python] First add to docs, addresses #323 and #363 URL: https://github.com/apache/incubator-iceberg/pull/551#discussion_r336131620 ## File path: site/docs/python-api-intro.md ## @@ -0,0 +1,143 @@ + + +# Iceberg Python API + +Much of the python api conforms to the java api. You can get more info about the java api [here](https://iceberg.apache.org/api/). + + +## Tables + +The Table interface provides access to table metadata + ++ schema returns the current table schema ++ spec returns the current table partition spec ++ properties returns a map of key-value properties ++ currentSnapshot returns the current table snapshot ++ snapshots returns all valid snapshots for the table ++ snapshot(id) returns a specific snapshot by ID ++ location returns the table’s base location + +Tables also provide refresh to update the table to the latest version. + +### Scanning +Iceberg table scans start by creating a TableScan object with newScan. + +``` python +scan = table.new_scan(); +``` + +To configure a scan, call filter and select on the TableScan to get a new TableScan with those changes. + +``` python +filtered_scan = scan.filter(Expressions.equal("id", 5)) +``` + +String expressions can also be passed to the filter method. + +``` python +filtered_scan = scan.filter("id=5") +``` + +Schema projections can be applied against a TableScan by passing a list of column names. + +``` python +filtered_scan = scan.select(["col_1", "col_2", "col_3"]) +``` + +Because some data types cannot be read using the python library, a convenience method for excluding columns from projection is provided. + +``` python +filtered_scan = scan.select_except(["unsupported_col_1", "unsupported_col_2"]) +``` + + +Calls to configuration methods create a new TableScan so that each TableScan is immutable. + +When a scan is configured, planFiles, planTasks, and schema are used to return files, tasks, and the read projection. + +``` python +scan = table.new_scan() \ +.filter("id=5") \ +.select(["id", "data"]) + +projection = scan.schema +for task in scan.plan_tasks(): +print(task) +``` + +## Types + +Iceberg data types are located in iceberg.api.types.types + +### Primitives + +Primitive type instances are available from static methods in each type class. Types without parameters use get, and types like __decimal__ use factory methods: + +```python +IntegerType.get()# int +DoubleType.get() # double +DecimalType.of(9, 2) # decimal(9, 2) +``` + +### Nested types +Structs, maps, and lists are created using factory methods in type classes. + +Like struct fields, map keys or values and list elements are tracked as nested fields. Nested fields track [field IDs](https://iceberg.apache.org/evolution/#correctness) and nullability. + +Struct fields are created using __NestedField.optional__ or __NestedField.required__. Map value and list element nullability is set in the map and list factory methods. + +```python +# struct<1 id: int, 2 data: optional string> +struct = StructType.of([NestedField.required(1, "id", IntegerType.get()), +NestedField.optional(2, "data", StringType.get()]) + ) +``` +```python +# map<1 key: int, 2 value: optional string> +map_var = MapType.of_optional(1, IntegerType.get(), + 2, StringType.get()) +``` +```python +# array<1 element: int> +list_var = ListType.of_required(1, IntegerType.get()); +``` + +## Expressions +Iceberg’s expressions are used to configure table scans. To create expressions, use the factory methods in Expressions. + +Supported predicate expressions are: + ++ __is_null__ Review comment: ditto for the above comment This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] TGooch44 commented on a change in pull request #551: [python] First add to docs, addresses #323 and #363
TGooch44 commented on a change in pull request #551: [python] First add to docs, addresses #323 and #363 URL: https://github.com/apache/incubator-iceberg/pull/551#discussion_r336131188 ## File path: python/README.md ## @@ -15,6 +15,26 @@ - limitations under the License. --> -# Iceberg -A python implementation of the Iceberg table format. -See the project level README for more details: https://github.com/apache/incubator-iceberg +# Iceberg Python + +Iceberg is a python library for programatic access to iceberg table metadata as well as data access. The intention is to provide a functional subset of the java library. + +## Getting Started Review comment: added This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] jzhuge commented on issue #446: KryoException when writing Iceberg tables in Spark
jzhuge commented on issue #446: KryoException when writing Iceberg tables in Spark URL: https://github.com/apache/incubator-iceberg/issues/446#issuecomment-543276340 @aokolnychyi @shardulm94 @rdsr please take a look at a custom Spark Kryo registrator for Iceberg in #549. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] rdblue commented on issue #550: Bump ORC from 1.5.5 to 1.5.6
rdblue commented on issue #550: Bump ORC from 1.5.5 to 1.5.6 URL: https://github.com/apache/incubator-iceberg/pull/550#issuecomment-543261091 Thanks, @Fokko! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #543: Avoid NullPointerException in FindFiles when there is no snapshot
rdblue commented on a change in pull request #543: Avoid NullPointerException in FindFiles when there is no snapshot URL: https://github.com/apache/incubator-iceberg/pull/543#discussion_r336111263 ## File path: core/src/main/java/org/apache/iceberg/FindFiles.java ## @@ -191,7 +191,10 @@ public Builder inPartitions(PartitionSpec spec, List partitions) { Snapshot snapshot = snapshotId != null ? ops.current().snapshot(snapshotId) : ops.current().currentSnapshot(); - CloseableIterable entries = new ManifestGroup(ops, snapshot.manifests()) + // snapshot could be null when the table just gets created + Iterable manifests = (snapshot != null) ? snapshot.manifests() : CloseableIterable.empty(); + + CloseableIterable entries = new ManifestGroup(ops, manifests) Review comment: If there are no manifests, then entries should be `CloseableIterable.empty()`, not the manifest iterable. That doesn't need to be closeable. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] rdblue merged pull request #550: Bump ORC from 1.5.5 to 1.5.6
rdblue merged pull request #550: Bump ORC from 1.5.5 to 1.5.6 URL: https://github.com/apache/incubator-iceberg/pull/550 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #551: [python] First add to docs, addresses #323 and #363
rdblue commented on a change in pull request #551: [python] First add to docs, addresses #323 and #363 URL: https://github.com/apache/incubator-iceberg/pull/551#discussion_r336110220 ## File path: site/docs/python-quickstart.md ## @@ -0,0 +1,40 @@ + + +# Examples + +## Inspect Table Metadata Review comment: It would be good to have the information on how to install the library in a section here. In user-facing docs like this, we need to be clear that installing from master is for development and testing purposes. We can't recommend using code unless it is a released version. That means the wording should be something like "Iceberg for Python is not yet released and published to PyPI. To try out the python library, you can install it using `pip -e`: ..." This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] rdblue commented on issue #551: [python] First add to docs, addresses #323 and #363
rdblue commented on issue #551: [python] First add to docs, addresses #323 and #363 URL: https://github.com/apache/incubator-iceberg/pull/551#issuecomment-543259905 Thanks, @TGooch44! Great to see Python docs! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #551: [python] First add to docs, addresses #323 and #363
rdblue commented on a change in pull request #551: [python] First add to docs, addresses #323 and #363 URL: https://github.com/apache/incubator-iceberg/pull/551#discussion_r336108392 ## File path: site/docs/python-api-intro.md ## @@ -0,0 +1,143 @@ + + +# Iceberg Python API + +Much of the python api conforms to the java api. You can get more info about the java api [here](https://iceberg.apache.org/api/). + + +## Tables + +The Table interface provides access to table metadata + ++ schema returns the current table schema ++ spec returns the current table partition spec ++ properties returns a map of key-value properties ++ currentSnapshot returns the current table snapshot ++ snapshots returns all valid snapshots for the table ++ snapshot(id) returns a specific snapshot by ID ++ location returns the table’s base location + +Tables also provide refresh to update the table to the latest version. + +### Scanning +Iceberg table scans start by creating a TableScan object with newScan. + +``` python +scan = table.new_scan(); +``` + +To configure a scan, call filter and select on the TableScan to get a new TableScan with those changes. + +``` python +filtered_scan = scan.filter(Expressions.equal("id", 5)) +``` + +String expressions can also be passed to the filter method. + +``` python +filtered_scan = scan.filter("id=5") +``` + +Schema projections can be applied against a TableScan by passing a list of column names. + +``` python +filtered_scan = scan.select(["col_1", "col_2", "col_3"]) +``` + +Because some data types cannot be read using the python library, a convenience method for excluding columns from projection is provided. + +``` python +filtered_scan = scan.select_except(["unsupported_col_1", "unsupported_col_2"]) +``` + + +Calls to configuration methods create a new TableScan so that each TableScan is immutable. + +When a scan is configured, planFiles, planTasks, and schema are used to return files, tasks, and the read projection. + +``` python +scan = table.new_scan() \ +.filter("id=5") \ +.select(["id", "data"]) + +projection = scan.schema +for task in scan.plan_tasks(): +print(task) +``` + +## Types + +Iceberg data types are located in iceberg.api.types.types + +### Primitives + +Primitive type instances are available from static methods in each type class. Types without parameters use get, and types like __decimal__ use factory methods: + +```python +IntegerType.get()# int +DoubleType.get() # double +DecimalType.of(9, 2) # decimal(9, 2) +``` + +### Nested types +Structs, maps, and lists are created using factory methods in type classes. + +Like struct fields, map keys or values and list elements are tracked as nested fields. Nested fields track [field IDs](https://iceberg.apache.org/evolution/#correctness) and nullability. + +Struct fields are created using __NestedField.optional__ or __NestedField.required__. Map value and list element nullability is set in the map and list factory methods. + +```python +# struct<1 id: int, 2 data: optional string> +struct = StructType.of([NestedField.required(1, "id", IntegerType.get()), +NestedField.optional(2, "data", StringType.get()]) + ) +``` +```python +# map<1 key: int, 2 value: optional string> +map_var = MapType.of_optional(1, IntegerType.get(), + 2, StringType.get()) +``` +```python +# array<1 element: int> +list_var = ListType.of_required(1, IntegerType.get()); +``` + +## Expressions +Iceberg’s expressions are used to configure table scans. To create expressions, use the factory methods in Expressions. + +Supported predicate expressions are: + ++ __is_null__ Review comment: Could you use fixed-width here instead of bold? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #551: [python] First add to docs, addresses #323 and #363
rdblue commented on a change in pull request #551: [python] First add to docs, addresses #323 and #363 URL: https://github.com/apache/incubator-iceberg/pull/551#discussion_r336108513 ## File path: site/docs/python-api-intro.md ## @@ -0,0 +1,143 @@ + + +# Iceberg Python API + +Much of the python api conforms to the java api. You can get more info about the java api [here](https://iceberg.apache.org/api/). + + +## Tables + +The Table interface provides access to table metadata + ++ schema returns the current table schema Review comment: Using a fixed-width font here for method names would assist readability. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #551: [python] First add to docs, addresses #323 and #363
rdblue commented on a change in pull request #551: [python] First add to docs, addresses #323 and #363 URL: https://github.com/apache/incubator-iceberg/pull/551#discussion_r336108312 ## File path: site/docs/python-api-intro.md ## @@ -0,0 +1,143 @@ + + +# Iceberg Python API + +Much of the python api conforms to the java api. You can get more info about the java api [here](https://iceberg.apache.org/api/). + + +## Tables + +The Table interface provides access to table metadata + ++ schema returns the current table schema ++ spec returns the current table partition spec ++ properties returns a map of key-value properties ++ currentSnapshot returns the current table snapshot ++ snapshots returns all valid snapshots for the table ++ snapshot(id) returns a specific snapshot by ID ++ location returns the table’s base location + +Tables also provide refresh to update the table to the latest version. + +### Scanning +Iceberg table scans start by creating a TableScan object with newScan. + +``` python +scan = table.new_scan(); +``` + +To configure a scan, call filter and select on the TableScan to get a new TableScan with those changes. + +``` python +filtered_scan = scan.filter(Expressions.equal("id", 5)) +``` + +String expressions can also be passed to the filter method. + +``` python +filtered_scan = scan.filter("id=5") +``` + +Schema projections can be applied against a TableScan by passing a list of column names. + +``` python +filtered_scan = scan.select(["col_1", "col_2", "col_3"]) +``` + +Because some data types cannot be read using the python library, a convenience method for excluding columns from projection is provided. + +``` python +filtered_scan = scan.select_except(["unsupported_col_1", "unsupported_col_2"]) +``` + + +Calls to configuration methods create a new TableScan so that each TableScan is immutable. + +When a scan is configured, planFiles, planTasks, and schema are used to return files, tasks, and the read projection. + +``` python +scan = table.new_scan() \ +.filter("id=5") \ +.select(["id", "data"]) + +projection = scan.schema +for task in scan.plan_tasks(): +print(task) +``` + +## Types + +Iceberg data types are located in iceberg.api.types.types + +### Primitives + +Primitive type instances are available from static methods in each type class. Types without parameters use get, and types like __decimal__ use factory methods: + +```python +IntegerType.get()# int +DoubleType.get() # double +DecimalType.of(9, 2) # decimal(9, 2) +``` + +### Nested types +Structs, maps, and lists are created using factory methods in type classes. + +Like struct fields, map keys or values and list elements are tracked as nested fields. Nested fields track [field IDs](https://iceberg.apache.org/evolution/#correctness) and nullability. + +Struct fields are created using __NestedField.optional__ or __NestedField.required__. Map value and list element nullability is set in the map and list factory methods. Review comment: For method names, we typically use fixed-width font, like this: ``` ... using `NestedField.optional` or `NestedField.required`. Map value ... ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #551: [python] First add to docs, addresses #323 and #363
rdblue commented on a change in pull request #551: [python] First add to docs, addresses #323 and #363 URL: https://github.com/apache/incubator-iceberg/pull/551#discussion_r336107609 ## File path: python/README.md ## @@ -15,6 +15,26 @@ - limitations under the License. --> -# Iceberg -A python implementation of the Iceberg table format. -See the project level README for more details: https://github.com/apache/incubator-iceberg +# Iceberg Python + +Iceberg is a python library for programatic access to iceberg table metadata as well as data access. The intention is to provide a functional subset of the java library. + +## Getting Started + +We are not currently publishing to PyPi so the best way to install the library is to clone the git repo and do a pip install -e + +``` +git clone https://github.com/apache/incubator-iceberg.git +cd incubator-iceberg/python +pip install -e . Review comment: This doesn't quite resolve #323 because it doesn't document how to run python tests. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #551: [python] First add to docs, addresses #323 and #363
rdblue commented on a change in pull request #551: [python] First add to docs, addresses #323 and #363 URL: https://github.com/apache/incubator-iceberg/pull/551#discussion_r336107609 ## File path: python/README.md ## @@ -15,6 +15,26 @@ - limitations under the License. --> -# Iceberg -A python implementation of the Iceberg table format. -See the project level README for more details: https://github.com/apache/incubator-iceberg +# Iceberg Python + +Iceberg is a python library for programatic access to iceberg table metadata as well as data access. The intention is to provide a functional subset of the java library. + +## Getting Started + +We are not currently publishing to PyPi so the best way to install the library is to clone the git repo and do a pip install -e + +``` +git clone https://github.com/apache/incubator-iceberg.git +cd incubator-iceberg/python +pip install -e . Review comment: This doesn't quite resolve #323 because it doesn't document how to run python tests. Could you add a section for that? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #551: [python] First add to docs, addresses #323 and #363
rdblue commented on a change in pull request #551: [python] First add to docs, addresses #323 and #363 URL: https://github.com/apache/incubator-iceberg/pull/551#discussion_r336107337 ## File path: python/README.md ## @@ -15,6 +15,26 @@ - limitations under the License. --> -# Iceberg -A python implementation of the Iceberg table format. -See the project level README for more details: https://github.com/apache/incubator-iceberg +# Iceberg Python + +Iceberg is a python library for programatic access to iceberg table metadata as well as data access. The intention is to provide a functional subset of the java library. + +## Getting Started + +We are not currently publishing to PyPi so the best way to install the library is to clone the git repo and do a pip install -e + +``` +git clone https://github.com/apache/incubator-iceberg.git +cd incubator-iceberg/python +pip install -e . + Review comment: Nit: empty line. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] rdblue merged pull request #554: Fix IllegalArgumentException in DataFiles.Builder.withPartitionPath
rdblue merged pull request #554: Fix IllegalArgumentException in DataFiles.Builder.withPartitionPath URL: https://github.com/apache/incubator-iceberg/pull/554 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] jzhuge edited a comment on issue #554: Fix IllegalArgumentException in DataFiles.Builder.withPartitionPath
jzhuge edited a comment on issue #554: Fix IllegalArgumentException in DataFiles.Builder.withPartitionPath URL: https://github.com/apache/incubator-iceberg/pull/554#issuecomment-543249742 @rdblue When you merged #57 into "rblue/iceberg" branch in commit 22d802aca84f27be4e95bda2030ca7f423e854fc on Mar 13th, did you add the changes to DataFiles.Builder.withPartitionPath? I have the suspicion because they were not in @aokolnychyi's commit 234f49ffdbae82566ef8971679576d8702571fd6 merged into master. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] jzhuge edited a comment on issue #554: Fix IllegalArgumentException in DataFiles.Builder.withPartitionPath
jzhuge edited a comment on issue #554: Fix IllegalArgumentException in DataFiles.Builder.withPartitionPath URL: https://github.com/apache/incubator-iceberg/pull/554#issuecomment-543249742 @rdblue When you merged #57 into "rblue/iceberg" branch in 22d802aca84f27be4e95bda2030ca7f423e854fc on Mar 13th, did you add the changes to DataFiles.Builder.withPartitionPath? I have the suspicion because they were not in @aokolnychyi's 234f49ffdbae82566ef8971679576d8702571fd6 merged into master. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] jzhuge edited a comment on issue #554: Fix IllegalArgumentException in DataFiles.Builder.withPartitionPath
jzhuge edited a comment on issue #554: Fix IllegalArgumentException in DataFiles.Builder.withPartitionPath URL: https://github.com/apache/incubator-iceberg/pull/554#issuecomment-543249742 @rdblue When you merged #57 into "rblue/iceberg" branch in commit 22d802aca84f27be4e95bda2030ca7f423e854fc on Mar 13th, did you add the changes to DataFiles.Builder.withPartitionPath? I have the suspicion because they were not in @aokolnychyi's #57 commit 234f49ffdbae82566ef8971679576d8702571fd6 merged into master. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] jzhuge commented on issue #554: Fix IllegalArgumentException in DataFiles.Builder.withPartitionPath
jzhuge commented on issue #554: Fix IllegalArgumentException in DataFiles.Builder.withPartitionPath URL: https://github.com/apache/incubator-iceberg/pull/554#issuecomment-543249742 @rdblue When you merged #57 into "rblue/iceberg" branch in commit 22d802aca84f27be4e95bda2030ca7f423e854fc on Mar 13th, did you add the changes to DataFiles.Builder.withPartitionPath? I have the suspicion because they were not in @aokolnychyi's original commit for #57. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] jzhuge commented on issue #554: Fix IllegalArgumentException in DataFiles.Builder.withPartitionPath
jzhuge commented on issue #554: Fix IllegalArgumentException in DataFiles.Builder.withPartitionPath URL: https://github.com/apache/incubator-iceberg/pull/554#issuecomment-543248490 @rdblue this PR is probably no longer necessary because of #507, right? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] jzhuge commented on a change in pull request #549: Add Spark custom Kryo registrator
jzhuge commented on a change in pull request #549: Add Spark custom Kryo registrator URL: https://github.com/apache/incubator-iceberg/pull/549#discussion_r336087118 ## File path: build.gradle ## @@ -429,6 +429,8 @@ project(':iceberg-spark') { compile project(':iceberg-parquet') compile project(':iceberg-hive') +compile 'de.javakaffee:kryo-serializers' Review comment: Added additional LICENSE and NOTICE. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] aokolnychyi edited a comment on issue #553: Spark ReadTask is expensive to serialize
aokolnychyi edited a comment on issue #553: Spark ReadTask is expensive to serialize URL: https://github.com/apache/incubator-iceberg/issues/553#issuecomment-543199975 As a short-term solution, we can broadcast `EncryptionManager` and `FileIO` in `IcebergSource`. Then `Reader` and `ReadTask` can store references to the broadcasted values and fetch actual ones in `createPartitionReader` while creating `TaskDataReader`. This seems to solve the scheduler delay issue. @rdblue thoughts? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] aokolnychyi commented on issue #553: Spark ReadTask is expensive to serialize
aokolnychyi commented on issue #553: Spark ReadTask is expensive to serialize URL: https://github.com/apache/incubator-iceberg/issues/553#issuecomment-543199975 As a short-term solution, we can broadcast `EncryptionManager` and `FileIO` in `IcebergSource`. Then `Reader` and `ReadTask` can store references to the broadcasted values and fetch actual ones in `createPartitionReader` while creating `TaskDataReader`. @rdblue thoughts? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] andrei-ionescu commented on issue #510: Cannot update an Iceberg dataset from a Parquet file due to "field should be required, but is optional"
andrei-ionescu commented on issue #510: Cannot update an Iceberg dataset from a Parquet file due to "field should be required, but is optional" URL: https://github.com/apache/incubator-iceberg/issues/510#issuecomment-543192332 @rdsr Given two different locations of data (`hdfs://host_1/input/data/` and `hdfs://host_2/input/data/`), how would you move the `day=2019-06-01` partition from **host_1** to **host_2** applying some transformations (host_1 data is parquet format, host_2 data is iceberg format)? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] aokolnychyi commented on issue #553: Spark ReadTask is expensive to serialize
aokolnychyi commented on issue #553: Spark ReadTask is expensive to serialize URL: https://github.com/apache/incubator-iceberg/issues/553#issuecomment-543149968 I can confirm the issue is resolved if we avoid serializing `FileIO`. The main question is how to achieve that with minimum changes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] jzhuge opened a new pull request #554: Fix IllegalArgumentException in DataFiles.Builder.withPartitionPath
jzhuge opened a new pull request #554: Fix IllegalArgumentException in DataFiles.Builder.withPartitionPath URL: https://github.com/apache/incubator-iceberg/pull/554 DataFiles.fillFromPath threw "Invalid partition data, too many fields (expecting 0)" when the path is empty. The fix was in Anton's #57 but somehow got lost. The ugly `var` code can be removed from SparkDataFile.toDataFile. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org