[GitHub] carbondata pull request #2200: [CARBONDATA-2373][DataMap] Add bloom datamap ...
Github user asfgit closed the pull request at: https://github.com/apache/carbondata/pull/2200 ---
[GitHub] carbondata pull request #2200: [CARBONDATA-2373][DataMap] Add bloom datamap ...
Github user jackylk commented on a diff in the pull request: https://github.com/apache/carbondata/pull/2200#discussion_r183212032 --- Diff: datamap/bloom/src/main/java/org/apache/carbondata/datamap/bloom/BloomDataMapWriter.java --- @@ -0,0 +1,216 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.carbondata.datamap.bloom; + +import java.io.DataOutputStream; +import java.io.File; +import java.io.IOException; +import java.io.ObjectOutputStream; +import java.util.ArrayList; +import java.util.HashMap; +import java.util.List; +import java.util.Map; + +import org.apache.carbondata.common.annotations.InterfaceAudience; +import org.apache.carbondata.core.datamap.DataMapMeta; +import org.apache.carbondata.core.datamap.Segment; +import org.apache.carbondata.core.datamap.dev.DataMapWriter; +import org.apache.carbondata.core.datastore.impl.FileFactory; +import org.apache.carbondata.core.datastore.page.ColumnPage; +import org.apache.carbondata.core.metadata.AbsoluteTableIdentifier; +import org.apache.carbondata.core.metadata.datatype.DataType; +import org.apache.carbondata.core.metadata.datatype.DataTypes; +import org.apache.carbondata.core.util.CarbonUtil; + +import com.google.common.hash.BloomFilter; +import com.google.common.hash.Funnels; +import org.apache.hadoop.fs.FileSystem; +import org.apache.hadoop.fs.Path; + +@InterfaceAudience.Internal +public class BloomDataMapWriter extends DataMapWriter { + /** + * suppose one blocklet contains 20 page and all the indexed value is distinct. + * later we can make it configurable. + */ + private static final int BLOOM_FILTER_SIZE = 32000 * 20; + private String dataMapName; + private List indexedColumns; + // map column name to ordinal in pages + private Mapcol2Ordianl; + private Map col2DataType; + private String currentBlockId; + private int currentBlockletId; + private List currentDMFiles; + private List currentDataOutStreams; + private List currentObjectOutStreams; + private List > indexBloomFilters; + + public BloomDataMapWriter(AbsoluteTableIdentifier identifier, DataMapMeta dataMapMeta, + Segment segment, String writeDirectoryPath) { +super(identifier, segment, writeDirectoryPath); +dataMapName = dataMapMeta.getDataMapName(); +indexedColumns = dataMapMeta.getIndexedColumns(); +col2Ordianl = new HashMap (indexedColumns.size()); +col2DataType = new HashMap (indexedColumns.size()); + +currentDMFiles = new ArrayList(indexedColumns.size()); +currentDataOutStreams = new ArrayList(indexedColumns.size()); +currentObjectOutStreams = new ArrayList(indexedColumns.size()); + +indexBloomFilters = new ArrayList >(indexedColumns.size()); + } + + @Override + public void onBlockStart(String blockId, long taskId) throws IOException { +this.currentBlockId = blockId; +this.currentBlockletId = 0; +currentDMFiles.clear(); +currentDataOutStreams.clear(); +currentObjectOutStreams.clear(); +initDataMapFile(); + } + + @Override + public void onBlockEnd(String blockId) throws IOException { +for (int indexColId = 0; indexColId < indexedColumns.size(); indexColId++) { + CarbonUtil.closeStreams(this.currentDataOutStreams.get(indexColId), + this.currentObjectOutStreams.get(indexColId)); + commitFile(this.currentDMFiles.get(indexColId)); +} + } + + @Override public void onBlockletStart(int blockletId) { +this.currentBlockletId = blockletId; +indexBloomFilters.clear(); +for (int i = 0; i < indexedColumns.size(); i++) { + indexBloomFilters.add(BloomFilter.create(Funnels.byteArrayFunnel(), + BLOOM_FILTER_SIZE, 0.1d)); +} + } + +
[GitHub] carbondata pull request #2200: [CARBONDATA-2373][DataMap] Add bloom datamap ...
Github user xuchuanyin commented on a diff in the pull request: https://github.com/apache/carbondata/pull/2200#discussion_r183211728 --- Diff: datamap/bloom/src/main/java/org/apache/carbondata/datamap/bloom/BloomDataMapWriter.java --- @@ -0,0 +1,216 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.carbondata.datamap.bloom; + +import java.io.DataOutputStream; +import java.io.File; +import java.io.IOException; +import java.io.ObjectOutputStream; +import java.util.ArrayList; +import java.util.HashMap; +import java.util.List; +import java.util.Map; + +import org.apache.carbondata.common.annotations.InterfaceAudience; +import org.apache.carbondata.core.datamap.DataMapMeta; +import org.apache.carbondata.core.datamap.Segment; +import org.apache.carbondata.core.datamap.dev.DataMapWriter; +import org.apache.carbondata.core.datastore.impl.FileFactory; +import org.apache.carbondata.core.datastore.page.ColumnPage; +import org.apache.carbondata.core.metadata.AbsoluteTableIdentifier; +import org.apache.carbondata.core.metadata.datatype.DataType; +import org.apache.carbondata.core.metadata.datatype.DataTypes; +import org.apache.carbondata.core.util.CarbonUtil; + +import com.google.common.hash.BloomFilter; +import com.google.common.hash.Funnels; +import org.apache.hadoop.fs.FileSystem; +import org.apache.hadoop.fs.Path; + +@InterfaceAudience.Internal +public class BloomDataMapWriter extends DataMapWriter { + /** + * suppose one blocklet contains 20 page and all the indexed value is distinct. + * later we can make it configurable. + */ + private static final int BLOOM_FILTER_SIZE = 32000 * 20; --- End diff -- Yeah, it is used to control the rate. I'll make a default value for this. ---
[GitHub] carbondata pull request #2200: [CARBONDATA-2373][DataMap] Add bloom datamap ...
Github user xuchuanyin commented on a diff in the pull request: https://github.com/apache/carbondata/pull/2200#discussion_r183211712 --- Diff: datamap/bloom/src/main/java/org/apache/carbondata/datamap/bloom/BloomDataMapWriter.java --- @@ -0,0 +1,216 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.carbondata.datamap.bloom; + +import java.io.DataOutputStream; +import java.io.File; +import java.io.IOException; +import java.io.ObjectOutputStream; +import java.util.ArrayList; +import java.util.HashMap; +import java.util.List; +import java.util.Map; + +import org.apache.carbondata.common.annotations.InterfaceAudience; +import org.apache.carbondata.core.datamap.DataMapMeta; +import org.apache.carbondata.core.datamap.Segment; +import org.apache.carbondata.core.datamap.dev.DataMapWriter; +import org.apache.carbondata.core.datastore.impl.FileFactory; +import org.apache.carbondata.core.datastore.page.ColumnPage; +import org.apache.carbondata.core.metadata.AbsoluteTableIdentifier; +import org.apache.carbondata.core.metadata.datatype.DataType; +import org.apache.carbondata.core.metadata.datatype.DataTypes; +import org.apache.carbondata.core.util.CarbonUtil; + +import com.google.common.hash.BloomFilter; +import com.google.common.hash.Funnels; +import org.apache.hadoop.fs.FileSystem; +import org.apache.hadoop.fs.Path; + +@InterfaceAudience.Internal +public class BloomDataMapWriter extends DataMapWriter { + /** + * suppose one blocklet contains 20 page and all the indexed value is distinct. + * later we can make it configurable. + */ + private static final int BLOOM_FILTER_SIZE = 32000 * 20; + private String dataMapName; + private List indexedColumns; + // map column name to ordinal in pages + private Mapcol2Ordianl; + private Map col2DataType; + private String currentBlockId; + private int currentBlockletId; + private List currentDMFiles; + private List currentDataOutStreams; + private List currentObjectOutStreams; + private List > indexBloomFilters; + + public BloomDataMapWriter(AbsoluteTableIdentifier identifier, DataMapMeta dataMapMeta, + Segment segment, String writeDirectoryPath) { +super(identifier, segment, writeDirectoryPath); +dataMapName = dataMapMeta.getDataMapName(); +indexedColumns = dataMapMeta.getIndexedColumns(); +col2Ordianl = new HashMap (indexedColumns.size()); +col2DataType = new HashMap (indexedColumns.size()); + +currentDMFiles = new ArrayList(indexedColumns.size()); +currentDataOutStreams = new ArrayList(indexedColumns.size()); +currentObjectOutStreams = new ArrayList(indexedColumns.size()); + +indexBloomFilters = new ArrayList >(indexedColumns.size()); + } + + @Override + public void onBlockStart(String blockId, long taskId) throws IOException { +this.currentBlockId = blockId; +this.currentBlockletId = 0; +currentDMFiles.clear(); +currentDataOutStreams.clear(); +currentObjectOutStreams.clear(); +initDataMapFile(); + } + + @Override + public void onBlockEnd(String blockId) throws IOException { +for (int indexColId = 0; indexColId < indexedColumns.size(); indexColId++) { + CarbonUtil.closeStreams(this.currentDataOutStreams.get(indexColId), + this.currentObjectOutStreams.get(indexColId)); + commitFile(this.currentDMFiles.get(indexColId)); +} + } + + @Override public void onBlockletStart(int blockletId) { +this.currentBlockletId = blockletId; +indexBloomFilters.clear(); +for (int i = 0; i < indexedColumns.size(); i++) { + indexBloomFilters.add(BloomFilter.create(Funnels.byteArrayFunnel(), + BLOOM_FILTER_SIZE, 0.1d)); +} + } +
[GitHub] carbondata pull request #2200: [CARBONDATA-2373][DataMap] Add bloom datamap ...
Github user jackylk commented on a diff in the pull request: https://github.com/apache/carbondata/pull/2200#discussion_r183211031 --- Diff: datamap/bloom/src/main/java/org/apache/carbondata/datamap/bloom/BloomDataMapWriter.java --- @@ -0,0 +1,216 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.carbondata.datamap.bloom; + +import java.io.DataOutputStream; +import java.io.File; +import java.io.IOException; +import java.io.ObjectOutputStream; +import java.util.ArrayList; +import java.util.HashMap; +import java.util.List; +import java.util.Map; + +import org.apache.carbondata.common.annotations.InterfaceAudience; +import org.apache.carbondata.core.datamap.DataMapMeta; +import org.apache.carbondata.core.datamap.Segment; +import org.apache.carbondata.core.datamap.dev.DataMapWriter; +import org.apache.carbondata.core.datastore.impl.FileFactory; +import org.apache.carbondata.core.datastore.page.ColumnPage; +import org.apache.carbondata.core.metadata.AbsoluteTableIdentifier; +import org.apache.carbondata.core.metadata.datatype.DataType; +import org.apache.carbondata.core.metadata.datatype.DataTypes; +import org.apache.carbondata.core.util.CarbonUtil; + +import com.google.common.hash.BloomFilter; +import com.google.common.hash.Funnels; +import org.apache.hadoop.fs.FileSystem; +import org.apache.hadoop.fs.Path; + +@InterfaceAudience.Internal +public class BloomDataMapWriter extends DataMapWriter { + /** + * suppose one blocklet contains 20 page and all the indexed value is distinct. + * later we can make it configurable. + */ + private static final int BLOOM_FILTER_SIZE = 32000 * 20; --- End diff -- Can you make one DMPROPERTY for it? Is it control the bloom filter size? ---
[GitHub] carbondata pull request #2200: [CARBONDATA-2373][DataMap] Add bloom datamap ...
Github user jackylk commented on a diff in the pull request: https://github.com/apache/carbondata/pull/2200#discussion_r183210997 --- Diff: datamap/bloom/src/main/java/org/apache/carbondata/datamap/bloom/BloomDataMapWriter.java --- @@ -0,0 +1,216 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.carbondata.datamap.bloom; + +import java.io.DataOutputStream; +import java.io.File; +import java.io.IOException; +import java.io.ObjectOutputStream; +import java.util.ArrayList; +import java.util.HashMap; +import java.util.List; +import java.util.Map; + +import org.apache.carbondata.common.annotations.InterfaceAudience; +import org.apache.carbondata.core.datamap.DataMapMeta; +import org.apache.carbondata.core.datamap.Segment; +import org.apache.carbondata.core.datamap.dev.DataMapWriter; +import org.apache.carbondata.core.datastore.impl.FileFactory; +import org.apache.carbondata.core.datastore.page.ColumnPage; +import org.apache.carbondata.core.metadata.AbsoluteTableIdentifier; +import org.apache.carbondata.core.metadata.datatype.DataType; +import org.apache.carbondata.core.metadata.datatype.DataTypes; +import org.apache.carbondata.core.util.CarbonUtil; + +import com.google.common.hash.BloomFilter; +import com.google.common.hash.Funnels; +import org.apache.hadoop.fs.FileSystem; +import org.apache.hadoop.fs.Path; + +@InterfaceAudience.Internal +public class BloomDataMapWriter extends DataMapWriter { + /** + * suppose one blocklet contains 20 page and all the indexed value is distinct. + * later we can make it configurable. + */ + private static final int BLOOM_FILTER_SIZE = 32000 * 20; + private String dataMapName; + private List indexedColumns; + // map column name to ordinal in pages + private Mapcol2Ordianl; + private Map col2DataType; + private String currentBlockId; + private int currentBlockletId; + private List currentDMFiles; + private List currentDataOutStreams; + private List currentObjectOutStreams; + private List > indexBloomFilters; + + public BloomDataMapWriter(AbsoluteTableIdentifier identifier, DataMapMeta dataMapMeta, + Segment segment, String writeDirectoryPath) { +super(identifier, segment, writeDirectoryPath); +dataMapName = dataMapMeta.getDataMapName(); +indexedColumns = dataMapMeta.getIndexedColumns(); +col2Ordianl = new HashMap (indexedColumns.size()); +col2DataType = new HashMap (indexedColumns.size()); + +currentDMFiles = new ArrayList(indexedColumns.size()); +currentDataOutStreams = new ArrayList(indexedColumns.size()); +currentObjectOutStreams = new ArrayList(indexedColumns.size()); + +indexBloomFilters = new ArrayList >(indexedColumns.size()); + } + + @Override + public void onBlockStart(String blockId, long taskId) throws IOException { +this.currentBlockId = blockId; +this.currentBlockletId = 0; +currentDMFiles.clear(); +currentDataOutStreams.clear(); +currentObjectOutStreams.clear(); +initDataMapFile(); + } + + @Override + public void onBlockEnd(String blockId) throws IOException { +for (int indexColId = 0; indexColId < indexedColumns.size(); indexColId++) { + CarbonUtil.closeStreams(this.currentDataOutStreams.get(indexColId), + this.currentObjectOutStreams.get(indexColId)); + commitFile(this.currentDMFiles.get(indexColId)); +} + } + + @Override public void onBlockletStart(int blockletId) { +this.currentBlockletId = blockletId; +indexBloomFilters.clear(); +for (int i = 0; i < indexedColumns.size(); i++) { + indexBloomFilters.add(BloomFilter.create(Funnels.byteArrayFunnel(), + BLOOM_FILTER_SIZE, 0.1d)); +} + } + +
[GitHub] carbondata pull request #2200: [CARBONDATA-2373][DataMap] Add bloom datamap ...
Github user jackylk commented on a diff in the pull request: https://github.com/apache/carbondata/pull/2200#discussion_r183210908 --- Diff: datamap/bloom/src/main/java/org/apache/carbondata/datamap/bloom/BloomDataMapWriter.java --- @@ -0,0 +1,216 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.carbondata.datamap.bloom; + +import java.io.DataOutputStream; +import java.io.File; +import java.io.IOException; +import java.io.ObjectOutputStream; +import java.util.ArrayList; +import java.util.HashMap; +import java.util.List; +import java.util.Map; + +import org.apache.carbondata.common.annotations.InterfaceAudience; +import org.apache.carbondata.core.datamap.DataMapMeta; +import org.apache.carbondata.core.datamap.Segment; +import org.apache.carbondata.core.datamap.dev.DataMapWriter; +import org.apache.carbondata.core.datastore.impl.FileFactory; +import org.apache.carbondata.core.datastore.page.ColumnPage; +import org.apache.carbondata.core.metadata.AbsoluteTableIdentifier; +import org.apache.carbondata.core.metadata.datatype.DataType; +import org.apache.carbondata.core.metadata.datatype.DataTypes; +import org.apache.carbondata.core.util.CarbonUtil; + +import com.google.common.hash.BloomFilter; +import com.google.common.hash.Funnels; +import org.apache.hadoop.fs.FileSystem; +import org.apache.hadoop.fs.Path; + +@InterfaceAudience.Internal +public class BloomDataMapWriter extends DataMapWriter { + /** + * suppose one blocklet contains 20 page and all the indexed value is distinct. + * later we can make it configurable. + */ + private static final int BLOOM_FILTER_SIZE = 32000 * 20; + private String dataMapName; + private List indexedColumns; + // map column name to ordinal in pages + private Mapcol2Ordianl; + private Map col2DataType; + private String currentBlockId; + private int currentBlockletId; + private List currentDMFiles; + private List currentDataOutStreams; + private List currentObjectOutStreams; + private List > indexBloomFilters; + + public BloomDataMapWriter(AbsoluteTableIdentifier identifier, DataMapMeta dataMapMeta, + Segment segment, String writeDirectoryPath) { +super(identifier, segment, writeDirectoryPath); +dataMapName = dataMapMeta.getDataMapName(); +indexedColumns = dataMapMeta.getIndexedColumns(); +col2Ordianl = new HashMap (indexedColumns.size()); +col2DataType = new HashMap (indexedColumns.size()); + +currentDMFiles = new ArrayList(indexedColumns.size()); +currentDataOutStreams = new ArrayList(indexedColumns.size()); +currentObjectOutStreams = new ArrayList(indexedColumns.size()); + +indexBloomFilters = new ArrayList >(indexedColumns.size()); + } + + @Override + public void onBlockStart(String blockId, long taskId) throws IOException { +this.currentBlockId = blockId; +this.currentBlockletId = 0; +currentDMFiles.clear(); +currentDataOutStreams.clear(); +currentObjectOutStreams.clear(); +initDataMapFile(); + } + + @Override + public void onBlockEnd(String blockId) throws IOException { +for (int indexColId = 0; indexColId < indexedColumns.size(); indexColId++) { + CarbonUtil.closeStreams(this.currentDataOutStreams.get(indexColId), + this.currentObjectOutStreams.get(indexColId)); + commitFile(this.currentDMFiles.get(indexColId)); +} + } + + @Override public void onBlockletStart(int blockletId) { +this.currentBlockletId = blockletId; +indexBloomFilters.clear(); +for (int i = 0; i < indexedColumns.size(); i++) { + indexBloomFilters.add(BloomFilter.create(Funnels.byteArrayFunnel(), + BLOOM_FILTER_SIZE, 0.1d)); +} + } + +
[GitHub] carbondata pull request #2200: [CARBONDATA-2373][DataMap] Add bloom datamap ...
Github user jackylk commented on a diff in the pull request: https://github.com/apache/carbondata/pull/2200#discussion_r183210742 --- Diff: datamap/bloom/src/main/java/org/apache/carbondata/datamap/bloom/BloomDataMapWriter.java --- @@ -0,0 +1,216 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.carbondata.datamap.bloom; + +import java.io.DataOutputStream; +import java.io.File; +import java.io.IOException; +import java.io.ObjectOutputStream; +import java.util.ArrayList; +import java.util.HashMap; +import java.util.List; +import java.util.Map; + +import org.apache.carbondata.common.annotations.InterfaceAudience; +import org.apache.carbondata.core.datamap.DataMapMeta; +import org.apache.carbondata.core.datamap.Segment; +import org.apache.carbondata.core.datamap.dev.DataMapWriter; +import org.apache.carbondata.core.datastore.impl.FileFactory; +import org.apache.carbondata.core.datastore.page.ColumnPage; +import org.apache.carbondata.core.metadata.AbsoluteTableIdentifier; +import org.apache.carbondata.core.metadata.datatype.DataType; +import org.apache.carbondata.core.metadata.datatype.DataTypes; +import org.apache.carbondata.core.util.CarbonUtil; + +import com.google.common.hash.BloomFilter; +import com.google.common.hash.Funnels; +import org.apache.hadoop.fs.FileSystem; +import org.apache.hadoop.fs.Path; + +@InterfaceAudience.Internal +public class BloomDataMapWriter extends DataMapWriter { + /** + * suppose one blocklet contains 20 page and all the indexed value is distinct. + * later we can make it configurable. + */ + private static final int BLOOM_FILTER_SIZE = 32000 * 20; + private String dataMapName; + private List indexedColumns; + // map column name to ordinal in pages + private Mapcol2Ordianl; + private Map col2DataType; + private String currentBlockId; + private int currentBlockletId; + private List currentDMFiles; + private List currentDataOutStreams; + private List currentObjectOutStreams; + private List > indexBloomFilters; + + public BloomDataMapWriter(AbsoluteTableIdentifier identifier, DataMapMeta dataMapMeta, --- End diff -- Add @InterfaceAudience And can you add description for: 1. BloomFilter is constructed in what level? page, blocklet, block? 2. bloomindex is written one file for one block, or one file for one write task? ---
[GitHub] carbondata pull request #2200: [CARBONDATA-2373][DataMap] Add bloom datamap ...
Github user jackylk commented on a diff in the pull request: https://github.com/apache/carbondata/pull/2200#discussion_r183210637 --- Diff: datamap/bloom/src/main/java/org/apache/carbondata/datamap/bloom/BloomDataMapWriter.java --- @@ -0,0 +1,216 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.carbondata.datamap.bloom; + +import java.io.DataOutputStream; +import java.io.File; +import java.io.IOException; +import java.io.ObjectOutputStream; +import java.util.ArrayList; +import java.util.HashMap; +import java.util.List; +import java.util.Map; + +import org.apache.carbondata.common.annotations.InterfaceAudience; +import org.apache.carbondata.core.datamap.DataMapMeta; +import org.apache.carbondata.core.datamap.Segment; +import org.apache.carbondata.core.datamap.dev.DataMapWriter; +import org.apache.carbondata.core.datastore.impl.FileFactory; +import org.apache.carbondata.core.datastore.page.ColumnPage; +import org.apache.carbondata.core.metadata.AbsoluteTableIdentifier; +import org.apache.carbondata.core.metadata.datatype.DataType; +import org.apache.carbondata.core.metadata.datatype.DataTypes; +import org.apache.carbondata.core.util.CarbonUtil; + +import com.google.common.hash.BloomFilter; +import com.google.common.hash.Funnels; +import org.apache.hadoop.fs.FileSystem; +import org.apache.hadoop.fs.Path; + +@InterfaceAudience.Internal +public class BloomDataMapWriter extends DataMapWriter { + /** + * suppose one blocklet contains 20 page and all the indexed value is distinct. + * later we can make it configurable. + */ + private static final int BLOOM_FILTER_SIZE = 32000 * 20; + private String dataMapName; + private List indexedColumns; + // map column name to ordinal in pages + private Mapcol2Ordianl; + private Map col2DataType; + private String currentBlockId; + private int currentBlockletId; + private List currentDMFiles; + private List currentDataOutStreams; + private List currentObjectOutStreams; + private List > indexBloomFilters; + + public BloomDataMapWriter(AbsoluteTableIdentifier identifier, DataMapMeta dataMapMeta, + Segment segment, String writeDirectoryPath) { +super(identifier, segment, writeDirectoryPath); +dataMapName = dataMapMeta.getDataMapName(); +indexedColumns = dataMapMeta.getIndexedColumns(); +col2Ordianl = new HashMap (indexedColumns.size()); +col2DataType = new HashMap (indexedColumns.size()); + +currentDMFiles = new ArrayList(indexedColumns.size()); +currentDataOutStreams = new ArrayList(indexedColumns.size()); +currentObjectOutStreams = new ArrayList(indexedColumns.size()); + +indexBloomFilters = new ArrayList >(indexedColumns.size()); + } + + @Override + public void onBlockStart(String blockId, long taskId) throws IOException { +this.currentBlockId = blockId; +this.currentBlockletId = 0; +currentDMFiles.clear(); +currentDataOutStreams.clear(); +currentObjectOutStreams.clear(); +initDataMapFile(); + } + + @Override + public void onBlockEnd(String blockId) throws IOException { +for (int indexColId = 0; indexColId < indexedColumns.size(); indexColId++) { + CarbonUtil.closeStreams(this.currentDataOutStreams.get(indexColId), + this.currentObjectOutStreams.get(indexColId)); + commitFile(this.currentDMFiles.get(indexColId)); +} + } + + @Override public void onBlockletStart(int blockletId) { +this.currentBlockletId = blockletId; +indexBloomFilters.clear(); +for (int i = 0; i < indexedColumns.size(); i++) { + indexBloomFilters.add(BloomFilter.create(Funnels.byteArrayFunnel(), + BLOOM_FILTER_SIZE, 0.1d)); +} + } + +
[GitHub] carbondata pull request #2200: [CARBONDATA-2373][DataMap] Add bloom datamap ...
Github user jackylk commented on a diff in the pull request: https://github.com/apache/carbondata/pull/2200#discussion_r183210634 --- Diff: datamap/bloom/src/main/java/org/apache/carbondata/datamap/bloom/BloomDataMapWriter.java --- @@ -0,0 +1,216 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.carbondata.datamap.bloom; + +import java.io.DataOutputStream; +import java.io.File; +import java.io.IOException; +import java.io.ObjectOutputStream; +import java.util.ArrayList; +import java.util.HashMap; +import java.util.List; +import java.util.Map; + +import org.apache.carbondata.common.annotations.InterfaceAudience; +import org.apache.carbondata.core.datamap.DataMapMeta; +import org.apache.carbondata.core.datamap.Segment; +import org.apache.carbondata.core.datamap.dev.DataMapWriter; +import org.apache.carbondata.core.datastore.impl.FileFactory; +import org.apache.carbondata.core.datastore.page.ColumnPage; +import org.apache.carbondata.core.metadata.AbsoluteTableIdentifier; +import org.apache.carbondata.core.metadata.datatype.DataType; +import org.apache.carbondata.core.metadata.datatype.DataTypes; +import org.apache.carbondata.core.util.CarbonUtil; + +import com.google.common.hash.BloomFilter; +import com.google.common.hash.Funnels; +import org.apache.hadoop.fs.FileSystem; +import org.apache.hadoop.fs.Path; + +@InterfaceAudience.Internal +public class BloomDataMapWriter extends DataMapWriter { + /** + * suppose one blocklet contains 20 page and all the indexed value is distinct. + * later we can make it configurable. + */ + private static final int BLOOM_FILTER_SIZE = 32000 * 20; + private String dataMapName; + private List indexedColumns; + // map column name to ordinal in pages + private Mapcol2Ordianl; + private Map col2DataType; + private String currentBlockId; + private int currentBlockletId; + private List currentDMFiles; + private List currentDataOutStreams; + private List currentObjectOutStreams; + private List > indexBloomFilters; + + public BloomDataMapWriter(AbsoluteTableIdentifier identifier, DataMapMeta dataMapMeta, + Segment segment, String writeDirectoryPath) { +super(identifier, segment, writeDirectoryPath); +dataMapName = dataMapMeta.getDataMapName(); +indexedColumns = dataMapMeta.getIndexedColumns(); +col2Ordianl = new HashMap (indexedColumns.size()); +col2DataType = new HashMap (indexedColumns.size()); + +currentDMFiles = new ArrayList(indexedColumns.size()); +currentDataOutStreams = new ArrayList(indexedColumns.size()); +currentObjectOutStreams = new ArrayList(indexedColumns.size()); + +indexBloomFilters = new ArrayList >(indexedColumns.size()); + } + + @Override + public void onBlockStart(String blockId, long taskId) throws IOException { +this.currentBlockId = blockId; +this.currentBlockletId = 0; +currentDMFiles.clear(); +currentDataOutStreams.clear(); +currentObjectOutStreams.clear(); +initDataMapFile(); + } + + @Override + public void onBlockEnd(String blockId) throws IOException { +for (int indexColId = 0; indexColId < indexedColumns.size(); indexColId++) { + CarbonUtil.closeStreams(this.currentDataOutStreams.get(indexColId), + this.currentObjectOutStreams.get(indexColId)); + commitFile(this.currentDMFiles.get(indexColId)); +} + } + + @Override public void onBlockletStart(int blockletId) { --- End diff -- move @Override to previous line ---
[GitHub] carbondata pull request #2200: [CARBONDATA-2373][DataMap] Add bloom datamap ...
Github user jackylk commented on a diff in the pull request: https://github.com/apache/carbondata/pull/2200#discussion_r183210621 --- Diff: datamap/bloom/src/test/scala/org/apache/carbondata/datamap/bloom/BloomCoarseGrainDataMapSuite.scala --- @@ -0,0 +1,123 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.carbondata.datamap.bloom + +import java.io.{File, PrintWriter} +import java.util.UUID + +import scala.util.Random + +import org.apache.spark.sql.Row +import org.apache.spark.sql.test.util.QueryTest +import org.scalatest.BeforeAndAfterAll + +class BloomCoarseGrainDataMapSuite extends QueryTest with BeforeAndAfterAll { + val inputFile = s"$resourcesPath/bloom_datamap_input.csv" + val normalTable = "carbon_normal" + val bloomDMSampleTable = "carbon_bloom" + val dataMapName = "bloom_dm" + val lineNum = 50 + + override protected def beforeAll(): Unit = { +createFile(inputFile, line = lineNum, start = 0) +sql(s"DROP TABLE IF EXISTS $normalTable") +sql(s"DROP TABLE IF EXISTS $bloomDMSampleTable") + } + + test("test bloom datamap") { +sql( + s""" + | CREATE TABLE $normalTable(id INT, name STRING, city STRING, age INT, + | s1 STRING, s2 STRING, s3 STRING, s4 STRING, s5 STRING, s6 STRING, s7 STRING, s8 STRING) + | STORED BY 'carbondata' TBLPROPERTIES('table_blocksize'='128') + | """.stripMargin) +sql( + s""" + | CREATE TABLE $bloomDMSampleTable(id INT, name STRING, city STRING, age INT, + | s1 STRING, s2 STRING, s3 STRING, s4 STRING, s5 STRING, s6 STRING, s7 STRING, s8 STRING) + | STORED BY 'carbondata' TBLPROPERTIES('table_blocksize'='128') + | """.stripMargin) +sql( + s""" + | CREATE DATAMAP $dataMapName ON TABLE $bloomDMSampleTable + | USING '${classOf[BloomCoarseGrainDataMapFactory].getName}' + | DMProperties('BLOOM_COLUMNS'='city,id') + """.stripMargin) + +sql( + s""" + | LOAD DATA LOCAL INPATH '$inputFile' INTO TABLE $normalTable + | OPTIONS('header'='false') + """.stripMargin) +sql( + s""" + | LOAD DATA LOCAL INPATH '$inputFile' INTO TABLE $bloomDMSampleTable + | OPTIONS('header'='false') + """.stripMargin) + +sql(s"show datamap on table $bloomDMSampleTable").show(false) +sql(s"select * from $bloomDMSampleTable where city = 'city_5'").show(false) --- End diff -- can you also assert the bloom index file is created in the file system? ---
[GitHub] carbondata pull request #2200: [CARBONDATA-2373][DataMap] Add bloom datamap ...
Github user xuchuanyin commented on a diff in the pull request: https://github.com/apache/carbondata/pull/2200#discussion_r183210168 --- Diff: datamap/bloom/src/main/java/org/apache/carbondata/datamap/bloom/BloomCoarseGrainDataMap.java --- @@ -0,0 +1,243 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.carbondata.datamap.bloom; + +import java.io.DataInputStream; +import java.io.EOFException; +import java.io.IOException; +import java.io.ObjectInputStream; +import java.io.UnsupportedEncodingException; +import java.util.ArrayList; +import java.util.HashSet; +import java.util.List; +import java.util.Set; + +import org.apache.carbondata.common.logging.LogService; +import org.apache.carbondata.common.logging.LogServiceFactory; +import org.apache.carbondata.core.datamap.dev.DataMapModel; +import org.apache.carbondata.core.datamap.dev.cgdatamap.CoarseGrainDataMap; +import org.apache.carbondata.core.datastore.block.SegmentProperties; +import org.apache.carbondata.core.datastore.impl.FileFactory; +import org.apache.carbondata.core.indexstore.Blocklet; +import org.apache.carbondata.core.indexstore.PartitionSpec; +import org.apache.carbondata.core.memory.MemoryException; +import org.apache.carbondata.core.metadata.datatype.DataType; +import org.apache.carbondata.core.metadata.datatype.DataTypes; +import org.apache.carbondata.core.scan.expression.ColumnExpression; +import org.apache.carbondata.core.scan.expression.Expression; +import org.apache.carbondata.core.scan.expression.LiteralExpression; +import org.apache.carbondata.core.scan.expression.conditional.EqualToExpression; +import org.apache.carbondata.core.scan.filter.resolver.FilterResolverIntf; +import org.apache.carbondata.core.util.CarbonUtil; + +import com.google.common.collect.ArrayListMultimap; +import com.google.common.collect.Multimap; +import org.apache.commons.lang3.StringUtils; +import org.apache.hadoop.fs.FileStatus; +import org.apache.hadoop.fs.FileSystem; +import org.apache.hadoop.fs.Path; +import org.apache.hadoop.fs.PathFilter; + +public class BloomCoarseGrainDataMap extends CoarseGrainDataMap { + private static final LogService LOGGER = + LogServiceFactory.getLogService(BloomCoarseGrainDataMap.class.getName()); + private String[] indexFilePath; + private Set indexedColumn; + private List bloomIndexList; + private MultimapindexCol2BloomDMList; + + @Override + public void init(DataMapModel dataMapModel) throws MemoryException, IOException { +Path indexPath = FileFactory.getPath(dataMapModel.getFilePath()); +FileSystem fs = FileFactory.getFileSystem(indexPath); +if (!fs.exists(indexPath)) { + throw new IOException( + String.format("Path %s for Bloom index dataMap does not exist", indexPath)); +} +if (!fs.isDirectory(indexPath)) { + throw new IOException( + String.format("Path %s for Bloom index dataMap must be a directory", indexPath)); +} + +FileStatus[] indexFileStatus = fs.listStatus(indexPath, new PathFilter() { + @Override public boolean accept(Path path) { +return path.getName().endsWith(".bloomindex"); + } +}); +indexFilePath = new String[indexFileStatus.length]; +indexedColumn = new HashSet(); +bloomIndexList = new ArrayList(); +indexCol2BloomDMList = ArrayListMultimap.create(); +for (int i = 0; i < indexFileStatus.length; i++) { + indexFilePath[i] = indexFileStatus[i].getPath().toString(); + String indexCol = StringUtils.substringBetween(indexFilePath[i], ".carbondata.", + ".bloomindex"); + indexedColumn.add(indexCol); + bloomIndexList.addAll(readBloomIndex(indexFilePath[i])); + indexCol2BloomDMList.put(indexCol, readBloomIndex(indexFilePath[i])); +} +LOGGER.info("find bloom index datamap for column: "
[GitHub] carbondata pull request #2200: [CARBONDATA-2373][DataMap] Add bloom datamap ...
Github user jackylk commented on a diff in the pull request: https://github.com/apache/carbondata/pull/2200#discussion_r183203455 --- Diff: datamap/bloom/src/main/java/org/apache/carbondata/datamap/bloom/BloomCoarseGrainDataMap.java --- @@ -0,0 +1,243 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.carbondata.datamap.bloom; + +import java.io.DataInputStream; +import java.io.EOFException; +import java.io.IOException; +import java.io.ObjectInputStream; +import java.io.UnsupportedEncodingException; +import java.util.ArrayList; +import java.util.HashSet; +import java.util.List; +import java.util.Set; + +import org.apache.carbondata.common.logging.LogService; +import org.apache.carbondata.common.logging.LogServiceFactory; +import org.apache.carbondata.core.datamap.dev.DataMapModel; +import org.apache.carbondata.core.datamap.dev.cgdatamap.CoarseGrainDataMap; +import org.apache.carbondata.core.datastore.block.SegmentProperties; +import org.apache.carbondata.core.datastore.impl.FileFactory; +import org.apache.carbondata.core.indexstore.Blocklet; +import org.apache.carbondata.core.indexstore.PartitionSpec; +import org.apache.carbondata.core.memory.MemoryException; +import org.apache.carbondata.core.metadata.datatype.DataType; +import org.apache.carbondata.core.metadata.datatype.DataTypes; +import org.apache.carbondata.core.scan.expression.ColumnExpression; +import org.apache.carbondata.core.scan.expression.Expression; +import org.apache.carbondata.core.scan.expression.LiteralExpression; +import org.apache.carbondata.core.scan.expression.conditional.EqualToExpression; +import org.apache.carbondata.core.scan.filter.resolver.FilterResolverIntf; +import org.apache.carbondata.core.util.CarbonUtil; + +import com.google.common.collect.ArrayListMultimap; +import com.google.common.collect.Multimap; +import org.apache.commons.lang3.StringUtils; +import org.apache.hadoop.fs.FileStatus; +import org.apache.hadoop.fs.FileSystem; +import org.apache.hadoop.fs.Path; +import org.apache.hadoop.fs.PathFilter; + +public class BloomCoarseGrainDataMap extends CoarseGrainDataMap { + private static final LogService LOGGER = + LogServiceFactory.getLogService(BloomCoarseGrainDataMap.class.getName()); + private String[] indexFilePath; + private Set indexedColumn; + private List bloomIndexList; + private MultimapindexCol2BloomDMList; + + @Override + public void init(DataMapModel dataMapModel) throws MemoryException, IOException { +Path indexPath = FileFactory.getPath(dataMapModel.getFilePath()); +FileSystem fs = FileFactory.getFileSystem(indexPath); +if (!fs.exists(indexPath)) { + throw new IOException( + String.format("Path %s for Bloom index dataMap does not exist", indexPath)); +} +if (!fs.isDirectory(indexPath)) { + throw new IOException( + String.format("Path %s for Bloom index dataMap must be a directory", indexPath)); +} + +FileStatus[] indexFileStatus = fs.listStatus(indexPath, new PathFilter() { + @Override public boolean accept(Path path) { +return path.getName().endsWith(".bloomindex"); + } +}); +indexFilePath = new String[indexFileStatus.length]; +indexedColumn = new HashSet(); +bloomIndexList = new ArrayList(); +indexCol2BloomDMList = ArrayListMultimap.create(); +for (int i = 0; i < indexFileStatus.length; i++) { + indexFilePath[i] = indexFileStatus[i].getPath().toString(); + String indexCol = StringUtils.substringBetween(indexFilePath[i], ".carbondata.", + ".bloomindex"); + indexedColumn.add(indexCol); + bloomIndexList.addAll(readBloomIndex(indexFilePath[i])); + indexCol2BloomDMList.put(indexCol, readBloomIndex(indexFilePath[i])); +} +LOGGER.info("find bloom index datamap for column: " +
[GitHub] carbondata pull request #2200: [CARBONDATA-2373][DataMap] Add bloom datamap ...
Github user xuchuanyin commented on a diff in the pull request: https://github.com/apache/carbondata/pull/2200#discussion_r183201359 --- Diff: datamap/bloom/src/main/java/org/apache/carbondata/datamap/bloom/BloomCoarseGrainDataMap.java --- @@ -0,0 +1,243 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.carbondata.datamap.bloom; + +import java.io.DataInputStream; +import java.io.EOFException; +import java.io.IOException; +import java.io.ObjectInputStream; +import java.io.UnsupportedEncodingException; +import java.util.ArrayList; +import java.util.HashSet; +import java.util.List; +import java.util.Set; + +import org.apache.carbondata.common.logging.LogService; +import org.apache.carbondata.common.logging.LogServiceFactory; +import org.apache.carbondata.core.datamap.dev.DataMapModel; +import org.apache.carbondata.core.datamap.dev.cgdatamap.CoarseGrainDataMap; +import org.apache.carbondata.core.datastore.block.SegmentProperties; +import org.apache.carbondata.core.datastore.impl.FileFactory; +import org.apache.carbondata.core.indexstore.Blocklet; +import org.apache.carbondata.core.indexstore.PartitionSpec; +import org.apache.carbondata.core.memory.MemoryException; +import org.apache.carbondata.core.metadata.datatype.DataType; +import org.apache.carbondata.core.metadata.datatype.DataTypes; +import org.apache.carbondata.core.scan.expression.ColumnExpression; +import org.apache.carbondata.core.scan.expression.Expression; +import org.apache.carbondata.core.scan.expression.LiteralExpression; +import org.apache.carbondata.core.scan.expression.conditional.EqualToExpression; +import org.apache.carbondata.core.scan.filter.resolver.FilterResolverIntf; +import org.apache.carbondata.core.util.CarbonUtil; + +import com.google.common.collect.ArrayListMultimap; +import com.google.common.collect.Multimap; +import org.apache.commons.lang3.StringUtils; +import org.apache.hadoop.fs.FileStatus; +import org.apache.hadoop.fs.FileSystem; +import org.apache.hadoop.fs.Path; +import org.apache.hadoop.fs.PathFilter; + +public class BloomCoarseGrainDataMap extends CoarseGrainDataMap { + private static final LogService LOGGER = + LogServiceFactory.getLogService(BloomCoarseGrainDataMap.class.getName()); + private String[] indexFilePath; + private Set indexedColumn; + private List bloomIndexList; + private MultimapindexCol2BloomDMList; + + @Override + public void init(DataMapModel dataMapModel) throws MemoryException, IOException { +Path indexPath = FileFactory.getPath(dataMapModel.getFilePath()); +FileSystem fs = FileFactory.getFileSystem(indexPath); +if (!fs.exists(indexPath)) { + throw new IOException( + String.format("Path %s for Bloom index dataMap does not exist", indexPath)); +} +if (!fs.isDirectory(indexPath)) { + throw new IOException( + String.format("Path %s for Bloom index dataMap must be a directory", indexPath)); +} + +FileStatus[] indexFileStatus = fs.listStatus(indexPath, new PathFilter() { + @Override public boolean accept(Path path) { +return path.getName().endsWith(".bloomindex"); + } +}); +indexFilePath = new String[indexFileStatus.length]; +indexedColumn = new HashSet(); +bloomIndexList = new ArrayList(); +indexCol2BloomDMList = ArrayListMultimap.create(); +for (int i = 0; i < indexFileStatus.length; i++) { + indexFilePath[i] = indexFileStatus[i].getPath().toString(); + String indexCol = StringUtils.substringBetween(indexFilePath[i], ".carbondata.", + ".bloomindex"); + indexedColumn.add(indexCol); + bloomIndexList.addAll(readBloomIndex(indexFilePath[i])); + indexCol2BloomDMList.put(indexCol, readBloomIndex(indexFilePath[i])); +} +LOGGER.info("find bloom index datamap for column: "
[GitHub] carbondata pull request #2200: [CARBONDATA-2373][DataMap] Add bloom datamap ...
Github user jackylk commented on a diff in the pull request: https://github.com/apache/carbondata/pull/2200#discussion_r183199149 --- Diff: datamap/bloom/src/main/java/org/apache/carbondata/datamap/bloom/BloomCoarseGrainDataMapFactory.java --- @@ -0,0 +1,192 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.carbondata.datamap.bloom; + +import java.io.File; +import java.io.IOException; +import java.util.ArrayList; +import java.util.HashSet; +import java.util.List; +import java.util.Objects; +import java.util.Set; + +import org.apache.carbondata.common.exceptions.sql.MalformedDataMapCommandException; +import org.apache.carbondata.common.logging.LogService; +import org.apache.carbondata.common.logging.LogServiceFactory; +import org.apache.carbondata.core.datamap.DataMapDistributable; +import org.apache.carbondata.core.datamap.DataMapLevel; +import org.apache.carbondata.core.datamap.DataMapMeta; +import org.apache.carbondata.core.datamap.Segment; +import org.apache.carbondata.core.datamap.dev.DataMapFactory; +import org.apache.carbondata.core.datamap.dev.DataMapModel; +import org.apache.carbondata.core.datamap.dev.DataMapWriter; +import org.apache.carbondata.core.datamap.dev.cgdatamap.CoarseGrainDataMap; +import org.apache.carbondata.core.datastore.filesystem.CarbonFile; +import org.apache.carbondata.core.datastore.impl.FileFactory; +import org.apache.carbondata.core.memory.MemoryException; +import org.apache.carbondata.core.metadata.CarbonMetadata; +import org.apache.carbondata.core.metadata.schema.table.CarbonTable; +import org.apache.carbondata.core.metadata.schema.table.DataMapSchema; +import org.apache.carbondata.core.metadata.schema.table.column.CarbonColumn; +import org.apache.carbondata.core.readcommitter.ReadCommittedScope; +import org.apache.carbondata.core.scan.filter.intf.ExpressionType; +import org.apache.carbondata.core.statusmanager.SegmentStatusManager; +import org.apache.carbondata.core.util.CarbonUtil; +import org.apache.carbondata.core.util.path.CarbonTablePath; +import org.apache.carbondata.events.Event; + +import org.apache.commons.lang3.StringUtils; + +public class BloomCoarseGrainDataMapFactory implements DataMapFactory { --- End diff -- add @InterfaceAudience.Internal ---
[GitHub] carbondata pull request #2200: [CARBONDATA-2373][DataMap] Add bloom datamap ...
Github user jackylk commented on a diff in the pull request: https://github.com/apache/carbondata/pull/2200#discussion_r183199053 --- Diff: datamap/bloom/src/main/java/org/apache/carbondata/datamap/bloom/BloomCoarseGrainDataMap.java --- @@ -0,0 +1,243 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.carbondata.datamap.bloom; + +import java.io.DataInputStream; +import java.io.EOFException; +import java.io.IOException; +import java.io.ObjectInputStream; +import java.io.UnsupportedEncodingException; +import java.util.ArrayList; +import java.util.HashSet; +import java.util.List; +import java.util.Set; + +import org.apache.carbondata.common.logging.LogService; +import org.apache.carbondata.common.logging.LogServiceFactory; +import org.apache.carbondata.core.datamap.dev.DataMapModel; +import org.apache.carbondata.core.datamap.dev.cgdatamap.CoarseGrainDataMap; +import org.apache.carbondata.core.datastore.block.SegmentProperties; +import org.apache.carbondata.core.datastore.impl.FileFactory; +import org.apache.carbondata.core.indexstore.Blocklet; +import org.apache.carbondata.core.indexstore.PartitionSpec; +import org.apache.carbondata.core.memory.MemoryException; +import org.apache.carbondata.core.metadata.datatype.DataType; +import org.apache.carbondata.core.metadata.datatype.DataTypes; +import org.apache.carbondata.core.scan.expression.ColumnExpression; +import org.apache.carbondata.core.scan.expression.Expression; +import org.apache.carbondata.core.scan.expression.LiteralExpression; +import org.apache.carbondata.core.scan.expression.conditional.EqualToExpression; +import org.apache.carbondata.core.scan.filter.resolver.FilterResolverIntf; +import org.apache.carbondata.core.util.CarbonUtil; + +import com.google.common.collect.ArrayListMultimap; +import com.google.common.collect.Multimap; +import org.apache.commons.lang3.StringUtils; +import org.apache.hadoop.fs.FileStatus; +import org.apache.hadoop.fs.FileSystem; +import org.apache.hadoop.fs.Path; +import org.apache.hadoop.fs.PathFilter; + +public class BloomCoarseGrainDataMap extends CoarseGrainDataMap { + private static final LogService LOGGER = + LogServiceFactory.getLogService(BloomCoarseGrainDataMap.class.getName()); + private String[] indexFilePath; + private Set indexedColumn; + private List bloomIndexList; + private MultimapindexCol2BloomDMList; + + @Override + public void init(DataMapModel dataMapModel) throws MemoryException, IOException { +Path indexPath = FileFactory.getPath(dataMapModel.getFilePath()); +FileSystem fs = FileFactory.getFileSystem(indexPath); +if (!fs.exists(indexPath)) { + throw new IOException( + String.format("Path %s for Bloom index dataMap does not exist", indexPath)); +} +if (!fs.isDirectory(indexPath)) { + throw new IOException( + String.format("Path %s for Bloom index dataMap must be a directory", indexPath)); +} + +FileStatus[] indexFileStatus = fs.listStatus(indexPath, new PathFilter() { + @Override public boolean accept(Path path) { +return path.getName().endsWith(".bloomindex"); + } +}); +indexFilePath = new String[indexFileStatus.length]; +indexedColumn = new HashSet(); +bloomIndexList = new ArrayList(); +indexCol2BloomDMList = ArrayListMultimap.create(); +for (int i = 0; i < indexFileStatus.length; i++) { + indexFilePath[i] = indexFileStatus[i].getPath().toString(); + String indexCol = StringUtils.substringBetween(indexFilePath[i], ".carbondata.", + ".bloomindex"); + indexedColumn.add(indexCol); + bloomIndexList.addAll(readBloomIndex(indexFilePath[i])); + indexCol2BloomDMList.put(indexCol, readBloomIndex(indexFilePath[i])); +} +LOGGER.info("find bloom index datamap for column: " +
[GitHub] carbondata pull request #2200: [CARBONDATA-2373][DataMap] Add bloom datamap ...
Github user jackylk commented on a diff in the pull request: https://github.com/apache/carbondata/pull/2200#discussion_r183199028 --- Diff: datamap/bloom/src/main/java/org/apache/carbondata/datamap/bloom/BloomCoarseGrainDataMap.java --- @@ -0,0 +1,243 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.carbondata.datamap.bloom; + +import java.io.DataInputStream; +import java.io.EOFException; +import java.io.IOException; +import java.io.ObjectInputStream; +import java.io.UnsupportedEncodingException; +import java.util.ArrayList; +import java.util.HashSet; +import java.util.List; +import java.util.Set; + +import org.apache.carbondata.common.logging.LogService; +import org.apache.carbondata.common.logging.LogServiceFactory; +import org.apache.carbondata.core.datamap.dev.DataMapModel; +import org.apache.carbondata.core.datamap.dev.cgdatamap.CoarseGrainDataMap; +import org.apache.carbondata.core.datastore.block.SegmentProperties; +import org.apache.carbondata.core.datastore.impl.FileFactory; +import org.apache.carbondata.core.indexstore.Blocklet; +import org.apache.carbondata.core.indexstore.PartitionSpec; +import org.apache.carbondata.core.memory.MemoryException; +import org.apache.carbondata.core.metadata.datatype.DataType; +import org.apache.carbondata.core.metadata.datatype.DataTypes; +import org.apache.carbondata.core.scan.expression.ColumnExpression; +import org.apache.carbondata.core.scan.expression.Expression; +import org.apache.carbondata.core.scan.expression.LiteralExpression; +import org.apache.carbondata.core.scan.expression.conditional.EqualToExpression; +import org.apache.carbondata.core.scan.filter.resolver.FilterResolverIntf; +import org.apache.carbondata.core.util.CarbonUtil; + +import com.google.common.collect.ArrayListMultimap; +import com.google.common.collect.Multimap; +import org.apache.commons.lang3.StringUtils; +import org.apache.hadoop.fs.FileStatus; +import org.apache.hadoop.fs.FileSystem; +import org.apache.hadoop.fs.Path; +import org.apache.hadoop.fs.PathFilter; + +public class BloomCoarseGrainDataMap extends CoarseGrainDataMap { + private static final LogService LOGGER = + LogServiceFactory.getLogService(BloomCoarseGrainDataMap.class.getName()); + private String[] indexFilePath; + private Set indexedColumn; + private List bloomIndexList; + private MultimapindexCol2BloomDMList; + + @Override + public void init(DataMapModel dataMapModel) throws MemoryException, IOException { +Path indexPath = FileFactory.getPath(dataMapModel.getFilePath()); +FileSystem fs = FileFactory.getFileSystem(indexPath); +if (!fs.exists(indexPath)) { + throw new IOException( + String.format("Path %s for Bloom index dataMap does not exist", indexPath)); +} +if (!fs.isDirectory(indexPath)) { + throw new IOException( + String.format("Path %s for Bloom index dataMap must be a directory", indexPath)); +} + +FileStatus[] indexFileStatus = fs.listStatus(indexPath, new PathFilter() { + @Override public boolean accept(Path path) { +return path.getName().endsWith(".bloomindex"); --- End diff -- make a constant string for `.bloomindex` ---
[GitHub] carbondata pull request #2200: [CARBONDATA-2373][DataMap] Add bloom datamap ...
Github user jackylk commented on a diff in the pull request: https://github.com/apache/carbondata/pull/2200#discussion_r183198993 --- Diff: datamap/bloom/pom.xml --- @@ -0,0 +1,88 @@ +http://maven.apache.org/POM/4.0.0; + xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance; + xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd;> + 4.0.0 + + +org.apache.carbondata +carbondata-parent +1.4.0-SNAPSHOT +../../pom.xml + + + carbondata-bloom + Apache CarbonData :: Bloom Index DataMap + + +${basedir}/../../dev +6.3.0 +6.3.0 + + + + + org.apache.carbondata + carbondata-spark2 + ${project.version} + + + org.apache.commons + commons-lang3 + 3.3.2 --- End diff -- can you move this version definition to parent pom ---
[GitHub] carbondata pull request #2200: [CARBONDATA-2373][DataMap] Add bloom datamap ...
Github user jackylk commented on a diff in the pull request: https://github.com/apache/carbondata/pull/2200#discussion_r183198962 --- Diff: datamap/bloom/pom.xml --- @@ -0,0 +1,88 @@ +http://maven.apache.org/POM/4.0.0; + xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance; + xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd;> + 4.0.0 + + +org.apache.carbondata +carbondata-parent +1.4.0-SNAPSHOT +../../pom.xml + + + carbondata-bloom + Apache CarbonData :: Bloom Index DataMap + + +${basedir}/../../dev +6.3.0 --- End diff -- can you move this definition in parent pom ---
[GitHub] carbondata pull request #2200: [CARBONDATA-2373][DataMap] Add bloom datamap ...
GitHub user xuchuanyin opened a pull request: https://github.com/apache/carbondata/pull/2200 [CARBONDATA-2373][DataMap] Add bloom datamap to support precise equal query For each indexed column, adding a bloom filter for each blocklet to indicate whether it belongs to this blocklet. Currently bloom filter is using guava version. Be sure to do all of the following checklist to help us incorporate your contribution quickly and easily: - [x] Any interfaces changed? `Yes, added interface in DataMapMeta` - [x] Any backward compatibility impacted? `NO` - [x] Document update required? `NO` - [x] Testing done Please provide details on - Whether new unit test cases have been added or why no new tests are required? `Added tests` - How it is tested? Please attach test report. `Tested in local machine` - Is it a performance related change? Please attach the performance test report. `Bloom datamap can reduce blocklets in precise equal query scenario ann enhance the query performance` - Any additional information to help reviewers in testing this change. `NO` - [x] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA. `Not related` You can merge this pull request into a Git repository by running: $ git pull https://github.com/xuchuanyin/carbondata 0421_bloom_datamap Alternatively you can review and apply these changes as the patch at: https://github.com/apache/carbondata/pull/2200.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2200 commit 160b0f42248fe719f898c10cb84ab2d32eafdaac Author: xuchuanyinDate: 2018-04-21T02:59:04Z Add bloom datamap using bloom filter For each indexed column, adding a bloom filter for each blocklet to indicate whether it belongs to this blocklet. Currently bloom filter is using guava version. ---