[GitHub] carbondata issue #2589: [WIP][CARBONSTORE] Refactor CarbonStore API
Github user ravipesala commented on the issue: https://github.com/apache/carbondata/pull/2589 SDV Build Fail , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/6159/ ---
[GitHub] carbondata issue #2589: [WIP][CARBONSTORE] Refactor CarbonStore API
Github user CarbonDataQA commented on the issue: https://github.com/apache/carbondata/pull/2589 Build Failed with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/6501/ ---
[GitHub] carbondata issue #2589: [WIP][CARBONSTORE] Refactor CarbonStore API
Github user CarbonDataQA commented on the issue: https://github.com/apache/carbondata/pull/2589 Build Failed with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1// ---
[GitHub] carbondata pull request #2589: [WIP][CARBONSTORE] Refactor CarbonStore API
Github user jackylk commented on a diff in the pull request: https://github.com/apache/carbondata/pull/2589#discussion_r207699726 --- Diff: store/sdk/src/main/java/org/apache/carbondata/sdk/store/BlockScanUnit.java --- @@ -0,0 +1,70 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.carbondata.sdk.store; + +import java.io.DataInput; +import java.io.DataOutput; +import java.io.IOException; + +import org.apache.carbondata.common.annotations.InterfaceAudience; +import org.apache.carbondata.hadoop.CarbonInputSplit; + +/** + * It contains a block to scan, and a destination worker who should scan it + */ +@InterfaceAudience.Internal +public class BlockScanUnit implements ScanUnit { + + // the data block to scan + private CarbonInputSplit inputSplit; + + // the worker who should scan this unit + private Schedulable schedulable; --- End diff -- fixed ---
[GitHub] carbondata pull request #2589: [WIP][CARBONSTORE] Refactor CarbonStore API
Github user jackylk commented on a diff in the pull request: https://github.com/apache/carbondata/pull/2589#discussion_r207699730 --- Diff: store/sdk/src/main/java/org/apache/carbondata/sdk/store/ScanUnit.java --- @@ -15,26 +15,27 @@ * limitations under the License. */ -package org.apache.carbondata.store.impl.rpc; +package org.apache.carbondata.sdk.store; -import org.apache.carbondata.common.annotations.InterfaceAudience; -import org.apache.carbondata.store.impl.rpc.model.BaseResponse; -import org.apache.carbondata.store.impl.rpc.model.LoadDataRequest; -import org.apache.carbondata.store.impl.rpc.model.QueryResponse; -import org.apache.carbondata.store.impl.rpc.model.Scan; -import org.apache.carbondata.store.impl.rpc.model.ShutdownRequest; -import org.apache.carbondata.store.impl.rpc.model.ShutdownResponse; - -import org.apache.hadoop.ipc.VersionedProtocol; - -@InterfaceAudience.Internal -public interface StoreService extends VersionedProtocol { - - long versionID = 1L; +import java.io.Serializable; - BaseResponse loadData(LoadDataRequest request); - - QueryResponse query(Scan scan); +import org.apache.carbondata.common.annotations.InterfaceAudience; +import org.apache.carbondata.common.annotations.InterfaceStability; +import org.apache.carbondata.core.metadata.schema.table.Writable; - ShutdownResponse shutdown(ShutdownRequest request); +/** + * An unit for the scanner in Carbon Store + */ +@InterfaceAudience.User +@InterfaceStability.Unstable +public interface ScanUnit extends Serializable, Writable { --- End diff -- fixed ---
[GitHub] carbondata pull request #2589: [WIP][CARBONSTORE] Refactor CarbonStore API
Github user jackylk commented on a diff in the pull request: https://github.com/apache/carbondata/pull/2589#discussion_r207699719 --- Diff: hadoop/src/main/java/org/apache/carbondata/hadoop/CarbonInputSplit.java --- @@ -444,4 +444,16 @@ public void setFormat(FileFormat fileFormat) { public Blocklet makeBlocklet() { return new Blocklet(getPath().getName(), blockletId); } + + public String[] preferredLocations() { --- End diff -- fixed ---
[GitHub] carbondata pull request #2589: [WIP][CARBONSTORE] Refactor CarbonStore API
Github user ajithme commented on a diff in the pull request: https://github.com/apache/carbondata/pull/2589#discussion_r207699358 --- Diff: store/sdk/src/main/java/org/apache/carbondata/sdk/store/ScanUnit.java --- @@ -15,26 +15,27 @@ * limitations under the License. */ -package org.apache.carbondata.store.impl.rpc; +package org.apache.carbondata.sdk.store; -import org.apache.carbondata.common.annotations.InterfaceAudience; -import org.apache.carbondata.store.impl.rpc.model.BaseResponse; -import org.apache.carbondata.store.impl.rpc.model.LoadDataRequest; -import org.apache.carbondata.store.impl.rpc.model.QueryResponse; -import org.apache.carbondata.store.impl.rpc.model.Scan; -import org.apache.carbondata.store.impl.rpc.model.ShutdownRequest; -import org.apache.carbondata.store.impl.rpc.model.ShutdownResponse; - -import org.apache.hadoop.ipc.VersionedProtocol; - -@InterfaceAudience.Internal -public interface StoreService extends VersionedProtocol { - - long versionID = 1L; +import java.io.Serializable; - BaseResponse loadData(LoadDataRequest request); - - QueryResponse query(Scan scan); +import org.apache.carbondata.common.annotations.InterfaceAudience; +import org.apache.carbondata.common.annotations.InterfaceStability; +import org.apache.carbondata.core.metadata.schema.table.Writable; - ShutdownResponse shutdown(ShutdownRequest request); +/** + * An unit for the scanner in Carbon Store + */ +@InterfaceAudience.User +@InterfaceStability.Unstable +public interface ScanUnit extends Serializable, Writable { --- End diff -- can remove Generics ---
[GitHub] carbondata pull request #2589: [WIP][CARBONSTORE] Refactor CarbonStore API
Github user ajithme commented on a diff in the pull request: https://github.com/apache/carbondata/pull/2589#discussion_r207699345 --- Diff: store/sdk/src/main/java/org/apache/carbondata/sdk/store/BlockScanUnit.java --- @@ -0,0 +1,70 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.carbondata.sdk.store; + +import java.io.DataInput; +import java.io.DataOutput; +import java.io.IOException; + +import org.apache.carbondata.common.annotations.InterfaceAudience; +import org.apache.carbondata.hadoop.CarbonInputSplit; + +/** + * It contains a block to scan, and a destination worker who should scan it + */ +@InterfaceAudience.Internal +public class BlockScanUnit implements ScanUnit { + + // the data block to scan + private CarbonInputSplit inputSplit; + + // the worker who should scan this unit + private Schedulable schedulable; --- End diff -- Add this in Writable interface else it will be null after deserialization ---
[jira] [Created] (CARBONDATA-2827) Refactor Segment Status Manager Interface
Ravindra Pesala created CARBONDATA-2827: --- Summary: Refactor Segment Status Manager Interface Key: CARBONDATA-2827 URL: https://issues.apache.org/jira/browse/CARBONDATA-2827 Project: CarbonData Issue Type: Improvement Reporter: Ravindra Pesala Attachments: Segment Status Management interface design_V1.docx Carbon uses tablestatus file to record segment status and details of each segment during each load. This tablestatus enables carbon to support concurrent loads and reads without data inconsistency or corruption. So it is very important feature of carbondata and we should have clean interfaces to maintain it. Current tablestatus updation is shattered to multiple places and there is no clean interface, so I am proposing to refactor current SegmentStatusManager interface and bringing all tablestatus operations to single interface. This new interface allows to add table status to any other storage like DB. This is needed for S3 type object stores as these are eventually consistent. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[GitHub] carbondata pull request #2589: [WIP][CARBONSTORE] Refactor CarbonStore API
Github user ajithme commented on a diff in the pull request: https://github.com/apache/carbondata/pull/2589#discussion_r207699308 --- Diff: hadoop/src/main/java/org/apache/carbondata/hadoop/CarbonInputSplit.java --- @@ -444,4 +444,16 @@ public void setFormat(FileFormat fileFormat) { public Blocklet makeBlocklet() { return new Blocklet(getPath().getName(), blockletId); } + + public String[] preferredLocations() { --- End diff -- The super FileSplit.file is not serializable. Refer HADOOP-13519 so java serialization may return empty ---
[GitHub] carbondata pull request #2589: [WIP][CARBONSTORE] Refactor CarbonStore API
Github user jackylk commented on a diff in the pull request: https://github.com/apache/carbondata/pull/2589#discussion_r207699000 --- Diff: store/sdk/src/main/java/org/apache/carbondata/sdk/store/descriptor/ScanDescriptor.java --- @@ -15,23 +15,33 @@ * limitations under the License. */ -package org.apache.carbondata.store.api.descriptor; +package org.apache.carbondata.sdk.store.descriptor; +import java.io.DataInput; +import java.io.DataOutput; +import java.io.IOException; import java.util.Objects; +import org.apache.carbondata.common.annotations.InterfaceAudience; +import org.apache.carbondata.common.annotations.InterfaceStability; import org.apache.carbondata.core.scan.expression.Expression; +import org.apache.carbondata.core.util.ObjectSerializationUtil; -public class SelectDescriptor { +import org.apache.hadoop.io.Writable; + +@InterfaceAudience.User +@InterfaceStability.Evolving +public class ScanDescriptor implements Writable { private TableIdentifier table; private String[] projection; private Expression filter; private long limit; --- End diff -- ok ---
[GitHub] carbondata pull request #2589: [WIP][CARBONSTORE] Refactor CarbonStore API
Github user jackylk commented on a diff in the pull request: https://github.com/apache/carbondata/pull/2589#discussion_r207698994 --- Diff: store/sdk/src/main/java/org/apache/carbondata/sdk/store/ScannerImpl.java --- @@ -0,0 +1,122 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.carbondata.sdk.store; + +import java.io.IOException; +import java.util.ArrayList; +import java.util.Arrays; +import java.util.Collections; +import java.util.Iterator; +import java.util.List; +import java.util.Random; +import java.util.stream.Collectors; + +import org.apache.carbondata.common.logging.LogService; +import org.apache.carbondata.common.logging.LogServiceFactory; +import org.apache.carbondata.core.datastore.row.CarbonRow; +import org.apache.carbondata.core.metadata.schema.table.TableInfo; +import org.apache.carbondata.core.scan.expression.Expression; +import org.apache.carbondata.hadoop.CarbonInputSplit; +import org.apache.carbondata.hadoop.CarbonMultiBlockSplit; +import org.apache.carbondata.hadoop.api.CarbonInputFormat; +import org.apache.carbondata.sdk.store.conf.StoreConf; +import org.apache.carbondata.sdk.store.descriptor.ScanDescriptor; +import org.apache.carbondata.sdk.store.descriptor.TableIdentifier; +import org.apache.carbondata.sdk.store.exception.CarbonException; +import org.apache.carbondata.sdk.store.service.DataService; +import org.apache.carbondata.sdk.store.service.PruneService; +import org.apache.carbondata.sdk.store.service.ServiceFactory; +import org.apache.carbondata.sdk.store.service.model.PruneRequest; +import org.apache.carbondata.sdk.store.service.model.PruneResponse; +import org.apache.carbondata.sdk.store.service.model.ScanRequest; +import org.apache.carbondata.sdk.store.service.model.ScanResponse; + +import org.apache.hadoop.conf.Configuration; + +class ScannerImpl implements Scanner { + private static final LogService LOGGER = + LogServiceFactory.getLogService(ScannerImpl.class.getCanonicalName()); + + private PruneService pruneService; + private TableInfo tableInfo; + + ScannerImpl(StoreConf conf, TableInfo tableInfo) throws IOException { +this.pruneService = ServiceFactory.createPruneService( +conf.masterHost(), conf.registryServicePort()); +this.tableInfo = tableInfo; + } + + /** + * Trigger a RPC to Carbon Master to do pruning + * @param table table identifier + * @param filterExpression expression of filter predicate given by user + * @return list of ScanUnit + * @throws CarbonException if any error occurs + */ + @Override + public List prune(TableIdentifier table, Expression filterExpression) + throws CarbonException { +try { + Configuration configuration = new Configuration(); + CarbonInputFormat.setTableName(configuration, table.getTableName()); --- End diff -- ok ---
[GitHub] carbondata pull request #2589: [WIP][CARBONSTORE] Refactor CarbonStore API
Github user ajithme commented on a diff in the pull request: https://github.com/apache/carbondata/pull/2589#discussion_r207501460 --- Diff: store/sdk/src/main/java/org/apache/carbondata/sdk/store/descriptor/ScanDescriptor.java --- @@ -15,23 +15,33 @@ * limitations under the License. */ -package org.apache.carbondata.store.api.descriptor; +package org.apache.carbondata.sdk.store.descriptor; +import java.io.DataInput; +import java.io.DataOutput; +import java.io.IOException; import java.util.Objects; +import org.apache.carbondata.common.annotations.InterfaceAudience; +import org.apache.carbondata.common.annotations.InterfaceStability; import org.apache.carbondata.core.scan.expression.Expression; +import org.apache.carbondata.core.util.ObjectSerializationUtil; -public class SelectDescriptor { +import org.apache.hadoop.io.Writable; + +@InterfaceAudience.User +@InterfaceStability.Evolving +public class ScanDescriptor implements Writable { private TableIdentifier table; private String[] projection; private Expression filter; private long limit; --- End diff -- Must be Long.MAX_VALUE ---
[GitHub] carbondata pull request #2589: [WIP][CARBONSTORE] Refactor CarbonStore API
Github user ajithme commented on a diff in the pull request: https://github.com/apache/carbondata/pull/2589#discussion_r207431095 --- Diff: store/sdk/src/main/java/org/apache/carbondata/sdk/store/ScannerImpl.java --- @@ -0,0 +1,122 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.carbondata.sdk.store; + +import java.io.IOException; +import java.util.ArrayList; +import java.util.Arrays; +import java.util.Collections; +import java.util.Iterator; +import java.util.List; +import java.util.Random; +import java.util.stream.Collectors; + +import org.apache.carbondata.common.logging.LogService; +import org.apache.carbondata.common.logging.LogServiceFactory; +import org.apache.carbondata.core.datastore.row.CarbonRow; +import org.apache.carbondata.core.metadata.schema.table.TableInfo; +import org.apache.carbondata.core.scan.expression.Expression; +import org.apache.carbondata.hadoop.CarbonInputSplit; +import org.apache.carbondata.hadoop.CarbonMultiBlockSplit; +import org.apache.carbondata.hadoop.api.CarbonInputFormat; +import org.apache.carbondata.sdk.store.conf.StoreConf; +import org.apache.carbondata.sdk.store.descriptor.ScanDescriptor; +import org.apache.carbondata.sdk.store.descriptor.TableIdentifier; +import org.apache.carbondata.sdk.store.exception.CarbonException; +import org.apache.carbondata.sdk.store.service.DataService; +import org.apache.carbondata.sdk.store.service.PruneService; +import org.apache.carbondata.sdk.store.service.ServiceFactory; +import org.apache.carbondata.sdk.store.service.model.PruneRequest; +import org.apache.carbondata.sdk.store.service.model.PruneResponse; +import org.apache.carbondata.sdk.store.service.model.ScanRequest; +import org.apache.carbondata.sdk.store.service.model.ScanResponse; + +import org.apache.hadoop.conf.Configuration; + +class ScannerImpl implements Scanner { + private static final LogService LOGGER = + LogServiceFactory.getLogService(ScannerImpl.class.getCanonicalName()); + + private PruneService pruneService; + private TableInfo tableInfo; + + ScannerImpl(StoreConf conf, TableInfo tableInfo) throws IOException { +this.pruneService = ServiceFactory.createPruneService( +conf.masterHost(), conf.registryServicePort()); --- End diff -- must be prune service port ---
[GitHub] carbondata pull request #2589: [WIP][CARBONSTORE] Refactor CarbonStore API
Github user ajithme commented on a diff in the pull request: https://github.com/apache/carbondata/pull/2589#discussion_r207431252 --- Diff: store/sdk/src/main/java/org/apache/carbondata/sdk/store/service/StoreService.java --- @@ -0,0 +1,53 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.carbondata.sdk.store.service; + +import java.util.List; + +import org.apache.carbondata.common.annotations.InterfaceAudience; +import org.apache.carbondata.core.datastore.row.CarbonRow; +import org.apache.carbondata.core.metadata.schema.table.CarbonTable; +import org.apache.carbondata.sdk.store.descriptor.LoadDescriptor; +import org.apache.carbondata.sdk.store.descriptor.ScanDescriptor; +import org.apache.carbondata.sdk.store.descriptor.TableDescriptor; +import org.apache.carbondata.sdk.store.descriptor.TableIdentifier; +import org.apache.carbondata.sdk.store.exception.CarbonException; + +import org.apache.hadoop.ipc.VersionedProtocol; + +@InterfaceAudience.Internal +public interface StoreService extends VersionedProtocol { + long versionID = 1L; + + void createTable(TableDescriptor descriptor) throws CarbonException; + + void dropTable(TableIdentifier table) throws CarbonException; + + CarbonTable getTable(TableIdentifier table) throws CarbonException; --- End diff -- hadoop RPC need response object to be a org.apache.hadoop.io.serializer.WritableSerialization ---
[GitHub] carbondata pull request #2589: [WIP][CARBONSTORE] Refactor CarbonStore API
Github user ajithme commented on a diff in the pull request: https://github.com/apache/carbondata/pull/2589#discussion_r207433215 --- Diff: store/sdk/src/main/java/org/apache/carbondata/sdk/store/ScannerImpl.java --- @@ -0,0 +1,122 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.carbondata.sdk.store; + +import java.io.IOException; +import java.util.ArrayList; +import java.util.Arrays; +import java.util.Collections; +import java.util.Iterator; +import java.util.List; +import java.util.Random; +import java.util.stream.Collectors; + +import org.apache.carbondata.common.logging.LogService; +import org.apache.carbondata.common.logging.LogServiceFactory; +import org.apache.carbondata.core.datastore.row.CarbonRow; +import org.apache.carbondata.core.metadata.schema.table.TableInfo; +import org.apache.carbondata.core.scan.expression.Expression; +import org.apache.carbondata.hadoop.CarbonInputSplit; +import org.apache.carbondata.hadoop.CarbonMultiBlockSplit; +import org.apache.carbondata.hadoop.api.CarbonInputFormat; +import org.apache.carbondata.sdk.store.conf.StoreConf; +import org.apache.carbondata.sdk.store.descriptor.ScanDescriptor; +import org.apache.carbondata.sdk.store.descriptor.TableIdentifier; +import org.apache.carbondata.sdk.store.exception.CarbonException; +import org.apache.carbondata.sdk.store.service.DataService; +import org.apache.carbondata.sdk.store.service.PruneService; +import org.apache.carbondata.sdk.store.service.ServiceFactory; +import org.apache.carbondata.sdk.store.service.model.PruneRequest; +import org.apache.carbondata.sdk.store.service.model.PruneResponse; +import org.apache.carbondata.sdk.store.service.model.ScanRequest; +import org.apache.carbondata.sdk.store.service.model.ScanResponse; + +import org.apache.hadoop.conf.Configuration; + +class ScannerImpl implements Scanner { + private static final LogService LOGGER = + LogServiceFactory.getLogService(ScannerImpl.class.getCanonicalName()); + + private PruneService pruneService; + private TableInfo tableInfo; + + ScannerImpl(StoreConf conf, TableInfo tableInfo) throws IOException { +this.pruneService = ServiceFactory.createPruneService( +conf.masterHost(), conf.registryServicePort()); +this.tableInfo = tableInfo; + } + + /** + * Trigger a RPC to Carbon Master to do pruning + * @param table table identifier + * @param filterExpression expression of filter predicate given by user + * @return list of ScanUnit + * @throws CarbonException if any error occurs + */ + @Override + public List prune(TableIdentifier table, Expression filterExpression) + throws CarbonException { +try { + Configuration configuration = new Configuration(); + CarbonInputFormat.setTableName(configuration, table.getTableName()); --- End diff -- can use CarbonInputFormat.setTableInfo(configuration, tableInfo); else org.apache.carbondata.hadoop.api.CarbonInputFormat#getAbsoluteTableIdentifier will have empty path ---
[jira] [Created] (CARBONDATA-2826) SELECT support using distributed carbon store
Ajith S created CARBONDATA-2826: --- Summary: SELECT support using distributed carbon store Key: CARBONDATA-2826 URL: https://issues.apache.org/jira/browse/CARBONDATA-2826 Project: CarbonData Issue Type: Sub-task Reporter: Ajith S Assignee: Ajith S Change the carbon code to support scanning ( table select using spark ) using distributed carbon store API -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (CARBONDATA-2825) Store Service Interface
Ajith S created CARBONDATA-2825: --- Summary: Store Service Interface Key: CARBONDATA-2825 URL: https://issues.apache.org/jira/browse/CARBONDATA-2825 Project: CarbonData Issue Type: Sub-task Reporter: Ajith S Assignee: Jacky Li This Jira targets on providing the interfaces from Distributed CarbonStore perspective -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (CARBONDATA-2824) Distributed CarbonStore
Ajith S created CARBONDATA-2824: --- Summary: Distributed CarbonStore Key: CARBONDATA-2824 URL: https://issues.apache.org/jira/browse/CARBONDATA-2824 Project: CarbonData Issue Type: New Feature Reporter: Ajith S Assignee: Ajith S Currently the CarbonStore is very tightly coupled with FileSystem interface and which runs in process JVM like in spark. We can instead make CarbonStore run as a separate service which can be accessed via network/rpc. So as a Followup of CARBONDATA-2688 (CarbonStore Java API and REST API) we can make carbon store distributed This has some advantages. 1. Distributed CarbonStore can support parallel scanning i.e multiple tasks can start scanning data parallely, which may have a higher parallelism factor than compute layer 2. Distributed CarbonStore can support index service to multiple apps like (spark/ flink/ presto), such that index will be shared to save resource 3. Distributed CarbonStore resource consumption is isolated from application and easily scalable to support higher workloads 4. As a future improvement, Distributed CarbonStore can implement a query cache since it has independent resources Distributed CarbonStore will have 2 main deployment parts: Cluster of remote carbon store service SDK which acts as a client for communication with store -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[GitHub] carbondata issue #2576: [CARBONDATA-2795] Add documentation for S3
Github user CarbonDataQA commented on the issue: https://github.com/apache/carbondata/pull/2576 Build Success with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/6500/ ---
[GitHub] carbondata issue #2576: [CARBONDATA-2795] Add documentation for S3
Github user CarbonDataQA commented on the issue: https://github.com/apache/carbondata/pull/2576 Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/7776/ ---
[GitHub] carbondata issue #2576: [CARBONDATA-2795] Add documentation for S3
Github user chenliang613 commented on the issue: https://github.com/apache/carbondata/pull/2576 retest this please ---
[jira] [Resolved] (CARBONDATA-2815) Add documentation for memory spill and rebuild datamap
[ https://issues.apache.org/jira/browse/CARBONDATA-2815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liang Chen resolved CARBONDATA-2815. Resolution: Fixed Fix Version/s: 1.4.1 1.5.0 > Add documentation for memory spill and rebuild datamap > -- > > Key: CARBONDATA-2815 > URL: https://issues.apache.org/jira/browse/CARBONDATA-2815 > Project: CarbonData > Issue Type: Improvement >Reporter: xuchuanyin >Assignee: xuchuanyin >Priority: Major > Fix For: 1.5.0, 1.4.1 > > Time Spent: 2.5h > Remaining Estimate: 0h > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[GitHub] carbondata pull request #2604: [CARBONDATA-2815][Doc] Add documentation for ...
Github user asfgit closed the pull request at: https://github.com/apache/carbondata/pull/2604 ---
[GitHub] carbondata issue #2576: [CARBONDATA-2795] Add documentation for S3
Github user ravipesala commented on the issue: https://github.com/apache/carbondata/pull/2576 SDV Build Fail , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/6158/ ---
[GitHub] carbondata issue #2607: [CARBONDATA-2818] Presto Upgrade to 0.206
Github user ravipesala commented on the issue: https://github.com/apache/carbondata/pull/2607 SDV Build Fail , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/6157/ ---
[GitHub] carbondata issue #2576: [CARBONDATA-2795] Add documentation for S3
Github user ravipesala commented on the issue: https://github.com/apache/carbondata/pull/2576 SDV Build Fail , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/6156/ ---
[GitHub] carbondata issue #2606: [CARBONDATA-2817]Thread Leak in Update and in No sor...
Github user ravipesala commented on the issue: https://github.com/apache/carbondata/pull/2606 SDV Build Fail , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/6155/ ---
[GitHub] carbondata issue #2576: [CARBONDATA-2795] Add documentation for S3
Github user ravipesala commented on the issue: https://github.com/apache/carbondata/pull/2576 SDV Build Fail , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/6154/ ---
[GitHub] carbondata issue #2606: [CARBONDATA-2817]Thread Leak in Update and in No sor...
Github user ravipesala commented on the issue: https://github.com/apache/carbondata/pull/2606 SDV Build Fail , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/6153/ ---
[GitHub] carbondata issue #2603: [Documentation] Editorial review comment fixed
Github user ravipesala commented on the issue: https://github.com/apache/carbondata/pull/2603 SDV Build Fail , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/6152/ ---
[GitHub] carbondata issue #2576: [CARBONDATA-2795] Add documentation for S3
Github user ravipesala commented on the issue: https://github.com/apache/carbondata/pull/2576 SDV Build Fail , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/6151/ ---
[GitHub] carbondata issue #2590: [CARBONDATA-2750] Updated documentation on Local Dic...
Github user ravipesala commented on the issue: https://github.com/apache/carbondata/pull/2590 SDV Build Fail , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/6150/ ---
[GitHub] carbondata issue #2576: [CARBONDATA-2795] Add documentation for S3
Github user ravipesala commented on the issue: https://github.com/apache/carbondata/pull/2576 SDV Build Fail , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/6149/ ---
[GitHub] carbondata issue #2594: [CARBONDATA-2809][DataMap] Block rebuilding for bloo...
Github user ravipesala commented on the issue: https://github.com/apache/carbondata/pull/2594 SDV Build Fail , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/6148/ ---
[GitHub] carbondata issue #2594: [CARBONDATA-2809][DataMap] Block rebuilding for bloo...
Github user ravipesala commented on the issue: https://github.com/apache/carbondata/pull/2594 SDV Build Fail , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/6147/ ---
[GitHub] carbondata issue #2603: [Documentation] Editorial review comment fixed
Github user ravipesala commented on the issue: https://github.com/apache/carbondata/pull/2603 SDV Build Fail , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/6146/ ---
[GitHub] carbondata issue #2537: [CARBONDATA-2768][CarbonStore] Fix error in tests fo...
Github user ravipesala commented on the issue: https://github.com/apache/carbondata/pull/2537 SDV Build Fail , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/6145/ ---
[GitHub] carbondata issue #2537: [CARBONDATA-2768][CarbonStore] Fix error in tests fo...
Github user ravipesala commented on the issue: https://github.com/apache/carbondata/pull/2537 SDV Build Fail , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/6144/ ---
[GitHub] carbondata issue #2590: [CARBONDATA-2750] Updated documentation on Local Dic...
Github user ravipesala commented on the issue: https://github.com/apache/carbondata/pull/2590 SDV Build Fail , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/6143/ ---
[GitHub] carbondata issue #2589: [WIP][CARBONSTORE] Refactor CarbonStore API
Github user ravipesala commented on the issue: https://github.com/apache/carbondata/pull/2589 SDV Build Fail , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/6142/ ---
[GitHub] carbondata issue #2589: [WIP][CARBONSTORE] Refactor CarbonStore API
Github user ravipesala commented on the issue: https://github.com/apache/carbondata/pull/2589 SDV Build Fail , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/6141/ ---
[GitHub] carbondata issue #2589: [WIP][CARBONSTORE] Refactor CarbonStore API
Github user ravipesala commented on the issue: https://github.com/apache/carbondata/pull/2589 SDV Build Fail , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/6140/ ---
[GitHub] carbondata issue #2601: [CARBONDATA-2804][DataMap] fix the bug when bloom fi...
Github user ravipesala commented on the issue: https://github.com/apache/carbondata/pull/2601 SDV Build Fail , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/6139/ ---
[GitHub] carbondata issue #2589: [WIP][CARBONSTORE] Refactor CarbonStore API
Github user ravipesala commented on the issue: https://github.com/apache/carbondata/pull/2589 SDV Build Fail , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/6138/ ---
[jira] [Created] (CARBONDATA-2823) Alter table set local dictionary include after bloom creation and merge index on old V3 store fails throwing incorrect error
Chetan Bhat created CARBONDATA-2823: --- Summary: Alter table set local dictionary include after bloom creation and merge index on old V3 store fails throwing incorrect error Key: CARBONDATA-2823 URL: https://issues.apache.org/jira/browse/CARBONDATA-2823 Project: CarbonData Issue Type: Bug Components: data-query Affects Versions: 1.4.1 Environment: Spark 2.1 Reporter: Chetan Bhat Steps : In old version V3 store create table and load data. CREATE TABLE uniqdata_load (CUST_ID int,CUST_NAME String,ACTIVE_EMUI_VERSION string, DOB timestamp, DOJ timestamp, BIGINT_COLUMN1 bigint,BIGINT_COLUMN2 bigint,DECIMAL_COLUMN1 decimal(30,10), DECIMAL_COLUMN2 decimal(36,36),Double_COLUMN1 double, Double_COLUMN2 double,INTEGER_COLUMN1 int) STORED BY 'org.apache.carbondata.format'; LOAD DATA INPATH 'hdfs://hacluster/chetan/2000_UniqData.csv' into table uniqdata_load OPTIONS('DELIMITER'=',' , 'QUOTECHAR'='"','BAD_RECORDS_ACTION'='FORCE','FILEHEADER'='CUST_ID,CUST_NAME,ACTIVE_EMUI_VERSION,DOB,DOJ,BIGINT_COLUMN1,BIGINT_COLUMN2,DECIMAL_COLUMN1,DECIMAL_COLUMN2,Double_COLUMN1,Double_COLUMN2,INTEGER_COLUMN1'); In 1.4.1 version refresh the table of old V3 store. refresh table uniqdata_load; Create bloom filter and merge index. CREATE DATAMAP dm_uniqdata1_tmstmp ON TABLE uniqdata_load USING 'bloomfilter' DMPROPERTIES ('INDEX_COLUMNS' = 'DOJ', 'BLOOM_SIZE'='64', 'BLOOM_FPP'='0.1'); Alter table set local dictionary include. alter table uniqdata_load set tblproperties('local_dictionary_include'='CUST_NAME'); Issue : Alter table set local dictionary include fails with incorrect error. 0: jdbc:hive2://10.18.98.101:22550/default> alter table uniqdata_load set tblproperties('local_dictionary_include'='CUST_NAME'); *Error: org.apache.carbondata.common.exceptions.sql.MalformedCarbonCommandException: streaming is not supported for index datamap (state=,code=0)* Expected : Operation should be success. If the operation is unsupported it should throw correct error message. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[GitHub] carbondata issue #2606: [CARBONDATA-2817]Thread Leak in Update and in No sor...
Github user ravipesala commented on the issue: https://github.com/apache/carbondata/pull/2606 SDV Build Fail , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/6137/ ---
[GitHub] carbondata issue #2605: [CARBONDATA-2585] Fix local dictionary for both tabl...
Github user ravipesala commented on the issue: https://github.com/apache/carbondata/pull/2605 SDV Build Fail , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/6136/ ---
[GitHub] carbondata issue #2604: [CARBONDATA-2815][Doc] Add documentation for spillin...
Github user ravipesala commented on the issue: https://github.com/apache/carbondata/pull/2604 SDV Build Fail , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/6135/ ---
[GitHub] carbondata issue #2576: [CARBONDATA-2795] Add documentation for S3
Github user CarbonDataQA commented on the issue: https://github.com/apache/carbondata/pull/2576 Build Success with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/6499/ ---
[GitHub] carbondata issue #2576: [CARBONDATA-2795] Add documentation for S3
Github user CarbonDataQA commented on the issue: https://github.com/apache/carbondata/pull/2576 Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/7775/ ---
[GitHub] carbondata issue #2605: [CARBONDATA-2585] Fix local dictionary for both tabl...
Github user ravipesala commented on the issue: https://github.com/apache/carbondata/pull/2605 SDV Build Fail , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/6134/ ---
[GitHub] carbondata issue #2607: [CARBONDATA-2818] Presto Upgrade to 0.206
Github user CarbonDataQA commented on the issue: https://github.com/apache/carbondata/pull/2607 Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/7774/ ---
[GitHub] carbondata issue #2576: [CARBONDATA-2795] Add documentation for S3
Github user CarbonDataQA commented on the issue: https://github.com/apache/carbondata/pull/2576 Build Failed with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/7773/ ---
[GitHub] carbondata pull request #2603: [Documentation] Editorial review comment fixe...
Github user asfgit closed the pull request at: https://github.com/apache/carbondata/pull/2603 ---
[GitHub] carbondata issue #2603: [Documentation] Editorial review comment fixed
Github user kunal642 commented on the issue: https://github.com/apache/carbondata/pull/2603 LGTM ---
[GitHub] carbondata issue #2606: [CARBONDATA-2817]Thread Leak in Update and in No sor...
Github user CarbonDataQA commented on the issue: https://github.com/apache/carbondata/pull/2606 Build Failed with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/6496/ ---
[GitHub] carbondata issue #2606: [CARBONDATA-2817]Thread Leak in Update and in No sor...
Github user CarbonDataQA commented on the issue: https://github.com/apache/carbondata/pull/2606 Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/7772/ ---
[GitHub] carbondata issue #2603: [Documentation] Editorial review comment fixed
Github user CarbonDataQA commented on the issue: https://github.com/apache/carbondata/pull/2603 Build Success with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/6493/ ---
[GitHub] carbondata issue #2603: [Documentation] Editorial review comment fixed
Github user CarbonDataQA commented on the issue: https://github.com/apache/carbondata/pull/2603 Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/7769/ ---
[GitHub] carbondata issue #2604: [CARBONDATA-2815][Doc] Add documentation for spillin...
Github user QiangCai commented on the issue: https://github.com/apache/carbondata/pull/2604 LGTM ---
[GitHub] carbondata issue #2590: [CARBONDATA-2750] Updated documentation on Local Dic...
Github user sraghunandan commented on the issue: https://github.com/apache/carbondata/pull/2590 Lgtm ---
[GitHub] carbondata pull request #2568: [Presto-integration-Technical-note] created d...
Github user vandana7 commented on a diff in the pull request: https://github.com/apache/carbondata/pull/2568#discussion_r207519570 --- Diff: integration/presto/presto-integration-technical-note.md --- @@ -0,0 +1,253 @@ + + +# Presto Integration Technical Note +Presto Integration with Carbon data include the below steps: + +* Setting up Presto Cluster + +* Setting up cluster to use carbondata as a catalog along with other catalogs provided by presto. + +In this technical note we will first learn about the above two points and after that we will see how we can do performance tuning with Presto. + +## **Let us begin with the first step of Presto Cluster Setup:** + + +* ### Installing Presto + + 1. Download the 0.187 version of Presto using: + `wget https://repo1.maven.org/maven2/com/facebook/presto/presto-server/0.187/presto-server-0.187.tar.gz` + + 2. Extract Presto tar file: `tar zxvf presto-server-0.187.tar.gz`. + + 3. Download the Presto CLI for the coordinator and name it presto. + + ``` +wget https://repo1.maven.org/maven2/com/facebook/presto/presto-cli/0.187/presto-cli-0.187-executable.jar + +mv presto-cli-0.187-executable.jar presto + +chmod +x presto + ``` + +### Create Configuration Files + + 1. Create `etc` folder in presto-server-0.187 directory. + 2. Create `config.properties`, `jvm.config`, `log.properties`, and `node.properties` files. + 3. Install uuid to generate a node.id. + + ``` + sudo apt-get install uuid + + uuid + ``` + + +# Contents of your node.properties file + + ``` + node.environment=production + node.id= + node.data-dir=/home/ubuntu/data + ``` + +# Contents of your jvm.config file + + ``` + -server + -Xmx16G + -XX:+UseG1GC + -XX:G1HeapRegionSize=32M + -XX:+UseGCOverheadLimit + -XX:+ExplicitGCInvokesConcurrent + -XX:+HeapDumpOnOutOfMemoryError + -XX:OnOutOfMemoryError=kill -9 %p + ``` + +# Contents of your log.properties file + ``` + com.facebook.presto=INFO + ``` + + The default minimum level is `INFO`. There are four levels: `DEBUG`, `INFO`, `WARN` and `ERROR`. + +### Coordinator Configurations + +# Contents of your config.properties + ``` + coordinator=true + node-scheduler.include-coordinator=false + http-server.http.port=8086 + query.max-memory=50GB + query.max-memory-per-node=2GB + discovery-server.enabled=true + discovery.uri=:8086 + ``` +The options `node-scheduler.include-coordinator=false` and `coordinator=true` indicate that the node is the coordinator and tells the coordinator not to do any of the computation work itself and to use the workers. + +**Note**: We recommend setting `query.max-memory-per-node` to half of the JVM config max memory, though if your workload is highly concurrent, you may want to use a lower value for `query.max-memory-per-node`. + +Also relation between below two configuration-properties should be like: +If, `query.max-memory-per-node=30GB` +Then, `query.max-memory=<30GB * number of nodes>`. + +### Worker Configurations + +# Contents of your config.properties + + ``` + coordinator=false + http-server.http.port=8086 + query.max-memory=50GB + query.max-memory-per-node=2GB + discovery.uri=:8086 + ``` + +**Note**: `jvm.config` and `node.properties` files are same for all the nodes (worker + coordinator). All the nodes should have different `node.id`.(generated by uuid command). + +### **With this we are ready with the Presto Cluster setup but to integrate with carbon data further steps are required which are as follows:** + +### Catalog Configurations + +1. Create a folder named `catalog` in etc directory of presto on all the nodes of the cluster including the coordinator. + +# Configuring Carbondata in Presto +1. Create a file named `carbondata.properties` in the `catalog` folder and set the required properties on all the nodes. + +### Add Plugins + +1. Create a directory named `carbondata` in plugin directory of presto. +2. Copy `carbondata` jars to `plugin/carbondata` directory on all nodes. + +### Start Presto Server on all nodes + +``` +./presto-server-0.187/bin/launcher start +``` +To run it as a background process. + +``` +./presto-server-0.187/bin/launcher run +``` +To run it in foreground. + +### Start Presto CLI +``` +./presto +``` +To connect to carbondata catalog use the following command: + +``` +./presto --server :8086 --catalog carbondata --schema +```
[GitHub] carbondata pull request #2568: [Presto-integration-Technical-note] created d...
Github user vandana7 commented on a diff in the pull request: https://github.com/apache/carbondata/pull/2568#discussion_r207517977 --- Diff: integration/presto/presto-integration-technical-note.md --- @@ -0,0 +1,253 @@ + + +# Presto Integration Technical Note +Presto Integration with Carbon data include the below steps: + +* Setting up Presto Cluster + +* Setting up cluster to use carbondata as a catalog along with other catalogs provided by presto. + +In this technical note we will first learn about the above two points and after that we will see how we can do performance tuning with Presto. + +## **Let us begin with the first step of Presto Cluster Setup:** + + +* ### Installing Presto + + 1. Download the 0.187 version of Presto using: + `wget https://repo1.maven.org/maven2/com/facebook/presto/presto-server/0.187/presto-server-0.187.tar.gz` + + 2. Extract Presto tar file: `tar zxvf presto-server-0.187.tar.gz`. + + 3. Download the Presto CLI for the coordinator and name it presto. + + ``` +wget https://repo1.maven.org/maven2/com/facebook/presto/presto-cli/0.187/presto-cli-0.187-executable.jar + +mv presto-cli-0.187-executable.jar presto + +chmod +x presto + ``` + +### Create Configuration Files + + 1. Create `etc` folder in presto-server-0.187 directory. + 2. Create `config.properties`, `jvm.config`, `log.properties`, and `node.properties` files. + 3. Install uuid to generate a node.id. + + ``` + sudo apt-get install uuid + + uuid + ``` + + +# Contents of your node.properties file + + ``` + node.environment=production + node.id= + node.data-dir=/home/ubuntu/data + ``` + +# Contents of your jvm.config file + + ``` + -server + -Xmx16G + -XX:+UseG1GC + -XX:G1HeapRegionSize=32M + -XX:+UseGCOverheadLimit + -XX:+ExplicitGCInvokesConcurrent + -XX:+HeapDumpOnOutOfMemoryError + -XX:OnOutOfMemoryError=kill -9 %p + ``` + +# Contents of your log.properties file + ``` + com.facebook.presto=INFO + ``` + + The default minimum level is `INFO`. There are four levels: `DEBUG`, `INFO`, `WARN` and `ERROR`. + +### Coordinator Configurations + +# Contents of your config.properties + ``` + coordinator=true + node-scheduler.include-coordinator=false + http-server.http.port=8086 + query.max-memory=50GB + query.max-memory-per-node=2GB + discovery-server.enabled=true + discovery.uri=:8086 + ``` +The options `node-scheduler.include-coordinator=false` and `coordinator=true` indicate that the node is the coordinator and tells the coordinator not to do any of the computation work itself and to use the workers. + +**Note**: We recommend setting `query.max-memory-per-node` to half of the JVM config max memory, though if your workload is highly concurrent, you may want to use a lower value for `query.max-memory-per-node`. + +Also relation between below two configuration-properties should be like: +If, `query.max-memory-per-node=30GB` +Then, `query.max-memory=<30GB * number of nodes>`. + +### Worker Configurations + +# Contents of your config.properties + + ``` + coordinator=false + http-server.http.port=8086 + query.max-memory=50GB + query.max-memory-per-node=2GB + discovery.uri=:8086 + ``` + +**Note**: `jvm.config` and `node.properties` files are same for all the nodes (worker + coordinator). All the nodes should have different `node.id`.(generated by uuid command). + +### **With this we are ready with the Presto Cluster setup but to integrate with carbon data further steps are required which are as follows:** + +### Catalog Configurations + +1. Create a folder named `catalog` in etc directory of presto on all the nodes of the cluster including the coordinator. + +# Configuring Carbondata in Presto +1. Create a file named `carbondata.properties` in the `catalog` folder and set the required properties on all the nodes. + +### Add Plugins + +1. Create a directory named `carbondata` in plugin directory of presto. +2. Copy `carbondata` jars to `plugin/carbondata` directory on all nodes. + +### Start Presto Server on all nodes + +``` +./presto-server-0.187/bin/launcher start +``` +To run it as a background process. + +``` +./presto-server-0.187/bin/launcher run +``` +To run it in foreground. + +### Start Presto CLI +``` +./presto +``` +To connect to carbondata catalog use the following command: + +``` +./presto --server :8086 --catalog carbondata --schema +```
[GitHub] carbondata issue #2603: [Documentation] Editorial review comment fixed
Github user chetandb commented on the issue: https://github.com/apache/carbondata/pull/2603 LGTM ---
[GitHub] carbondata pull request #2603: [Documentation] Editorial review comment fixe...
Github user sgururajshetty commented on a diff in the pull request: https://github.com/apache/carbondata/pull/2603#discussion_r207516087 --- Diff: docs/configuration-parameters.md --- @@ -140,7 +140,7 @@ This section provides the details of all the configurations required for CarbonD | carbon.enableMinMax | true | Min max is feature added to enhance query performance. To disable this feature, set it false. | | carbon.dynamicallocation.schedulertimeout | 5 | Specifies the maximum time (unit in seconds) the scheduler can wait for executor to be active. Minimum value is 5 sec and maximum value is 15 sec. | | carbon.scheduler.minregisteredresourcesratio | 0.8 | Specifies the minimum resource (executor) ratio needed for starting the block distribution. The default value is 0.8, which indicates 80% of the requested resource is allocated for starting block distribution. The minimum value is 0.1 min and the maximum value is 1.0. | -| carbon.search.enabled | false | If set to true, it will use CarbonReader to do distributed scan directly instead of using compute framework like spark, thus avoiding limitation of compute framework like SQL optimizer and task scheduling overhead. | +| carbon.search.enabled (Alpha Feature) | false | If set to true, it will use CarbonReader to do distributed scan directly instead of using compute framework like spark, thus avoiding limitation of compute framework like SQL optimizer and task scheduling overhead. | * **Global Dictionary Configurations** --- End diff -- This issue is handled in a different PR #2576 ---
[GitHub] carbondata issue #2590: [CARBONDATA-2750] Updated documentation on Local Dic...
Github user CarbonDataQA commented on the issue: https://github.com/apache/carbondata/pull/2590 Build Failed with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/7767/ ---
[GitHub] carbondata pull request #2603: [Documentation] Editorial review comment fixe...
Github user sgururajshetty commented on a diff in the pull request: https://github.com/apache/carbondata/pull/2603#discussion_r207516006 --- Diff: docs/configuration-parameters.md --- @@ -140,7 +140,7 @@ This section provides the details of all the configurations required for CarbonD | carbon.enableMinMax | true | Min max is feature added to enhance query performance. To disable this feature, set it false. | | carbon.dynamicallocation.schedulertimeout | 5 | Specifies the maximum time (unit in seconds) the scheduler can wait for executor to be active. Minimum value is 5 sec and maximum value is 15 sec. | | carbon.scheduler.minregisteredresourcesratio | 0.8 | Specifies the minimum resource (executor) ratio needed for starting the block distribution. The default value is 0.8, which indicates 80% of the requested resource is allocated for starting block distribution. The minimum value is 0.1 min and the maximum value is 1.0. | -| carbon.search.enabled | false | If set to true, it will use CarbonReader to do distributed scan directly instead of using compute framework like spark, thus avoiding limitation of compute framework like SQL optimizer and task scheduling overhead. | +| carbon.search.enabled (Alpha Feature) | false | If set to true, it will use CarbonReader to do distributed scan directly instead of using compute framework like spark, thus avoiding limitation of compute framework like SQL optimizer and task scheduling overhead. | * **Global Dictionary Configurations** --- End diff -- The minimum value need not be mentioned now ---
[GitHub] carbondata issue #2590: [CARBONDATA-2750] Updated documentation on Local Dic...
Github user CarbonDataQA commented on the issue: https://github.com/apache/carbondata/pull/2590 Build Success with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/6491/ ---
[jira] [Created] (CARBONDATA-2821) For non-lazy index datamap (index datamap that not specified as deferred rebuild), rebuilding is not skipped
Chetan Bhat created CARBONDATA-2821: --- Summary: For non-lazy index datamap (index datamap that not specified as deferred rebuild), rebuilding is not skipped Key: CARBONDATA-2821 URL: https://issues.apache.org/jira/browse/CARBONDATA-2821 Project: CarbonData Issue Type: Bug Components: data-query Affects Versions: 1.4.1 Environment: Spark 2.1, Spark 2.2 Reporter: Chetan Bhat Assignee: xuchuanyin Steps : User creates a datamap on a table. User loads the data. User tries to rebuild the datamap. Actual Issue : For non-lazy index datamap (index datamap that not specified as deferred rebuild), rebuilding is not skipped. As a result the rebuild datamap fails and throws error. Expected : For non-lazy index datamap (index datamap that not specified as deferred rebuild), rebuilding can be skipped. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[GitHub] carbondata issue #2594: [CARBONDATA-2809][DataMap] Block rebuilding for bloo...
Github user CarbonDataQA commented on the issue: https://github.com/apache/carbondata/pull/2594 Build Failed with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/7765/ ---
[GitHub] carbondata issue #2590: [CARBONDATA-2750] Updated documentation on Local Dic...
Github user ravipesala commented on the issue: https://github.com/apache/carbondata/pull/2590 SDV Build Fail , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/6132/ ---
[GitHub] carbondata issue #2594: [CARBONDATA-2809][DataMap] Block rebuilding for bloo...
Github user CarbonDataQA commented on the issue: https://github.com/apache/carbondata/pull/2594 Build Failed with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/6489/ ---
[GitHub] carbondata issue #2576: [CARBONDATA-2795] Add documentation for S3
Github user ravipesala commented on the issue: https://github.com/apache/carbondata/pull/2576 SDV Build Fail , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/6131/ ---
[GitHub] carbondata pull request #2568: [Presto-integration-Technical-note] created d...
Github user vandana7 commented on a diff in the pull request: https://github.com/apache/carbondata/pull/2568#discussion_r207481959 --- Diff: integration/presto/presto-integration-technical-note.md --- @@ -0,0 +1,253 @@ + + +# Presto Integration Technical Note +Presto Integration with Carbon data include the below steps: + +* Setting up Presto Cluster + +* Setting up cluster to use carbondata as a catalog along with other catalogs provided by presto. + +In this technical note we will first learn about the above two points and after that we will see how we can do performance tuning with Presto. + +## **Let us begin with the first step of Presto Cluster Setup:** + + +* ### Installing Presto + + 1. Download the 0.187 version of Presto using: + `wget https://repo1.maven.org/maven2/com/facebook/presto/presto-server/0.187/presto-server-0.187.tar.gz` + + 2. Extract Presto tar file: `tar zxvf presto-server-0.187.tar.gz`. + + 3. Download the Presto CLI for the coordinator and name it presto. + + ``` +wget https://repo1.maven.org/maven2/com/facebook/presto/presto-cli/0.187/presto-cli-0.187-executable.jar + +mv presto-cli-0.187-executable.jar presto + +chmod +x presto + ``` + +### Create Configuration Files + + 1. Create `etc` folder in presto-server-0.187 directory. + 2. Create `config.properties`, `jvm.config`, `log.properties`, and `node.properties` files. + 3. Install uuid to generate a node.id. + + ``` + sudo apt-get install uuid + + uuid + ``` + + +# Contents of your node.properties file + + ``` + node.environment=production + node.id= + node.data-dir=/home/ubuntu/data + ``` + +# Contents of your jvm.config file + + ``` + -server + -Xmx16G + -XX:+UseG1GC + -XX:G1HeapRegionSize=32M + -XX:+UseGCOverheadLimit + -XX:+ExplicitGCInvokesConcurrent + -XX:+HeapDumpOnOutOfMemoryError + -XX:OnOutOfMemoryError=kill -9 %p + ``` + +# Contents of your log.properties file + ``` + com.facebook.presto=INFO + ``` + + The default minimum level is `INFO`. There are four levels: `DEBUG`, `INFO`, `WARN` and `ERROR`. + +### Coordinator Configurations + +# Contents of your config.properties + ``` + coordinator=true + node-scheduler.include-coordinator=false + http-server.http.port=8086 + query.max-memory=50GB + query.max-memory-per-node=2GB + discovery-server.enabled=true + discovery.uri=:8086 + ``` +The options `node-scheduler.include-coordinator=false` and `coordinator=true` indicate that the node is the coordinator and tells the coordinator not to do any of the computation work itself and to use the workers. + +**Note**: We recommend setting `query.max-memory-per-node` to half of the JVM config max memory, though if your workload is highly concurrent, you may want to use a lower value for `query.max-memory-per-node`. + +Also relation between below two configuration-properties should be like: +If, `query.max-memory-per-node=30GB` +Then, `query.max-memory=<30GB * number of nodes>`. + +### Worker Configurations + +# Contents of your config.properties + + ``` + coordinator=false + http-server.http.port=8086 + query.max-memory=50GB + query.max-memory-per-node=2GB + discovery.uri=:8086 + ``` + +**Note**: `jvm.config` and `node.properties` files are same for all the nodes (worker + coordinator). All the nodes should have different `node.id`.(generated by uuid command). + +### **With this we are ready with the Presto Cluster setup but to integrate with carbon data further steps are required which are as follows:** + +### Catalog Configurations + +1. Create a folder named `catalog` in etc directory of presto on all the nodes of the cluster including the coordinator. + +# Configuring Carbondata in Presto +1. Create a file named `carbondata.properties` in the `catalog` folder and set the required properties on all the nodes. + +### Add Plugins + +1. Create a directory named `carbondata` in plugin directory of presto. +2. Copy `carbondata` jars to `plugin/carbondata` directory on all nodes. + +### Start Presto Server on all nodes + +``` +./presto-server-0.187/bin/launcher start +``` +To run it as a background process. + +``` +./presto-server-0.187/bin/launcher run +``` +To run it in foreground. + +### Start Presto CLI +``` +./presto +``` +To connect to carbondata catalog use the following command: + +``` +./presto --server :8086 --catalog carbondata --schema +```
[GitHub] carbondata pull request #2568: [Presto-integration-Technical-note] created d...
Github user vandana7 commented on a diff in the pull request: https://github.com/apache/carbondata/pull/2568#discussion_r207479703 --- Diff: integration/presto/presto-integration-technical-note.md --- @@ -0,0 +1,253 @@ + + +# Presto Integration Technical Note +Presto Integration with Carbon data include the below steps: + +* Setting up Presto Cluster + +* Setting up cluster to use carbondata as a catalog along with other catalogs provided by presto. + +In this technical note we will first learn about the above two points and after that we will see how we can do performance tuning with Presto. + +## **Let us begin with the first step of Presto Cluster Setup:** + + +* ### Installing Presto + + 1. Download the 0.187 version of Presto using: + `wget https://repo1.maven.org/maven2/com/facebook/presto/presto-server/0.187/presto-server-0.187.tar.gz` + + 2. Extract Presto tar file: `tar zxvf presto-server-0.187.tar.gz`. + + 3. Download the Presto CLI for the coordinator and name it presto. + + ``` +wget https://repo1.maven.org/maven2/com/facebook/presto/presto-cli/0.187/presto-cli-0.187-executable.jar + +mv presto-cli-0.187-executable.jar presto + +chmod +x presto + ``` + +### Create Configuration Files + + 1. Create `etc` folder in presto-server-0.187 directory. + 2. Create `config.properties`, `jvm.config`, `log.properties`, and `node.properties` files. + 3. Install uuid to generate a node.id. + + ``` + sudo apt-get install uuid + + uuid + ``` + + +# Contents of your node.properties file + + ``` + node.environment=production + node.id= + node.data-dir=/home/ubuntu/data + ``` + +# Contents of your jvm.config file + + ``` + -server + -Xmx16G + -XX:+UseG1GC + -XX:G1HeapRegionSize=32M + -XX:+UseGCOverheadLimit + -XX:+ExplicitGCInvokesConcurrent + -XX:+HeapDumpOnOutOfMemoryError + -XX:OnOutOfMemoryError=kill -9 %p + ``` + +# Contents of your log.properties file + ``` + com.facebook.presto=INFO + ``` + + The default minimum level is `INFO`. There are four levels: `DEBUG`, `INFO`, `WARN` and `ERROR`. + +### Coordinator Configurations + +# Contents of your config.properties + ``` + coordinator=true + node-scheduler.include-coordinator=false + http-server.http.port=8086 + query.max-memory=50GB + query.max-memory-per-node=2GB + discovery-server.enabled=true + discovery.uri=:8086 + ``` +The options `node-scheduler.include-coordinator=false` and `coordinator=true` indicate that the node is the coordinator and tells the coordinator not to do any of the computation work itself and to use the workers. + +**Note**: We recommend setting `query.max-memory-per-node` to half of the JVM config max memory, though if your workload is highly concurrent, you may want to use a lower value for `query.max-memory-per-node`. + +Also relation between below two configuration-properties should be like: +If, `query.max-memory-per-node=30GB` +Then, `query.max-memory=<30GB * number of nodes>`. + +### Worker Configurations + +# Contents of your config.properties + + ``` + coordinator=false + http-server.http.port=8086 + query.max-memory=50GB + query.max-memory-per-node=2GB + discovery.uri=:8086 + ``` + +**Note**: `jvm.config` and `node.properties` files are same for all the nodes (worker + coordinator). All the nodes should have different `node.id`.(generated by uuid command). + +### **With this we are ready with the Presto Cluster setup but to integrate with carbon data further steps are required which are as follows:** + +### Catalog Configurations + +1. Create a folder named `catalog` in etc directory of presto on all the nodes of the cluster including the coordinator. + +# Configuring Carbondata in Presto +1. Create a file named `carbondata.properties` in the `catalog` folder and set the required properties on all the nodes. + +### Add Plugins + +1. Create a directory named `carbondata` in plugin directory of presto. +2. Copy `carbondata` jars to `plugin/carbondata` directory on all nodes. + +### Start Presto Server on all nodes + +``` +./presto-server-0.187/bin/launcher start +``` +To run it as a background process. + +``` +./presto-server-0.187/bin/launcher run +``` +To run it in foreground. + +### Start Presto CLI +``` +./presto +``` +To connect to carbondata catalog use the following command: + +``` +./presto --server :8086 --catalog carbondata --schema +```
[jira] [Created] (CARBONDATA-2820) Block rebuilding for preagg, bloom and lucene datamap
xuchuanyin created CARBONDATA-2820: -- Summary: Block rebuilding for preagg, bloom and lucene datamap Key: CARBONDATA-2820 URL: https://issues.apache.org/jira/browse/CARBONDATA-2820 Project: CarbonData Issue Type: Improvement Reporter: xuchuanyin Assignee: xuchuanyin currently we will block rebuilding these datamap -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (CARBONDATA-2819) cannot drop preagg datamap on table if the table has other index datamaps
lianganping created CARBONDATA-2819: --- Summary: cannot drop preagg datamap on table if the table has other index datamaps Key: CARBONDATA-2819 URL: https://issues.apache.org/jira/browse/CARBONDATA-2819 Project: CarbonData Issue Type: Improvement Affects Versions: 1.4.1 Reporter: lianganping 1.create table student_test(id int,name string,class_number int,male int,female int) stored by 'carbondata'; 2.create datamap dm1_preaggr_student_test ON TABLE student_test USING 'preaggregate' as select class_number,sum(male) from student_test group by class_number 3.create datamap dm_lucene_student_test on table student_test using 'lucene' dmproperties('index_columns' = 'name'); 4.drop datamap dm1_preaggr_student_test on table student_test; and will get this error: Error: org.apache.carbondata.common.exceptions.sql.NoSuchDataMapException: Datamap with name dm1_preaggr_student_test does not exist (state=,code=0) -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[GitHub] carbondata pull request #2568: [Presto-integration-Technical-note] created d...
Github user vandana7 commented on a diff in the pull request: https://github.com/apache/carbondata/pull/2568#discussion_r207475132 --- Diff: integration/presto/presto-integration-in-carbondata.md --- @@ -0,0 +1,134 @@ + + +# PRESTO INTEGRATION IN CARBONDATA + +1. [Document Purpose](#document-purpose) +1. [Purpose](#purpose) +1. [Scope](#scope) +1. [Definitions and Acronyms](#definitions-and-acronyms) +1. [Requirements addressed](#requirements-addressed) +1. [Design Considerations](#design-considerations) +1. [Row Iterator Implementation](#row-iterator-implementation) +1. [ColumnarReaders or StreamReaders approach](#columnarreaders-or-streamreaders-approach) +1. [Module Structure](#module-structure) +1. [Detailed design](#detailed-design) +1. [Modules](#modules) +1. [Functions Developed](#functions-developed) +1. [Integration Tests](#integration-tests) +1. [Tools and languages used](#tools-and-languages-used) +1. [References](#references) + +## Document Purpose + + * _Purpose_ + The purpose of this document is to outline the technical design of the Presto Integration in CarbonData. + + Its main purpose is to - + * Provide the link between the Functional Requirement and the detailed Technical Design documents. + * Detail the functionality which will be provided by each component or group of components and show how the various components interact in the design. + + This document is not intended to address installation and configuration details of the actual implementation. Installation and configuration details are provided in technology guides provided on CarbonData wiki page.As is true with any high level design, this document will be updated and refined based on changing requirements. + * _Scope_ + Presto Integration with CarbonData will allow execution of CarbonData queries on the Presto CLI. Â CarbonData can be added easily as a Data Source among the multiple heterogeneous data sources for Presto. + * _Definitions and Acronyms_ + **CarbonData :** CarbonData is a fully indexed columnar and Hadoop native data-store for processing heavy analytical workloads and detailed queries on big data. In customer benchmarks, CarbonData has proven to manage Petabyte of data running on extraordinarily low-cost hardware and answers queries around 10 times faster than the current open source solutions (column-oriented SQL on Hadoop data-stores). + + **Presto :** Presto is a distributed SQL query engine designed to query large data sets distributed over one or more heterogeneous data sources. + +## Requirements addressed +This integration of Presto mainly serves two purpose: + * Support of Apache CarbonData as Data Source in Presto. + * Execution of Apache CarbonData Queries on Presto. + +## Design Considerations +Following are the design considerations for the Presto Integration with the Carbondata. + + Row Iterator Implementation + + Presto provides a way to iterate the records through a RecordSetProvider which creates a RecordCursor so we have to extend this class to create a CarbondataRecordSetProvider and CarbondataRecordCursor to read data from Carbondata core module. The CarbondataRecordCursor will utilize the DictionaryBasedResultCollector class of Core module to read data row by row. This approach has two drawbacks. + * The Presto converts this row data into columnar data again since carbondata itself store data in columnar format we are adding an additional conversion to row to column instead of directly using the column. + * The cursor reads the data row by row instead of a batch of data , so this is a costly operation as we are already storing the data in pages or batches we can directly read the batches of data. + + ColumnarReaders or StreamReaders approach + + In this design we can create StreamReaders that can read data from the Carbondata Column based on DataType and directly convert it into Presto Block. This approach saves us the row by row processing as well as reduce the transition and conversion of data . By this approach we can achieve the fastest read from Presto and create a Presto Page by extending PageSourceProvider and PageSource class. This design will be discussed in detail in the next sections of this document. + +## Module Structure + + +![module structure](../presto/images/module-structure.jpg?raw=true) + + + +## Detailed design + Modules + +Based on the above functionality, Presto Integration is implemented as following module: + +1. **Presto** + +Integration of Presto with CarbonData includes implementation of connector Api of the Presto. --- End diff -- done ---
[GitHub] carbondata pull request #2568: [Presto-integration-Technical-note] created d...
Github user vandana7 commented on a diff in the pull request: https://github.com/apache/carbondata/pull/2568#discussion_r207474334 --- Diff: integration/presto/presto-integration-in-carbondata.md --- @@ -0,0 +1,134 @@ + + +# PRESTO INTEGRATION IN CARBONDATA + +1. [Document Purpose](#document-purpose) +1. [Purpose](#purpose) +1. [Scope](#scope) +1. [Definitions and Acronyms](#definitions-and-acronyms) +1. [Requirements addressed](#requirements-addressed) +1. [Design Considerations](#design-considerations) +1. [Row Iterator Implementation](#row-iterator-implementation) +1. [ColumnarReaders or StreamReaders approach](#columnarreaders-or-streamreaders-approach) +1. [Module Structure](#module-structure) +1. [Detailed design](#detailed-design) +1. [Modules](#modules) +1. [Functions Developed](#functions-developed) +1. [Integration Tests](#integration-tests) +1. [Tools and languages used](#tools-and-languages-used) +1. [References](#references) + +## Document Purpose + + * _Purpose_ + The purpose of this document is to outline the technical design of the Presto Integration in CarbonData. + + Its main purpose is to - + * Provide the link between the Functional Requirement and the detailed Technical Design documents. + * Detail the functionality which will be provided by each component or group of components and show how the various components interact in the design. + + This document is not intended to address installation and configuration details of the actual implementation. Installation and configuration details are provided in technology guides provided on CarbonData wiki page.As is true with any high level design, this document will be updated and refined based on changing requirements. + * _Scope_ + Presto Integration with CarbonData will allow execution of CarbonData queries on the Presto CLI. Â CarbonData can be added easily as a Data Source among the multiple heterogeneous data sources for Presto. --- End diff -- done. ---
[GitHub] carbondata issue #2603: [Documentation] Editorial review comment fixed
Github user CarbonDataQA commented on the issue: https://github.com/apache/carbondata/pull/2603 Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/7763/ ---
[jira] [Created] (CARBONDATA-2818) Migrate Presto Integration from 0.187 to 0.206
Bhavya Aggarwal created CARBONDATA-2818: --- Summary: Migrate Presto Integration from 0.187 to 0.206 Key: CARBONDATA-2818 URL: https://issues.apache.org/jira/browse/CARBONDATA-2818 Project: CarbonData Issue Type: Improvement Affects Versions: 1.4.2 Reporter: Bhavya Aggarwal Assignee: Bhavya Aggarwal Presto Integration Module migration to 0.206 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[GitHub] carbondata issue #2603: [Documentation] Editorial review comment fixed
Github user CarbonDataQA commented on the issue: https://github.com/apache/carbondata/pull/2603 Build Failed with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/6487/ ---
[GitHub] carbondata issue #2537: [CARBONDATA-2768][CarbonStore] Fix error in tests fo...
Github user CarbonDataQA commented on the issue: https://github.com/apache/carbondata/pull/2537 Build Failed with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/7762/ ---
[GitHub] carbondata issue #2590: [CARBONDATA-2750] Updated documentation on Local Dic...
Github user ravipesala commented on the issue: https://github.com/apache/carbondata/pull/2590 SDV Build Fail , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/6130/ ---
[GitHub] carbondata issue #2537: [CARBONDATA-2768][CarbonStore] Fix error in tests fo...
Github user CarbonDataQA commented on the issue: https://github.com/apache/carbondata/pull/2537 Build Success with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/6486/ ---
[GitHub] carbondata issue #2603: [Documentation] Editorial review comment fixed
Github user CarbonDataQA commented on the issue: https://github.com/apache/carbondata/pull/2603 Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/7760/ ---
[GitHub] carbondata pull request #2568: [Presto-integration-Technical-note] created d...
Github user vandana7 commented on a diff in the pull request: https://github.com/apache/carbondata/pull/2568#discussion_r207462575 --- Diff: integration/presto/presto-integration-in-carbondata.md --- @@ -0,0 +1,134 @@ + + +# PRESTO INTEGRATION IN CARBONDATA + +1. [Document Purpose](#document-purpose) +1. [Purpose](#purpose) +1. [Scope](#scope) +1. [Definitions and Acronyms](#definitions-and-acronyms) +1. [Requirements addressed](#requirements-addressed) +1. [Design Considerations](#design-considerations) +1. [Row Iterator Implementation](#row-iterator-implementation) +1. [ColumnarReaders or StreamReaders approach](#columnarreaders-or-streamreaders-approach) +1. [Module Structure](#module-structure) +1. [Detailed design](#detailed-design) +1. [Modules](#modules) +1. [Functions Developed](#functions-developed) +1. [Integration Tests](#integration-tests) +1. [Tools and languages used](#tools-and-languages-used) +1. [References](#references) + +## Document Purpose + + * _Purpose_ + The purpose of this document is to outline the technical design of the Presto Integration in CarbonData. + + Its main purpose is to - + * Provide the link between the Functional Requirement and the detailed Technical Design documents. + * Detail the functionality which will be provided by each component or group of components and show how the various components interact in the design. + + This document is not intended to address installation and configuration details of the actual implementation. Installation and configuration details are provided in technology guides provided on CarbonData wiki page.As is true with any high level design, this document will be updated and refined based on changing requirements. --- End diff -- To make it more clear I have linked the installation and configuration for integrating Carbondata with presto to this document. If anyone wants to know about installation and configuration they can easily visit that document page. ---
[GitHub] carbondata pull request #2594: [CARBONDATA-2809][DataMap] Skip rebuilding fo...
Github user xuchuanyin commented on a diff in the pull request: https://github.com/apache/carbondata/pull/2594#discussion_r207462265 --- Diff: integration/spark2/src/main/scala/org/apache/spark/sql/execution/command/datamap/CarbonDataMapRebuildCommand.scala --- @@ -48,7 +50,17 @@ case class CarbonDataMapRebuildCommand( )(sparkSession) } val provider = DataMapManager.get().getDataMapProvider(table, schema, sparkSession) -provider.rebuild() +// for non-lazy index datamap, the data of datamap will be generated immediately after +// the datamap is created or the main table is loaded, so there is no need to +// rebuild this datamap. +if (!schema.isLazy && provider.isInstanceOf[IndexDataMapProvider]) { --- End diff -- OK. ---
[jira] [Resolved] (CARBONDATA-2804) Incorrect error message when bloom filter or preaggregate datamap tried to be created on older V1-V2 version stores
[ https://issues.apache.org/jira/browse/CARBONDATA-2804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] xuchuanyin resolved CARBONDATA-2804. Resolution: Fixed Assignee: wangsen Fix Version/s: 1.4.1 > Incorrect error message when bloom filter or preaggregate datamap tried to be > created on older V1-V2 version stores > --- > > Key: CARBONDATA-2804 > URL: https://issues.apache.org/jira/browse/CARBONDATA-2804 > Project: CarbonData > Issue Type: Bug > Components: data-query >Affects Versions: 1.4.1 > Environment: Spark 2.1 >Reporter: Chetan Bhat >Assignee: wangsen >Priority: Minor > Fix For: 1.4.1 > > Time Spent: 4h 10m > Remaining Estimate: 0h > > Steps : > User creates a table with V1 version store and loads data to the table. > create table brinjal (imei string,AMSize string,channelsId > string,ActiveCountry string, Activecity string,gamePointId > double,deviceInformationId double,productionDate Timestamp,deliveryDate > timestamp,deliverycharge double) STORED BY 'org.apache.carbondata.format' > TBLPROPERTIES('table_blocksize'='1'); > LOAD DATA INPATH 'hdfs://hacluster/chetan/vardhandaterestruct.csv' INTO > TABLE brinjal OPTIONS('DELIMITER'=',', 'QUOTECHAR'= > '"','BAD_RECORDS_ACTION'='FORCE','FILEHEADER'= > 'imei,deviceInformationId,AMSize,channelsId,ActiveCountry,Activecity,gamePointId,productionDate,deliveryDate,deliverycharge'); > In 1.4.1 version user refreshes the table with V1 store and tries to create a > bloom filter datamap. > CREATE DATAMAP dm_brinjal ON TABLE brinjal2 USING 'bloomfilter' DMPROPERTIES > ('INDEX_COLUMNS' = 'AMSize', 'BLOOM_SIZE'='64', 'BLOOM_FPP'='0.1'); > create datamap brinjal_agg on table brinjal2 using 'preaggregate' as select > AMSize, avg(gamePointId) from brinjal group by gamePointId, AMSize; > Issue : Bloom filter or preaggregate datamap fails with incorrect error > message. > 0: jdbc:hive2://10.18.98.101:22550/default> CREATE DATAMAP dm_brinjal ON > TABLE brinjal2 USING 'bloomfilter' DMPROPERTIES ('INDEX_COLUMNS' = 'AMSize', > 'BLOOM_SIZE'='64', 'BLOOM_FPP'='0.1'); > Error: java.io.IOException: org.apache.thrift.protocol.TProtocolException: > Required field 'version' was not found in serialized data! Struct: > org.apache.carbondata.format.FileHeader$FileHeaderStandardScheme@4d5aa8b2 > (state=,code=0) > 0: jdbc:hive2://10.18.98.101:22550/default> create datamap brinjal_agg on > table brinjal2 using 'preaggregate' as select AMSize, avg(gamePointId) from > brinjal group by gamePointId, AMSize; > Error: java.io.IOException: org.apache.thrift.protocol.TProtocolException: > Required field 'version' was not found in serialized data! Struct: > org.apache.carbondata.format.FileHeader$FileHeaderStandardScheme@55d8323c > (state=,code=0) > Expected : Correct error message should be displayed when bloom filter or > preaggregate datamap creation is blocked/fails. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[GitHub] carbondata pull request #2603: [Documentation] Editorial review comment fixe...
Github user chetandb commented on a diff in the pull request: https://github.com/apache/carbondata/pull/2603#discussion_r207462001 --- Diff: docs/configuration-parameters.md --- @@ -140,7 +140,7 @@ This section provides the details of all the configurations required for CarbonD | carbon.enableMinMax | true | Min max is feature added to enhance query performance. To disable this feature, set it false. | | carbon.dynamicallocation.schedulertimeout | 5 | Specifies the maximum time (unit in seconds) the scheduler can wait for executor to be active. Minimum value is 5 sec and maximum value is 15 sec. | | carbon.scheduler.minregisteredresourcesratio | 0.8 | Specifies the minimum resource (executor) ratio needed for starting the block distribution. The default value is 0.8, which indicates 80% of the requested resource is allocated for starting block distribution. The minimum value is 0.1 min and the maximum value is 1.0. | -| carbon.search.enabled | false | If set to true, it will use CarbonReader to do distributed scan directly instead of using compute framework like spark, thus avoiding limitation of compute framework like SQL optimizer and task scheduling overhead. | +| carbon.search.enabled (Alpha Feature) | false | If set to true, it will use CarbonReader to do distributed scan directly instead of using compute framework like spark, thus avoiding limitation of compute framework like SQL optimizer and task scheduling overhead. | * **Global Dictionary Configurations** --- End diff -- In Local Dictionary section the following updates needs to be done. 1) Remove the line: â44ad8fb40⦠Updated documentation on Local Dictionary Supoort |â in Page no: 7 at the Local Dictionary Configuration section in the Opensource PDF. 2) Change the description for âLocal dictionary thresholdâ from: âThe maximum cardinality for local dictionary generation (maximum - 10)â to âThe maximum cardinality for local dictionary generation (maximum value is 10 and minimum value is 1000. If the âlocal_dictionary_thresholdâ value is set below 1000 or above 10, then it would take the default value 1)â ---
[GitHub] carbondata pull request #2601: [CARBONDATA-2804][DataMap] fix the bug when b...
Github user asfgit closed the pull request at: https://github.com/apache/carbondata/pull/2601 ---
[GitHub] carbondata pull request #2603: [Documentation] Editorial review comment fixe...
Github user chetandb commented on a diff in the pull request: https://github.com/apache/carbondata/pull/2603#discussion_r207460915 --- Diff: docs/configuration-parameters.md --- @@ -140,7 +140,7 @@ This section provides the details of all the configurations required for CarbonD | carbon.enableMinMax | true | Min max is feature added to enhance query performance. To disable this feature, set it false. | | carbon.dynamicallocation.schedulertimeout | 5 | Specifies the maximum time (unit in seconds) the scheduler can wait for executor to be active. Minimum value is 5 sec and maximum value is 15 sec. | | carbon.scheduler.minregisteredresourcesratio | 0.8 | Specifies the minimum resource (executor) ratio needed for starting the block distribution. The default value is 0.8, which indicates 80% of the requested resource is allocated for starting block distribution. The minimum value is 0.1 min and the maximum value is 1.0. | -| carbon.search.enabled | false | If set to true, it will use CarbonReader to do distributed scan directly instead of using compute framework like spark, thus avoiding limitation of compute framework like SQL optimizer and task scheduling overhead. | +| carbon.search.enabled (Alpha Feature) | false | If set to true, it will use CarbonReader to do distributed scan directly instead of using compute framework like spark, thus avoiding limitation of compute framework like SQL optimizer and task scheduling overhead. | * **Global Dictionary Configurations** --- End diff -- In S3 section. 1. there should not be any Space in parameter . Should be carbon.storelocation. 2. "Concurrent queries are not supported" should be changed to "Only concurrent put (data management operations like load,insert,update)are supported." 3. The "Another way of setting the authentication parameters is as follows" should be removed. ---
[GitHub] carbondata pull request #2594: [CARBONDATA-2809][DataMap] Skip rebuilding fo...
Github user KanakaKumar commented on a diff in the pull request: https://github.com/apache/carbondata/pull/2594#discussion_r207460748 --- Diff: integration/spark2/src/main/scala/org/apache/spark/sql/execution/command/datamap/CarbonDataMapRebuildCommand.scala --- @@ -48,7 +50,17 @@ case class CarbonDataMapRebuildCommand( )(sparkSession) } val provider = DataMapManager.get().getDataMapProvider(table, schema, sparkSession) -provider.rebuild() +// for non-lazy index datamap, the data of datamap will be generated immediately after +// the datamap is created or the main table is loaded, so there is no need to +// rebuild this datamap. +if (!schema.isLazy && provider.isInstanceOf[IndexDataMapProvider]) { --- End diff -- Right now rebuild call on pre-aggregate DM ithrows "NoSuchDataMapException". Please handle to give correct message as pre-aggregate also rebuild is not required. ---
[GitHub] carbondata issue #2601: [CARBONDATA-2804][DataMap] fix the bug when bloom fi...
Github user xuchuanyin commented on the issue: https://github.com/apache/carbondata/pull/2601 LGTM ---
[GitHub] carbondata pull request #2603: [Documentation] Editorial review comment fixe...
Github user chetandb commented on a diff in the pull request: https://github.com/apache/carbondata/pull/2603#discussion_r207459790 --- Diff: docs/sdk-guide.md --- @@ -351,7 +351,7 @@ public CarbonWriter buildWriterForCSVInput() throws IOException, InvalidLoadOpti * @throws IOException * @throws InvalidLoadOptionException */ -public CarbonWriter buildWriterForAvroInput() throws IOException, InvalidLoadOptionException; +public CarbonWriter buildWriterForAvroInput(org.apache.avro.Schema schema) throws IOException, InvalidLoadOptionException; ``` --- End diff -- TestSdkJson example code needs to be corrected. testJsonSdkWriter should be static and IOException should be handled import java.io.IOException; import org.apache.carbondata.common.exceptions.sql.InvalidLoadOptionException; import org.apache.carbondata.core.metadata.datatype.DataTypes; import org.apache.carbondata.core.util.CarbonProperties; import org.apache.carbondata.sdk.file.CarbonWriter; import org.apache.carbondata.sdk.file.CarbonWriterBuilder; import org.apache.carbondata.sdk.file.Field; import org.apache.carbondata.sdk.file.Schema; public class TestSdkJson { public static void main(String[] args) throws InvalidLoadOptionException { testJsonSdkWriter(); } public void testJsonSdkWriter() throws InvalidLoadOptionException { String path = "./target/testJsonSdkWriter"; Field[] fields = new Field[2]; fields[0] = new Field("name", DataTypes.STRING); fields[1] = new Field("age", DataTypes.INT); Schema CarbonSchema = new Schema(fields); CarbonWriterBuilder builder = CarbonWriter.builder().outputPath(path); // initialize json writer with carbon schema CarbonWriter writer = builder.buildWriterForJsonInput(CarbonSchema); // one row of json Data as String String JsonRow = "{\"name\":\"abcd\", \"age\":10}"; int rows = 5; for (int i = 0; i < rows; i++) { writer.write(JsonRow); } writer.close(); } } 8.2 ---
[GitHub] carbondata pull request #2598: [CARBONDATA-2811][BloomDataMap] Add query tes...
Github user kevinjmh commented on a diff in the pull request: https://github.com/apache/carbondata/pull/2598#discussion_r207458031 --- Diff: integration/spark2/src/test/scala/org/apache/carbondata/datamap/bloom/BloomCoarseGrainDataMapSuite.scala --- @@ -219,6 +220,62 @@ class BloomCoarseGrainDataMapSuite extends QueryTest with BeforeAndAfterAll with sql(s"DROP TABLE IF EXISTS $bloomDMSampleTable") } + test("test using search mode to query tabel with bloom datamap") { +sql( + s""" + | CREATE TABLE $normalTable(id INT, name STRING, city STRING, age INT, + | s1 STRING, s2 STRING, s3 STRING, s4 STRING, s5 STRING, s6 STRING, s7 STRING, s8 STRING) + | STORED BY 'carbondata' TBLPROPERTIES('table_blocksize'='128') + | """.stripMargin) +sql( + s""" + | CREATE TABLE $bloomDMSampleTable(id INT, name STRING, city STRING, age INT, + | s1 STRING, s2 STRING, s3 STRING, s4 STRING, s5 STRING, s6 STRING, s7 STRING, s8 STRING) + | STORED BY 'carbondata' TBLPROPERTIES('table_blocksize'='128') + | """.stripMargin) + +// load two segments +(1 to 2).foreach { i => + sql( +s""" + | LOAD DATA LOCAL INPATH '$bigFile' INTO TABLE $normalTable + | OPTIONS('header'='false') + """.stripMargin) + sql( +s""" + | LOAD DATA LOCAL INPATH '$bigFile' INTO TABLE $bloomDMSampleTable + | OPTIONS('header'='false') + """.stripMargin) +} + +sql( + s""" + | CREATE DATAMAP $dataMapName ON TABLE $bloomDMSampleTable + | USING 'bloomfilter' + | DMProperties('INDEX_COLUMNS'='city,id', 'BLOOM_SIZE'='64') + """.stripMargin) + +checkExistence(sql(s"SHOW DATAMAP ON TABLE $bloomDMSampleTable"), true, dataMapName) + +// get answer before search mode is enable +val expectedAnswer1 = sql(s"select * from $normalTable where id = 1").collect() +val expectedAnswer2 = sql(s"select * from $normalTable where city in ('city_999')").collect() + +carbonSession.startSearchMode() +assert(carbonSession.isSearchModeEnabled) + +checkAnswer( --- End diff -- Question also for `LuceneFineGrainDataMapWithSearchModeSuite` If we use EXPLAIN command, it won't run in Search Mode. When we debug this test case, we can see that the query will be pruned in Master side of search mode using `getSplit` method in CarbonTableInputFormat which finally using datamap to prune. So that should be confirm in other test case with same table schema and data, and take this test case as an extended test only for Search Mode feature. This test case also does not care about whether the datamap created before or after data load. ---
[GitHub] carbondata issue #2601: [CARBONDATA-2804][DataMap] fix the bug when bloom fi...
Github user manishgupta88 commented on the issue: https://github.com/apache/carbondata/pull/2601 LGTM ---
[GitHub] carbondata issue #2590: [CARBONDATA-2750] Updated documentation on Local Dic...
Github user CarbonDataQA commented on the issue: https://github.com/apache/carbondata/pull/2590 Build Success with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/6484/ ---
[jira] [Updated] (CARBONDATA-2816) MV Datamap - With the hive metastore disabled, MV is not working as expected.
[ https://issues.apache.org/jira/browse/CARBONDATA-2816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prasanna Ravichandran updated CARBONDATA-2816: -- Description: When the hive metastore is disabled(spark.carbon.hive.schema.store=false), then the below issues are seen. CARBONDATA-2534 CARBONDATA-2539 CARBONDATA-2576 was: When the hive metastore is disabled(spark.carbon.hive.schema.store=false), then the below issues are seen. CARBONDATA-2540 CARBONDATA-2539 CARBONDATA-2576 > MV Datamap - With the hive metastore disabled, MV is not working as expected. > - > > Key: CARBONDATA-2816 > URL: https://issues.apache.org/jira/browse/CARBONDATA-2816 > Project: CarbonData > Issue Type: Bug > Components: data-query >Reporter: Prasanna Ravichandran >Priority: Minor > Labels: MV > > When the hive metastore is disabled(spark.carbon.hive.schema.store=false), > then the below issues are seen. > CARBONDATA-2534 > CARBONDATA-2539 > CARBONDATA-2576 > -- This message was sent by Atlassian JIRA (v7.6.3#76005)