[GitHub] incubator-carbondata pull request #229: [CARBONDATA-297]Added interface for ...
Github user asfgit closed the pull request at: https://github.com/apache/incubator-carbondata/pull/229 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-carbondata pull request #229: [CARBONDATA-297]Added interface for ...
Github user ravipesala commented on a diff in the pull request: https://github.com/apache/incubator-carbondata/pull/229#discussion_r83373827 --- Diff: processing/src/main/java/org/apache/carbondata/processing/newflow/AbstractDataLoadProcessorStep.java --- @@ -73,15 +72,15 @@ public AbstractDataLoadProcessorStep(CarbonDataLoadConfiguration configuration, * Create the iterator using child iterator. * * @param childIter - * @return + * @return new iterator with step specific processing. */ - protected Iterator getIterator(final Iterator childIter) { -return new CarbonIterator() { + protected Iterator getIterator(final Iterator childIter) { --- End diff -- ok. Added --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-carbondata pull request #229: [CARBONDATA-297]Added interface for ...
Github user jackylk commented on a diff in the pull request: https://github.com/apache/incubator-carbondata/pull/229#discussion_r83354231 --- Diff: processing/src/main/java/org/apache/carbondata/processing/newflow/AbstractDataLoadProcessorStep.java --- @@ -73,15 +72,15 @@ public AbstractDataLoadProcessorStep(CarbonDataLoadConfiguration configuration, * Create the iterator using child iterator. * * @param childIter - * @return + * @return new iterator with step specific processing. */ - protected Iterator getIterator(final Iterator childIter) { -return new CarbonIterator() { + protected Iterator getIterator(final Iterator childIter) { --- End diff -- In order to support batch conversion, it is better to put `Iterator` instead --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-carbondata pull request #229: [CARBONDATA-297]Added interface for ...
Github user jackylk commented on a diff in the pull request: https://github.com/apache/incubator-carbondata/pull/229#discussion_r83354325 --- Diff: processing/src/main/java/org/apache/carbondata/processing/newflow/AbstractDataLoadProcessorStep.java --- @@ -55,14 +54,14 @@ public AbstractDataLoadProcessorStep(CarbonDataLoadConfiguration configuration, public abstract void intialize() throws CarbonDataLoadingException; /** - * Tranform the data as per the implemetation. + * Tranform the data as per the implementation. * * @return Array of Iterator with data. It can be processed parallel if implementation class wants * @throws CarbonDataLoadingException */ - public Iterator[] execute() throws CarbonDataLoadingException { -Iterator[] childIters = child.execute(); -Iterator[] iterators = new Iterator[childIters.length]; + public Iterator[] execute() throws CarbonDataLoadingException { +Iterator[] childIters = child.execute(); +Iterator[] iterators = new Iterator[childIters.length]; --- End diff -- In order to support batch conversion, it is better to put `Iterator[]` instead --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-carbondata pull request #229: [CARBONDATA-297]Added interface for ...
Github user ravipesala commented on a diff in the pull request: https://github.com/apache/incubator-carbondata/pull/229#discussion_r83208986 --- Diff: processing/src/main/java/org/apache/carbondata/processing/newflow/iterators/RecordReaderIterator.java --- @@ -0,0 +1,40 @@ +package org.apache.carbondata.processing.newflow.iterators; + +import java.io.IOException; + +import org.apache.carbondata.common.CarbonIterator; +import org.apache.carbondata.common.logging.LogService; +import org.apache.carbondata.common.logging.LogServiceFactory; + +import org.apache.hadoop.mapred.RecordReader; + +/** + * This iterator iterates RecordReader. + */ +public class RecordReaderIterator extends CarbonIterator { --- End diff -- Ok. I will remove it from this PR. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-carbondata pull request #229: [CARBONDATA-297]Added interface for ...
Github user jackylk commented on a diff in the pull request: https://github.com/apache/incubator-carbondata/pull/229#discussion_r83208961 --- Diff: processing/src/main/java/org/apache/carbondata/processing/newflow/DataLoadProcessorStep.java --- @@ -0,0 +1,40 @@ +package org.apache.carbondata.processing.newflow; + +import java.util.Iterator; + +import org.apache.carbondata.processing.newflow.exception.CarbonDataLoadingException; + +/** + * This base interface for data loading. It can do transformation jobs as per the implementation. + * + */ +public interface DataLoadProcessorStep { + + /** + * The output meta for this step. The data returns from this step is as per this meta. + * @return + */ + DataField[] getOutput(); + + /** + * Intialization process for this step. + * @param configuration + * @param child + * @throws CarbonDataLoadingException + */ + void intialize(CarbonDataLoadConfiguration configuration, DataLoadProcessorStep child) throws + CarbonDataLoadingException; + + /** + * Tranform the data as per the implemetation. + * @return Iterator of data + * @throws CarbonDataLoadingException + */ + Iterator execute() throws CarbonDataLoadingException; --- End diff -- I thought the SortStep is a singleton object within the executor, and if there are only one executor in one datanode, then the SortStep is sorting the data within datanode-scope, which is what we want. Synchronization means SortStep is thread-safe, so that multiple task can insert row into it. Does your desing look like this? Otherwise how you ensure data is sorting within datanode? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-carbondata pull request #229: [CARBONDATA-297]Added interface for ...
Github user ravipesala commented on a diff in the pull request: https://github.com/apache/incubator-carbondata/pull/229#discussion_r83147351 --- Diff: processing/src/main/java/org/apache/carbondata/processing/newflow/iterators/RecordReaderIterator.java --- @@ -0,0 +1,40 @@ +package org.apache.carbondata.processing.newflow.iterators; + +import java.io.IOException; + +import org.apache.carbondata.common.CarbonIterator; +import org.apache.carbondata.common.logging.LogService; +import org.apache.carbondata.common.logging.LogServiceFactory; + +import org.apache.hadoop.mapred.RecordReader; + +/** + * This iterator iterates RecordReader. + */ +public class RecordReaderIterator extends CarbonIterator { --- End diff -- It is used for iterating RecordReader. I can move it to carbon-hadoop module but processing module need to be dependent on it. Already processing module is dependent on hadoop module so it becomes dependent on each other. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-carbondata pull request #229: [CARBONDATA-297]Added interface for ...
Github user ravipesala commented on a diff in the pull request: https://github.com/apache/incubator-carbondata/pull/229#discussion_r83147018 --- Diff: processing/src/main/java/org/apache/carbondata/processing/newflow/DataLoadProcessorStep.java --- @@ -0,0 +1,40 @@ +package org.apache.carbondata.processing.newflow; + +import java.util.Iterator; + +import org.apache.carbondata.processing.newflow.exception.CarbonDataLoadingException; + +/** + * This base interface for data loading. It can do transformation jobs as per the implementation. + * + */ +public interface DataLoadProcessorStep { + + /** + * The output meta for this step. The data returns from this step is as per this meta. + * @return + */ + DataField[] getOutput(); + + /** + * Intialization process for this step. + * @param configuration + * @param child + * @throws CarbonDataLoadingException + */ + void intialize(CarbonDataLoadConfiguration configuration, DataLoadProcessorStep child) throws + CarbonDataLoadingException; + + /** + * Tranform the data as per the implemetation. + * @return Iterator of data + * @throws CarbonDataLoadingException + */ + Iterator execute() throws CarbonDataLoadingException; + + /** + * Any closing of resources after step execution can be done here. + */ + void finish(); --- End diff -- Ok. I will add close method along with finish method --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-carbondata pull request #229: [CARBONDATA-297]Added interface for ...
Github user ravipesala commented on a diff in the pull request: https://github.com/apache/incubator-carbondata/pull/229#discussion_r83130319 --- Diff: processing/src/main/java/org/apache/carbondata/processing/newflow/DataLoadProcessorStep.java --- @@ -0,0 +1,40 @@ +package org.apache.carbondata.processing.newflow; + +import java.util.Iterator; + +import org.apache.carbondata.processing.newflow.exception.CarbonDataLoadingException; + +/** + * This base interface for data loading. It can do transformation jobs as per the implementation. + * + */ +public interface DataLoadProcessorStep { + + /** + * The output meta for this step. The data returns from this step is as per this meta. + * @return + */ + DataField[] getOutput(); + + /** + * Intialization process for this step. + * @param configuration + * @param child + * @throws CarbonDataLoadingException + */ + void intialize(CarbonDataLoadConfiguration configuration, DataLoadProcessorStep child) throws + CarbonDataLoadingException; + + /** + * Tranform the data as per the implemetation. + * @return Iterator of data + * @throws CarbonDataLoadingException + */ + Iterator execute() throws CarbonDataLoadingException; + + /** + * Any closing of resources after step execution can be done here. + */ + void finish(); --- End diff -- It should be called in both failure and success cases. So i will rename it to `close` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-carbondata pull request #229: [CARBONDATA-297]Added interface for ...
Github user ravipesala commented on a diff in the pull request: https://github.com/apache/incubator-carbondata/pull/229#discussion_r83130123 --- Diff: processing/src/main/java/org/apache/carbondata/processing/newflow/DataLoadProcessorStep.java --- @@ -0,0 +1,40 @@ +package org.apache.carbondata.processing.newflow; + +import java.util.Iterator; + +import org.apache.carbondata.processing.newflow.exception.CarbonDataLoadingException; + +/** + * This base interface for data loading. It can do transformation jobs as per the implementation. + * + */ +public interface DataLoadProcessorStep { --- End diff -- ok --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-carbondata pull request #229: [CARBONDATA-297]Added interface for ...
Github user ravipesala commented on a diff in the pull request: https://github.com/apache/incubator-carbondata/pull/229#discussion_r83129418 --- Diff: processing/src/main/java/org/apache/carbondata/processing/newflow/CarbonDataLoadConfiguration.java --- @@ -0,0 +1,185 @@ +package org.apache.carbondata.processing.newflow; + +import java.util.Iterator; + +import org.apache.carbondata.core.carbon.AbsoluteTableIdentifier; + +public class CarbonDataLoadConfiguration { --- End diff -- Ok. I will keep the configuration but only keep important and global configurations I will keep in it, remaining configurations we can move to `Map` and keep it inside configuartion it self. what do you say? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-carbondata pull request #229: [CARBONDATA-297]Added interface for ...
Github user ravipesala commented on a diff in the pull request: https://github.com/apache/incubator-carbondata/pull/229#discussion_r83129008 --- Diff: processing/src/main/java/org/apache/carbondata/processing/newflow/CarbonDataLoadConfiguration.java --- @@ -0,0 +1,185 @@ +package org.apache.carbondata.processing.newflow; --- End diff -- ok --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-carbondata pull request #229: [CARBONDATA-297]Added interface for ...
Github user ravipesala commented on a diff in the pull request: https://github.com/apache/incubator-carbondata/pull/229#discussion_r83049479 --- Diff: processing/src/main/java/org/apache/carbondata/processing/newflow/DataLoadProcessorStep.java --- @@ -0,0 +1,40 @@ +package org.apache.carbondata.processing.newflow; + +import java.util.Iterator; + +import org.apache.carbondata.processing.newflow.exception.CarbonDataLoadingException; + +/** + * This base interface for data loading. It can do transformation jobs as per the implementation. + * + */ +public interface DataLoadProcessorStep { + + /** + * The output meta for this step. The data returns from this step is as per this meta. + * @return + */ + DataField[] getOutput(); + + /** + * Intialization process for this step. + * @param configuration + * @param child + * @throws CarbonDataLoadingException + */ + void intialize(CarbonDataLoadConfiguration configuration, DataLoadProcessorStep child) throws + CarbonDataLoadingException; + + /** + * Tranform the data as per the implemetation. + * @return Iterator of data + * @throws CarbonDataLoadingException + */ + Iterator execute() throws CarbonDataLoadingException; --- End diff -- For suppose if we are loading 50GB of csv files and each HDFS block size is 256MB then total number of partitions are 200. If we allow one task per partition then it would be 200 tasks. In carbondata one btree is created for each task. So if we allow all 200 tasks then it would be massively 200 btrees and it is not effective both in performance and memory wise. That is the reason why we pool multiple blocks per task in the current kettle implementation. And these blocks are processed parallely. We can take the same way and use iterator for each thread and returns array of iterator. What do you mean by datanode-scope sorting? how to synchronize between multiple tasks? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-carbondata pull request #229: [CARBONDATA-297]Added interface for ...
Github user jackylk commented on a diff in the pull request: https://github.com/apache/incubator-carbondata/pull/229#discussion_r83032371 --- Diff: processing/src/main/java/org/apache/carbondata/processing/newflow/DataLoadProcessorStep.java --- @@ -0,0 +1,40 @@ +package org.apache.carbondata.processing.newflow; + +import java.util.Iterator; + +import org.apache.carbondata.processing.newflow.exception.CarbonDataLoadingException; + +/** + * This base interface for data loading. It can do transformation jobs as per the implementation. + * + */ +public interface DataLoadProcessorStep { + + /** + * The output meta for this step. The data returns from this step is as per this meta. + * @return + */ + DataField[] getOutput(); + + /** + * Intialization process for this step. + * @param configuration + * @param child + * @throws CarbonDataLoadingException + */ + void intialize(CarbonDataLoadConfiguration configuration, DataLoadProcessorStep child) throws --- End diff -- If there is a abstract class, it can have the child as its member variable, then this `initialize` function takes no parameter as input --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-carbondata pull request #229: [CARBONDATA-297]Added interface for ...
Github user jackylk commented on a diff in the pull request: https://github.com/apache/incubator-carbondata/pull/229#discussion_r83031958 --- Diff: processing/src/main/java/org/apache/carbondata/processing/newflow/DataLoadProcessorStep.java --- @@ -0,0 +1,40 @@ +package org.apache.carbondata.processing.newflow; + +import java.util.Iterator; + +import org.apache.carbondata.processing.newflow.exception.CarbonDataLoadingException; + +/** + * This base interface for data loading. It can do transformation jobs as per the implementation. + * + */ +public interface DataLoadProcessorStep { --- End diff -- I think each implementation of this interface have similar logic in the execute function, can we create a abstract class to implement the common logic? The common logic like: ``` Iterator execute() throws CarbonDataLoadingException { Iterator childIter = child.execute(); return new Iterator { public boolean hasNext() { return childIter.hasNext(); } public Object[] next() { // processInput is the abstract func in this class return processInput(childItor.next()); } } } ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-carbondata pull request #229: [CARBONDATA-297]Added interface for ...
Github user jackylk commented on a diff in the pull request: https://github.com/apache/incubator-carbondata/pull/229#discussion_r83028336 --- Diff: processing/src/main/java/org/apache/carbondata/processing/newflow/CarbonDataLoadConfiguration.java --- @@ -0,0 +1,185 @@ +package org.apache.carbondata.processing.newflow; --- End diff -- add license header --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-carbondata pull request #229: [CARBONDATA-297]Added interface for ...
Github user jackylk commented on a diff in the pull request: https://github.com/apache/incubator-carbondata/pull/229#discussion_r83033524 --- Diff: processing/src/main/java/org/apache/carbondata/processing/newflow/iterators/RecordReaderIterator.java --- @@ -0,0 +1,40 @@ +package org.apache.carbondata.processing.newflow.iterators; + +import java.io.IOException; + +import org.apache.carbondata.common.CarbonIterator; +import org.apache.carbondata.common.logging.LogService; +import org.apache.carbondata.common.logging.LogServiceFactory; + +import org.apache.hadoop.mapred.RecordReader; + +/** + * This iterator iterates RecordReader. + */ +public class RecordReaderIterator extends CarbonIterator { --- End diff -- what is it used for? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-carbondata pull request #229: [CARBONDATA-297]Added interface for ...
Github user jackylk commented on a diff in the pull request: https://github.com/apache/incubator-carbondata/pull/229#discussion_r83033746 --- Diff: processing/src/main/java/org/apache/carbondata/processing/newflow/iterators/CarbonArrayWritable.java --- @@ -0,0 +1,51 @@ +package org.apache.carbondata.processing.newflow.iterators; + +import java.io.DataInput; +import java.io.DataOutput; +import java.io.IOException; +import java.nio.charset.Charset; +import java.util.Arrays; + +import org.apache.hadoop.io.Writable; + +/** + * It is hadoop's writable value wrapper. + */ +public class CarbonArrayWritable implements Writable { --- End diff -- Why this is carbon-processing but not carbon-hadoop module? All hadoop related class should be in carbon-hadoop module --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-carbondata pull request #229: [CARBONDATA-297]Added interface for ...
Github user jackylk commented on a diff in the pull request: https://github.com/apache/incubator-carbondata/pull/229#discussion_r83030022 --- Diff: processing/src/main/java/org/apache/carbondata/processing/newflow/CarbonDataLoadConfiguration.java --- @@ -0,0 +1,185 @@ +package org.apache.carbondata.processing.newflow; + +import java.util.Iterator; + +import org.apache.carbondata.core.carbon.AbsoluteTableIdentifier; + +public class CarbonDataLoadConfiguration { --- End diff -- It seems this configuration is quite complex, I think it is because it contains configuration for all steps. Can we just have a simple `Map` as the configuration and let the `Step` decide what to keep in it? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-carbondata pull request #229: [CARBONDATA-297]Added interface for ...
Github user jackylk commented on a diff in the pull request: https://github.com/apache/incubator-carbondata/pull/229#discussion_r83033298 --- Diff: processing/src/main/java/org/apache/carbondata/processing/newflow/iterators/RecordReaderIterator.java --- @@ -0,0 +1,40 @@ +package org.apache.carbondata.processing.newflow.iterators; + +import java.io.IOException; + +import org.apache.carbondata.common.CarbonIterator; +import org.apache.carbondata.common.logging.LogService; +import org.apache.carbondata.common.logging.LogServiceFactory; + +import org.apache.hadoop.mapred.RecordReader; + +/** + * This iterator iterates RecordReader. + */ +public class RecordReaderIterator extends CarbonIterator { --- End diff -- why this is carbon-processing but not carbon-hadoop module? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-carbondata pull request #229: [CARBONDATA-297]Added interface for ...
Github user jackylk commented on a diff in the pull request: https://github.com/apache/incubator-carbondata/pull/229#discussion_r83032703 --- Diff: processing/src/main/java/org/apache/carbondata/processing/newflow/DataLoadProcessorStep.java --- @@ -0,0 +1,40 @@ +package org.apache.carbondata.processing.newflow; + +import java.util.Iterator; + +import org.apache.carbondata.processing.newflow.exception.CarbonDataLoadingException; + +/** + * This base interface for data loading. It can do transformation jobs as per the implementation. + * + */ +public interface DataLoadProcessorStep { + + /** + * The output meta for this step. The data returns from this step is as per this meta. + * @return + */ + DataField[] getOutput(); + + /** + * Intialization process for this step. + * @param configuration + * @param child + * @throws CarbonDataLoadingException + */ + void intialize(CarbonDataLoadConfiguration configuration, DataLoadProcessorStep child) throws + CarbonDataLoadingException; + + /** + * Tranform the data as per the implemetation. + * @return Iterator of data + * @throws CarbonDataLoadingException + */ + Iterator execute() throws CarbonDataLoadingException; + + /** + * Any closing of resources after step execution can be done here. + */ + void finish(); --- End diff -- This is called when the step successfully finished. But what about failure case, should there be a `void close();` interface for failure case? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-carbondata pull request #229: [CARBONDATA-297]Added interface for ...
Github user jackylk commented on a diff in the pull request: https://github.com/apache/incubator-carbondata/pull/229#discussion_r83018479 --- Diff: processing/src/main/java/org/apache/carbondata/processing/newflow/DataLoadProcessorStep.java --- @@ -0,0 +1,40 @@ +package org.apache.carbondata.processing.newflow; + +import java.util.Iterator; + +import org.apache.carbondata.processing.newflow.exception.CarbonDataLoadingException; + +/** + * This base interface for data loading. It can do transformation jobs as per the implementation. + * + */ +public interface DataLoadProcessorStep { + + /** + * The output meta for this step. The data returns from this step is as per this meta. + * @return + */ + DataField[] getOutput(); + + /** + * Intialization process for this step. + * @param configuration + * @param child + * @throws CarbonDataLoadingException + */ + void intialize(CarbonDataLoadConfiguration configuration, DataLoadProcessorStep child) throws + CarbonDataLoadingException; + + /** + * Tranform the data as per the implemetation. + * @return Iterator of data + * @throws CarbonDataLoadingException + */ + Iterator execute() throws CarbonDataLoadingException; --- End diff -- I think `execute()` is called for every parallel unit of the input, right? For example, when using spark to load from dataframe, `execute()` is called for every spark partition (execute one task for one partition). When loading from CSV HDFS file, `execute()` is called for every HDFS block. So I do not think returning array of iterator is required. The loading process of carbon in every executor, some of the step can be parallelized, but sort step need to be synchronized (potential bottle net), since we need datanode-scope sorting. Am I correct? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-carbondata pull request #229: [CARBONDATA-297]Added interface for ...
Github user ravipesala commented on a diff in the pull request: https://github.com/apache/incubator-carbondata/pull/229#discussion_r82830846 --- Diff: processing/src/main/java/org/apache/carbondata/processing/newflow/DataLoadProcessorStep.java --- @@ -0,0 +1,40 @@ +package org.apache.carbondata.processing.newflow; + +import java.util.Iterator; + +import org.apache.carbondata.processing.newflow.exception.CarbonDataLoadingException; + +/** + * This base interface for data loading. It can do transformation jobs as per the implementation. + * + */ +public interface DataLoadProcessorStep { + + /** + * The output meta for this step. The data returns from this step is as per this meta. + * @return + */ + DataField[] getOutput(); + + /** + * Intialization process for this step. + * @param configuration + * @param child + * @throws CarbonDataLoadingException + */ + void intialize(CarbonDataLoadConfiguration configuration, DataLoadProcessorStep child) throws + CarbonDataLoadingException; + + /** + * Tranform the data as per the implemetation. + * @return Iterator of data + * @throws CarbonDataLoadingException + */ + Iterator execute() throws CarbonDataLoadingException; --- End diff -- I think it is better to return array of Iterator. Like `Iterator[] execute()` So that multiple blocks can be parallely processed to improve the performance. Other wise lot of btrees need to be created at the end of data loading. @jackylk what do you think? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-carbondata pull request #229: [CARBONDATA-297]Added interface for ...
GitHub user ravipesala opened a pull request: https://github.com/apache/incubator-carbondata/pull/229 [CARBONDATA-297]Added interface for carbon data loading Interface DataLoadProcessorStep and related supported classes are added in this PR. You can merge this pull request into a Git repository by running: $ git pull https://github.com/ravipesala/incubator-carbondata interfaces-dataloading Alternatively you can review and apply these changes as the patch at: https://github.com/apache/incubator-carbondata/pull/229.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #229 commit 5c018a9749d4d0565598bb9cd0beb2b4d9f203b8 Author: ravipesala Date: 2016-10-11T16:03:48Z Added interface for carbon data loading --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---