[GitHub] incubator-carbondata pull request #229: [CARBONDATA-297]Added interface for ...

2016-10-14 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/incubator-carbondata/pull/229


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-carbondata pull request #229: [CARBONDATA-297]Added interface for ...

2016-10-14 Thread ravipesala
Github user ravipesala commented on a diff in the pull request:

https://github.com/apache/incubator-carbondata/pull/229#discussion_r83373827
  
--- Diff: 
processing/src/main/java/org/apache/carbondata/processing/newflow/AbstractDataLoadProcessorStep.java
 ---
@@ -73,15 +72,15 @@ public 
AbstractDataLoadProcessorStep(CarbonDataLoadConfiguration configuration,
* Create the iterator using child iterator.
*
* @param childIter
-   * @return
+   * @return new iterator with step specific processing.
*/
-  protected Iterator getIterator(final Iterator 
childIter) {
-return new CarbonIterator() {
+  protected Iterator getIterator(final Iterator 
childIter) {
--- End diff --

ok. Added


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-carbondata pull request #229: [CARBONDATA-297]Added interface for ...

2016-10-13 Thread jackylk
Github user jackylk commented on a diff in the pull request:

https://github.com/apache/incubator-carbondata/pull/229#discussion_r83354231
  
--- Diff: 
processing/src/main/java/org/apache/carbondata/processing/newflow/AbstractDataLoadProcessorStep.java
 ---
@@ -73,15 +72,15 @@ public 
AbstractDataLoadProcessorStep(CarbonDataLoadConfiguration configuration,
* Create the iterator using child iterator.
*
* @param childIter
-   * @return
+   * @return new iterator with step specific processing.
*/
-  protected Iterator getIterator(final Iterator 
childIter) {
-return new CarbonIterator() {
+  protected Iterator getIterator(final Iterator 
childIter) {
--- End diff --

In order to support batch conversion, it is better to put 
`Iterator` instead


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-carbondata pull request #229: [CARBONDATA-297]Added interface for ...

2016-10-13 Thread jackylk
Github user jackylk commented on a diff in the pull request:

https://github.com/apache/incubator-carbondata/pull/229#discussion_r83354325
  
--- Diff: 
processing/src/main/java/org/apache/carbondata/processing/newflow/AbstractDataLoadProcessorStep.java
 ---
@@ -55,14 +54,14 @@ public 
AbstractDataLoadProcessorStep(CarbonDataLoadConfiguration configuration,
   public abstract void intialize() throws CarbonDataLoadingException;
 
   /**
-   * Tranform the data as per the implemetation.
+   * Tranform the data as per the implementation.
*
* @return Array of Iterator with data. It can be processed parallel if 
implementation class wants
* @throws CarbonDataLoadingException
*/
-  public Iterator[] execute() throws CarbonDataLoadingException {
-Iterator[] childIters = child.execute();
-Iterator[] iterators = new Iterator[childIters.length];
+  public Iterator[] execute() throws CarbonDataLoadingException 
{
+Iterator[] childIters = child.execute();
+Iterator[] iterators = new Iterator[childIters.length];
--- End diff --

In order to support batch conversion, it is better to put 
`Iterator[]` instead


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-carbondata pull request #229: [CARBONDATA-297]Added interface for ...

2016-10-13 Thread ravipesala
Github user ravipesala commented on a diff in the pull request:

https://github.com/apache/incubator-carbondata/pull/229#discussion_r83208986
  
--- Diff: 
processing/src/main/java/org/apache/carbondata/processing/newflow/iterators/RecordReaderIterator.java
 ---
@@ -0,0 +1,40 @@
+package org.apache.carbondata.processing.newflow.iterators;
+
+import java.io.IOException;
+
+import org.apache.carbondata.common.CarbonIterator;
+import org.apache.carbondata.common.logging.LogService;
+import org.apache.carbondata.common.logging.LogServiceFactory;
+
+import org.apache.hadoop.mapred.RecordReader;
+
+/**
+ * This iterator iterates RecordReader.
+ */
+public class RecordReaderIterator extends CarbonIterator {
--- End diff --

Ok. I will remove it from this PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-carbondata pull request #229: [CARBONDATA-297]Added interface for ...

2016-10-13 Thread jackylk
Github user jackylk commented on a diff in the pull request:

https://github.com/apache/incubator-carbondata/pull/229#discussion_r83208961
  
--- Diff: 
processing/src/main/java/org/apache/carbondata/processing/newflow/DataLoadProcessorStep.java
 ---
@@ -0,0 +1,40 @@
+package org.apache.carbondata.processing.newflow;
+
+import java.util.Iterator;
+
+import 
org.apache.carbondata.processing.newflow.exception.CarbonDataLoadingException;
+
+/**
+ * This base interface for data loading. It can do transformation jobs as 
per the implementation.
+ *
+ */
+public interface DataLoadProcessorStep {
+
+  /**
+   * The output meta for this step. The data returns from this step is as 
per this meta.
+   * @return
+   */
+  DataField[] getOutput();
+
+  /**
+   * Intialization process for this step.
+   * @param configuration
+   * @param child
+   * @throws CarbonDataLoadingException
+   */
+  void intialize(CarbonDataLoadConfiguration configuration, 
DataLoadProcessorStep child) throws
+  CarbonDataLoadingException;
+
+  /**
+   * Tranform the data as per the implemetation.
+   * @return Iterator of data
+   * @throws CarbonDataLoadingException
+   */
+  Iterator execute() throws CarbonDataLoadingException;
--- End diff --

I thought the SortStep is a singleton object within the executor, and if 
there are only one executor in one datanode, then the SortStep is sorting the 
data within datanode-scope, which is what we want. Synchronization means 
SortStep is thread-safe, so that multiple task can insert row into it. 
Does your desing look like this? Otherwise how you ensure data is sorting 
within datanode?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-carbondata pull request #229: [CARBONDATA-297]Added interface for ...

2016-10-12 Thread ravipesala
Github user ravipesala commented on a diff in the pull request:

https://github.com/apache/incubator-carbondata/pull/229#discussion_r83147351
  
--- Diff: 
processing/src/main/java/org/apache/carbondata/processing/newflow/iterators/RecordReaderIterator.java
 ---
@@ -0,0 +1,40 @@
+package org.apache.carbondata.processing.newflow.iterators;
+
+import java.io.IOException;
+
+import org.apache.carbondata.common.CarbonIterator;
+import org.apache.carbondata.common.logging.LogService;
+import org.apache.carbondata.common.logging.LogServiceFactory;
+
+import org.apache.hadoop.mapred.RecordReader;
+
+/**
+ * This iterator iterates RecordReader.
+ */
+public class RecordReaderIterator extends CarbonIterator {
--- End diff --

It is used for iterating RecordReader. I can move it to carbon-hadoop 
module but processing module need to be dependent on it. Already processing 
module is dependent on hadoop module so it becomes dependent on each other.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-carbondata pull request #229: [CARBONDATA-297]Added interface for ...

2016-10-12 Thread ravipesala
Github user ravipesala commented on a diff in the pull request:

https://github.com/apache/incubator-carbondata/pull/229#discussion_r83147018
  
--- Diff: 
processing/src/main/java/org/apache/carbondata/processing/newflow/DataLoadProcessorStep.java
 ---
@@ -0,0 +1,40 @@
+package org.apache.carbondata.processing.newflow;
+
+import java.util.Iterator;
+
+import 
org.apache.carbondata.processing.newflow.exception.CarbonDataLoadingException;
+
+/**
+ * This base interface for data loading. It can do transformation jobs as 
per the implementation.
+ *
+ */
+public interface DataLoadProcessorStep {
+
+  /**
+   * The output meta for this step. The data returns from this step is as 
per this meta.
+   * @return
+   */
+  DataField[] getOutput();
+
+  /**
+   * Intialization process for this step.
+   * @param configuration
+   * @param child
+   * @throws CarbonDataLoadingException
+   */
+  void intialize(CarbonDataLoadConfiguration configuration, 
DataLoadProcessorStep child) throws
+  CarbonDataLoadingException;
+
+  /**
+   * Tranform the data as per the implemetation.
+   * @return Iterator of data
+   * @throws CarbonDataLoadingException
+   */
+  Iterator execute() throws CarbonDataLoadingException;
+
+  /**
+   * Any closing of resources after step execution can be done here.
+   */
+  void finish();
--- End diff --

Ok. I will add close method along with finish method


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-carbondata pull request #229: [CARBONDATA-297]Added interface for ...

2016-10-12 Thread ravipesala
Github user ravipesala commented on a diff in the pull request:

https://github.com/apache/incubator-carbondata/pull/229#discussion_r83130319
  
--- Diff: 
processing/src/main/java/org/apache/carbondata/processing/newflow/DataLoadProcessorStep.java
 ---
@@ -0,0 +1,40 @@
+package org.apache.carbondata.processing.newflow;
+
+import java.util.Iterator;
+
+import 
org.apache.carbondata.processing.newflow.exception.CarbonDataLoadingException;
+
+/**
+ * This base interface for data loading. It can do transformation jobs as 
per the implementation.
+ *
+ */
+public interface DataLoadProcessorStep {
+
+  /**
+   * The output meta for this step. The data returns from this step is as 
per this meta.
+   * @return
+   */
+  DataField[] getOutput();
+
+  /**
+   * Intialization process for this step.
+   * @param configuration
+   * @param child
+   * @throws CarbonDataLoadingException
+   */
+  void intialize(CarbonDataLoadConfiguration configuration, 
DataLoadProcessorStep child) throws
+  CarbonDataLoadingException;
+
+  /**
+   * Tranform the data as per the implemetation.
+   * @return Iterator of data
+   * @throws CarbonDataLoadingException
+   */
+  Iterator execute() throws CarbonDataLoadingException;
+
+  /**
+   * Any closing of resources after step execution can be done here.
+   */
+  void finish();
--- End diff --

It should be called in both failure and success cases. So i will rename it 
to `close`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-carbondata pull request #229: [CARBONDATA-297]Added interface for ...

2016-10-12 Thread ravipesala
Github user ravipesala commented on a diff in the pull request:

https://github.com/apache/incubator-carbondata/pull/229#discussion_r83130123
  
--- Diff: 
processing/src/main/java/org/apache/carbondata/processing/newflow/DataLoadProcessorStep.java
 ---
@@ -0,0 +1,40 @@
+package org.apache.carbondata.processing.newflow;
+
+import java.util.Iterator;
+
+import 
org.apache.carbondata.processing.newflow.exception.CarbonDataLoadingException;
+
+/**
+ * This base interface for data loading. It can do transformation jobs as 
per the implementation.
+ *
+ */
+public interface DataLoadProcessorStep {
--- End diff --

ok


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-carbondata pull request #229: [CARBONDATA-297]Added interface for ...

2016-10-12 Thread ravipesala
Github user ravipesala commented on a diff in the pull request:

https://github.com/apache/incubator-carbondata/pull/229#discussion_r83129418
  
--- Diff: 
processing/src/main/java/org/apache/carbondata/processing/newflow/CarbonDataLoadConfiguration.java
 ---
@@ -0,0 +1,185 @@
+package org.apache.carbondata.processing.newflow;
+
+import java.util.Iterator;
+
+import org.apache.carbondata.core.carbon.AbsoluteTableIdentifier;
+
+public class CarbonDataLoadConfiguration {
--- End diff --

Ok. I will keep the configuration but only keep important and global 
configurations I will keep in it, remaining configurations we can move to `Map` 
and keep it inside configuartion it self.  what do you say?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-carbondata pull request #229: [CARBONDATA-297]Added interface for ...

2016-10-12 Thread ravipesala
Github user ravipesala commented on a diff in the pull request:

https://github.com/apache/incubator-carbondata/pull/229#discussion_r83129008
  
--- Diff: 
processing/src/main/java/org/apache/carbondata/processing/newflow/CarbonDataLoadConfiguration.java
 ---
@@ -0,0 +1,185 @@
+package org.apache.carbondata.processing.newflow;
--- End diff --

ok


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-carbondata pull request #229: [CARBONDATA-297]Added interface for ...

2016-10-12 Thread ravipesala
Github user ravipesala commented on a diff in the pull request:

https://github.com/apache/incubator-carbondata/pull/229#discussion_r83049479
  
--- Diff: 
processing/src/main/java/org/apache/carbondata/processing/newflow/DataLoadProcessorStep.java
 ---
@@ -0,0 +1,40 @@
+package org.apache.carbondata.processing.newflow;
+
+import java.util.Iterator;
+
+import 
org.apache.carbondata.processing.newflow.exception.CarbonDataLoadingException;
+
+/**
+ * This base interface for data loading. It can do transformation jobs as 
per the implementation.
+ *
+ */
+public interface DataLoadProcessorStep {
+
+  /**
+   * The output meta for this step. The data returns from this step is as 
per this meta.
+   * @return
+   */
+  DataField[] getOutput();
+
+  /**
+   * Intialization process for this step.
+   * @param configuration
+   * @param child
+   * @throws CarbonDataLoadingException
+   */
+  void intialize(CarbonDataLoadConfiguration configuration, 
DataLoadProcessorStep child) throws
+  CarbonDataLoadingException;
+
+  /**
+   * Tranform the data as per the implemetation.
+   * @return Iterator of data
+   * @throws CarbonDataLoadingException
+   */
+  Iterator execute() throws CarbonDataLoadingException;
--- End diff --

For suppose if we are loading 50GB of csv files and each HDFS block size is 
256MB then total number of partitions are 200. If we allow one task per 
partition then it would be 200 tasks. In carbondata one btree is created for 
each task. So if we allow all 200 tasks then it would be massively 200 btrees 
and it is not effective both in performance and memory wise. 
That is the reason why we pool multiple blocks per task in the current 
kettle implementation. And these blocks are processed parallely. We can take 
the same way and use iterator for each thread and returns array of iterator.

What do you mean by datanode-scope sorting? how to synchronize between 
multiple tasks?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-carbondata pull request #229: [CARBONDATA-297]Added interface for ...

2016-10-12 Thread jackylk
Github user jackylk commented on a diff in the pull request:

https://github.com/apache/incubator-carbondata/pull/229#discussion_r83032371
  
--- Diff: 
processing/src/main/java/org/apache/carbondata/processing/newflow/DataLoadProcessorStep.java
 ---
@@ -0,0 +1,40 @@
+package org.apache.carbondata.processing.newflow;
+
+import java.util.Iterator;
+
+import 
org.apache.carbondata.processing.newflow.exception.CarbonDataLoadingException;
+
+/**
+ * This base interface for data loading. It can do transformation jobs as 
per the implementation.
+ *
+ */
+public interface DataLoadProcessorStep {
+
+  /**
+   * The output meta for this step. The data returns from this step is as 
per this meta.
+   * @return
+   */
+  DataField[] getOutput();
+
+  /**
+   * Intialization process for this step.
+   * @param configuration
+   * @param child
+   * @throws CarbonDataLoadingException
+   */
+  void intialize(CarbonDataLoadConfiguration configuration, 
DataLoadProcessorStep child) throws
--- End diff --

If there is a abstract class, it can have the child as its member variable, 
then this `initialize` function takes no parameter as input


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-carbondata pull request #229: [CARBONDATA-297]Added interface for ...

2016-10-12 Thread jackylk
Github user jackylk commented on a diff in the pull request:

https://github.com/apache/incubator-carbondata/pull/229#discussion_r83031958
  
--- Diff: 
processing/src/main/java/org/apache/carbondata/processing/newflow/DataLoadProcessorStep.java
 ---
@@ -0,0 +1,40 @@
+package org.apache.carbondata.processing.newflow;
+
+import java.util.Iterator;
+
+import 
org.apache.carbondata.processing.newflow.exception.CarbonDataLoadingException;
+
+/**
+ * This base interface for data loading. It can do transformation jobs as 
per the implementation.
+ *
+ */
+public interface DataLoadProcessorStep {
--- End diff --

I think each implementation of this interface have similar logic in the 
execute function, can we create a abstract class to implement the common logic? 
The common logic like:
```
Iterator execute() throws CarbonDataLoadingException {
Iterator childIter = child.execute();
return new Iterator {
  public boolean hasNext() { 
return childIter.hasNext();
  }
  public Object[] next() {
// processInput is the abstract func in this class
return processInput(childItor.next());
  }
 }
}
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-carbondata pull request #229: [CARBONDATA-297]Added interface for ...

2016-10-12 Thread jackylk
Github user jackylk commented on a diff in the pull request:

https://github.com/apache/incubator-carbondata/pull/229#discussion_r83028336
  
--- Diff: 
processing/src/main/java/org/apache/carbondata/processing/newflow/CarbonDataLoadConfiguration.java
 ---
@@ -0,0 +1,185 @@
+package org.apache.carbondata.processing.newflow;
--- End diff --

add license header


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-carbondata pull request #229: [CARBONDATA-297]Added interface for ...

2016-10-12 Thread jackylk
Github user jackylk commented on a diff in the pull request:

https://github.com/apache/incubator-carbondata/pull/229#discussion_r83033524
  
--- Diff: 
processing/src/main/java/org/apache/carbondata/processing/newflow/iterators/RecordReaderIterator.java
 ---
@@ -0,0 +1,40 @@
+package org.apache.carbondata.processing.newflow.iterators;
+
+import java.io.IOException;
+
+import org.apache.carbondata.common.CarbonIterator;
+import org.apache.carbondata.common.logging.LogService;
+import org.apache.carbondata.common.logging.LogServiceFactory;
+
+import org.apache.hadoop.mapred.RecordReader;
+
+/**
+ * This iterator iterates RecordReader.
+ */
+public class RecordReaderIterator extends CarbonIterator {
--- End diff --

what is it used for?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-carbondata pull request #229: [CARBONDATA-297]Added interface for ...

2016-10-12 Thread jackylk
Github user jackylk commented on a diff in the pull request:

https://github.com/apache/incubator-carbondata/pull/229#discussion_r83033746
  
--- Diff: 
processing/src/main/java/org/apache/carbondata/processing/newflow/iterators/CarbonArrayWritable.java
 ---
@@ -0,0 +1,51 @@
+package org.apache.carbondata.processing.newflow.iterators;
+
+import java.io.DataInput;
+import java.io.DataOutput;
+import java.io.IOException;
+import java.nio.charset.Charset;
+import java.util.Arrays;
+
+import org.apache.hadoop.io.Writable;
+
+/**
+ * It is hadoop's writable value wrapper.
+ */
+public class CarbonArrayWritable implements Writable {
--- End diff --

Why this is carbon-processing but not carbon-hadoop module? All hadoop 
related class should be in carbon-hadoop module


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-carbondata pull request #229: [CARBONDATA-297]Added interface for ...

2016-10-12 Thread jackylk
Github user jackylk commented on a diff in the pull request:

https://github.com/apache/incubator-carbondata/pull/229#discussion_r83030022
  
--- Diff: 
processing/src/main/java/org/apache/carbondata/processing/newflow/CarbonDataLoadConfiguration.java
 ---
@@ -0,0 +1,185 @@
+package org.apache.carbondata.processing.newflow;
+
+import java.util.Iterator;
+
+import org.apache.carbondata.core.carbon.AbsoluteTableIdentifier;
+
+public class CarbonDataLoadConfiguration {
--- End diff --

It seems this configuration is quite complex, I think it is because it 
contains configuration for all steps. 
Can we just have a simple `Map` as the configuration and let the `Step` 
decide what to keep in it?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-carbondata pull request #229: [CARBONDATA-297]Added interface for ...

2016-10-12 Thread jackylk
Github user jackylk commented on a diff in the pull request:

https://github.com/apache/incubator-carbondata/pull/229#discussion_r83033298
  
--- Diff: 
processing/src/main/java/org/apache/carbondata/processing/newflow/iterators/RecordReaderIterator.java
 ---
@@ -0,0 +1,40 @@
+package org.apache.carbondata.processing.newflow.iterators;
+
+import java.io.IOException;
+
+import org.apache.carbondata.common.CarbonIterator;
+import org.apache.carbondata.common.logging.LogService;
+import org.apache.carbondata.common.logging.LogServiceFactory;
+
+import org.apache.hadoop.mapred.RecordReader;
+
+/**
+ * This iterator iterates RecordReader.
+ */
+public class RecordReaderIterator extends CarbonIterator {
--- End diff --

why this is carbon-processing but not carbon-hadoop module?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-carbondata pull request #229: [CARBONDATA-297]Added interface for ...

2016-10-12 Thread jackylk
Github user jackylk commented on a diff in the pull request:

https://github.com/apache/incubator-carbondata/pull/229#discussion_r83032703
  
--- Diff: 
processing/src/main/java/org/apache/carbondata/processing/newflow/DataLoadProcessorStep.java
 ---
@@ -0,0 +1,40 @@
+package org.apache.carbondata.processing.newflow;
+
+import java.util.Iterator;
+
+import 
org.apache.carbondata.processing.newflow.exception.CarbonDataLoadingException;
+
+/**
+ * This base interface for data loading. It can do transformation jobs as 
per the implementation.
+ *
+ */
+public interface DataLoadProcessorStep {
+
+  /**
+   * The output meta for this step. The data returns from this step is as 
per this meta.
+   * @return
+   */
+  DataField[] getOutput();
+
+  /**
+   * Intialization process for this step.
+   * @param configuration
+   * @param child
+   * @throws CarbonDataLoadingException
+   */
+  void intialize(CarbonDataLoadConfiguration configuration, 
DataLoadProcessorStep child) throws
+  CarbonDataLoadingException;
+
+  /**
+   * Tranform the data as per the implemetation.
+   * @return Iterator of data
+   * @throws CarbonDataLoadingException
+   */
+  Iterator execute() throws CarbonDataLoadingException;
+
+  /**
+   * Any closing of resources after step execution can be done here.
+   */
+  void finish();
--- End diff --

This is called when the step successfully finished. But what about failure 
case, should there be a 
`void close();` 
interface for failure case?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-carbondata pull request #229: [CARBONDATA-297]Added interface for ...

2016-10-12 Thread jackylk
Github user jackylk commented on a diff in the pull request:

https://github.com/apache/incubator-carbondata/pull/229#discussion_r83018479
  
--- Diff: 
processing/src/main/java/org/apache/carbondata/processing/newflow/DataLoadProcessorStep.java
 ---
@@ -0,0 +1,40 @@
+package org.apache.carbondata.processing.newflow;
+
+import java.util.Iterator;
+
+import 
org.apache.carbondata.processing.newflow.exception.CarbonDataLoadingException;
+
+/**
+ * This base interface for data loading. It can do transformation jobs as 
per the implementation.
+ *
+ */
+public interface DataLoadProcessorStep {
+
+  /**
+   * The output meta for this step. The data returns from this step is as 
per this meta.
+   * @return
+   */
+  DataField[] getOutput();
+
+  /**
+   * Intialization process for this step.
+   * @param configuration
+   * @param child
+   * @throws CarbonDataLoadingException
+   */
+  void intialize(CarbonDataLoadConfiguration configuration, 
DataLoadProcessorStep child) throws
+  CarbonDataLoadingException;
+
+  /**
+   * Tranform the data as per the implemetation.
+   * @return Iterator of data
+   * @throws CarbonDataLoadingException
+   */
+  Iterator execute() throws CarbonDataLoadingException;
--- End diff --

I think `execute()` is called for every parallel unit of the input, right? 
For example, when using spark to load from dataframe, `execute()` is called for 
every spark partition (execute one task for one partition). When loading from 
CSV HDFS file, `execute()` is called for every HDFS block. So I do not think 
returning array of iterator is required.

The loading process of carbon in every executor, some of the step can be 
parallelized, but sort step need to be synchronized (potential bottle net), 
since we need datanode-scope sorting. Am I correct?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-carbondata pull request #229: [CARBONDATA-297]Added interface for ...

2016-10-11 Thread ravipesala
Github user ravipesala commented on a diff in the pull request:

https://github.com/apache/incubator-carbondata/pull/229#discussion_r82830846
  
--- Diff: 
processing/src/main/java/org/apache/carbondata/processing/newflow/DataLoadProcessorStep.java
 ---
@@ -0,0 +1,40 @@
+package org.apache.carbondata.processing.newflow;
+
+import java.util.Iterator;
+
+import 
org.apache.carbondata.processing.newflow.exception.CarbonDataLoadingException;
+
+/**
+ * This base interface for data loading. It can do transformation jobs as 
per the implementation.
+ *
+ */
+public interface DataLoadProcessorStep {
+
+  /**
+   * The output meta for this step. The data returns from this step is as 
per this meta.
+   * @return
+   */
+  DataField[] getOutput();
+
+  /**
+   * Intialization process for this step.
+   * @param configuration
+   * @param child
+   * @throws CarbonDataLoadingException
+   */
+  void intialize(CarbonDataLoadConfiguration configuration, 
DataLoadProcessorStep child) throws
+  CarbonDataLoadingException;
+
+  /**
+   * Tranform the data as per the implemetation.
+   * @return Iterator of data
+   * @throws CarbonDataLoadingException
+   */
+  Iterator execute() throws CarbonDataLoadingException;
--- End diff --

I think it is better to return array of  Iterator. Like 
`Iterator[] execute()`
So that multiple blocks can be parallely processed to improve the 
performance. Other wise lot of btrees need to be created at the end of data 
loading.
@jackylk what do you think? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-carbondata pull request #229: [CARBONDATA-297]Added interface for ...

2016-10-11 Thread ravipesala
GitHub user ravipesala opened a pull request:

https://github.com/apache/incubator-carbondata/pull/229

[CARBONDATA-297]Added interface for carbon data loading

Interface DataLoadProcessorStep and related supported classes are added in 
this PR.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/ravipesala/incubator-carbondata 
interfaces-dataloading

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/incubator-carbondata/pull/229.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #229


commit 5c018a9749d4d0565598bb9cd0beb2b4d9f203b8
Author: ravipesala 
Date:   2016-10-11T16:03:48Z

Added interface for carbon data loading




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---