[jira] [Updated] (CARBONDATA-4169) Secondary Index coarse grain datamap to work for existing Secondary Index tables

2021-04-27 Thread Venugopal Reddy K (Jira)


 [ 
https://issues.apache.org/jira/browse/CARBONDATA-4169?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Venugopal Reddy K updated CARBONDATA-4169:
--
Attachment: Leverage Secondary Indexes with Presto Queries.pdf

> Secondary Index coarse grain datamap to work for existing Secondary Index 
> tables
> 
>
> Key: CARBONDATA-4169
> URL: https://issues.apache.org/jira/browse/CARBONDATA-4169
> Project: CarbonData
>  Issue Type: Sub-task
>  Components: core, spark-integration
>Reporter: Venugopal Reddy K
>Priority: Minor
> Attachments: Leverage Secondary Indexes with Presto Queries.pdf
>
>
> With [[https://github.com/apache/carbondata/pull/4110]] user has to delete 
> all segments from the older version of existing secondary index table and 
> reload/sync all segments again to use them in the seconday index coarse grain 
> datamap.  This restriction is due to the fact that PR expects 
> CarbonCommonConstants.INDEX_STATUS property for the index to be set to 
> IndexStatus.ENABLED in the index metadata when main table and respective 
> secondary index table segments are in sync. And IndexChooser picks only 
> IndexStatus.ENABLED indexes in the datamap pruning. It applies to bloom, 
> lucene as well.
> But this property was not set for the secondary indexes in the older versions 
> as they were not part of datamap pruning.
>  
> To support the secondary index coarse grain datamap pruning for existing 
> secondary indexes, need to support a way it happens automatically.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (CARBONDATA-4169) Secondary Index coarse grain datamap to work for existing Secondary Index tables

2021-04-21 Thread Venugopal Reddy K (Jira)


 [ 
https://issues.apache.org/jira/browse/CARBONDATA-4169?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Venugopal Reddy K updated CARBONDATA-4169:
--
Description: 
With [[https://github.com/apache/carbondata/pull/4110]] user has to delete all 
segments from the older version of existing secondary index table and 
reload/sync all segments again to use them in the seconday index coarse grain 
datamap.  This restriction is due to the fact that PR expects 
CarbonCommonConstants.INDEX_STATUS property for the index to be set to 
IndexStatus.ENABLED in the index metadata when main table and respective 
secondary index table segments are in sync. And IndexChooser picks only 
IndexStatus.ENABLED indexes in the datamap pruning. It applies to bloom, lucene 
as well.

But this property was not set for the secondary indexes in the older versions 
as they were not part of datamap pruning.

 

To support the secondary index coarse grain datamap pruning for existing 
secondary indexes, need to support a way it happens automatically.

 

  was:
With [[https://github.com/apache/carbondata/pull/4110],] user has to delete all 
segments from the older version of existing secondary index table and 
reload/sync all segments again to use them in the seconday index coarse grain 
datamap.  This restriction is due to the fact that PR expects 
CarbonCommonConstants.INDEX_STATUS property for the index to be set to 
IndexStatus.ENABLED in the index metadata when main table and respective 
secondary index table segments are in sync. And IndexChooser picks only 
IndexStatus.ENABLED indexes in the datamap pruning. It applies to bloom, lucene 
as well.

But this property was not set for the secondary indexes in the older versions 
as they were not part of datamap pruning.

 

To support the secondary index coarse grain datamap pruning for existing 
secondary indexes, need to support a way it happens automatically.

 


> Secondary Index coarse grain datamap to work for existing Secondary Index 
> tables
> 
>
> Key: CARBONDATA-4169
> URL: https://issues.apache.org/jira/browse/CARBONDATA-4169
> Project: CarbonData
>  Issue Type: Sub-task
>  Components: core, spark-integration
>Reporter: Venugopal Reddy K
>Priority: Minor
>
> With [[https://github.com/apache/carbondata/pull/4110]] user has to delete 
> all segments from the older version of existing secondary index table and 
> reload/sync all segments again to use them in the seconday index coarse grain 
> datamap.  This restriction is due to the fact that PR expects 
> CarbonCommonConstants.INDEX_STATUS property for the index to be set to 
> IndexStatus.ENABLED in the index metadata when main table and respective 
> secondary index table segments are in sync. And IndexChooser picks only 
> IndexStatus.ENABLED indexes in the datamap pruning. It applies to bloom, 
> lucene as well.
> But this property was not set for the secondary indexes in the older versions 
> as they were not part of datamap pruning.
>  
> To support the secondary index coarse grain datamap pruning for existing 
> secondary indexes, need to support a way it happens automatically.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (CARBONDATA-4169) Secondary Index coarse grain datamap to work for existing Secondary Index tables

2021-04-21 Thread Venugopal Reddy K (Jira)


 [ 
https://issues.apache.org/jira/browse/CARBONDATA-4169?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Venugopal Reddy K updated CARBONDATA-4169:
--
Description: 
With [[https://github.com/apache/carbondata/pull/4110],] user has to delete all 
segments from the older version of existing secondary index table and 
reload/sync all segments again to use them in the seconday index coarse grain 
datamap.  This restriction is due to the fact that PR expects 
CarbonCommonConstants.INDEX_STATUS property for the index to be set to 
IndexStatus.ENABLED in the index metadata when main table and respective 
secondary index table segments are in sync. And IndexChooser picks only 
IndexStatus.ENABLED indexes in the datamap pruning. It applies to bloom, lucene 
as well.

But this property was not set for the secondary indexes in the older versions 
as they were not part of datamap pruning.

 

To support the secondary index coarse grain datamap pruning for existing 
secondary indexes, need to support a way it happens automatically.

 

  was:
With [PR 4110|[https://github.com/apache/carbondata/pull/4110],] user has to 
delete all segments from the older version of existing secondary index table 
and reload/sync all segments again to use them in the seconday index coarse 
grain datamap.  This restriction is due to the fact that PR expects 
CarbonCommonConstants.INDEX_STATUS property for the index to be set to 
IndexStatus.ENABLED in the index metadata when main table and respective 
secondary index table segments are in sync. And IndexChooser picks only 
IndexStatus.ENABLED indexes in the datamap pruning. It applies to bloom, lucene 
as well.

But this property was not set for the secondary indexes in the older versions 
as they were not part of datamap pruning.

 

To support the secondary index coarse grain datamap pruning for existing 
secondary indexes, need to support a way it happens automatically.

 


> Secondary Index coarse grain datamap to work for existing Secondary Index 
> tables
> 
>
> Key: CARBONDATA-4169
> URL: https://issues.apache.org/jira/browse/CARBONDATA-4169
> Project: CarbonData
>  Issue Type: Sub-task
>  Components: core, spark-integration
>Reporter: Venugopal Reddy K
>Priority: Minor
>
> With [[https://github.com/apache/carbondata/pull/4110],] user has to delete 
> all segments from the older version of existing secondary index table and 
> reload/sync all segments again to use them in the seconday index coarse grain 
> datamap.  This restriction is due to the fact that PR expects 
> CarbonCommonConstants.INDEX_STATUS property for the index to be set to 
> IndexStatus.ENABLED in the index metadata when main table and respective 
> secondary index table segments are in sync. And IndexChooser picks only 
> IndexStatus.ENABLED indexes in the datamap pruning. It applies to bloom, 
> lucene as well.
> But this property was not set for the secondary indexes in the older versions 
> as they were not part of datamap pruning.
>  
> To support the secondary index coarse grain datamap pruning for existing 
> secondary indexes, need to support a way it happens automatically.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (CARBONDATA-4169) Secondary Index coarse grain datamap to work for existing Secondary Index tables

2021-04-21 Thread Venugopal Reddy K (Jira)
Venugopal Reddy K created CARBONDATA-4169:
-

 Summary: Secondary Index coarse grain datamap to work for existing 
Secondary Index tables
 Key: CARBONDATA-4169
 URL: https://issues.apache.org/jira/browse/CARBONDATA-4169
 Project: CarbonData
  Issue Type: Sub-task
  Components: core, spark-integration
Reporter: Venugopal Reddy K


With [PR 4110|[https://github.com/apache/carbondata/pull/4110],] user has to 
delete all segments from the older version of existing secondary index table 
and reload/sync all segments again to use them in the seconday index coarse 
grain datamap.  This restriction is due to the fact that PR expects 
CarbonCommonConstants.INDEX_STATUS property for the index to be set to 
IndexStatus.ENABLED in the index metadata when main table and respective 
secondary index table segments are in sync. And IndexChooser picks only 
IndexStatus.ENABLED indexes in the datamap pruning. It applies to bloom, lucene 
as well.

But this property was not set for the secondary indexes in the older versions 
as they were not part of datamap pruning.

 

To support the secondary index coarse grain datamap pruning for existing 
secondary indexes, need to support a way it happens automatically.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (CARBONDATA-4158) Make Secondary Index as a coarse grain datamap and use secondary indexes for Presto queries

2021-03-30 Thread Venugopal Reddy K (Jira)


 [ 
https://issues.apache.org/jira/browse/CARBONDATA-4158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Venugopal Reddy K updated CARBONDATA-4158:
--
Description: 
*Background:*

Secondary Indexes are created as carbon tables and are managed as child tables 
to the main table. And these indexes are leveraged for query pruning via spark 
plan modification during optimizer/execution phases of query execution. In 
order to make use of Secondary Indexes for queries from engines other than 
spark like presto etc, it is not feasible to modify the engine specific query 
execution plans as we desire in the current approach. It makes Secondary 
Indexes not usable for presto query pruning. Thus need arises for an engine 
agnostic approach to use Secondary Indexes for presto queries.

*Description:*

Current Secondary Index pruning is tightly coupled with spark because the query 
plan modification is specific to the spark engine. It is hard to reuse the 
solution for presto queries. Need a new solution to use secondary indexes with 
Presto queries. And it  shouldn’t affect the existing customer using secondary 
index with spark.

  was:
*Background:*

Secondary Indexes are created as carbon tables and are managed as child tables 
to the main table. And these indexes are leveraged for query pruning via spark 
plan modification during optimizer/execution phases of query execution. In 
order to make use of Secondary Indexes for queries from engines other than 
spark like presto etc, it is not feasible to modify the engine specific query 
execution plans as we desire in the current approach. It makes Secondary 
Indexes not usable for presto query pruning. Thus need arises for an engine 
agnostic approach to use Secondary Indexes for presto queries.

*Description:*

                          Current Secondary Index pruning is tightly coupled 
with spark because the query plan modification is specific to the spark engine. 
It is hard to reuse the solution for presto queries. Need a new solution to use 
secondary indexes with Presto queries. And it  shouldn’t affect the existing 
customer using secondary index with spark.


> Make Secondary Index as a coarse grain datamap and use secondary indexes for 
> Presto queries
> ---
>
> Key: CARBONDATA-4158
> URL: https://issues.apache.org/jira/browse/CARBONDATA-4158
> Project: CarbonData
>  Issue Type: New Feature
>Reporter: Venugopal Reddy K
>Priority: Minor
>
> *Background:*
> Secondary Indexes are created as carbon tables and are managed as child 
> tables to the main table. And these indexes are leveraged for query pruning 
> via spark plan modification during optimizer/execution phases of query 
> execution. In order to make use of Secondary Indexes for queries from engines 
> other than spark like presto etc, it is not feasible to modify the engine 
> specific query execution plans as we desire in the current approach. It makes 
> Secondary Indexes not usable for presto query pruning. Thus need arises for 
> an engine agnostic approach to use Secondary Indexes for presto queries.
> *Description:*
> Current Secondary Index pruning is tightly coupled with spark because the 
> query plan modification is specific to the spark engine. It is hard to reuse 
> the solution for presto queries. Need a new solution to use secondary indexes 
> with Presto queries. And it  shouldn’t affect the existing customer using 
> secondary index with spark.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (CARBONDATA-4158) Make Secondary Index as a coarse grain datamap and use secondary indexes for Presto queries

2021-03-30 Thread Venugopal Reddy K (Jira)
Venugopal Reddy K created CARBONDATA-4158:
-

 Summary: Make Secondary Index as a coarse grain datamap and use 
secondary indexes for Presto queries
 Key: CARBONDATA-4158
 URL: https://issues.apache.org/jira/browse/CARBONDATA-4158
 Project: CarbonData
  Issue Type: New Feature
Reporter: Venugopal Reddy K


*Background:*

Secondary Indexes are created as carbon tables and are managed as child tables 
to the main table. And these indexes are leveraged for query pruning via spark 
plan modification during optimizer/execution phases of query execution. In 
order to make use of Secondary Indexes for queries from engines other than 
spark like presto etc, it is not feasible to modify the engine specific query 
execution plans as we desire in the current approach. It makes Secondary 
Indexes not usable for presto query pruning. Thus need arises for an engine 
agnostic approach to use Secondary Indexes for presto queries.

*Description:*

                          Current Secondary Index pruning is tightly coupled 
with spark because the query plan modification is specific to the spark engine. 
It is hard to reuse the solution for presto queries. Need a new solution to use 
secondary indexes with Presto queries. And it  shouldn’t affect the existing 
customer using secondary index with spark.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (CARBONDATA-4008) IN filter on date column is returning 0 results when 'carbon.push.rowfilters.for.vector' is true

2020-12-28 Thread Venugopal Reddy K (Jira)


 [ 
https://issues.apache.org/jira/browse/CARBONDATA-4008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Venugopal Reddy K resolved CARBONDATA-4008.
---
Resolution: Fixed

It is fixed with [PR 3953|https://github.com/apache/carbondata/pull/3953]

> IN filter on date column is returning 0 results when 
> 'carbon.push.rowfilters.for.vector' is true
> 
>
> Key: CARBONDATA-4008
> URL: https://issues.apache.org/jira/browse/CARBONDATA-4008
> Project: CarbonData
>  Issue Type: Bug
>  Components: core
>Affects Versions: 2.0.0
>Reporter: Venugopal Reddy K
>Priority: Major
> Fix For: 2.1.1
>
>  Time Spent: 4h 50m
>  Remaining Estimate: 0h
>
> *Issue:*
> IN filter with date column in condition is returning 0 results when 
> 'carbon.push.rowfilters.for.vector' is set to true.
>  
> *Steps to reproduce:*
> sql("set carbon.push.rowfilters.for.vector=true")
> sql("create table test_table(i int, dt date, ts timestamp) stored as 
> carbondata")
> sql("insert into test_table select 1, '2020-03-30', '2020-03-30 10:00:00'")
> sql("insert into test_table select 2, '2020-07-04', '2020-07-04 14:12:15'")
> sql("insert into test_table select 3, '2020-09-23', '2020-09-23 12:30:45'")
> sql("select * from test_table where dt IN ('2020-03-30', 
> '2020-09-23')").show()



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (CARBONDATA-4050) TPC-DS queries performance degraded when compared to older versions due to redundant getFileStatus() invocations

2020-11-18 Thread Venugopal Reddy K (Jira)


 [ 
https://issues.apache.org/jira/browse/CARBONDATA-4050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Venugopal Reddy K updated CARBONDATA-4050:
--
Description: 
*Issue:*

In createCarbonDataFileBlockMetaInfoMapping method, we get list of carbondata 
files in the segment, loop through all the carbon files and make a map of 
fileNameToMetaInfoMapping

      In that carbon files loop, if the file is of AbstractDFSCarbonFile type, 
we get the org.apache.hadoop.fs.FileStatus thrice for each file. And the method 
to get file status is an RPC call(fileSystem.getFileStatus(path)). It takes 
~2ms in the cluster for each call. Thus, incur an overhead of ~6ms per file. So 
overall driver side query processing time has increased significantly when 
there are more carbon files. Hence caused TPC-DS queries performance 
degradation.

Have shown the methods/calls which get the file status for the carbon file in 
loop:
{code:java}
public static Map 
createCarbonDataFileBlockMetaInfoMapping(
String segmentFilePath, Configuration configuration) throws IOException {
  Map fileNameToMetaInfoMapping = new TreeMap();
  CarbonFile carbonFile = FileFactory.getCarbonFile(segmentFilePath, 
configuration);
  if (carbonFile instanceof AbstractDFSCarbonFile && !(carbonFile instanceof 
S3CarbonFile)) {
PathFilter pathFilter = new PathFilter() {
  @Override
  public boolean accept(Path path) {
return CarbonTablePath.isCarbonDataFile(path.getName());
  }
};
CarbonFile[] carbonFiles = carbonFile.locationAwareListFiles(pathFilter);
for (CarbonFile file : carbonFiles) {
  String[] location = file.getLocations(); // RPC call - 1
  long len = file.getSize(); // RPC call - 2
  BlockMetaInfo blockMetaInfo = new BlockMetaInfo(location, len);
  fileNameToMetaInfoMapping.put(file.getPath(), blockMetaInfo); // RPC call 
- 3 in file.getpath() method
}
  }
  return fileNameToMetaInfoMapping;
}
{code}
 

*Suggestion:*

I think, currently we make RPC call to get the file status upon each invocation 
because file status may change over a period of time. And we shouldn't cache 
the file status in AbstractDFSCarbonFile.

     In the current case, just before the loop of carbon files, we get the file 
status of all the carbon files in the segment with RPC call shown below. 
LocatedFileStatus is a child class of FileStatus. It has BlockLocation along 
with file status. 
{code:java}
RemoteIterator iter = 
fileSystem.listLocatedStatus(path);{code}
        Intention of getting all the file status here is to create instance of 
BlockMetaInfo and maintain the map of fileNameToMetaInfoMapping.

So it is safe to avoid these unnecessary rpc calls to get file status again in 
getLocations(), getSize() and getPath() methods.

 

  was:
*Issue:*

In createCarbonDataFileBlockMetaInfoMapping method, we get list of carbondata 
files in the segment, loop through all the carbon files and make a map of 
fileNameToMetaInfoMapping

      In that carbon files loop, if the file is of AbstractDFSCarbonFile type, 
we get the org.apache.hadoop.fs.FileStatus thrice for each file. And the method 
to get file status is an RPC call(fileSystem.getFileStatus(path)). It takes 
~2ms in the cluster for each call. Thus, incurs an overhead of ~6ms per file. 
So overall driver side query processing time has increased significantly when 
there are more carbon files. Hence caused TPC-DS queries performance 
degradation.

Have shown the methods/calls which get the file status for the carbon file in 
loop:
{code:java}
public static Map 
createCarbonDataFileBlockMetaInfoMapping(
String segmentFilePath, Configuration configuration) throws IOException {
  Map fileNameToMetaInfoMapping = new TreeMap();
  CarbonFile carbonFile = FileFactory.getCarbonFile(segmentFilePath, 
configuration);
  if (carbonFile instanceof AbstractDFSCarbonFile && !(carbonFile instanceof 
S3CarbonFile)) {
PathFilter pathFilter = new PathFilter() {
  @Override
  public boolean accept(Path path) {
return CarbonTablePath.isCarbonDataFile(path.getName());
  }
};
CarbonFile[] carbonFiles = carbonFile.locationAwareListFiles(pathFilter);
for (CarbonFile file : carbonFiles) {
  String[] location = file.getLocations(); // RPC call - 1
  long len = file.getSize(); // RPC call - 2
  BlockMetaInfo blockMetaInfo = new BlockMetaInfo(location, len);
  fileNameToMetaInfoMapping.put(file.getPath(), blockMetaInfo); // RPC call 
- 3 in file.getpath() method
}
  }
  return fileNameToMetaInfoMapping;
}
{code}
 

*Suggestion:*

I think, currently we make RPC call to get the file status upon each invocation 
because file status may change over a period of time. And we shouldn't cache 
the file status in AbstractDFSCarbonFile.

     In the current case, just before the loop of carbon files, we get the file 
status of all the carbon files in the segment 

[jira] [Updated] (CARBONDATA-4050) TPC-DS queries performance degraded when compared to older versions due to redundant getFileStatus() invocations

2020-11-18 Thread Venugopal Reddy K (Jira)


 [ 
https://issues.apache.org/jira/browse/CARBONDATA-4050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Venugopal Reddy K updated CARBONDATA-4050:
--
Description: 
*Issue:*

In createCarbonDataFileBlockMetaInfoMapping method, we get list of carbondata 
files in the segment, loop through all the carbon files and make a map of 
fileNameToMetaInfoMapping

      In that carbon files loop, if the file is of AbstractDFSCarbonFile type, 
we get the org.apache.hadoop.fs.FileStatus thrice for each file. And the method 
to get file status is an RPC call(fileSystem.getFileStatus(path)). It takes 
~2ms in the cluster for each call. Thus, incurs an overhead of ~6ms per file. 
So overall driver side query processing time has increased significantly when 
there are more carbon files. Hence caused TPC-DS queries performance 
degradation.

Have shown the methods/calls which get the file status for the carbon file in 
loop
{code:java}
public static Map 
createCarbonDataFileBlockMetaInfoMapping(
String segmentFilePath, Configuration configuration) throws IOException {
  Map fileNameToMetaInfoMapping = new TreeMap();
  CarbonFile carbonFile = FileFactory.getCarbonFile(segmentFilePath, 
configuration);
  if (carbonFile instanceof AbstractDFSCarbonFile && !(carbonFile instanceof 
S3CarbonFile)) {
PathFilter pathFilter = new PathFilter() {
  @Override
  public boolean accept(Path path) {
return CarbonTablePath.isCarbonDataFile(path.getName());
  }
};
CarbonFile[] carbonFiles = carbonFile.locationAwareListFiles(pathFilter);
for (CarbonFile file : carbonFiles) {
  String[] location = file.getLocations(); // RPC call - 1
  long len = file.getSize(); // RPC call - 2
  BlockMetaInfo blockMetaInfo = new BlockMetaInfo(location, len);
  fileNameToMetaInfoMapping.put(file.getPath(), blockMetaInfo); // RPC call 
- 3 in file.getpath() method
}
  }
  return fileNameToMetaInfoMapping;
}
{code}
 

*Suggestion:*

I think, currently we make RPC call to get the file status upon each invocation 
because file status may change over a period of time. And we shouldn't cache 
the file status in AbstractDFSCarbonFile.

     In the current case, just before the loop of carbon files, we get the file 
status of all the carbon files in the segment with RPC call shown below. 
LocatedFileStatus is a child class of FileStatus. It has BlockLocation along 
with file status. 
{code:java}
RemoteIterator iter = 
fileSystem.listLocatedStatus(path);{code}
        Intention of getting all the file status here is to create instance of 
BlockMetaInfo and maintain the map of fileNameToMetaInfoMapping.

So it is safe to avoid these unnecessary rpc calls to get file status again in 
getLocations(), getSize() and getPath() methods.

 

  was:
*Issue:*

In createCarbonDataFileBlockMetaInfoMapping method, we get list of carbondata 
files in the segment, loop through all the carbon files and make a map of 
fileNameToMetaInfoMapping

      In that carbon files loop, if the file is of AbstractDFSCarbonFile type, 
we get the org.apache.hadoop.fs.FileStatus thrice for each file. And the method 
to get file status is an RPC call(fileSystem.getFileStatus(path)). It takes 
~2ms in the cluster for each call. Thus, incurs an overhead of ~6ms per file. 
So overall driver side query processing time has increased significantly. Hence 
caused TPC-DS queries performance degradation.

Have highlighted the methods/calls which get the file status for the carbon 
file in loop

 
{code:java}
public static Map 
createCarbonDataFileBlockMetaInfoMapping(
String segmentFilePath, Configuration configuration) throws IOException {
  Map fileNameToMetaInfoMapping = new TreeMap();
  CarbonFile carbonFile = FileFactory.getCarbonFile(segmentFilePath, 
configuration);
  if (carbonFile instanceof AbstractDFSCarbonFile && !(carbonFile instanceof 
S3CarbonFile)) {
PathFilter pathFilter = new PathFilter() {
  @Override
  public boolean accept(Path path) {
return CarbonTablePath.isCarbonDataFile(path.getName());
  }
};
CarbonFile[] carbonFiles = carbonFile.locationAwareListFiles(pathFilter);
for (CarbonFile file : carbonFiles) {
  String[] location = file.getLocations(); // RPC call - 1
  long len = file.getSize(); // RPC call - 2
  BlockMetaInfo blockMetaInfo = new BlockMetaInfo(location, len);
  fileNameToMetaInfoMapping.put(file.getPath(), blockMetaInfo); // RPC call 
- 3 in file.getpath() method
}
  }
  return fileNameToMetaInfoMapping;
}
{code}
 

*Suggestion:*

I think, currently we make RPC call to get the file status upon each invocation 
because file status may change over a period of time. And we shouldn't cache 
the file status in AbstractDFSCarbonFile.

     In the current case, just before the loop of carbon files, we get the file 
status of all the carbon files in the segment with RPC call shown below. 

[jira] [Updated] (CARBONDATA-4050) TPC-DS queries performance degraded when compared to older versions due to redundant getFileStatus() invocations

2020-11-18 Thread Venugopal Reddy K (Jira)


 [ 
https://issues.apache.org/jira/browse/CARBONDATA-4050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Venugopal Reddy K updated CARBONDATA-4050:
--
Description: 
*Issue:*

In createCarbonDataFileBlockMetaInfoMapping method, we get list of carbondata 
files in the segment, loop through all the carbon files and make a map of 
fileNameToMetaInfoMapping

      In that carbon files loop, if the file is of AbstractDFSCarbonFile type, 
we get the org.apache.hadoop.fs.FileStatus thrice for each file. And the method 
to get file status is an RPC call(fileSystem.getFileStatus(path)). It takes 
~2ms in the cluster for each call. Thus, incurs an overhead of ~6ms per file. 
So overall driver side query processing time has increased significantly when 
there are more carbon files. Hence caused TPC-DS queries performance 
degradation.

Have shown the methods/calls which get the file status for the carbon file in 
loop:
{code:java}
public static Map 
createCarbonDataFileBlockMetaInfoMapping(
String segmentFilePath, Configuration configuration) throws IOException {
  Map fileNameToMetaInfoMapping = new TreeMap();
  CarbonFile carbonFile = FileFactory.getCarbonFile(segmentFilePath, 
configuration);
  if (carbonFile instanceof AbstractDFSCarbonFile && !(carbonFile instanceof 
S3CarbonFile)) {
PathFilter pathFilter = new PathFilter() {
  @Override
  public boolean accept(Path path) {
return CarbonTablePath.isCarbonDataFile(path.getName());
  }
};
CarbonFile[] carbonFiles = carbonFile.locationAwareListFiles(pathFilter);
for (CarbonFile file : carbonFiles) {
  String[] location = file.getLocations(); // RPC call - 1
  long len = file.getSize(); // RPC call - 2
  BlockMetaInfo blockMetaInfo = new BlockMetaInfo(location, len);
  fileNameToMetaInfoMapping.put(file.getPath(), blockMetaInfo); // RPC call 
- 3 in file.getpath() method
}
  }
  return fileNameToMetaInfoMapping;
}
{code}
 

*Suggestion:*

I think, currently we make RPC call to get the file status upon each invocation 
because file status may change over a period of time. And we shouldn't cache 
the file status in AbstractDFSCarbonFile.

     In the current case, just before the loop of carbon files, we get the file 
status of all the carbon files in the segment with RPC call shown below. 
LocatedFileStatus is a child class of FileStatus. It has BlockLocation along 
with file status. 
{code:java}
RemoteIterator iter = 
fileSystem.listLocatedStatus(path);{code}
        Intention of getting all the file status here is to create instance of 
BlockMetaInfo and maintain the map of fileNameToMetaInfoMapping.

So it is safe to avoid these unnecessary rpc calls to get file status again in 
getLocations(), getSize() and getPath() methods.

 

  was:
*Issue:*

In createCarbonDataFileBlockMetaInfoMapping method, we get list of carbondata 
files in the segment, loop through all the carbon files and make a map of 
fileNameToMetaInfoMapping

      In that carbon files loop, if the file is of AbstractDFSCarbonFile type, 
we get the org.apache.hadoop.fs.FileStatus thrice for each file. And the method 
to get file status is an RPC call(fileSystem.getFileStatus(path)). It takes 
~2ms in the cluster for each call. Thus, incurs an overhead of ~6ms per file. 
So overall driver side query processing time has increased significantly when 
there are more carbon files. Hence caused TPC-DS queries performance 
degradation.

Have shown the methods/calls which get the file status for the carbon file in 
loop
{code:java}
public static Map 
createCarbonDataFileBlockMetaInfoMapping(
String segmentFilePath, Configuration configuration) throws IOException {
  Map fileNameToMetaInfoMapping = new TreeMap();
  CarbonFile carbonFile = FileFactory.getCarbonFile(segmentFilePath, 
configuration);
  if (carbonFile instanceof AbstractDFSCarbonFile && !(carbonFile instanceof 
S3CarbonFile)) {
PathFilter pathFilter = new PathFilter() {
  @Override
  public boolean accept(Path path) {
return CarbonTablePath.isCarbonDataFile(path.getName());
  }
};
CarbonFile[] carbonFiles = carbonFile.locationAwareListFiles(pathFilter);
for (CarbonFile file : carbonFiles) {
  String[] location = file.getLocations(); // RPC call - 1
  long len = file.getSize(); // RPC call - 2
  BlockMetaInfo blockMetaInfo = new BlockMetaInfo(location, len);
  fileNameToMetaInfoMapping.put(file.getPath(), blockMetaInfo); // RPC call 
- 3 in file.getpath() method
}
  }
  return fileNameToMetaInfoMapping;
}
{code}
 

*Suggestion:*

I think, currently we make RPC call to get the file status upon each invocation 
because file status may change over a period of time. And we shouldn't cache 
the file status in AbstractDFSCarbonFile.

     In the current case, just before the loop of carbon files, we get the file 
status of all the carbon files in the segment 

[jira] [Updated] (CARBONDATA-4050) TPC-DS queries performance degraded when compared to older versions due to redundant getFileStatus() invocations

2020-11-18 Thread Venugopal Reddy K (Jira)


 [ 
https://issues.apache.org/jira/browse/CARBONDATA-4050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Venugopal Reddy K updated CARBONDATA-4050:
--
Description: 
*Issue:*

In createCarbonDataFileBlockMetaInfoMapping method, we get list of carbondata 
files in the segment, loop through all the carbon files and make a map of 
fileNameToMetaInfoMapping

      In that carbon files loop, if the file is of AbstractDFSCarbonFile type, 
we get the org.apache.hadoop.fs.FileStatus thrice for each file. And the method 
to get file status is an RPC call(fileSystem.getFileStatus(path)). It takes 
~2ms in the cluster for each call. Thus, incurs an overhead of ~6ms per file. 
So overall driver side query processing time has increased significantly. Hence 
caused TPC-DS queries performance degradation.

Have highlighted the methods/calls which get the file status for the carbon 
file in loop

 
{code:java}
public static Map 
createCarbonDataFileBlockMetaInfoMapping(
String segmentFilePath, Configuration configuration) throws IOException {
  Map fileNameToMetaInfoMapping = new TreeMap();
  CarbonFile carbonFile = FileFactory.getCarbonFile(segmentFilePath, 
configuration);
  if (carbonFile instanceof AbstractDFSCarbonFile && !(carbonFile instanceof 
S3CarbonFile)) {
PathFilter pathFilter = new PathFilter() {
  @Override
  public boolean accept(Path path) {
return CarbonTablePath.isCarbonDataFile(path.getName());
  }
};
CarbonFile[] carbonFiles = carbonFile.locationAwareListFiles(pathFilter);
for (CarbonFile file : carbonFiles) {
  String[] location = file.getLocations(); // RPC call - 1
  long len = file.getSize(); // RPC call - 2
  BlockMetaInfo blockMetaInfo = new BlockMetaInfo(location, len);
  fileNameToMetaInfoMapping.put(file.getPath(), blockMetaInfo); // RPC call 
- 3 in file.getpath() method
}
  }
  return fileNameToMetaInfoMapping;
}
{code}
 

*Suggestion:*

I think, currently we make RPC call to get the file status upon each invocation 
because file status may change over a period of time. And we shouldn't cache 
the file status in AbstractDFSCarbonFile.

     In the current case, just before the loop of carbon files, we get the file 
status of all the carbon files in the segment with RPC call shown below. 
LocatedFileStatus is a child class of FileStatus. It has BlockLocation along 
with file status. 
{code:java}
RemoteIterator iter = 
fileSystem.listLocatedStatus(path);{code}
        Intention of getting all the file status here is to create instance of 
BlockMetaInfo and maintain the map of fileNameToMetaInfoMapping.

So it is safe to avoid these unnecessary rpc calls to get file status again in 
getLocations(), getSize() and getPath() methods.

 

  was:
*Issue:*

In createCarbonDataFileBlockMetaInfoMapping method, we get list of carbondata 
files in the segment, loop through all the carbon files and make a map of 
fileNameToMetaInfoMapping

      In that carbon files loop, if the file is of AbstractDFSCarbonFile type, 
we get the org.apache.hadoop.fs.FileStatus thrice for each file. And the method 
to get file status is an RPC call(fileSystem.getFileStatus(path)). It takes 
~2ms in the cluster for each call. Thus, incurs an overhead of ~6ms per file. 
So overall driver side query processing time has increased significantly.

Have highlighted the methods/calls which get the file status for the carbon 
file in loop

 
{code:java}
public static Map 
createCarbonDataFileBlockMetaInfoMapping(
String segmentFilePath, Configuration configuration) throws IOException {
  Map fileNameToMetaInfoMapping = new TreeMap();
  CarbonFile carbonFile = FileFactory.getCarbonFile(segmentFilePath, 
configuration);
  if (carbonFile instanceof AbstractDFSCarbonFile && !(carbonFile instanceof 
S3CarbonFile)) {
PathFilter pathFilter = new PathFilter() {
  @Override
  public boolean accept(Path path) {
return CarbonTablePath.isCarbonDataFile(path.getName());
  }
};
CarbonFile[] carbonFiles = carbonFile.locationAwareListFiles(pathFilter);
for (CarbonFile file : carbonFiles) {
  String[] location = file.getLocations();
  long len = file.getSize();
  BlockMetaInfo blockMetaInfo = new BlockMetaInfo(location, len);
  fileNameToMetaInfoMapping.put(file.getPath(), blockMetaInfo);
}
  }
  return fileNameToMetaInfoMapping;
}
{code}
 

*Suggestion:*

I think, currently we make RPC call to get the file status upon each invocation 
because file status may change over a period of time. And we shouldn't cache 
the file status in AbstractDFSCarbonFile.

     In the current case, just before the loop of carbon files, we get the file 
status of all the carbon files in the segment with RPC call shown below. 
LocatedFileStatus is a child class of FileStatus. It has BlockLocation along 
with file status. 
{code:java}
RemoteIterator iter = 

[jira] [Updated] (CARBONDATA-4050) TPC-DS queries performance degraded when compared to older versions due to redundant getFileStatus() invocations

2020-11-18 Thread Venugopal Reddy K (Jira)


 [ 
https://issues.apache.org/jira/browse/CARBONDATA-4050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Venugopal Reddy K updated CARBONDATA-4050:
--
Description: 
*Issue:*

In createCarbonDataFileBlockMetaInfoMapping method, we get list of carbondata 
files in the segment, loop through all the carbon files and make a map of 
fileNameToMetaInfoMapping

      In that carbon files loop, if the file is of AbstractDFSCarbonFile type, 
we get the org.apache.hadoop.fs.FileStatus thrice for each file. And the method 
to get file status is an RPC call(fileSystem.getFileStatus(path)). It takes 
~2ms in the cluster for each call. Thus, incurs an overhead of ~6ms per file. 
So overall driver side query processing time has increased significantly.

Have highlighted the methods/calls which get the file status for the carbon 
file in loop

 
{code:java}
public static Map 
createCarbonDataFileBlockMetaInfoMapping(
String segmentFilePath, Configuration configuration) throws IOException {
  Map fileNameToMetaInfoMapping = new TreeMap();
  CarbonFile carbonFile = FileFactory.getCarbonFile(segmentFilePath, 
configuration);
  if (carbonFile instanceof AbstractDFSCarbonFile && !(carbonFile instanceof 
S3CarbonFile)) {
PathFilter pathFilter = new PathFilter() {
  @Override
  public boolean accept(Path path) {
return CarbonTablePath.isCarbonDataFile(path.getName());
  }
};
CarbonFile[] carbonFiles = carbonFile.locationAwareListFiles(pathFilter);
for (CarbonFile file : carbonFiles) {
  String[] location = file.getLocations();
  long len = file.getSize();
  BlockMetaInfo blockMetaInfo = new BlockMetaInfo(location, len);
  fileNameToMetaInfoMapping.put(file.getPath(), blockMetaInfo);
}
  }
  return fileNameToMetaInfoMapping;
}
{code}
 

*Suggestion:*

I think, currently we make RPC call to get the file status upon each invocation 
because file status may change over a period of time. And we shouldn't cache 
the file status in AbstractDFSCarbonFile.

     In the current case, just before the loop of carbon files, we get the file 
status of all the carbon files in the segment with RPC call shown below. 
LocatedFileStatus is a child class of FileStatus. It has BlockLocation along 
with file status. 
{code:java}
RemoteIterator iter = 
fileSystem.listLocatedStatus(path);{code}
        Intention of getting all the file status here is to create instance of 
BlockMetaInfo and maintain the map of fileNameToMetaInfoMapping.

So it is safe to avoid these unnecessary rpc calls to get file status again in 
getLocations(), getSize() and getPath() methods.

 

  was:
*Issue:*

In createCarbonDataFileBlockMetaInfoMapping method, we get list of carbondata 
files in the segment, loop through all the carbon files and make a map of 
fileNameToMetaInfoMapping

      In that carbon files loop, if the file is of AbstractDFSCarbonFile type, 
we get the org.apache.hadoop.fs.FileStatus thrice for each file. And the method 
to get file status is an RPC call(fileSystem.getFileStatus(path)). It takes 
~2ms in the cluster for each call. Thus, incurs an overhead of ~6ms per file. 
So overall driver side query processing time has increased significantly.

Have highlighted the methods/calls which get the file status for the carbon 
file in loop

 
{code:java}
public static Map 
createCarbonDataFileBlockMetaInfoMapping(
String segmentFilePath, Configuration configuration) throws IOException {
  Map fileNameToMetaInfoMapping = new TreeMap();
  CarbonFile carbonFile = FileFactory.getCarbonFile(segmentFilePath, 
configuration);
  if (carbonFile instanceof AbstractDFSCarbonFile && !(carbonFile instanceof 
S3CarbonFile)) {
PathFilter pathFilter = new PathFilter() {
  @Override
  public boolean accept(Path path) {
return CarbonTablePath.isCarbonDataFile(path.getName());
  }
};
CarbonFile[] carbonFiles = carbonFile.locationAwareListFiles(pathFilter);
for (CarbonFile file : carbonFiles) {
  String[] location = file.getLocations();
  long len = file.getSize();
  BlockMetaInfo blockMetaInfo = new BlockMetaInfo(location, len);
  fileNameToMetaInfoMapping.put(file.getPath(), blockMetaInfo);
}
  }
  return fileNameToMetaInfoMapping;
}
{code}
 

 

*Suggestion:*

I think, currently we make RPC call to get the file status upon each invocation 
because file status may change over a period of time. And we shouldn't cache 
the file status in AbstractDFSCarbonFile.

     In the current case, just before the loop of carbon files, we get the file 
status of all the carbon files in the segment with RPC call shown below. 
LocatedFileStatus is a child class of FileStatus. It has BlockLocation along 
with file status.

 
{code:java}
RemoteIterator iter = 
fileSystem.listLocatedStatus(path);{code}
Intention of getting all the file status here is to create instance of 
{{BlockMetaInfo}} and maintain 

[jira] [Updated] (CARBONDATA-4050) TPC-DS queries performance degraded when compared to older versions due to redundant getFileStatus() invocations

2020-11-18 Thread Venugopal Reddy K (Jira)


 [ 
https://issues.apache.org/jira/browse/CARBONDATA-4050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Venugopal Reddy K updated CARBONDATA-4050:
--
Description: 
*Issue:*

In createCarbonDataFileBlockMetaInfoMapping method, we get list of carbondata 
files in the segment, loop through all the carbon files and make a map of 
fileNameToMetaInfoMapping

      In that carbon files loop, if the file is of AbstractDFSCarbonFile type, 
we get the org.apache.hadoop.fs.FileStatus thrice for each file. And the method 
to get file status is an RPC call(fileSystem.getFileStatus(path)). It takes 
~2ms in the cluster for each call. Thus, incurs an overhead of ~6ms per file. 
So overall driver side query processing time has increased significantly.

Have highlighted the methods/calls which get the file status for the carbon 
file in loop

 
{code:java}
public static Map 
createCarbonDataFileBlockMetaInfoMapping(
String segmentFilePath, Configuration configuration) throws IOException {
  Map fileNameToMetaInfoMapping = new TreeMap();
  CarbonFile carbonFile = FileFactory.getCarbonFile(segmentFilePath, 
configuration);
  if (carbonFile instanceof AbstractDFSCarbonFile && !(carbonFile instanceof 
S3CarbonFile)) {
PathFilter pathFilter = new PathFilter() {
  @Override
  public boolean accept(Path path) {
return CarbonTablePath.isCarbonDataFile(path.getName());
  }
};
CarbonFile[] carbonFiles = carbonFile.locationAwareListFiles(pathFilter);
for (CarbonFile file : carbonFiles) {
  String[] location = file.getLocations();
  long len = file.getSize();
  BlockMetaInfo blockMetaInfo = new BlockMetaInfo(location, len);
  fileNameToMetaInfoMapping.put(file.getPath(), blockMetaInfo);
}
  }
  return fileNameToMetaInfoMapping;
}
{code}
 

 

*Suggestion:*

I think, currently we make RPC call to get the file status upon each invocation 
because file status may change over a period of time. And we shouldn't cache 
the file status in AbstractDFSCarbonFile.

     In the current case, just before the loop of carbon files, we get the file 
status of all the carbon files in the segment with RPC call shown below. 
LocatedFileStatus is a child class of FileStatus. It has BlockLocation along 
with file status.

 
{code:java}
RemoteIterator iter = 
fileSystem.listLocatedStatus(path);{code}
Intention of getting all the file status here is to create instance of 
{{BlockMetaInfo}} and maintain the map of {{fileNameToMetaInfoMapping.}}

So it is safe to avoid these unnecessary rpc calls to get file status again in 
getLocations(), getSize() and getPath() methods.

 

  was:
*Issue:*

In {{createCarbonDataFileBlockMetaInfoMapping }}method, we get list of 
carbondata files in the segment, loop through all the carbon files and make a 
map of {{fileNameToMetaInfoMapping}}

      In that carbon files loop, if the file is of {{AbstractDFSCarbonFile}} 
type, we get the {{org.apache.hadoop.fs.FileStatus}} thrice for each file. And 
the method to get file status is an RPC 
call({{fileSystem.getFileStatus(path)}}). It takes ~2ms in the cluster for each 
call. Thus, incurs an overhead of ~6ms per file. So overall driver side query 
processing time has increased significantly.

Have highlighted the methods/calls which get the file status for the carbon 
file in loop

 
{code:java}
public static Map 
createCarbonDataFileBlockMetaInfoMapping(
String segmentFilePath, Configuration configuration) throws IOException {
  Map fileNameToMetaInfoMapping = new TreeMap();
  CarbonFile carbonFile = FileFactory.getCarbonFile(segmentFilePath, 
configuration);
  if (carbonFile instanceof AbstractDFSCarbonFile && !(carbonFile instanceof 
S3CarbonFile)) {
PathFilter pathFilter = new PathFilter() {
  @Override
  public boolean accept(Path path) {
return CarbonTablePath.isCarbonDataFile(path.getName());
  }
};
CarbonFile[] carbonFiles = carbonFile.locationAwareListFiles(pathFilter);
for (CarbonFile file : carbonFiles) {
  String[] location = file.getLocations();
  long len = file.getSize();
  BlockMetaInfo blockMetaInfo = new BlockMetaInfo(location, len);
  fileNameToMetaInfoMapping.put(file.getPath(), blockMetaInfo);
}
  }
  return fileNameToMetaInfoMapping;
}
{code}
 

 

*Suggestion:*

I think, currently we make RPC call to get the file status upon each invocation 
because file status may change over a period of time. And we shouldn't cache 
the file status in AbstractDFSCarbonFile.

     In the current case, just before the loop of carbon files, we get the file 
status of all the carbon files in the segment with RPC call shown below. 
LocatedFileStatus is a child class of FileStatus. It has BlockLocation along 
with file status.

 
{code:java}
RemoteIterator iter = 
fileSystem.listLocatedStatus(path);{code}
Intention of getting all the file status here is to create instance of 

[jira] [Updated] (CARBONDATA-4050) TPC-DS queries performance degraded when compared to older versions due to redundant getFileStatus() invocations

2020-11-18 Thread Venugopal Reddy K (Jira)


 [ 
https://issues.apache.org/jira/browse/CARBONDATA-4050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Venugopal Reddy K updated CARBONDATA-4050:
--
Description: 
*Issue:*

In {{createCarbonDataFileBlockMetaInfoMapping }}method, we get list of 
carbondata files in the segment, loop through all the carbon files and make a 
map of {{fileNameToMetaInfoMapping}}

      In that carbon files loop, if the file is of {{AbstractDFSCarbonFile}} 
type, we get the {{org.apache.hadoop.fs.FileStatus}} thrice for each file. And 
the method to get file status is an RPC 
call({{fileSystem.getFileStatus(path)}}). It takes ~2ms in the cluster for each 
call. Thus, incurs an overhead of ~6ms per file. So overall driver side query 
processing time has increased significantly.

Have highlighted the methods/calls which get the file status for the carbon 
file in loop

 
{code:java}
public static Map 
createCarbonDataFileBlockMetaInfoMapping(
String segmentFilePath, Configuration configuration) throws IOException {
  Map fileNameToMetaInfoMapping = new TreeMap();
  CarbonFile carbonFile = FileFactory.getCarbonFile(segmentFilePath, 
configuration);
  if (carbonFile instanceof AbstractDFSCarbonFile && !(carbonFile instanceof 
S3CarbonFile)) {
PathFilter pathFilter = new PathFilter() {
  @Override
  public boolean accept(Path path) {
return CarbonTablePath.isCarbonDataFile(path.getName());
  }
};
CarbonFile[] carbonFiles = carbonFile.locationAwareListFiles(pathFilter);
for (CarbonFile file : carbonFiles) {
  String[] location = file.getLocations();
  long len = file.getSize();
  BlockMetaInfo blockMetaInfo = new BlockMetaInfo(location, len);
  fileNameToMetaInfoMapping.put(file.getPath(), blockMetaInfo);
}
  }
  return fileNameToMetaInfoMapping;
}
{code}
 

 

*Suggestion:*

I think, currently we make RPC call to get the file status upon each invocation 
because file status may change over a period of time. And we shouldn't cache 
the file status in AbstractDFSCarbonFile.

     In the current case, just before the loop of carbon files, we get the file 
status of all the carbon files in the segment with RPC call shown below. 
LocatedFileStatus is a child class of FileStatus. It has BlockLocation along 
with file status.

 
{code:java}
RemoteIterator iter = 
fileSystem.listLocatedStatus(path);{code}
Intention of getting all the file status here is to create instance of 
{{BlockMetaInfo}} and maintain the map of {{fileNameToMetaInfoMapping.}}

So it is safe to avoid these unnecessary rpc calls to get file status again in 
getLocations(), getSize() and getPath() methods.

 

  was:
*Issue:*

In ** {{createCarbonDataFileBlockMetaInfoMapping }}method, we get list 
carbondata files in the segment, loop through all the carbon files and make a 
map of {{fileNameToMetaInfoMapping}}

      In that carbon files loop, if the file is of {{AbstractDFSCarbonFile 
}}type, we get the{{ org.apache.hadoop.fs.FileStatus}} thrice for each file.{{ 
}}And the method to get file status is an RPC 
call({{fileSystem.getFileStatus(path)}}). It takes ~2ms in the cluster for each 
call. Thus, incurs an overhead of ~6ms per file. So overall driver side query 
processing time has increased significantly.

Have highlighted the methods/calls which get the file status for the carbon 
file in loop

 
{code:java}
public static Map 
createCarbonDataFileBlockMetaInfoMapping(
String segmentFilePath, Configuration configuration) throws IOException {
  Map fileNameToMetaInfoMapping = new TreeMap();
  CarbonFile carbonFile = FileFactory.getCarbonFile(segmentFilePath, 
configuration);
  if (carbonFile instanceof AbstractDFSCarbonFile && !(carbonFile instanceof 
S3CarbonFile)) {
PathFilter pathFilter = new PathFilter() {
  @Override
  public boolean accept(Path path) {
return CarbonTablePath.isCarbonDataFile(path.getName());
  }
};
CarbonFile[] carbonFiles = carbonFile.locationAwareListFiles(pathFilter);
for (CarbonFile file : carbonFiles) {
  String[] location = file.getLocations();
  long len = file.getSize();
  BlockMetaInfo blockMetaInfo = new BlockMetaInfo(location, len);
  fileNameToMetaInfoMapping.put(file.getPath(), blockMetaInfo);
}
  }
  return fileNameToMetaInfoMapping;
}
{code}
 

 

*Suggestion:*

I think, currently we make RPC call to get the file status upon each invocation 
because file status may change over a period of time. And we shouldn't cache 
the file status in AbstractDFSCarbonFile.

     In the current case, just before the loop of carbon files, we get the file 
status of all the carbon files in the segment with RPC call shown below. 
LocatedFileStatus is a child class of FileStatus. It has BlockLocation along 
with file status.

 
{code:java}
RemoteIterator iter = 
fileSystem.listLocatedStatus(path);{code}
Intention of getting all the file status here is to 

[jira] [Created] (CARBONDATA-4050) TPC-DS queries performance degraded when compared to older versions due to redundant getFileStatus() invocations

2020-11-18 Thread Venugopal Reddy K (Jira)
Venugopal Reddy K created CARBONDATA-4050:
-

 Summary: TPC-DS queries performance degraded when compared to 
older versions due to redundant getFileStatus() invocations
 Key: CARBONDATA-4050
 URL: https://issues.apache.org/jira/browse/CARBONDATA-4050
 Project: CarbonData
  Issue Type: Improvement
  Components: core
Affects Versions: 2.0.0
Reporter: Venugopal Reddy K
 Fix For: 2.1.0


*Issue:*

In ** {{createCarbonDataFileBlockMetaInfoMapping }}method, we get list 
carbondata files in the segment, loop through all the carbon files and make a 
map of {{fileNameToMetaInfoMapping}}

      In that carbon files loop, if the file is of {{AbstractDFSCarbonFile 
}}type, we get the{{ org.apache.hadoop.fs.FileStatus}} thrice for each file.{{ 
}}And the method to get file status is an RPC 
call({{fileSystem.getFileStatus(path)}}). It takes ~2ms in the cluster for each 
call. Thus, incurs an overhead of ~6ms per file. So overall driver side query 
processing time has increased significantly.

Have highlighted the methods/calls which get the file status for the carbon 
file in loop

 
{code:java}
public static Map 
createCarbonDataFileBlockMetaInfoMapping(
String segmentFilePath, Configuration configuration) throws IOException {
  Map fileNameToMetaInfoMapping = new TreeMap();
  CarbonFile carbonFile = FileFactory.getCarbonFile(segmentFilePath, 
configuration);
  if (carbonFile instanceof AbstractDFSCarbonFile && !(carbonFile instanceof 
S3CarbonFile)) {
PathFilter pathFilter = new PathFilter() {
  @Override
  public boolean accept(Path path) {
return CarbonTablePath.isCarbonDataFile(path.getName());
  }
};
CarbonFile[] carbonFiles = carbonFile.locationAwareListFiles(pathFilter);
for (CarbonFile file : carbonFiles) {
  String[] location = file.getLocations();
  long len = file.getSize();
  BlockMetaInfo blockMetaInfo = new BlockMetaInfo(location, len);
  fileNameToMetaInfoMapping.put(file.getPath(), blockMetaInfo);
}
  }
  return fileNameToMetaInfoMapping;
}
{code}
 

 

*Suggestion:*

I think, currently we make RPC call to get the file status upon each invocation 
because file status may change over a period of time. And we shouldn't cache 
the file status in AbstractDFSCarbonFile.

     In the current case, just before the loop of carbon files, we get the file 
status of all the carbon files in the segment with RPC call shown below. 
LocatedFileStatus is a child class of FileStatus. It has BlockLocation along 
with file status.

 
{code:java}
RemoteIterator iter = 
fileSystem.listLocatedStatus(path);{code}
Intention of getting all the file status here is to create instance of 
{{BlockMetaInfo}} and maintain the map of {{fileNameToMetaInfoMapping.}}

So it is safe to avoid these unnecessary rpc calls to get file status again in 
getLocations(), getSize() and getPath() methods.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (CARBONDATA-4042) Insert into select and CTAS launches fewer tasks(task count limited to number of nodes in cluster) even when target table is of no_sort

2020-10-23 Thread Venugopal Reddy K (Jira)


 [ 
https://issues.apache.org/jira/browse/CARBONDATA-4042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Venugopal Reddy K updated CARBONDATA-4042:
--
Description: 
*Issue:*

At present, When we do insert into table select from or create table as select 
from, we lauch one single task per node. Whereas when we do a simple select * 
from table query, tasks launched are equal to number of carbondata 
files(CARBON_TASK_DISTRIBUTION default is CARBON_TASK_DISTRIBUTION_BLOCK). 

Thus, slows down the load performance of insert into select and ctas cases.

Refer [Community discussion regd. task 
lauch|http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/Discussion-Query-Regarding-Task-launch-mechanism-for-data-load-operations-tt98711.html]

 

*Suggestion:*

Launch the same number of tasks as in select query for insert into select and 
ctas cases when the target table is of no-sort.

  was:
*Issue:*

At present, When we do insert into table select from or create table as select 
from, we lauch one single task per node. Whereas when we do a simple select * 
from table query, tasks launched are equal to number of carbondata 
files(CARBON_TASK_DISTRIBUTION default is CARBON_TASK_DISTRIBUTION_BLOCK). 

Thus, slows down the load performance of insert into select and ctas cases.

Refer [Community discussion regd. task 
lauch|http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/Discussion-Query-Regarding-Task-launch-mechanism-for-data-load-operations-tt98711.html]

 

*Suggestion:*

Lauch the same number of tasks as in select query for insert into select and 
ctas cases when the target table is of no-sort.

 

SI creationSI creation
1. DDL -> Parser -> CarbonCreateSecondaryIndexCommand do all Validations(list 
important once) acquireLockForSecondaryIndexCreation()acquire locks(compact, 
meta, dele_seg lock) preparetableInfo(prepare column schema, set positionref as 
sort,inherit local dict from main table ) for SI table & 
addIndexInfoToParentTable (create indexinfo and add to main table) 
CreateTablePreExecutionEvent(for acl work) Create SI 
table(sparksession.sql(create ...)) addIndexTableInfo, refreshTable index 
table, add indexInfo to hive metastore as Serde 
addOrModifyTableProperty(indexTableExists -> true) and refresh table(refresh 
catalog table) 2. try load, LoadDataForSecondaryIndex 1. prepare load model for 
SI 2. read table status and setinto load model 3. if loadmeta is empty, just 
return, else start load to SI 4. getValidseg, if yes go ahead, else return 5. 
prepare segmentIdToLoadStartTimeMapping, and prepare secindeModel 6. create 
exeSer based on threadpool size for parallel load of segments to SI 7. 
LoadTableSIPreExecutionEvent(ACL load events) 8. try toget seg local for all 
valid segment, if u get for all , add to valid, else add to skipped segment 9. 
start load for valid segs, update SI table status for in progress 10. if sort 
scope not global sort CarbonSecondaryIndexRDD internalGetPartitions 
prepareInputFormat and getSplits() internalCompute Sort blocks, 
prepareTaskBlockMap, prepare CarbonSecondaryIndexExecutor 
exec.processTableBlocks(prepare query model and execute query and return 
Iterator) SecondaryIndexQueryResultProcessor(prepare seg prop 
from processingQuery result) 
SecondaryIndexQueryResultProcessor.processQueryResult(init tempLoc, sort data 
rows,processResult(does sort on data in iterators) prepareRowObjectForSorting 
addRowForSorting and startSorting 
initializeFinalThreadMergerForMergeSort();,initDataHandler();readAndLoadDataFromSortTempFiles();
 write the carbon files to indextable store path WriteSegmentfile getLoadresult 
from future and make, SUccess and failed seg list if (failedSeglist not empty) 
if (isCompactionCall || !isLoadToFailedSISegments) \{ fail the SI load } else 
\{ just make markedfordelet and next load take care } } else create projections 
list including PR and create a datafram from MT loadDataUsingGlobalSort 
writeSegmentFile getLoadresult from future and make, SUccess and failed seg 
list if (failedSeglist not empty) if (isCompactionCall || 
!isLoadToFailedSISegments) \{ fail the SI load } else \{ just make 
markedfordelet and next load take care } } 11. if (successSISegments.nonEmpty 
&& !isCompactionCall) update status for in progress (can avoid this) 
mergeIndexFiles writeSegmentFile(can be avoided, shreelekya working on it) 
readTable statusfile and prepareLoad model for merge datafiles 
mergeDataFilesSISegments -> scanSegmentsAndSubmitJob -> triggerCompaction ->  
CarbonSIRebuildRDD internalGetPartitions prepareInputFormat and getSplits() 
internalCompute CarbonCompactionExecutor.processTableBlocks() close(delete old 
data files) deleteOldIndexOrMergeIndexFiles writeSegmentFile for each 
mergedSegment updateTableStatusFile readTableStatusFile 
writeLoadDetailsIntoFile(updated, new Index and datasize into tablestatus file) 

[jira] [Updated] (CARBONDATA-4042) Insert into select and CTAS launches fewer tasks(task count limited to number of nodes in cluster) even when target table is of no_sort

2020-10-23 Thread Venugopal Reddy K (Jira)


 [ 
https://issues.apache.org/jira/browse/CARBONDATA-4042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Venugopal Reddy K updated CARBONDATA-4042:
--
Description: 
*Issue:*

At present, When we do insert into table select from or create table as select 
from, we lauch one single task per node. Whereas when we do a simple select * 
from table query, tasks launched are equal to number of carbondata 
files(CARBON_TASK_DISTRIBUTION default is CARBON_TASK_DISTRIBUTION_BLOCK). 

Thus, slows down the load performance of insert into select and ctas cases.

Refer [Community discussion regd. task 
lauch|http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/Discussion-Query-Regarding-Task-launch-mechanism-for-data-load-operations-tt98711.html]

 

*Suggestion:*

Lauch the same number of tasks as in select query for insert into select and 
ctas cases when the target table is of no-sort.

 

SI creationSI creation
1. DDL -> Parser -> CarbonCreateSecondaryIndexCommand do all Validations(list 
important once) acquireLockForSecondaryIndexCreation()acquire locks(compact, 
meta, dele_seg lock) preparetableInfo(prepare column schema, set positionref as 
sort,inherit local dict from main table ) for SI table & 
addIndexInfoToParentTable (create indexinfo and add to main table) 
CreateTablePreExecutionEvent(for acl work) Create SI 
table(sparksession.sql(create ...)) addIndexTableInfo, refreshTable index 
table, add indexInfo to hive metastore as Serde 
addOrModifyTableProperty(indexTableExists -> true) and refresh table(refresh 
catalog table) 2. try load, LoadDataForSecondaryIndex 1. prepare load model for 
SI 2. read table status and setinto load model 3. if loadmeta is empty, just 
return, else start load to SI 4. getValidseg, if yes go ahead, else return 5. 
prepare segmentIdToLoadStartTimeMapping, and prepare secindeModel 6. create 
exeSer based on threadpool size for parallel load of segments to SI 7. 
LoadTableSIPreExecutionEvent(ACL load events) 8. try toget seg local for all 
valid segment, if u get for all , add to valid, else add to skipped segment 9. 
start load for valid segs, update SI table status for in progress 10. if sort 
scope not global sort CarbonSecondaryIndexRDD internalGetPartitions 
prepareInputFormat and getSplits() internalCompute Sort blocks, 
prepareTaskBlockMap, prepare CarbonSecondaryIndexExecutor 
exec.processTableBlocks(prepare query model and execute query and return 
Iterator) SecondaryIndexQueryResultProcessor(prepare seg prop 
from processingQuery result) 
SecondaryIndexQueryResultProcessor.processQueryResult(init tempLoc, sort data 
rows,processResult(does sort on data in iterators) prepareRowObjectForSorting 
addRowForSorting and startSorting 
initializeFinalThreadMergerForMergeSort();,initDataHandler();readAndLoadDataFromSortTempFiles();
 write the carbon files to indextable store path WriteSegmentfile getLoadresult 
from future and make, SUccess and failed seg list if (failedSeglist not empty) 
if (isCompactionCall || !isLoadToFailedSISegments) \{ fail the SI load } else 
\{ just make markedfordelet and next load take care } } else create projections 
list including PR and create a datafram from MT loadDataUsingGlobalSort 
writeSegmentFile getLoadresult from future and make, SUccess and failed seg 
list if (failedSeglist not empty) if (isCompactionCall || 
!isLoadToFailedSISegments) \{ fail the SI load } else \{ just make 
markedfordelet and next load take care } } 11. if (successSISegments.nonEmpty 
&& !isCompactionCall) update status for in progress (can avoid this) 
mergeIndexFiles writeSegmentFile(can be avoided, shreelekya working on it) 
readTable statusfile and prepareLoad model for merge datafiles 
mergeDataFilesSISegments -> scanSegmentsAndSubmitJob -> triggerCompaction ->  
CarbonSIRebuildRDD internalGetPartitions prepareInputFormat and getSplits() 
internalCompute CarbonCompactionExecutor.processTableBlocks() close(delete old 
data files) deleteOldIndexOrMergeIndexFiles writeSegmentFile for each 
mergedSegment updateTableStatusFile readTableStatusFile 
writeLoadDetailsIntoFile(updated, new Index and datasize into tablestatus file) 
mergeIndexFiles for newly generated index files for merged data files If 
IndexServer enabled clear cahce else clear driver cache 12. update table staus 
for success 13. if (!isCompactionCall) { triggerPrepriming(trigger pre priming 
for SI) 14. if (failedSISegments.nonEmpty && !isCompactionCall) { update table 
status for MFD 15. if (!isCompactionCall) { LoadTableSIPostExecutionEvent 16. 
if skippedSegmentNotEmpty, isSITableEnabled to false 17. 
deleteLoadsAndUpdateMetadata 18. Relase segment locks3. if 
checkMainTableSegEqualToSISeg isSITableEnabled - true 4. 
CreateTablePostExecutionEvent 5. releaseLocks(meta, dele_seg, compact) 

 

 

 

 

 

 

 

Refresh issue1. calling 3 refresh avoid it2. check if dummy is req or not3. it 
inherits same sort 

[jira] [Updated] (CARBONDATA-4042) Insert into select and CTAS launches fewer tasks(task count limited to number of nodes in cluster) even when target table is of no_sort

2020-10-23 Thread Venugopal Reddy K (Jira)


 [ 
https://issues.apache.org/jira/browse/CARBONDATA-4042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Venugopal Reddy K updated CARBONDATA-4042:
--
Summary: Insert into select and CTAS launches fewer tasks(task count 
limited to number of nodes in cluster) even when target table is of no_sort  
(was: Insert into select and CTAS launches fewer tasks(limited to max nodes) 
even when target table is of no_sort)

> Insert into select and CTAS launches fewer tasks(task count limited to number 
> of nodes in cluster) even when target table is of no_sort
> ---
>
> Key: CARBONDATA-4042
> URL: https://issues.apache.org/jira/browse/CARBONDATA-4042
> Project: CarbonData
>  Issue Type: Improvement
>  Components: data-load, spark-integration
>Reporter: Venugopal Reddy K
>Priority: Major
>
> *Issue:*
> At present, When we do insert into table select from or create table as 
> select from, we lauch one single task per node. Whereas when we do a simple 
> select * from table query, tasks launched are equal to number of carbondata 
> files(CARBON_TASK_DISTRIBUTION default is CARBON_TASK_DISTRIBUTION_BLOCK). 
> Thus, slows down the load performance of insert into select and ctas cases.
> Refer [Community discussion regd. task 
> lauch|http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/Discussion-Query-Regarding-Task-launch-mechanism-for-data-load-operations-tt98711.html]
>  
> *Suggestion:*
> Lauch the same number of tasks as in select query for insert into select and 
> ctas cases when the target table is of no-sort.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (CARBONDATA-4042) Insert into select and CTAS launches fewer tasks(limited to max nodes) even when target table is of no_sort

2020-10-23 Thread Venugopal Reddy K (Jira)
Venugopal Reddy K created CARBONDATA-4042:
-

 Summary: Insert into select and CTAS launches fewer tasks(limited 
to max nodes) even when target table is of no_sort
 Key: CARBONDATA-4042
 URL: https://issues.apache.org/jira/browse/CARBONDATA-4042
 Project: CarbonData
  Issue Type: Improvement
  Components: data-load, spark-integration
Reporter: Venugopal Reddy K


*Issue:*

At present, When we do insert into table select from or create table as select 
from, we lauch one single task per node. Whereas when we do a simple select * 
from table query, tasks launched are equal to number of carbondata 
files(CARBON_TASK_DISTRIBUTION default is CARBON_TASK_DISTRIBUTION_BLOCK). 

Thus, slows down the load performance of insert into select and ctas cases.

Refer [Community discussion regd. task 
lauch|http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/Discussion-Query-Regarding-Task-launch-mechanism-for-data-load-operations-tt98711.html]

 

*Suggestion:*

Lauch the same number of tasks as in select query for insert into select and 
ctas cases when the target table is of no-sort.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (CARBONDATA-3834) Segment directory and segment file in metadata are not created for partitioned table when 'carbon.merge.index.in.segment' property is set to false.

2020-10-06 Thread Venugopal Reddy K (Jira)


 [ 
https://issues.apache.org/jira/browse/CARBONDATA-3834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Venugopal Reddy K resolved CARBONDATA-3834.
---
Fix Version/s: 2.0.0
   Resolution: Fixed

> Segment directory and segment file in metadata are not created for 
> partitioned table when 'carbon.merge.index.in.segment' property is set to 
> false.
> ---
>
> Key: CARBONDATA-3834
> URL: https://issues.apache.org/jira/browse/CARBONDATA-3834
> Project: CarbonData
>  Issue Type: Bug
>  Components: hadoop-integration, spark-integration
>Affects Versions: 2.0.0
>Reporter: Venugopal Reddy K
>Priority: Major
> Fix For: 2.0.0
>
>  Time Spent: 8h
>  Remaining Estimate: 0h
>
> *[Issue]*
> With the latest version of Carbon, Segment directory and segment file in 
> metadata directory are not created for partitioned table when 
> 'carbon.merge.index.in.segment' property is set to 'false'. And actual index 
> files which were present in respective partition's '.tmp' directory are also 
> deleted without moving them out to respective partition directory where its 
> '.carbondata' file exist. Thus all the queries fails due to this problem.
> {color:#FF}Exception in thread "main" 
> java.lang.NullPointerExceptionException in thread "main" 
> java.lang.NullPointerException at 
> org.apache.carbondata.core.readcommitter.TableStatusReadCommittedScope.getCommittedIndexFile(TableStatusReadCommittedScope.java:90)
>  at 
> org.apache.carbondata.core.index.Segment.getCommittedIndexFile(Segment.java:183)
>  at 
> org.apache.carbondata.core.util.BlockletIndexUtil.getTableBlockUniqueIdentifiers(BlockletIndexUtil.java:204)
>  at {color}
>  
> This issue was introduced from the resolution of an older optimization issue 
> -CARBONDATA-3641-+[Should improve data loading performance for partition 
> table]+
> i.e., with [https://github.com/apache/carbondata/pull/3535]
>  
> *[Modification Suggestion]*
>  
> If 'carbon.merge.index.in.segment' property is false, we can create the 
> segment directory and segment file, and move the index file from respective 
> partition's temp directory to partition directory where the .carbondata file 
> exists.
> Note: This need to be done before the respective partition's .tmp directory 
> is deleted. Otherwise, we loose the index files.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (CARBONDATA-4008) IN filter on date column is returning 0 results when 'carbon.push.rowfilters.for.vector' is true

2020-09-23 Thread Venugopal Reddy K (Jira)


 [ 
https://issues.apache.org/jira/browse/CARBONDATA-4008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Venugopal Reddy K updated CARBONDATA-4008:
--
Summary: IN filter on date column is returning 0 results when 
'carbon.push.rowfilters.for.vector' is true  (was: IN filter with date column 
in condition is returning 0 results when 'carbon.push.rowfilters.for.vector' is 
true)

> IN filter on date column is returning 0 results when 
> 'carbon.push.rowfilters.for.vector' is true
> 
>
> Key: CARBONDATA-4008
> URL: https://issues.apache.org/jira/browse/CARBONDATA-4008
> Project: CarbonData
>  Issue Type: Bug
>  Components: core
>Affects Versions: 2.0.0
>Reporter: Venugopal Reddy K
>Priority: Major
> Fix For: 2.1.0
>
>
> *Issue:*
> IN filter with date column in condition is returning 0 results when 
> 'carbon.push.rowfilters.for.vector' is set to true.
>  
> *Steps to reproduce:*
> sql("set carbon.push.rowfilters.for.vector=true")
> sql("create table test_table(i int, dt date, ts timestamp) stored as 
> carbondata")
> sql("insert into test_table select 1, '2020-03-30', '2020-03-30 10:00:00'")
> sql("insert into test_table select 2, '2020-07-04', '2020-07-04 14:12:15'")
> sql("insert into test_table select 3, '2020-09-23', '2020-09-23 12:30:45'")
> sql("select * from test_table where dt IN ('2020-03-30', 
> '2020-09-23')").show()



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (CARBONDATA-4008) IN filter with date column in condition is returning 0 results when 'carbon.push.rowfilters.for.vector' is true

2020-09-23 Thread Venugopal Reddy K (Jira)
Venugopal Reddy K created CARBONDATA-4008:
-

 Summary: IN filter with date column in condition is returning 0 
results when 'carbon.push.rowfilters.for.vector' is true
 Key: CARBONDATA-4008
 URL: https://issues.apache.org/jira/browse/CARBONDATA-4008
 Project: CarbonData
  Issue Type: Bug
  Components: core
Affects Versions: 2.0.0
Reporter: Venugopal Reddy K
 Fix For: 2.1.0


*Issue:*

IN filter with date column in condition is returning 0 results when 
'carbon.push.rowfilters.for.vector' is set to true.

 

*Steps to reproduce:*

sql("set carbon.push.rowfilters.for.vector=true")

sql("create table test_table(i int, dt date, ts timestamp) stored as 
carbondata")
sql("insert into test_table select 1, '2020-03-30', '2020-03-30 10:00:00'")
sql("insert into test_table select 2, '2020-07-04', '2020-07-04 14:12:15'")
sql("insert into test_table select 3, '2020-09-23', '2020-09-23 12:30:45'")
sql("select * from test_table where dt IN ('2020-03-30', '2020-09-23')").show()



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (CARBONDATA-3996) Show table extended like command throws java.lang.ArrayIndexOutOfBoundsException

2020-09-18 Thread Venugopal Reddy K (Jira)
Venugopal Reddy K created CARBONDATA-3996:
-

 Summary: Show table extended like command throws 
java.lang.ArrayIndexOutOfBoundsException
 Key: CARBONDATA-3996
 URL: https://issues.apache.org/jira/browse/CARBONDATA-3996
 Project: CarbonData
  Issue Type: Bug
  Components: spark-integration
Affects Versions: 2.0.0
Reporter: Venugopal Reddy K
 Fix For: 2.1.0


*Issue:*

Show table extended like command throws java.lang.ArrayIndexOutOfBoundsException

*Steps to reproduce:*

spark.sql("create table employee(id string, name string) stored as carbondata")
spark.sql("show table extended like 'emp*'").show(100, false)

*Exception stack:*

 
{code:java}
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 3Exception 
in thread "main" java.lang.ArrayIndexOutOfBoundsException: 3 at 
org.apache.spark.sql.catalyst.expressions.GenericInternalRow.genericGet(rows.scala:201)
 at 
org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getAs(rows.scala:35)
 at 
org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getUTF8String(rows.scala:46)
 at 
org.apache.spark.sql.catalyst.expressions.GenericInternalRow.getUTF8String(rows.scala:195)
 at 
org.apache.spark.sql.catalyst.InternalRow$$anonfun$getAccessor$8.apply(InternalRow.scala:136)
 at 
org.apache.spark.sql.catalyst.InternalRow$$anonfun$getAccessor$8.apply(InternalRow.scala:136)
 at 
org.apache.spark.sql.catalyst.expressions.BoundReference.eval(BoundAttribute.scala:44)
 at 
org.apache.spark.sql.catalyst.expressions.UnaryExpression.eval(Expression.scala:389)
 at 
org.apache.spark.sql.catalyst.expressions.Alias.eval(namedExpressions.scala:152)
 at 
org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.apply(Projection.scala:92)
 at 
org.apache.spark.sql.catalyst.optimizer.ConvertToLocalRelation$$anonfun$apply$24$$anonfun$applyOrElse$23.apply(Optimizer.scala:1364)
 at 
org.apache.spark.sql.catalyst.optimizer.ConvertToLocalRelation$$anonfun$apply$24$$anonfun$applyOrElse$23.apply(Optimizer.scala:1364)
 at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
 at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
 at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
 at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:35) at 
scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at 
scala.collection.AbstractTraversable.map(Traversable.scala:104) at 
org.apache.spark.sql.catalyst.optimizer.ConvertToLocalRelation$$anonfun$apply$24.applyOrElse(Optimizer.scala:1364)
 at 
org.apache.spark.sql.catalyst.optimizer.ConvertToLocalRelation$$anonfun$apply$24.applyOrElse(Optimizer.scala:1359)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:258)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:258)
 at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:257) 
at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDown(LogicalPlan.scala:29)
 at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$class.transformDown(AnalysisHelper.scala:149)
 at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29)
 at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:263)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:263)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:328)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:186)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:326) at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:263) 
at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDown(LogicalPlan.scala:29)
 at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$class.transformDown(AnalysisHelper.scala:149)
 at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29)
 at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:263)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:263)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:328)
 

[jira] [Updated] (CARBONDATA-3907) Reuse firePreLoadEvents and firePostLoadEvents methods from CommonLoadUtils to trigger LoadTablePreExecutionEvent and LoadTablePostExecutionEvent respectively in alt

2020-07-16 Thread Venugopal Reddy K (Jira)


 [ 
https://issues.apache.org/jira/browse/CARBONDATA-3907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Venugopal Reddy K updated CARBONDATA-3907:
--
Description: 
*[Issue]*

Currently we have 2 different ways of firing LoadTablePreExecutionEvent and 
LoadTablePostExecutionEvent. We can reuse firePreLoadEvents and 
firePostLoadEvents methods from CommonLoadUtils to trigger 
LoadTablePreExecutionEvent and LoadTablePostExecutionEvent respectively in 
alter table add segment flow as well. 

*[Suggestion]*

Reuse firePreLoadEvents and firePostLoadEvents methods from CommonLoadUtils to 
trigger LoadTablePreExecutionEvent and LoadTablePostExecutionEvent respectively 
in alter table add segment flow.

  was:
*[Issue]*

Currently we have 2 different ways of firing LoadTablePreExecutionEvent and 
LoadTablePostExecutionEvent. We can reuse firePreLoadEvents and 
firePostLoadEvents methods from CommonLoadUtils to trigger 
LoadTablePreExecutionEvent and LoadTablePostExecutionEvent respectively in 
alter table add segment flow as well. So that we can have single flow to fire 
these events

 

*[Suggestion]*

Reuse firePreLoadEvents and firePostLoadEvents methods from CommonLoadUtils to 
trigger LoadTablePreExecutionEvent and LoadTablePostExecutionEvent respectively 
in alter table add segment flow.


> Reuse firePreLoadEvents and firePostLoadEvents methods from CommonLoadUtils 
> to trigger LoadTablePreExecutionEvent and LoadTablePostExecutionEvent 
> respectively in alter table add segment flow
> --
>
> Key: CARBONDATA-3907
> URL: https://issues.apache.org/jira/browse/CARBONDATA-3907
> Project: CarbonData
>  Issue Type: Improvement
>  Components: spark-integration
>Affects Versions: 2.0.0
>Reporter: Venugopal Reddy K
>Priority: Minor
> Fix For: 2.1.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> *[Issue]*
> Currently we have 2 different ways of firing LoadTablePreExecutionEvent and 
> LoadTablePostExecutionEvent. We can reuse firePreLoadEvents and 
> firePostLoadEvents methods from CommonLoadUtils to trigger 
> LoadTablePreExecutionEvent and LoadTablePostExecutionEvent respectively in 
> alter table add segment flow as well. 
> *[Suggestion]*
> Reuse firePreLoadEvents and firePostLoadEvents methods from CommonLoadUtils 
> to trigger LoadTablePreExecutionEvent and LoadTablePostExecutionEvent 
> respectively in alter table add segment flow.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (CARBONDATA-3907) Reuse firePreLoadEvents and firePostLoadEvents methods from CommonLoadUtils to trigger LoadTablePreExecutionEvent and LoadTablePostExecutionEvent respectively in alt

2020-07-16 Thread Venugopal Reddy K (Jira)
Venugopal Reddy K created CARBONDATA-3907:
-

 Summary: Reuse firePreLoadEvents and firePostLoadEvents methods 
from CommonLoadUtils to trigger LoadTablePreExecutionEvent and 
LoadTablePostExecutionEvent respectively in alter table add segment flow
 Key: CARBONDATA-3907
 URL: https://issues.apache.org/jira/browse/CARBONDATA-3907
 Project: CarbonData
  Issue Type: Improvement
  Components: spark-integration
Affects Versions: 2.0.0
Reporter: Venugopal Reddy K
 Fix For: 2.1.0


*[Issue]*

Currently we have 2 different ways of firing LoadTablePreExecutionEvent and 
LoadTablePostExecutionEvent. We can reuse firePreLoadEvents and 
firePostLoadEvents methods from CommonLoadUtils to trigger 
LoadTablePreExecutionEvent and LoadTablePostExecutionEvent respectively in 
alter table add segment flow as well. So that we can have single flow to fire 
these events

 

*[Suggestion]*

Reuse firePreLoadEvents and firePostLoadEvents methods from CommonLoadUtils to 
trigger LoadTablePreExecutionEvent and LoadTablePostExecutionEvent respectively 
in alter table add segment flow.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (CARBONDATA-3834) Segment directory and segment file in metadata are not created for partitioned table when 'carbon.merge.index.in.segment' property is set to false.

2020-05-27 Thread Venugopal Reddy K (Jira)


 [ 
https://issues.apache.org/jira/browse/CARBONDATA-3834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Venugopal Reddy K updated CARBONDATA-3834:
--
Description: 
*[Issue]*

With the latest version of Carbon, Segment directory and segment file in 
metadata directory are not created for partitioned table when 
'carbon.merge.index.in.segment' property is set to 'false'. And actual index 
files which were present in respective partition's '.tmp' directory are also 
deleted without moving them out to respective partition directory where its 
'.carbondata' file exist. Thus all the queries fails due to this problem.

{color:#FF}Exception in thread "main" 
java.lang.NullPointerExceptionException in thread "main" 
java.lang.NullPointerException at 
org.apache.carbondata.core.readcommitter.TableStatusReadCommittedScope.getCommittedIndexFile(TableStatusReadCommittedScope.java:90)
 at 
org.apache.carbondata.core.index.Segment.getCommittedIndexFile(Segment.java:183)
 at 
org.apache.carbondata.core.util.BlockletIndexUtil.getTableBlockUniqueIdentifiers(BlockletIndexUtil.java:204)
 at {color}

 

This issue was introduced from the resolution of an older optimization issue 
-CARBONDATA-3641-+[Should improve data loading performance for partition table]+

i.e., with [https://github.com/apache/carbondata/pull/3535]

 

*[Modification Suggestion]*

 

If 'carbon.merge.index.in.segment' property is false, we can create the segment 
directory and segment file, and move the index file from respective partition's 
temp directory to partition directory where the .carbondata file exists.

Note: This need to be done before the respective partition's .tmp directory is 
deleted. Otherwise, we loose the index files.

  was:
*[Issue]*

With the latest version of Carbon, Segment directory and segment file in 
metadata directory are not created for partitioned table when 
'carbon.merge.index.in.segment' property is set to 'false'. And actual index 
files which were present in respective partition's '.tmp' directory are also 
deleted without moving them out to respective partition directory where its 
'.carbondata' file exist. Thus all the queries fail due to this problem.

This issue was introduced from the resolution of an older optimization issue 
-CARBONDATA-3641-+[Should improve data loading performance for partition table]+

i.e., with [https://github.com/apache/carbondata/pull/3535]

 

*[Modification Suggestion]*

 

If 'carbon.merge.index.in.segment' property is false, we can create the segment 
directory and segment file, and move the index file from respective partition's 
temp directory to partition directory where the .carbondata file exists.

Note: This need to be done before the respective partition's .tmp directory is 
deleted. Otherwise, we loose the index files.


> Segment directory and segment file in metadata are not created for 
> partitioned table when 'carbon.merge.index.in.segment' property is set to 
> false.
> ---
>
> Key: CARBONDATA-3834
> URL: https://issues.apache.org/jira/browse/CARBONDATA-3834
> Project: CarbonData
>  Issue Type: Bug
>  Components: hadoop-integration, spark-integration
>Affects Versions: 2.0.0
>Reporter: Venugopal Reddy K
>Priority: Major
>
> *[Issue]*
> With the latest version of Carbon, Segment directory and segment file in 
> metadata directory are not created for partitioned table when 
> 'carbon.merge.index.in.segment' property is set to 'false'. And actual index 
> files which were present in respective partition's '.tmp' directory are also 
> deleted without moving them out to respective partition directory where its 
> '.carbondata' file exist. Thus all the queries fails due to this problem.
> {color:#FF}Exception in thread "main" 
> java.lang.NullPointerExceptionException in thread "main" 
> java.lang.NullPointerException at 
> org.apache.carbondata.core.readcommitter.TableStatusReadCommittedScope.getCommittedIndexFile(TableStatusReadCommittedScope.java:90)
>  at 
> org.apache.carbondata.core.index.Segment.getCommittedIndexFile(Segment.java:183)
>  at 
> org.apache.carbondata.core.util.BlockletIndexUtil.getTableBlockUniqueIdentifiers(BlockletIndexUtil.java:204)
>  at {color}
>  
> This issue was introduced from the resolution of an older optimization issue 
> -CARBONDATA-3641-+[Should improve data loading performance for partition 
> table]+
> i.e., with [https://github.com/apache/carbondata/pull/3535]
>  
> *[Modification Suggestion]*
>  
> If 'carbon.merge.index.in.segment' property is false, we can create the 
> segment directory and segment file, and move the index file from respective 
> partition's temp directory to partition directory where the .carbondata file 
> 

[jira] [Updated] (CARBONDATA-3834) Segment directory and segment file in metadata are not created for partitioned table when 'carbon.merge.index.in.segment' property is set to false.

2020-05-27 Thread Venugopal Reddy K (Jira)


 [ 
https://issues.apache.org/jira/browse/CARBONDATA-3834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Venugopal Reddy K updated CARBONDATA-3834:
--
Description: 
*[Issue]*

With the latest version of Carbon, Segment directory and segment file in 
metadata directory are not created for partitioned table when 
'carbon.merge.index.in.segment' property is set to 'false'. And actual index 
files which were present in respective partition's '.tmp' directory are also 
deleted without moving them out to respective partition directory where its 
'.carbondata' file exist. Thus all the queries fail due to this problem.

This issue was introduced from the resolution of an older optimization issue 
-CARBONDATA-3641-+[Should improve data loading performance for partition table]+

i.e., with [https://github.com/apache/carbondata/pull/3535]

 

*[Modification Suggestion]*

 

If 'carbon.merge.index.in.segment' property is false, we can create the segment 
directory and segment file, and move the index file from respective partition's 
temp directory to partition directory where the .carbondata file exists.

Note: This need to be done before the respective partition's .tmp directory is 
deleted. Otherwise, we loose the index files.

  was:
*[Issue]*

With the latest version of Carbon, Segment directory and segment file in 
metadata directory are not created for partitioned table when 
'carbon.merge.index.in.segment' property is set to 'false'. And actual index 
files which were present in respective partition's '.tmp' directory are also 
deleted without moving them out to respective partition directory where its 
'.carbondata' file exist.

This issue was introduced from the resolution of an older optimization issue 
-CARBONDATA-3641-+[Should improve data loading performance for partition table]+

i.e., with [https://github.com/apache/carbondata/pull/3535]

 

*[Modification Suggestion]*

 

If 'carbon.merge.index.in.segment' property is false, we can create the segment 
directory and segment file, and move the index file from respective partition's 
temp directory to partition directory where the .carbondata file exists.

Note: This need to be done before the respective partition's .tmp directory is 
deleted. Otherwise, we loose the index files.


> Segment directory and segment file in metadata are not created for 
> partitioned table when 'carbon.merge.index.in.segment' property is set to 
> false.
> ---
>
> Key: CARBONDATA-3834
> URL: https://issues.apache.org/jira/browse/CARBONDATA-3834
> Project: CarbonData
>  Issue Type: Bug
>  Components: hadoop-integration, spark-integration
>Affects Versions: 2.0.0
>Reporter: Venugopal Reddy K
>Priority: Major
>
> *[Issue]*
> With the latest version of Carbon, Segment directory and segment file in 
> metadata directory are not created for partitioned table when 
> 'carbon.merge.index.in.segment' property is set to 'false'. And actual index 
> files which were present in respective partition's '.tmp' directory are also 
> deleted without moving them out to respective partition directory where its 
> '.carbondata' file exist. Thus all the queries fail due to this problem.
> This issue was introduced from the resolution of an older optimization issue 
> -CARBONDATA-3641-+[Should improve data loading performance for partition 
> table]+
> i.e., with [https://github.com/apache/carbondata/pull/3535]
>  
> *[Modification Suggestion]*
>  
> If 'carbon.merge.index.in.segment' property is false, we can create the 
> segment directory and segment file, and move the index file from respective 
> partition's temp directory to partition directory where the .carbondata file 
> exists.
> Note: This need to be done before the respective partition's .tmp directory 
> is deleted. Otherwise, we loose the index files.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (CARBONDATA-3834) Segment directory and segment file in metadata are not created for partitioned table when 'carbon.merge.index.in.segment' property is set to false.

2020-05-27 Thread Venugopal Reddy K (Jira)


 [ 
https://issues.apache.org/jira/browse/CARBONDATA-3834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Venugopal Reddy K updated CARBONDATA-3834:
--
Description: 
*[Issue]*

With the latest version of Carbon, Segment directory and segment file in 
metadata directory are not created for partitioned table when 
'carbon.merge.index.in.segment' property is set to 'false'. And actual index 
files which were present in respective partition's '.tmp' directory are also 
deleted without moving them out to respective partition directory where its 
'.carbondata' file exist.

This issue was introduced from the resolution of an older optimization issue 
-CARBONDATA-3641-+[Should improve data loading performance for partition table]+

i.e., with [https://github.com/apache/carbondata/pull/3535]

 

*[Modification Suggestion]*

 

If 'carbon.merge.index.in.segment' property is false, we can create the segment 
directory and segment file, and move the index file from respective partition's 
temp directory to partition directory where the .carbondata file exists.

Note: This need to be done before the respective partition's .tmp directory is 
deleted. Otherwise, we loose the index files.

  was:
*[Issue]*

With the latest version of Carbon, Segment directory and segment file in 
metadata directory are not created for partitioned table when 
'carbon.merge.index.in.segment' property is set to 'false'. And actual index 
files which were present in respective partition's '.tmp' directory are also 
deleted without moving them out to respective partition directory where its 
'.carbondata' file exist.

This issue was introduced from the resolution of an older optimization issue 
-CARBONDATA-3641-+[Should improve data loading performance for partition table]+

i.e., with PR 3535

 

*[Modification Suggestion]*

 

If 'carbon.merge.index.in.segment' property is false, we can create the segment 
directory and segment file, and move the index file from respective partition's 
temp directory to partition directory where the .carbondata file exists.

Note: This need to be done before the respective partition's .tmp directory is 
deleted. Otherwise, we loose the index files.


> Segment directory and segment file in metadata are not created for 
> partitioned table when 'carbon.merge.index.in.segment' property is set to 
> false.
> ---
>
> Key: CARBONDATA-3834
> URL: https://issues.apache.org/jira/browse/CARBONDATA-3834
> Project: CarbonData
>  Issue Type: Bug
>  Components: hadoop-integration, spark-integration
>Affects Versions: 2.0.0
>Reporter: Venugopal Reddy K
>Priority: Major
>
> *[Issue]*
> With the latest version of Carbon, Segment directory and segment file in 
> metadata directory are not created for partitioned table when 
> 'carbon.merge.index.in.segment' property is set to 'false'. And actual index 
> files which were present in respective partition's '.tmp' directory are also 
> deleted without moving them out to respective partition directory where its 
> '.carbondata' file exist.
> This issue was introduced from the resolution of an older optimization issue 
> -CARBONDATA-3641-+[Should improve data loading performance for partition 
> table]+
> i.e., with [https://github.com/apache/carbondata/pull/3535]
>  
> *[Modification Suggestion]*
>  
> If 'carbon.merge.index.in.segment' property is false, we can create the 
> segment directory and segment file, and move the index file from respective 
> partition's temp directory to partition directory where the .carbondata file 
> exists.
> Note: This need to be done before the respective partition's .tmp directory 
> is deleted. Otherwise, we loose the index files.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (CARBONDATA-3834) Segment directory and segment file in metadata are not created for partitioned table when 'carbon.merge.index.in.segment' property is set to false.

2020-05-27 Thread Venugopal Reddy K (Jira)


 [ 
https://issues.apache.org/jira/browse/CARBONDATA-3834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Venugopal Reddy K updated CARBONDATA-3834:
--
Description: 
*[Issue]*

With the latest version of Carbon, Segment directory and segment file in 
metadata directory are not created for partitioned table when 
'carbon.merge.index.in.segment' property is set to 'false'. And actual index 
files which were present in respective partition's '.tmp' directory are also 
deleted without moving them out to respective partition directory where its 
'.carbondata' file exist.

This issue was introduced from the resolution of an older optimization issue?? 
-CARBONDATA-3641- ??+[Should improve data loading performance for partition 
table]+

i.e., with PR 3535

 

*[Modification Suggestion]*

 

If 'carbon.merge.index.in.segment' property is false, we can create the segment 
directory and segment file, and move the index file from respective partition's 
temp directory to partition directory where the .carbondata file exists.

Note: This need to be done before the respective partition's .tmp directory is 
deleted. Otherwise, we loose the index files.

  was:
*[Issue]*

With the latest version of Carbon, Segment directory and segment file in 
metadata directory are not created for partitioned table when 
'carbon.merge.index.in.segment' property is set to 'false'. And actual index 
files which were present in respective partition's '.tmp' directory are also 
deleted without moving them out to respective partition directory where its 
'.carbondata' file exist.

This issue was introduced from the resolution of an older optimization issue 
CARBONDATA-3641 +[Should improve data loading performance for partition table]+

i.e., with PR 3535

 

*[Modification Suggestion]*

 

If 'carbon.merge.index.in.segment' property is false, we can create the segment 
directory and segment file, and move the index file from respective partition's 
temp directory to partition directory where the .carbondata file exists.

Note: This need to be done before the respective partition's .tmp directory is 
deleted. Otherwise, we loose the index files.


> Segment directory and segment file in metadata are not created for 
> partitioned table when 'carbon.merge.index.in.segment' property is set to 
> false.
> ---
>
> Key: CARBONDATA-3834
> URL: https://issues.apache.org/jira/browse/CARBONDATA-3834
> Project: CarbonData
>  Issue Type: Bug
>  Components: hadoop-integration, spark-integration
>Affects Versions: 2.0.0
>Reporter: Venugopal Reddy K
>Priority: Major
>
> *[Issue]*
> With the latest version of Carbon, Segment directory and segment file in 
> metadata directory are not created for partitioned table when 
> 'carbon.merge.index.in.segment' property is set to 'false'. And actual index 
> files which were present in respective partition's '.tmp' directory are also 
> deleted without moving them out to respective partition directory where its 
> '.carbondata' file exist.
> This issue was introduced from the resolution of an older optimization 
> issue?? -CARBONDATA-3641- ??+[Should improve data loading performance for 
> partition table]+
> i.e., with PR 3535
>  
> *[Modification Suggestion]*
>  
> If 'carbon.merge.index.in.segment' property is false, we can create the 
> segment directory and segment file, and move the index file from respective 
> partition's temp directory to partition directory where the .carbondata file 
> exists.
> Note: This need to be done before the respective partition's .tmp directory 
> is deleted. Otherwise, we loose the index files.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (CARBONDATA-3834) Segment directory and segment file in metadata are not created for partitioned table when 'carbon.merge.index.in.segment' property is set to false.

2020-05-27 Thread Venugopal Reddy K (Jira)


 [ 
https://issues.apache.org/jira/browse/CARBONDATA-3834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Venugopal Reddy K updated CARBONDATA-3834:
--
Description: 
*[Issue]*

With the latest version of Carbon, Segment directory and segment file in 
metadata directory are not created for partitioned table when 
'carbon.merge.index.in.segment' property is set to 'false'. And actual index 
files which were present in respective partition's '.tmp' directory are also 
deleted without moving them out to respective partition directory where its 
'.carbondata' file exist.

This issue was introduced from the resolution of an older optimization issue?? 
-CARBONDATA-3641-+[Should improve data loading performance for partition table]+

i.e., with PR 3535

 

*[Modification Suggestion]*

 

If 'carbon.merge.index.in.segment' property is false, we can create the segment 
directory and segment file, and move the index file from respective partition's 
temp directory to partition directory where the .carbondata file exists.

Note: This need to be done before the respective partition's .tmp directory is 
deleted. Otherwise, we loose the index files.

  was:
*[Issue]*

With the latest version of Carbon, Segment directory and segment file in 
metadata directory are not created for partitioned table when 
'carbon.merge.index.in.segment' property is set to 'false'. And actual index 
files which were present in respective partition's '.tmp' directory are also 
deleted without moving them out to respective partition directory where its 
'.carbondata' file exist.

This issue was introduced from the resolution of an older optimization issue?? 
-CARBONDATA-3641- ??+[Should improve data loading performance for partition 
table]+

i.e., with PR 3535

 

*[Modification Suggestion]*

 

If 'carbon.merge.index.in.segment' property is false, we can create the segment 
directory and segment file, and move the index file from respective partition's 
temp directory to partition directory where the .carbondata file exists.

Note: This need to be done before the respective partition's .tmp directory is 
deleted. Otherwise, we loose the index files.


> Segment directory and segment file in metadata are not created for 
> partitioned table when 'carbon.merge.index.in.segment' property is set to 
> false.
> ---
>
> Key: CARBONDATA-3834
> URL: https://issues.apache.org/jira/browse/CARBONDATA-3834
> Project: CarbonData
>  Issue Type: Bug
>  Components: hadoop-integration, spark-integration
>Affects Versions: 2.0.0
>Reporter: Venugopal Reddy K
>Priority: Major
>
> *[Issue]*
> With the latest version of Carbon, Segment directory and segment file in 
> metadata directory are not created for partitioned table when 
> 'carbon.merge.index.in.segment' property is set to 'false'. And actual index 
> files which were present in respective partition's '.tmp' directory are also 
> deleted without moving them out to respective partition directory where its 
> '.carbondata' file exist.
> This issue was introduced from the resolution of an older optimization 
> issue?? -CARBONDATA-3641-+[Should improve data loading performance for 
> partition table]+
> i.e., with PR 3535
>  
> *[Modification Suggestion]*
>  
> If 'carbon.merge.index.in.segment' property is false, we can create the 
> segment directory and segment file, and move the index file from respective 
> partition's temp directory to partition directory where the .carbondata file 
> exists.
> Note: This need to be done before the respective partition's .tmp directory 
> is deleted. Otherwise, we loose the index files.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (CARBONDATA-3834) Segment directory and segment file in metadata are not created for partitioned table when 'carbon.merge.index.in.segment' property is set to false.

2020-05-27 Thread Venugopal Reddy K (Jira)


 [ 
https://issues.apache.org/jira/browse/CARBONDATA-3834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Venugopal Reddy K updated CARBONDATA-3834:
--
Description: 
*[Issue]*

With the latest version of Carbon, Segment directory and segment file in 
metadata directory are not created for partitioned table when 
'carbon.merge.index.in.segment' property is set to 'false'. And actual index 
files which were present in respective partition's '.tmp' directory are also 
deleted without moving them out to respective partition directory where its 
'.carbondata' file exist.

This issue was introduced from the resolution of an older optimization issue 
-CARBONDATA-3641-+[Should improve data loading performance for partition table]+

i.e., with PR 3535

 

*[Modification Suggestion]*

 

If 'carbon.merge.index.in.segment' property is false, we can create the segment 
directory and segment file, and move the index file from respective partition's 
temp directory to partition directory where the .carbondata file exists.

Note: This need to be done before the respective partition's .tmp directory is 
deleted. Otherwise, we loose the index files.

  was:
*[Issue]*

With the latest version of Carbon, Segment directory and segment file in 
metadata directory are not created for partitioned table when 
'carbon.merge.index.in.segment' property is set to 'false'. And actual index 
files which were present in respective partition's '.tmp' directory are also 
deleted without moving them out to respective partition directory where its 
'.carbondata' file exist.

This issue was introduced from the resolution of an older optimization issue?? 
-CARBONDATA-3641-+[Should improve data loading performance for partition table]+

i.e., with PR 3535

 

*[Modification Suggestion]*

 

If 'carbon.merge.index.in.segment' property is false, we can create the segment 
directory and segment file, and move the index file from respective partition's 
temp directory to partition directory where the .carbondata file exists.

Note: This need to be done before the respective partition's .tmp directory is 
deleted. Otherwise, we loose the index files.


> Segment directory and segment file in metadata are not created for 
> partitioned table when 'carbon.merge.index.in.segment' property is set to 
> false.
> ---
>
> Key: CARBONDATA-3834
> URL: https://issues.apache.org/jira/browse/CARBONDATA-3834
> Project: CarbonData
>  Issue Type: Bug
>  Components: hadoop-integration, spark-integration
>Affects Versions: 2.0.0
>Reporter: Venugopal Reddy K
>Priority: Major
>
> *[Issue]*
> With the latest version of Carbon, Segment directory and segment file in 
> metadata directory are not created for partitioned table when 
> 'carbon.merge.index.in.segment' property is set to 'false'. And actual index 
> files which were present in respective partition's '.tmp' directory are also 
> deleted without moving them out to respective partition directory where its 
> '.carbondata' file exist.
> This issue was introduced from the resolution of an older optimization issue 
> -CARBONDATA-3641-+[Should improve data loading performance for partition 
> table]+
> i.e., with PR 3535
>  
> *[Modification Suggestion]*
>  
> If 'carbon.merge.index.in.segment' property is false, we can create the 
> segment directory and segment file, and move the index file from respective 
> partition's temp directory to partition directory where the .carbondata file 
> exists.
> Note: This need to be done before the respective partition's .tmp directory 
> is deleted. Otherwise, we loose the index files.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (CARBONDATA-3834) Segment directory and segment file in metadata are not created for partitioned table when 'carbon.merge.index.in.segment' property is set to false.

2020-05-27 Thread Venugopal Reddy K (Jira)


 [ 
https://issues.apache.org/jira/browse/CARBONDATA-3834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Venugopal Reddy K updated CARBONDATA-3834:
--
Description: 
*[Issue]*

With the latest version of Carbon, Segment directory and segment file in 
metadata directory are not created for partitioned table when 
'carbon.merge.index.in.segment' property is set to 'false'. And actual index 
files which were present in respective partition's '.tmp' directory are also 
deleted without moving them out to respective partition directory where its 
'.carbondata' file exist.

This issue was introduced from the resolution of an older optimization issue 
CARBONDATA-3641 +[Should improve data loading performance for partition table]+

i.e., with PR 3535

 

*[Modification Suggestion]*

 

If 'carbon.merge.index.in.segment' property is false, we can create the segment 
directory and segment file, and move the index file from respective partition's 
temp directory to partition directory where the .carbondata file exists.

Note: This need to be done before the respective partition's .tmp directory is 
deleted. Otherwise, we loose the index files.

  was:
*[Issue]*

With the latest version of Carbon, Segment directory and segment file in 
metadata directory are not created for partitioned table when 
'carbon.merge.index.in.segment' property is set to 'false'. And actual index 
files which were present in respective partition's '.tmp' directory are also 
deleted without moving them out to respective partition directory where its 
'.carbondata' file exist.

This issue was introduced from the resolution of an older optimization issue 
CARBONDATA-3641 +[++Should improve data loading performance for partition 
table]+

 

i.e., PR 3535

*[Modification Suggestion]*

 

If 'carbon.merge.index.in.segment' property is false, we can create the segment 
directory and segment file, and move the index file from respective partition's 
temp directory to partition directory where the .carbondata file exists.

Note: This need to be done before the respective partition's .tmp directory is 
deleted. Otherwise, we loose the index files.


> Segment directory and segment file in metadata are not created for 
> partitioned table when 'carbon.merge.index.in.segment' property is set to 
> false.
> ---
>
> Key: CARBONDATA-3834
> URL: https://issues.apache.org/jira/browse/CARBONDATA-3834
> Project: CarbonData
>  Issue Type: Bug
>  Components: hadoop-integration, spark-integration
>Affects Versions: 2.0.0
>Reporter: Venugopal Reddy K
>Priority: Major
>
> *[Issue]*
> With the latest version of Carbon, Segment directory and segment file in 
> metadata directory are not created for partitioned table when 
> 'carbon.merge.index.in.segment' property is set to 'false'. And actual index 
> files which were present in respective partition's '.tmp' directory are also 
> deleted without moving them out to respective partition directory where its 
> '.carbondata' file exist.
> This issue was introduced from the resolution of an older optimization issue 
> CARBONDATA-3641 +[Should improve data loading performance for partition 
> table]+
> i.e., with PR 3535
>  
> *[Modification Suggestion]*
>  
> If 'carbon.merge.index.in.segment' property is false, we can create the 
> segment directory and segment file, and move the index file from respective 
> partition's temp directory to partition directory where the .carbondata file 
> exists.
> Note: This need to be done before the respective partition's .tmp directory 
> is deleted. Otherwise, we loose the index files.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (CARBONDATA-3834) Segment directory and segment file in metadata are not created for partitioned table when 'carbon.merge.index.in.segment' property is set to false.

2020-05-27 Thread Venugopal Reddy K (Jira)
Venugopal Reddy K created CARBONDATA-3834:
-

 Summary: Segment directory and segment file in metadata are not 
created for partitioned table when 'carbon.merge.index.in.segment' property is 
set to false.
 Key: CARBONDATA-3834
 URL: https://issues.apache.org/jira/browse/CARBONDATA-3834
 Project: CarbonData
  Issue Type: Improvement
  Components: hadoop-integration, spark-integration
Affects Versions: 2.0.0
Reporter: Venugopal Reddy K


*[Issue]*

With the latest version of Carbon, Segment directory and segment file in 
metadata directory are not created for partitioned table when 
'carbon.merge.index.in.segment' property is set to 'false'. And actual index 
files which were present in respective partition's '.tmp' directory are also 
deleted without moving them out to respective partition directory where its 
'.carbondata' file exist.

This issue was introduced from the resolution of an older optimization issue 
CARBONDATA-3641 +[++Should improve data loading performance for partition 
table]+

 

i.e., PR 3535

*[Modification Suggestion]*

 

If 'carbon.merge.index.in.segment' property is false, we can create the segment 
directory and segment file, and move the index file from respective partition's 
temp directory to partition directory where the .carbondata file exists.

Note: This need to be done before the respective partition's .tmp directory is 
deleted. Otherwise, we loose the index files.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (CARBONDATA-3834) Segment directory and segment file in metadata are not created for partitioned table when 'carbon.merge.index.in.segment' property is set to false.

2020-05-27 Thread Venugopal Reddy K (Jira)


 [ 
https://issues.apache.org/jira/browse/CARBONDATA-3834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Venugopal Reddy K updated CARBONDATA-3834:
--
Issue Type: Bug  (was: Improvement)

> Segment directory and segment file in metadata are not created for 
> partitioned table when 'carbon.merge.index.in.segment' property is set to 
> false.
> ---
>
> Key: CARBONDATA-3834
> URL: https://issues.apache.org/jira/browse/CARBONDATA-3834
> Project: CarbonData
>  Issue Type: Bug
>  Components: hadoop-integration, spark-integration
>Affects Versions: 2.0.0
>Reporter: Venugopal Reddy K
>Priority: Major
>
> *[Issue]*
> With the latest version of Carbon, Segment directory and segment file in 
> metadata directory are not created for partitioned table when 
> 'carbon.merge.index.in.segment' property is set to 'false'. And actual index 
> files which were present in respective partition's '.tmp' directory are also 
> deleted without moving them out to respective partition directory where its 
> '.carbondata' file exist.
> This issue was introduced from the resolution of an older optimization issue 
> CARBONDATA-3641 +[++Should improve data loading performance for partition 
> table]+
>  
> i.e., PR 3535
> *[Modification Suggestion]*
>  
> If 'carbon.merge.index.in.segment' property is false, we can create the 
> segment directory and segment file, and move the index file from respective 
> partition's temp directory to partition directory where the .carbondata file 
> exists.
> Note: This need to be done before the respective partition's .tmp directory 
> is deleted. Otherwise, we loose the index files.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (CARBONDATA-3832) Block Pruning for geospatial polygon expression

2020-05-21 Thread Venugopal Reddy K (Jira)
Venugopal Reddy K created CARBONDATA-3832:
-

 Summary: Block Pruning for geospatial polygon expression
 Key: CARBONDATA-3832
 URL: https://issues.apache.org/jira/browse/CARBONDATA-3832
 Project: CarbonData
  Issue Type: Improvement
Affects Versions: 2.0.0
Reporter: Venugopal Reddy K


*[Issue]*

At present, carbon doesn't do block/blocklet pruning for polygon fileter 
queries. It does rowlevel filtering at carbon layer and returns result. With 
this approach, all the carbon files are scanned irrespective of the where there 
are any matching rows in the block. It also has spark overhead to launch many 
jobs and tasks to process them. Thus affects the overall performance of polygon 
query.

 

*[Solution]*

We can leverage the existing block pruning mechanism in the carbon and avoid 
the unwanted blocks with block pruning. Thus reduce the number of splits. And 
at the executor side,  we can also use blocklet pruning and reduce the number 
of blocklets to be read and scanned.

Thus improves the polygon query performace.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (CARBONDATA-3815) Insert into table select from another table throws exception for spatial tables

2020-05-11 Thread Venugopal Reddy K (Jira)


 [ 
https://issues.apache.org/jira/browse/CARBONDATA-3815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Venugopal Reddy K updated CARBONDATA-3815:
--
Description: 
*Issue:*

Insert into table select from another table throws exception for spatial 
tables. NoSuchElementException exception is thrown with 'mygeohash' column.

{color:#ff}Exception in thread "main" java.util.NoSuchElementException: key 
not found: mygeohashException in thread "main" 
java.util.NoSuchElementException: key not found: mygeohash at 
scala.collection.MapLike$class.default(MapLike.scala:228) at 
scala.collection.AbstractMap.default(Map.scala:59) at 
scala.collection.mutable.HashMap.apply(HashMap.scala:65) at 
org.apache.spark.sql.execution.command.management.CarbonInsertIntoCommand$$anonfun$getReArrangedIndexAndSelectedSchema$5.apply(CarbonInsertIntoCommand.scala:504)
 at 
org.apache.spark.sql.execution.command.management.CarbonInsertIntoCommand$$anonfun$getReArrangedIndexAndSelectedSchema$5.apply(CarbonInsertIntoCommand.scala:497)
 at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) 
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at 
org.apache.spark.sql.execution.command.management.CarbonInsertIntoCommand.getReArrangedIndexAndSelectedSchema(CarbonInsertIntoCommand.scala:496)
 at 
org.apache.spark.sql.execution.command.management.CarbonInsertIntoCommand.processData(CarbonInsertIntoCommand.scala:164){color}

*Step to reproduce:*
 # Create source table and target table spatial tables.
 # Load data to source table.
 # Insert into target table select from source table.

*TestCase:*

spark.sql(s"""
CREATE TABLE source(
timevalue BIGINT,
longitude LONG,
latitude LONG) COMMENT "This is a GeoTable"
STORED AS carbondata
TBLPROPERTIES ('INDEX_HANDLER'='mygeohash',
'INDEX_HANDLER.mygeohash.type'='geohash',
'INDEX_HANDLER.mygeohash.sourcecolumns'='longitude, latitude',
'INDEX_HANDLER.mygeohash.originLatitude'='39.832277',
'INDEX_HANDLER.mygeohash.gridSize'='50',
'INDEX_HANDLER.mygeohash.minLongitude'='115.811865',
'INDEX_HANDLER.mygeohash.maxLongitude'='116.782233',
'INDEX_HANDLER.mygeohash.minLatitude'='39.832277',
'INDEX_HANDLER.mygeohash.maxLatitude'='40.225281',
'INDEX_HANDLER.mygeohash.conversionRatio'='100')
""".stripMargin)

 

val path = s"$rootPath/examples/spark/src/main/resources/geodata.csv"
spark.sql(s"""
LOAD DATA LOCAL INPATH '$path'
INTO TABLE source
OPTIONS('COMPLEX_DELIMITER_LEVEL_1'='#')
""".stripMargin)

 

spark.sql(s"""
CREATE TABLE target(
timevalue BIGINT,
longitude LONG,
latitude LONG) COMMENT "This is a GeoTable"
STORED AS carbondata
TBLPROPERTIES ('INDEX_HANDLER'='mygeohash',
'INDEX_HANDLER.mygeohash.type'='geohash',
'INDEX_HANDLER.mygeohash.sourcecolumns'='longitude, latitude',
'INDEX_HANDLER.mygeohash.originLatitude'='39.832277',
'INDEX_HANDLER.mygeohash.gridSize'='50',
'INDEX_HANDLER.mygeohash.minLongitude'='115.811865',
'INDEX_HANDLER.mygeohash.maxLongitude'='116.782233',
'INDEX_HANDLER.mygeohash.minLatitude'='39.832277',
'INDEX_HANDLER.mygeohash.maxLatitude'='40.225281',
'INDEX_HANDLER.mygeohash.conversionRatio'='100')
""".stripMargin)
 
 
spark.sql("insert into target select * from source")

  was:
*Issue:*

Insert into table select from another table throws exception for spatial 
tables. NoSuchElementException exception is thrown with 'mygeohash' column.

{color:#ff}Exception in thread "main" java.util.NoSuchElementException: key 
not found: mygeohashException in thread "main" 
java.util.NoSuchElementException: key not found: mygeohash at 
scala.collection.MapLike$class.default(MapLike.scala:228) at 
scala.collection.AbstractMap.default(Map.scala:59) at 
scala.collection.mutable.HashMap.apply(HashMap.scala:65) at 
org.apache.spark.sql.execution.command.management.CarbonInsertIntoCommand$$anonfun$getReArrangedIndexAndSelectedSchema$5.apply(CarbonInsertIntoCommand.scala:504)
 at 
org.apache.spark.sql.execution.command.management.CarbonInsertIntoCommand$$anonfun$getReArrangedIndexAndSelectedSchema$5.apply(CarbonInsertIntoCommand.scala:497)
 at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) 
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at 
org.apache.spark.sql.execution.command.management.CarbonInsertIntoCommand.getReArrangedIndexAndSelectedSchema(CarbonInsertIntoCommand.scala:496)
 at 
org.apache.spark.sql.execution.command.management.CarbonInsertIntoCommand.processData(CarbonInsertIntoCommand.scala:164){color}

*Step to reproduce:*
 # Create source table and target table spatial tables.
 # Load data to source table.
 # Insert into target table select from source table.

*TestCase:*
|spark.sql(s"""|
|CREATE TABLE source(|
|timevalue BIGINT,|
|longitude LONG,|
|latitude LONG) COMMENT "This is a GeoTable"|
|STORED AS carbondata|
|TBLPROPERTIES ('INDEX_HANDLER'='mygeohash',|
|'INDEX_HANDLER.mygeohash.type'='geohash',|

[jira] [Updated] (CARBONDATA-3815) Insert into table select from another table throws exception for spatial tables

2020-05-11 Thread Venugopal Reddy K (Jira)


 [ 
https://issues.apache.org/jira/browse/CARBONDATA-3815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Venugopal Reddy K updated CARBONDATA-3815:
--
Description: 
*Issue:*

Insert into table select from another table throws exception for spatial 
tables. NoSuchElementException exception is thrown with 'mygeohash' column.

{color:#ff}Exception in thread "main" java.util.NoSuchElementException: key 
not found: mygeohashException in thread "main" 
java.util.NoSuchElementException: key not found: mygeohash at 
scala.collection.MapLike$class.default(MapLike.scala:228) at 
scala.collection.AbstractMap.default(Map.scala:59) at 
scala.collection.mutable.HashMap.apply(HashMap.scala:65) at 
org.apache.spark.sql.execution.command.management.CarbonInsertIntoCommand$$anonfun$getReArrangedIndexAndSelectedSchema$5.apply(CarbonInsertIntoCommand.scala:504)
 at 
org.apache.spark.sql.execution.command.management.CarbonInsertIntoCommand$$anonfun$getReArrangedIndexAndSelectedSchema$5.apply(CarbonInsertIntoCommand.scala:497)
 at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) 
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at 
org.apache.spark.sql.execution.command.management.CarbonInsertIntoCommand.getReArrangedIndexAndSelectedSchema(CarbonInsertIntoCommand.scala:496)
 at 
org.apache.spark.sql.execution.command.management.CarbonInsertIntoCommand.processData(CarbonInsertIntoCommand.scala:164){color}

*Step to reproduce:*
 # Create source table and target table spatial tables.
 # Load data to source table.
 # Insert into target table select from source table.

*TestCase:*
|spark.sql(s"""|
|CREATE TABLE source(|
|timevalue BIGINT,|
|longitude LONG,|
|latitude LONG) COMMENT "This is a GeoTable"|
|STORED AS carbondata|
|TBLPROPERTIES ('INDEX_HANDLER'='mygeohash',|
|'INDEX_HANDLER.mygeohash.type'='geohash',|
|'INDEX_HANDLER.mygeohash.sourcecolumns'='longitude, latitude',|
|'INDEX_HANDLER.mygeohash.originLatitude'='39.832277',|
|'INDEX_HANDLER.mygeohash.gridSize'='50',|
|'INDEX_HANDLER.mygeohash.minLongitude'='115.811865',|
|'INDEX_HANDLER.mygeohash.maxLongitude'='116.782233',|
|'INDEX_HANDLER.mygeohash.minLatitude'='39.832277',|
|'INDEX_HANDLER.mygeohash.maxLatitude'='40.225281',|
|'INDEX_HANDLER.mygeohash.conversionRatio'='100')
 """.stripMargin)
 val path = s"$rootPath/examples/spark/src/main/resources/geodata.csv"|

| spark.sql(s"""
|LOAD DATA LOCAL INPATH '$path'|
|INTO TABLE source|
|OPTIONS('COMPLEX_DELIMITER_LEVEL_1'='#')
 """.stripMargin)|

| spark.sql(s"""
|CREATE TABLE target(|
|timevalue BIGINT,|
|longitude LONG,|
|latitude LONG) COMMENT "This is a GeoTable"|
|STORED AS carbondata|
|TBLPROPERTIES ('INDEX_HANDLER'='mygeohash',|
|'INDEX_HANDLER.mygeohash.type'='geohash',|
|'INDEX_HANDLER.mygeohash.sourcecolumns'='longitude, latitude',|
|'INDEX_HANDLER.mygeohash.originLatitude'='39.832277',|
|'INDEX_HANDLER.mygeohash.gridSize'='50',|
|'INDEX_HANDLER.mygeohash.minLongitude'='115.811865',|
|'INDEX_HANDLER.mygeohash.maxLongitude'='116.782233',|
|'INDEX_HANDLER.mygeohash.minLatitude'='39.832277',|
|'INDEX_HANDLER.mygeohash.maxLatitude'='40.225281',|
|'INDEX_HANDLER.mygeohash.conversionRatio'='100')
 """.stripMargin)|

| spark.sql("insert into target select * from source")

  was:
*Issue:*

Insert into table select from another table throws exception for spatial 
tables. NoSuchElementException exception is thrown with 'mygeohash' column.

{color:#ff}Exception in thread "main" java.util.NoSuchElementException: key 
not found: mygeohashException in thread "main" 
java.util.NoSuchElementException: key not found: mygeohash at 
scala.collection.MapLike$class.default(MapLike.scala:228) at 
scala.collection.AbstractMap.default(Map.scala:59) at 
scala.collection.mutable.HashMap.apply(HashMap.scala:65) at 
org.apache.spark.sql.execution.command.management.CarbonInsertIntoCommand$$anonfun$getReArrangedIndexAndSelectedSchema$5.apply(CarbonInsertIntoCommand.scala:504)
 at 
org.apache.spark.sql.execution.command.management.CarbonInsertIntoCommand$$anonfun$getReArrangedIndexAndSelectedSchema$5.apply(CarbonInsertIntoCommand.scala:497)
 at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) 
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at 
org.apache.spark.sql.execution.command.management.CarbonInsertIntoCommand.getReArrangedIndexAndSelectedSchema(CarbonInsertIntoCommand.scala:496)
 at 
org.apache.spark.sql.execution.command.management.CarbonInsertIntoCommand.processData(CarbonInsertIntoCommand.scala:164){color}

*Step to reproduce:*
 # Create source table and target table spatial tables.
 # Load data to source table.
 # Insert into target table select from source table.

*TestCase:*

| spark.sql(s"""
|CREATE TABLE source(|
|timevalue BIGINT,|
|longitude LONG,|
|latitude LONG) COMMENT "This is a GeoTable"|
|STORED AS carbondata|
|TBLPROPERTIES 

[jira] [Updated] (CARBONDATA-3815) Insert into table select from another table throws exception for spatial tables

2020-05-11 Thread Venugopal Reddy K (Jira)


 [ 
https://issues.apache.org/jira/browse/CARBONDATA-3815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Venugopal Reddy K updated CARBONDATA-3815:
--
Description: 
*Issue:*

Insert into table select from another table throws exception for spatial 
tables. NoSuchElementException exception is thrown with 'mygeohash' column.

{color:#ff}Exception in thread "main" java.util.NoSuchElementException: key 
not found: mygeohashException in thread "main" 
java.util.NoSuchElementException: key not found: mygeohash at 
scala.collection.MapLike$class.default(MapLike.scala:228) at 
scala.collection.AbstractMap.default(Map.scala:59) at 
scala.collection.mutable.HashMap.apply(HashMap.scala:65) at 
org.apache.spark.sql.execution.command.management.CarbonInsertIntoCommand$$anonfun$getReArrangedIndexAndSelectedSchema$5.apply(CarbonInsertIntoCommand.scala:504)
 at 
org.apache.spark.sql.execution.command.management.CarbonInsertIntoCommand$$anonfun$getReArrangedIndexAndSelectedSchema$5.apply(CarbonInsertIntoCommand.scala:497)
 at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) 
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at 
org.apache.spark.sql.execution.command.management.CarbonInsertIntoCommand.getReArrangedIndexAndSelectedSchema(CarbonInsertIntoCommand.scala:496)
 at 
org.apache.spark.sql.execution.command.management.CarbonInsertIntoCommand.processData(CarbonInsertIntoCommand.scala:164){color}

*Step to reproduce:*
 # Create source table and target table spatial tables.
 # Load data to source table.
 # Insert into target table select from source table.

*TestCase:*

| spark.sql(s"""
|CREATE TABLE source(|
|timevalue BIGINT,|
|longitude LONG,|
|latitude LONG) COMMENT "This is a GeoTable"|
|STORED AS carbondata|
|TBLPROPERTIES ('INDEX_HANDLER'='mygeohash',|
|'INDEX_HANDLER.mygeohash.type'='geohash',|
|'INDEX_HANDLER.mygeohash.sourcecolumns'='longitude, latitude',|
|'INDEX_HANDLER.mygeohash.originLatitude'='39.832277',|
|'INDEX_HANDLER.mygeohash.gridSize'='50',|
|'INDEX_HANDLER.mygeohash.minLongitude'='115.811865',|
|'INDEX_HANDLER.mygeohash.maxLongitude'='116.782233',|
|'INDEX_HANDLER.mygeohash.minLatitude'='39.832277',|
|'INDEX_HANDLER.mygeohash.maxLatitude'='40.225281',|
|'INDEX_HANDLER.mygeohash.conversionRatio'='100')
 """.stripMargin)
 val path = s"$rootPath/examples/spark/src/main/resources/geodata.csv"|


 spark.sql(
 s"""
|LOAD DATA LOCAL INPATH '$path'|
|INTO TABLE source|
|OPTIONS('COMPLEX_DELIMITER_LEVEL_1'='#')
 """.stripMargin)|

spark.sql(s"""
|CREATE TABLE target(|
|timevalue BIGINT,|
|longitude LONG,|
|latitude LONG) COMMENT "This is a GeoTable"|
|STORED AS carbondata|
|TBLPROPERTIES ('INDEX_HANDLER'='mygeohash',|
|'INDEX_HANDLER.mygeohash.type'='geohash',|
|'INDEX_HANDLER.mygeohash.sourcecolumns'='longitude, latitude',|
|'INDEX_HANDLER.mygeohash.originLatitude'='39.832277',|
|'INDEX_HANDLER.mygeohash.gridSize'='50',|
|'INDEX_HANDLER.mygeohash.minLongitude'='115.811865',|
|'INDEX_HANDLER.mygeohash.maxLongitude'='116.782233',|
|'INDEX_HANDLER.mygeohash.minLatitude'='39.832277',|
|'INDEX_HANDLER.mygeohash.maxLatitude'='40.225281',|
|'INDEX_HANDLER.mygeohash.conversionRatio'='100')
 """.stripMargin)
 spark.sql("insert into target select * from source")|

  was:
*Issue:*

Insert into table select from another table throws exception for spatial 
tables. NoSuchElementException exception is thrown with 'mygeohash' column.

{color:#FF}Exception in thread "main" java.util.NoSuchElementException: key 
not found: mygeohashException in thread "main" 
java.util.NoSuchElementException: key not found: mygeohash at 
scala.collection.MapLike$class.default(MapLike.scala:228) at 
scala.collection.AbstractMap.default(Map.scala:59) at 
scala.collection.mutable.HashMap.apply(HashMap.scala:65) at 
org.apache.spark.sql.execution.command.management.CarbonInsertIntoCommand$$anonfun$getReArrangedIndexAndSelectedSchema$5.apply(CarbonInsertIntoCommand.scala:504)
 at 
org.apache.spark.sql.execution.command.management.CarbonInsertIntoCommand$$anonfun$getReArrangedIndexAndSelectedSchema$5.apply(CarbonInsertIntoCommand.scala:497)
 at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) 
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at 
org.apache.spark.sql.execution.command.management.CarbonInsertIntoCommand.getReArrangedIndexAndSelectedSchema(CarbonInsertIntoCommand.scala:496)
 at 
org.apache.spark.sql.execution.command.management.CarbonInsertIntoCommand.processData(CarbonInsertIntoCommand.scala:164){color}

*Step to reproduce:*
 # Create source table and target table spatial tables.
 # Load data to source table.
 # Insert into target table select from source table.

*TestCase:*

spark.sql(s"""
 | CREATE TABLE source(
 | timevalue BIGINT,
 | longitude LONG,
 | latitude LONG) COMMENT "This is a GeoTable"
 | STORED AS carbondata
 | TBLPROPERTIES 

[jira] [Updated] (CARBONDATA-3815) Insert into table select from another table throws exception for spatial tables

2020-05-11 Thread Venugopal Reddy K (Jira)


 [ 
https://issues.apache.org/jira/browse/CARBONDATA-3815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Venugopal Reddy K updated CARBONDATA-3815:
--
Summary: Insert into table select from another table throws exception for 
spatial tables  (was: Insert into table select from another table throws 
exception)

> Insert into table select from another table throws exception for spatial 
> tables
> ---
>
> Key: CARBONDATA-3815
> URL: https://issues.apache.org/jira/browse/CARBONDATA-3815
> Project: CarbonData
>  Issue Type: Bug
>  Components: core, spark-integration
>Affects Versions: 2.0.0
>Reporter: Venugopal Reddy K
>Priority: Major
>
> *Issue:*
> Insert into table select from another table throws exception for spatial 
> tables. NoSuchElementException exception is thrown with 'mygeohash' column.
> {color:#FF}Exception in thread "main" java.util.NoSuchElementException: 
> key not found: mygeohashException in thread "main" 
> java.util.NoSuchElementException: key not found: mygeohash at 
> scala.collection.MapLike$class.default(MapLike.scala:228) at 
> scala.collection.AbstractMap.default(Map.scala:59) at 
> scala.collection.mutable.HashMap.apply(HashMap.scala:65) at 
> org.apache.spark.sql.execution.command.management.CarbonInsertIntoCommand$$anonfun$getReArrangedIndexAndSelectedSchema$5.apply(CarbonInsertIntoCommand.scala:504)
>  at 
> org.apache.spark.sql.execution.command.management.CarbonInsertIntoCommand$$anonfun$getReArrangedIndexAndSelectedSchema$5.apply(CarbonInsertIntoCommand.scala:497)
>  at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at 
> org.apache.spark.sql.execution.command.management.CarbonInsertIntoCommand.getReArrangedIndexAndSelectedSchema(CarbonInsertIntoCommand.scala:496)
>  at 
> org.apache.spark.sql.execution.command.management.CarbonInsertIntoCommand.processData(CarbonInsertIntoCommand.scala:164){color}
> *Step to reproduce:*
>  # Create source table and target table spatial tables.
>  # Load data to source table.
>  # Insert into target table select from source table.
> *TestCase:*
> spark.sql(s"""
>  | CREATE TABLE source(
>  | timevalue BIGINT,
>  | longitude LONG,
>  | latitude LONG) COMMENT "This is a GeoTable"
>  | STORED AS carbondata
>  | TBLPROPERTIES ('INDEX_HANDLER'='mygeohash',
>  | 'INDEX_HANDLER.mygeohash.type'='geohash',
>  | 'INDEX_HANDLER.mygeohash.sourcecolumns'='longitude, latitude',
>  | 'INDEX_HANDLER.mygeohash.originLatitude'='39.832277',
>  | 'INDEX_HANDLER.mygeohash.gridSize'='50',
>  | 'INDEX_HANDLER.mygeohash.minLongitude'='115.811865',
>  | 'INDEX_HANDLER.mygeohash.maxLongitude'='116.782233',
>  | 'INDEX_HANDLER.mygeohash.minLatitude'='39.832277',
>  | 'INDEX_HANDLER.mygeohash.maxLatitude'='40.225281',
>  | 'INDEX_HANDLER.mygeohash.conversionRatio'='100')
>  """.stripMargin)
> val path = s"$rootPath/examples/spark/src/main/resources/geodata.csv"
> // scalastyle:off
> spark.sql(
>  s"""
>  | LOAD DATA LOCAL INPATH '$path'
>  | INTO TABLE source
>  | OPTIONS('COMPLEX_DELIMITER_LEVEL_1'='#')
>  """.stripMargin)
> spark.sql(s"""
>  | CREATE TABLE target(
>  | timevalue BIGINT,
>  | longitude LONG,
>  | latitude LONG) COMMENT "This is a GeoTable"
>  | STORED AS carbondata
>  | TBLPROPERTIES ('INDEX_HANDLER'='mygeohash',
>  | 'INDEX_HANDLER.mygeohash.type'='geohash',
>  | 'INDEX_HANDLER.mygeohash.sourcecolumns'='longitude, latitude',
>  | 'INDEX_HANDLER.mygeohash.originLatitude'='39.832277',
>  | 'INDEX_HANDLER.mygeohash.gridSize'='50',
>  | 'INDEX_HANDLER.mygeohash.minLongitude'='115.811865',
>  | 'INDEX_HANDLER.mygeohash.maxLongitude'='116.782233',
>  | 'INDEX_HANDLER.mygeohash.minLatitude'='39.832277',
>  | 'INDEX_HANDLER.mygeohash.maxLatitude'='40.225281',
>  | 'INDEX_HANDLER.mygeohash.conversionRatio'='100')
>  """.stripMargin)
> spark.sql("insert into target select * from source")



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (CARBONDATA-3815) Insert into table select from another table throws exception

2020-05-11 Thread Venugopal Reddy K (Jira)
Venugopal Reddy K created CARBONDATA-3815:
-

 Summary: Insert into table select from another table throws 
exception
 Key: CARBONDATA-3815
 URL: https://issues.apache.org/jira/browse/CARBONDATA-3815
 Project: CarbonData
  Issue Type: Bug
  Components: core, spark-integration
Affects Versions: 2.0.0
Reporter: Venugopal Reddy K


*Issue:*

Insert into table select from another table throws exception for spatial 
tables. NoSuchElementException exception is thrown with 'mygeohash' column.

{color:#FF}Exception in thread "main" java.util.NoSuchElementException: key 
not found: mygeohashException in thread "main" 
java.util.NoSuchElementException: key not found: mygeohash at 
scala.collection.MapLike$class.default(MapLike.scala:228) at 
scala.collection.AbstractMap.default(Map.scala:59) at 
scala.collection.mutable.HashMap.apply(HashMap.scala:65) at 
org.apache.spark.sql.execution.command.management.CarbonInsertIntoCommand$$anonfun$getReArrangedIndexAndSelectedSchema$5.apply(CarbonInsertIntoCommand.scala:504)
 at 
org.apache.spark.sql.execution.command.management.CarbonInsertIntoCommand$$anonfun$getReArrangedIndexAndSelectedSchema$5.apply(CarbonInsertIntoCommand.scala:497)
 at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) 
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at 
org.apache.spark.sql.execution.command.management.CarbonInsertIntoCommand.getReArrangedIndexAndSelectedSchema(CarbonInsertIntoCommand.scala:496)
 at 
org.apache.spark.sql.execution.command.management.CarbonInsertIntoCommand.processData(CarbonInsertIntoCommand.scala:164){color}

*Step to reproduce:*
 # Create source table and target table spatial tables.
 # Load data to source table.
 # Insert into target table select from source table.

*TestCase:*

spark.sql(s"""
 | CREATE TABLE source(
 | timevalue BIGINT,
 | longitude LONG,
 | latitude LONG) COMMENT "This is a GeoTable"
 | STORED AS carbondata
 | TBLPROPERTIES ('INDEX_HANDLER'='mygeohash',
 | 'INDEX_HANDLER.mygeohash.type'='geohash',
 | 'INDEX_HANDLER.mygeohash.sourcecolumns'='longitude, latitude',
 | 'INDEX_HANDLER.mygeohash.originLatitude'='39.832277',
 | 'INDEX_HANDLER.mygeohash.gridSize'='50',
 | 'INDEX_HANDLER.mygeohash.minLongitude'='115.811865',
 | 'INDEX_HANDLER.mygeohash.maxLongitude'='116.782233',
 | 'INDEX_HANDLER.mygeohash.minLatitude'='39.832277',
 | 'INDEX_HANDLER.mygeohash.maxLatitude'='40.225281',
 | 'INDEX_HANDLER.mygeohash.conversionRatio'='100')
 """.stripMargin)
val path = s"$rootPath/examples/spark/src/main/resources/geodata.csv"

// scalastyle:off
spark.sql(
 s"""
 | LOAD DATA LOCAL INPATH '$path'
 | INTO TABLE source
 | OPTIONS('COMPLEX_DELIMITER_LEVEL_1'='#')
 """.stripMargin)

spark.sql(s"""
 | CREATE TABLE target(
 | timevalue BIGINT,
 | longitude LONG,
 | latitude LONG) COMMENT "This is a GeoTable"
 | STORED AS carbondata
 | TBLPROPERTIES ('INDEX_HANDLER'='mygeohash',
 | 'INDEX_HANDLER.mygeohash.type'='geohash',
 | 'INDEX_HANDLER.mygeohash.sourcecolumns'='longitude, latitude',
 | 'INDEX_HANDLER.mygeohash.originLatitude'='39.832277',
 | 'INDEX_HANDLER.mygeohash.gridSize'='50',
 | 'INDEX_HANDLER.mygeohash.minLongitude'='115.811865',
 | 'INDEX_HANDLER.mygeohash.maxLongitude'='116.782233',
 | 'INDEX_HANDLER.mygeohash.minLatitude'='39.832277',
 | 'INDEX_HANDLER.mygeohash.maxLatitude'='40.225281',
 | 'INDEX_HANDLER.mygeohash.conversionRatio'='100')
 """.stripMargin)
spark.sql("insert into target select * from source")



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (CARBONDATA-3793) Data load with partition columns fail with InvalidLoadOptionException when load option 'header' is set to 'true'

2020-05-05 Thread Venugopal Reddy K (Jira)


 [ 
https://issues.apache.org/jira/browse/CARBONDATA-3793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Venugopal Reddy K updated CARBONDATA-3793:
--
Summary: Data load with partition columns fail with 
InvalidLoadOptionException when load option 'header' is set to 'true'  (was: 
Data load with partition columns fail with with InvalidLoadOptionException when 
load option 'header' is set to 'true')

> Data load with partition columns fail with InvalidLoadOptionException when 
> load option 'header' is set to 'true'
> 
>
> Key: CARBONDATA-3793
> URL: https://issues.apache.org/jira/browse/CARBONDATA-3793
> Project: CarbonData
>  Issue Type: Bug
>  Components: spark-integration
>Affects Versions: 2.0.0
>Reporter: Venugopal Reddy K
>Priority: Minor
> Attachments: Selection_001.png
>
>
> *Issue:*
> Data load with partition fail with `InvalidLoadOptionException` when load 
> option `header` is set to `true`
>  
> *CallStack:*
> 2020-05-05 21:49:35 AUDIT audit:97 - {"time":"5 May, 2020 9:49:35 PM 
> IST","username":"root1","opName":"LOAD 
> DATA","opId":"199081091980878","opStatus":"FAILED","opTime":"1734 
> ms","table":"default.source","extraInfo":{color:#ff}{"Exception":"org.apache.carbondata.common.exceptions.sql.InvalidLoadOptionException","Message":"When
>  'header' option is true, 'fileheader' option is not required."}}{color}
> Exception in thread "main" 
> org.apache.carbondata.common.exceptions.sql.InvalidLoadOptionException: When 
> 'header' option is true, 'fileheader' option is not required.
> at 
> org.apache.carbondata.processing.loading.model.CarbonLoadModelBuilder.build(CarbonLoadModelBuilder.java:203)
> at 
> org.apache.carbondata.processing.loading.model.CarbonLoadModelBuilder.build(CarbonLoadModelBuilder.java:126)
> at 
> org.apache.spark.sql.execution.datasources.SparkCarbonTableFormat.prepareWrite(SparkCarbonTableFormat.scala:132)
> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:103)
> at 
> org.apache.spark.sql.execution.command.management.CarbonInsertIntoHadoopFsRelationCommand.run(CarbonInsertIntoHadoopFsRelationCommand.scala:160)
> at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
> at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (CARBONDATA-3793) Data load with partition columns fail with with InvalidLoadOptionException when load option `header` is set to `true`

2020-05-05 Thread Venugopal Reddy K (Jira)


 [ 
https://issues.apache.org/jira/browse/CARBONDATA-3793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Venugopal Reddy K updated CARBONDATA-3793:
--
Description: 
*Issue:*

Data load with partition fail with `InvalidLoadOptionException` when load 
option `header` is set to `true`

 

*CallStack:*

2020-05-05 21:49:35 AUDIT audit:97 - {"time":"5 May, 2020 9:49:35 PM 
IST","username":"root1","opName":"LOAD 
DATA","opId":"199081091980878","opStatus":"FAILED","opTime":"1734 
ms","table":"default.source","extraInfo":{color:#ff}{"Exception":"org.apache.carbondata.common.exceptions.sql.InvalidLoadOptionException","Message":"When
 'header' option is true, 'fileheader' option is not required."}}{color}

Exception in thread "main" 
org.apache.carbondata.common.exceptions.sql.InvalidLoadOptionException: When 
'header' option is true, 'fileheader' option is not required.

at 
org.apache.carbondata.processing.loading.model.CarbonLoadModelBuilder.build(CarbonLoadModelBuilder.java:203)

at 
org.apache.carbondata.processing.loading.model.CarbonLoadModelBuilder.build(CarbonLoadModelBuilder.java:126)

at 
org.apache.spark.sql.execution.datasources.SparkCarbonTableFormat.prepareWrite(SparkCarbonTableFormat.scala:132)

at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:103)

at 
org.apache.spark.sql.execution.command.management.CarbonInsertIntoHadoopFsRelationCommand.run(CarbonInsertIntoHadoopFsRelationCommand.scala:160)

at 
org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)

at 
org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)

  was:
*Issue:*

Data load with partition fails when load option 'header' is set to 'true'

 

*CallStack:*

2020-05-05 21:49:35 AUDIT audit:97 - {"time":"5 May, 2020 9:49:35 PM 
IST","username":"root1","opName":"LOAD 
DATA","opId":"199081091980878","opStatus":"FAILED","opTime":"1734 
ms","table":"default.source","extraInfo":{color:#ff}{"Exception":"org.apache.carbondata.common.exceptions.sql.InvalidLoadOptionException","Message":"When
 'header' option is true, 'fileheader' option is not required."}}{color}

Exception in thread "main" 
org.apache.carbondata.common.exceptions.sql.InvalidLoadOptionException: When 
'header' option is true, 'fileheader' option is not required.

at 
org.apache.carbondata.processing.loading.model.CarbonLoadModelBuilder.build(CarbonLoadModelBuilder.java:203)

at 
org.apache.carbondata.processing.loading.model.CarbonLoadModelBuilder.build(CarbonLoadModelBuilder.java:126)

at 
org.apache.spark.sql.execution.datasources.SparkCarbonTableFormat.prepareWrite(SparkCarbonTableFormat.scala:132)

at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:103)

at 
org.apache.spark.sql.execution.command.management.CarbonInsertIntoHadoopFsRelationCommand.run(CarbonInsertIntoHadoopFsRelationCommand.scala:160)

at 
org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)

at 
org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)

Summary: Data load with partition columns fail with with 
InvalidLoadOptionException when load option `header` is set to `true`  (was: 
Data load with partition columns fails when load option 'header' is set to 
'true')

> Data load with partition columns fail with with InvalidLoadOptionException 
> when load option `header` is set to `true`
> -
>
> Key: CARBONDATA-3793
> URL: https://issues.apache.org/jira/browse/CARBONDATA-3793
> Project: CarbonData
>  Issue Type: Bug
>  Components: spark-integration
>Affects Versions: 2.0.0
>Reporter: Venugopal Reddy K
>Priority: Minor
> Attachments: Selection_001.png
>
>
> *Issue:*
> Data load with partition fail with `InvalidLoadOptionException` when load 
> option `header` is set to `true`
>  
> *CallStack:*
> 2020-05-05 21:49:35 AUDIT audit:97 - {"time":"5 May, 2020 9:49:35 PM 
> IST","username":"root1","opName":"LOAD 
> DATA","opId":"199081091980878","opStatus":"FAILED","opTime":"1734 
> ms","table":"default.source","extraInfo":{color:#ff}{"Exception":"org.apache.carbondata.common.exceptions.sql.InvalidLoadOptionException","Message":"When
>  'header' option is true, 'fileheader' option is not required."}}{color}
> Exception in thread "main" 
> org.apache.carbondata.common.exceptions.sql.InvalidLoadOptionException: When 
> 'header' option is true, 'fileheader' option is not required.
> at 
> org.apache.carbondata.processing.loading.model.CarbonLoadModelBuilder.build(CarbonLoadModelBuilder.java:203)
> at 
> 

[jira] [Updated] (CARBONDATA-3793) Data load with partition columns fail with with InvalidLoadOptionException when load option 'header' is set to 'true'

2020-05-05 Thread Venugopal Reddy K (Jira)


 [ 
https://issues.apache.org/jira/browse/CARBONDATA-3793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Venugopal Reddy K updated CARBONDATA-3793:
--
Summary: Data load with partition columns fail with with 
InvalidLoadOptionException when load option 'header' is set to 'true'  (was: 
Data load with partition columns fail with with InvalidLoadOptionException when 
load option `header` is set to `true`)

> Data load with partition columns fail with with InvalidLoadOptionException 
> when load option 'header' is set to 'true'
> -
>
> Key: CARBONDATA-3793
> URL: https://issues.apache.org/jira/browse/CARBONDATA-3793
> Project: CarbonData
>  Issue Type: Bug
>  Components: spark-integration
>Affects Versions: 2.0.0
>Reporter: Venugopal Reddy K
>Priority: Minor
> Attachments: Selection_001.png
>
>
> *Issue:*
> Data load with partition fail with `InvalidLoadOptionException` when load 
> option `header` is set to `true`
>  
> *CallStack:*
> 2020-05-05 21:49:35 AUDIT audit:97 - {"time":"5 May, 2020 9:49:35 PM 
> IST","username":"root1","opName":"LOAD 
> DATA","opId":"199081091980878","opStatus":"FAILED","opTime":"1734 
> ms","table":"default.source","extraInfo":{color:#ff}{"Exception":"org.apache.carbondata.common.exceptions.sql.InvalidLoadOptionException","Message":"When
>  'header' option is true, 'fileheader' option is not required."}}{color}
> Exception in thread "main" 
> org.apache.carbondata.common.exceptions.sql.InvalidLoadOptionException: When 
> 'header' option is true, 'fileheader' option is not required.
> at 
> org.apache.carbondata.processing.loading.model.CarbonLoadModelBuilder.build(CarbonLoadModelBuilder.java:203)
> at 
> org.apache.carbondata.processing.loading.model.CarbonLoadModelBuilder.build(CarbonLoadModelBuilder.java:126)
> at 
> org.apache.spark.sql.execution.datasources.SparkCarbonTableFormat.prepareWrite(SparkCarbonTableFormat.scala:132)
> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:103)
> at 
> org.apache.spark.sql.execution.command.management.CarbonInsertIntoHadoopFsRelationCommand.run(CarbonInsertIntoHadoopFsRelationCommand.scala:160)
> at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
> at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (CARBONDATA-3793) Data load with partition columns fails when load option 'header' is set to 'true'

2020-05-05 Thread Venugopal Reddy K (Jira)
Venugopal Reddy K created CARBONDATA-3793:
-

 Summary: Data load with partition columns fails when load option 
'header' is set to 'true'
 Key: CARBONDATA-3793
 URL: https://issues.apache.org/jira/browse/CARBONDATA-3793
 Project: CarbonData
  Issue Type: Bug
  Components: spark-integration
Affects Versions: 2.0.0
Reporter: Venugopal Reddy K
 Attachments: Selection_001.png

Data load with partition fails when load option 'header' is set to 'true'

2020-05-05 21:49:35 AUDIT audit:97 - {"time":"5 May, 2020 9:49:35 PM 
IST","username":"root1","opName":"LOAD 
DATA","opId":"199081091980878","opStatus":"FAILED","opTime":"1734 
ms","table":"default.source","extraInfo":{color:#FF}{"Exception":"org.apache.carbondata.common.exceptions.sql.InvalidLoadOptionException","Message":"When
 'header' option is true, 'fileheader' option is not required."}}{color}

Exception in thread "main" 
org.apache.carbondata.common.exceptions.sql.InvalidLoadOptionException: When 
'header' option is true, 'fileheader' option is not required.

at 
org.apache.carbondata.processing.loading.model.CarbonLoadModelBuilder.build(CarbonLoadModelBuilder.java:203)

at 
org.apache.carbondata.processing.loading.model.CarbonLoadModelBuilder.build(CarbonLoadModelBuilder.java:126)

at 
org.apache.spark.sql.execution.datasources.SparkCarbonTableFormat.prepareWrite(SparkCarbonTableFormat.scala:132)

at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:103)

at 
org.apache.spark.sql.execution.command.management.CarbonInsertIntoHadoopFsRelationCommand.run(CarbonInsertIntoHadoopFsRelationCommand.scala:160)

at 
org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)

at 
org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (CARBONDATA-3793) Data load with partition columns fails when load option 'header' is set to 'true'

2020-05-05 Thread Venugopal Reddy K (Jira)


 [ 
https://issues.apache.org/jira/browse/CARBONDATA-3793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Venugopal Reddy K updated CARBONDATA-3793:
--
Description: 
*Issue:*

Data load with partition fails when load option 'header' is set to 'true'

 

*CallStack:*

2020-05-05 21:49:35 AUDIT audit:97 - {"time":"5 May, 2020 9:49:35 PM 
IST","username":"root1","opName":"LOAD 
DATA","opId":"199081091980878","opStatus":"FAILED","opTime":"1734 
ms","table":"default.source","extraInfo":{color:#ff}{"Exception":"org.apache.carbondata.common.exceptions.sql.InvalidLoadOptionException","Message":"When
 'header' option is true, 'fileheader' option is not required."}}{color}

Exception in thread "main" 
org.apache.carbondata.common.exceptions.sql.InvalidLoadOptionException: When 
'header' option is true, 'fileheader' option is not required.

at 
org.apache.carbondata.processing.loading.model.CarbonLoadModelBuilder.build(CarbonLoadModelBuilder.java:203)

at 
org.apache.carbondata.processing.loading.model.CarbonLoadModelBuilder.build(CarbonLoadModelBuilder.java:126)

at 
org.apache.spark.sql.execution.datasources.SparkCarbonTableFormat.prepareWrite(SparkCarbonTableFormat.scala:132)

at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:103)

at 
org.apache.spark.sql.execution.command.management.CarbonInsertIntoHadoopFsRelationCommand.run(CarbonInsertIntoHadoopFsRelationCommand.scala:160)

at 
org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)

at 
org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)

  was:
Data load with partition fails when load option 'header' is set to 'true'

2020-05-05 21:49:35 AUDIT audit:97 - {"time":"5 May, 2020 9:49:35 PM 
IST","username":"root1","opName":"LOAD 
DATA","opId":"199081091980878","opStatus":"FAILED","opTime":"1734 
ms","table":"default.source","extraInfo":{color:#FF}{"Exception":"org.apache.carbondata.common.exceptions.sql.InvalidLoadOptionException","Message":"When
 'header' option is true, 'fileheader' option is not required."}}{color}

Exception in thread "main" 
org.apache.carbondata.common.exceptions.sql.InvalidLoadOptionException: When 
'header' option is true, 'fileheader' option is not required.

at 
org.apache.carbondata.processing.loading.model.CarbonLoadModelBuilder.build(CarbonLoadModelBuilder.java:203)

at 
org.apache.carbondata.processing.loading.model.CarbonLoadModelBuilder.build(CarbonLoadModelBuilder.java:126)

at 
org.apache.spark.sql.execution.datasources.SparkCarbonTableFormat.prepareWrite(SparkCarbonTableFormat.scala:132)

at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:103)

at 
org.apache.spark.sql.execution.command.management.CarbonInsertIntoHadoopFsRelationCommand.run(CarbonInsertIntoHadoopFsRelationCommand.scala:160)

at 
org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)

at 
org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)


> Data load with partition columns fails when load option 'header' is set to 
> 'true'
> -
>
> Key: CARBONDATA-3793
> URL: https://issues.apache.org/jira/browse/CARBONDATA-3793
> Project: CarbonData
>  Issue Type: Bug
>  Components: spark-integration
>Affects Versions: 2.0.0
>Reporter: Venugopal Reddy K
>Priority: Minor
> Attachments: Selection_001.png
>
>
> *Issue:*
> Data load with partition fails when load option 'header' is set to 'true'
>  
> *CallStack:*
> 2020-05-05 21:49:35 AUDIT audit:97 - {"time":"5 May, 2020 9:49:35 PM 
> IST","username":"root1","opName":"LOAD 
> DATA","opId":"199081091980878","opStatus":"FAILED","opTime":"1734 
> ms","table":"default.source","extraInfo":{color:#ff}{"Exception":"org.apache.carbondata.common.exceptions.sql.InvalidLoadOptionException","Message":"When
>  'header' option is true, 'fileheader' option is not required."}}{color}
> Exception in thread "main" 
> org.apache.carbondata.common.exceptions.sql.InvalidLoadOptionException: When 
> 'header' option is true, 'fileheader' option is not required.
> at 
> org.apache.carbondata.processing.loading.model.CarbonLoadModelBuilder.build(CarbonLoadModelBuilder.java:203)
> at 
> org.apache.carbondata.processing.loading.model.CarbonLoadModelBuilder.build(CarbonLoadModelBuilder.java:126)
> at 
> org.apache.spark.sql.execution.datasources.SparkCarbonTableFormat.prepareWrite(SparkCarbonTableFormat.scala:132)
> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:103)
> at 
> 

[jira] [Closed] (CARBONDATA-3780) Old store read compatibility issue for Index handler column

2020-05-05 Thread Venugopal Reddy K (Jira)


 [ 
https://issues.apache.org/jira/browse/CARBONDATA-3780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Venugopal Reddy K closed CARBONDATA-3780.
-
Resolution: Invalid

> Old store read compatibility issue for Index handler column
> ---
>
> Key: CARBONDATA-3780
> URL: https://issues.apache.org/jira/browse/CARBONDATA-3780
> Project: CarbonData
>  Issue Type: Bug
>  Components: core
>Affects Versions: 2.0.0
>Reporter: Venugopal Reddy K
>Priority: Minor
> Fix For: 2.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Table created in older version do not have indexColumn property in 
> columnSchema. Newer versions have indexColumn property in columnSchema with 
> default value as false. Due to this, when Old store is read, index handler 
> column which is hidden to user is also treated as normal schema column and 
> can perform operations like alter table on the index column.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (CARBONDATA-3783) Alter table drop column on main table is not dropping the eligible secondary index tables

2020-04-23 Thread Venugopal Reddy K (Jira)
Venugopal Reddy K created CARBONDATA-3783:
-

 Summary: Alter table drop column on main table is not dropping the 
eligible secondary index tables
 Key: CARBONDATA-3783
 URL: https://issues.apache.org/jira/browse/CARBONDATA-3783
 Project: CarbonData
  Issue Type: Bug
Reporter: Venugopal Reddy K


Alter table drop column on main table when droping columns match to a secondary 
index table columns is actually not dropping secondary index table.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (CARBONDATA-3780) Old store read compatibility issue for Index handler column

2020-04-22 Thread Venugopal Reddy K (Jira)
Venugopal Reddy K created CARBONDATA-3780:
-

 Summary: Old store read compatibility issue for Index handler 
column
 Key: CARBONDATA-3780
 URL: https://issues.apache.org/jira/browse/CARBONDATA-3780
 Project: CarbonData
  Issue Type: Bug
  Components: core
Affects Versions: 2.0.0
Reporter: Venugopal Reddy K
 Fix For: 2.0.0


Table created in older version do not have indexColumn property in 
columnSchema. Newer versions have indexColumn property in columnSchema with 
default value as false. Due to this, when Old store is read, index handler 
column which is hidden to user is also treated as normal schema column and can 
perform operations like alter table on the index column.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (CARBONDATA-3779) BlockletIndexInputFormat object instantiation failed due to mismatch in constructor params

2020-04-22 Thread Venugopal Reddy K (Jira)
Venugopal Reddy K created CARBONDATA-3779:
-

 Summary: BlockletIndexInputFormat object instantiation failed due 
to mismatch in constructor params
 Key: CARBONDATA-3779
 URL: https://issues.apache.org/jira/browse/CARBONDATA-3779
 Project: CarbonData
  Issue Type: Bug
  Components: core
Reporter: Venugopal Reddy K


BlockletIndexInputFormat object instantiation failed due to mismatch in 
constructor params.

We have used java reflections to create the BlockletIndexInputFormat. Actual 
class constructor arguments are different than that of the one used with 
instantiation with java reflections. So illegal argument exception occurs with 
wrong number of arguments.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (CARBONDATA-3548) Support for Geospatial indexing

2020-01-09 Thread Venugopal Reddy K (Jira)


[ 
https://issues.apache.org/jira/browse/CARBONDATA-3548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17011544#comment-17011544
 ] 

Venugopal Reddy K edited comment on CARBONDATA-3548 at 1/10/20 4:51 AM:


Updated for
 # algorithm description.
 # Modified polygon UDF syntax as -
IN_POLYGON('116.321011 40.123503, 116.137676 39.947911, 116.560993 39.935276, 
116.321011 40.123503')
 # Used IN filter expression with a LIST expression containing all the 
geohashIds to be filtered instead of RANGE filter expression as this improves 
the query performance significantly.


was (Author: venureddy):
Updated for
 # algorithm description.
 # Used IN filter expression with a LIST expression containing all the 
geohashIds to be filtered instead of RANGE filter expression as this improves 
the query performance significantly.

> Support for Geospatial indexing
> ---
>
> Key: CARBONDATA-3548
> URL: https://issues.apache.org/jira/browse/CARBONDATA-3548
> Project: CarbonData
>  Issue Type: New Feature
>Reporter: Venugopal Reddy K
>Priority: Major
> Attachments: Geospatial Index Design Doc-OpenSource-Version 2.0.pdf, 
> Geospatial Index Design Doc-OpenSource.pdf
>
>  Time Spent: 63h
>  Remaining Estimate: 0h
>
> In general, database may contain geographical location data. For instance, 
> Telecom operators require to perform analytics based on a particular region, 
> cell tower IDs(within a region) and/or may include geographical locations for 
> a particular period of time. At present, Carbon do not have native support to 
> store geographical locations/coordinates and to do filter queries based on 
> them. Yet, longitude and latitude of coordinates can be treated as 
> independent columns, sort hierarchically and store them.
>          But, when longitude and latitude are treated independently, 2D space 
> is linearized i.e., points in the two dimensional domain are ordered by 
> sorting first on longitide and then on latitude. Thus, data is not ordered by 
> geospatial proximity. Hence range queries require lot of IO operations and 
> query performance is degraded.
>         To alleviate it, we can use z-order curve to store geospatial data 
> points. This ensures that geographically nearer points are present at same 
> block/blocklet. This reduces the IO operations for range queries and improves 
> query performance. Also can support polygon queries of geodata. Attached 
> design document describes in detailed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (CARBONDATA-3548) Support for Geospatial indexing

2020-01-09 Thread Venugopal Reddy K (Jira)


 [ 
https://issues.apache.org/jira/browse/CARBONDATA-3548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Venugopal Reddy K updated CARBONDATA-3548:
--
Attachment: Geospatial Index Design Doc-OpenSource-Version 2.0.pdf

> Support for Geospatial indexing
> ---
>
> Key: CARBONDATA-3548
> URL: https://issues.apache.org/jira/browse/CARBONDATA-3548
> Project: CarbonData
>  Issue Type: New Feature
>Reporter: Venugopal Reddy K
>Priority: Major
> Attachments: Geospatial Index Design Doc-OpenSource-Version 2.0.pdf, 
> Geospatial Index Design Doc-OpenSource.pdf
>
>  Time Spent: 63h
>  Remaining Estimate: 0h
>
> In general, database may contain geographical location data. For instance, 
> Telecom operators require to perform analytics based on a particular region, 
> cell tower IDs(within a region) and/or may include geographical locations for 
> a particular period of time. At present, Carbon do not have native support to 
> store geographical locations/coordinates and to do filter queries based on 
> them. Yet, longitude and latitude of coordinates can be treated as 
> independent columns, sort hierarchically and store them.
>          But, when longitude and latitude are treated independently, 2D space 
> is linearized i.e., points in the two dimensional domain are ordered by 
> sorting first on longitide and then on latitude. Thus, data is not ordered by 
> geospatial proximity. Hence range queries require lot of IO operations and 
> query performance is degraded.
>         To alleviate it, we can use z-order curve to store geospatial data 
> points. This ensures that geographically nearer points are present at same 
> block/blocklet. This reduces the IO operations for range queries and improves 
> query performance. Also can support polygon queries of geodata. Attached 
> design document describes in detailed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (CARBONDATA-3548) Support for Geospatial indexing

2020-01-09 Thread Venugopal Reddy K (Jira)


 [ 
https://issues.apache.org/jira/browse/CARBONDATA-3548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Venugopal Reddy K updated CARBONDATA-3548:
--
Attachment: (was: Geospatial Index Design Doc-OpenSource-Version 
2.0.pdf)

> Support for Geospatial indexing
> ---
>
> Key: CARBONDATA-3548
> URL: https://issues.apache.org/jira/browse/CARBONDATA-3548
> Project: CarbonData
>  Issue Type: New Feature
>Reporter: Venugopal Reddy K
>Priority: Major
> Attachments: Geospatial Index Design Doc-OpenSource-Version 2.0.pdf, 
> Geospatial Index Design Doc-OpenSource.pdf
>
>  Time Spent: 63h
>  Remaining Estimate: 0h
>
> In general, database may contain geographical location data. For instance, 
> Telecom operators require to perform analytics based on a particular region, 
> cell tower IDs(within a region) and/or may include geographical locations for 
> a particular period of time. At present, Carbon do not have native support to 
> store geographical locations/coordinates and to do filter queries based on 
> them. Yet, longitude and latitude of coordinates can be treated as 
> independent columns, sort hierarchically and store them.
>          But, when longitude and latitude are treated independently, 2D space 
> is linearized i.e., points in the two dimensional domain are ordered by 
> sorting first on longitide and then on latitude. Thus, data is not ordered by 
> geospatial proximity. Hence range queries require lot of IO operations and 
> query performance is degraded.
>         To alleviate it, we can use z-order curve to store geospatial data 
> points. This ensures that geographically nearer points are present at same 
> block/blocklet. This reduces the IO operations for range queries and improves 
> query performance. Also can support polygon queries of geodata. Attached 
> design document describes in detailed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (CARBONDATA-3548) Support for Geospatial indexing

2020-01-09 Thread Venugopal Reddy K (Jira)


 [ 
https://issues.apache.org/jira/browse/CARBONDATA-3548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Venugopal Reddy K updated CARBONDATA-3548:
--
Attachment: Geospatial Index Design Doc-OpenSource.pdf

> Support for Geospatial indexing
> ---
>
> Key: CARBONDATA-3548
> URL: https://issues.apache.org/jira/browse/CARBONDATA-3548
> Project: CarbonData
>  Issue Type: New Feature
>Reporter: Venugopal Reddy K
>Priority: Major
> Attachments: Geospatial Index Design Doc-OpenSource.pdf
>
>  Time Spent: 63h
>  Remaining Estimate: 0h
>
> In general, database may contain geographical location data. For instance, 
> Telecom operators require to perform analytics based on a particular region, 
> cell tower IDs(within a region) and/or may include geographical locations for 
> a particular period of time. At present, Carbon do not have native support to 
> store geographical locations/coordinates and to do filter queries based on 
> them. Yet, longitude and latitude of coordinates can be treated as 
> independent columns, sort hierarchically and store them.
>          But, when longitude and latitude are treated independently, 2D space 
> is linearized i.e., points in the two dimensional domain are ordered by 
> sorting first on longitide and then on latitude. Thus, data is not ordered by 
> geospatial proximity. Hence range queries require lot of IO operations and 
> query performance is degraded.
>         To alleviate it, we can use z-order curve to store geospatial data 
> points. This ensures that geographically nearer points are present at same 
> block/blocklet. This reduces the IO operations for range queries and improves 
> query performance. Also can support polygon queries of geodata. Attached 
> design document describes in detailed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (CARBONDATA-3548) Support for Geospatial indexing

2020-01-09 Thread Venugopal Reddy K (Jira)


 [ 
https://issues.apache.org/jira/browse/CARBONDATA-3548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Venugopal Reddy K updated CARBONDATA-3548:
--
Attachment: (was: Geospatial Index Design Doc-OpenSource.pdf)

> Support for Geospatial indexing
> ---
>
> Key: CARBONDATA-3548
> URL: https://issues.apache.org/jira/browse/CARBONDATA-3548
> Project: CarbonData
>  Issue Type: New Feature
>Reporter: Venugopal Reddy K
>Priority: Major
> Attachments: Geospatial Index Design Doc-OpenSource-Version 2.0.pdf
>
>  Time Spent: 63h
>  Remaining Estimate: 0h
>
> In general, database may contain geographical location data. For instance, 
> Telecom operators require to perform analytics based on a particular region, 
> cell tower IDs(within a region) and/or may include geographical locations for 
> a particular period of time. At present, Carbon do not have native support to 
> store geographical locations/coordinates and to do filter queries based on 
> them. Yet, longitude and latitude of coordinates can be treated as 
> independent columns, sort hierarchically and store them.
>          But, when longitude and latitude are treated independently, 2D space 
> is linearized i.e., points in the two dimensional domain are ordered by 
> sorting first on longitide and then on latitude. Thus, data is not ordered by 
> geospatial proximity. Hence range queries require lot of IO operations and 
> query performance is degraded.
>         To alleviate it, we can use z-order curve to store geospatial data 
> points. This ensures that geographically nearer points are present at same 
> block/blocklet. This reduces the IO operations for range queries and improves 
> query performance. Also can support polygon queries of geodata. Attached 
> design document describes in detailed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (CARBONDATA-3548) Support for Geospatial indexing

2020-01-09 Thread Venugopal Reddy K (Jira)


 [ 
https://issues.apache.org/jira/browse/CARBONDATA-3548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Venugopal Reddy K updated CARBONDATA-3548:
--
Attachment: Geospatial Index Design Doc-OpenSource-Version 2.0.pdf

> Support for Geospatial indexing
> ---
>
> Key: CARBONDATA-3548
> URL: https://issues.apache.org/jira/browse/CARBONDATA-3548
> Project: CarbonData
>  Issue Type: New Feature
>Reporter: Venugopal Reddy K
>Priority: Major
> Attachments: Geospatial Index Design Doc-OpenSource-Version 2.0.pdf
>
>  Time Spent: 63h
>  Remaining Estimate: 0h
>
> In general, database may contain geographical location data. For instance, 
> Telecom operators require to perform analytics based on a particular region, 
> cell tower IDs(within a region) and/or may include geographical locations for 
> a particular period of time. At present, Carbon do not have native support to 
> store geographical locations/coordinates and to do filter queries based on 
> them. Yet, longitude and latitude of coordinates can be treated as 
> independent columns, sort hierarchically and store them.
>          But, when longitude and latitude are treated independently, 2D space 
> is linearized i.e., points in the two dimensional domain are ordered by 
> sorting first on longitide and then on latitude. Thus, data is not ordered by 
> geospatial proximity. Hence range queries require lot of IO operations and 
> query performance is degraded.
>         To alleviate it, we can use z-order curve to store geospatial data 
> points. This ensures that geographically nearer points are present at same 
> block/blocklet. This reduces the IO operations for range queries and improves 
> query performance. Also can support polygon queries of geodata. Attached 
> design document describes in detailed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (CARBONDATA-3548) Support for Geospatial indexing

2020-01-09 Thread Venugopal Reddy K (Jira)


 [ 
https://issues.apache.org/jira/browse/CARBONDATA-3548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Venugopal Reddy K updated CARBONDATA-3548:
--
Attachment: (was: Geospatial Index Design Doc-OpenSource-Version 
2.0.pdf)

> Support for Geospatial indexing
> ---
>
> Key: CARBONDATA-3548
> URL: https://issues.apache.org/jira/browse/CARBONDATA-3548
> Project: CarbonData
>  Issue Type: New Feature
>Reporter: Venugopal Reddy K
>Priority: Major
> Attachments: Geospatial Index Design Doc-OpenSource.pdf
>
>  Time Spent: 63h
>  Remaining Estimate: 0h
>
> In general, database may contain geographical location data. For instance, 
> Telecom operators require to perform analytics based on a particular region, 
> cell tower IDs(within a region) and/or may include geographical locations for 
> a particular period of time. At present, Carbon do not have native support to 
> store geographical locations/coordinates and to do filter queries based on 
> them. Yet, longitude and latitude of coordinates can be treated as 
> independent columns, sort hierarchically and store them.
>          But, when longitude and latitude are treated independently, 2D space 
> is linearized i.e., points in the two dimensional domain are ordered by 
> sorting first on longitide and then on latitude. Thus, data is not ordered by 
> geospatial proximity. Hence range queries require lot of IO operations and 
> query performance is degraded.
>         To alleviate it, we can use z-order curve to store geospatial data 
> points. This ensures that geographically nearer points are present at same 
> block/blocklet. This reduces the IO operations for range queries and improves 
> query performance. Also can support polygon queries of geodata. Attached 
> design document describes in detailed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (CARBONDATA-3548) Support for Geospatial indexing

2020-01-09 Thread Venugopal Reddy K (Jira)


[ 
https://issues.apache.org/jira/browse/CARBONDATA-3548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17011544#comment-17011544
 ] 

Venugopal Reddy K commented on CARBONDATA-3548:
---

Updated for
 # algorithm description.
 # Used IN filter expression with a LIST expression containing all the 
geohashIds to be filtered instead of RANGE filter expression as this improves 
the query performance significantly.

> Support for Geospatial indexing
> ---
>
> Key: CARBONDATA-3548
> URL: https://issues.apache.org/jira/browse/CARBONDATA-3548
> Project: CarbonData
>  Issue Type: New Feature
>Reporter: Venugopal Reddy K
>Priority: Major
> Attachments: Geospatial Index Design Doc-OpenSource-Version 2.0.pdf, 
> Geospatial Index Design Doc-OpenSource.pdf
>
>  Time Spent: 63h
>  Remaining Estimate: 0h
>
> In general, database may contain geographical location data. For instance, 
> Telecom operators require to perform analytics based on a particular region, 
> cell tower IDs(within a region) and/or may include geographical locations for 
> a particular period of time. At present, Carbon do not have native support to 
> store geographical locations/coordinates and to do filter queries based on 
> them. Yet, longitude and latitude of coordinates can be treated as 
> independent columns, sort hierarchically and store them.
>          But, when longitude and latitude are treated independently, 2D space 
> is linearized i.e., points in the two dimensional domain are ordered by 
> sorting first on longitide and then on latitude. Thus, data is not ordered by 
> geospatial proximity. Hence range queries require lot of IO operations and 
> query performance is degraded.
>         To alleviate it, we can use z-order curve to store geospatial data 
> points. This ensures that geographically nearer points are present at same 
> block/blocklet. This reduces the IO operations for range queries and improves 
> query performance. Also can support polygon queries of geodata. Attached 
> design document describes in detailed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (CARBONDATA-3548) Support for Geospatial indexing

2020-01-09 Thread Venugopal Reddy K (Jira)


 [ 
https://issues.apache.org/jira/browse/CARBONDATA-3548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Venugopal Reddy K updated CARBONDATA-3548:
--
Attachment: (was: Geospatial Index Design Doc-OpenSource-Version 
2.0.pdf)

> Support for Geospatial indexing
> ---
>
> Key: CARBONDATA-3548
> URL: https://issues.apache.org/jira/browse/CARBONDATA-3548
> Project: CarbonData
>  Issue Type: New Feature
>Reporter: Venugopal Reddy K
>Priority: Major
> Attachments: Geospatial Index Design Doc-OpenSource-Version 2.0.pdf, 
> Geospatial Index Design Doc-OpenSource.pdf
>
>  Time Spent: 63h
>  Remaining Estimate: 0h
>
> In general, database may contain geographical location data. For instance, 
> Telecom operators require to perform analytics based on a particular region, 
> cell tower IDs(within a region) and/or may include geographical locations for 
> a particular period of time. At present, Carbon do not have native support to 
> store geographical locations/coordinates and to do filter queries based on 
> them. Yet, longitude and latitude of coordinates can be treated as 
> independent columns, sort hierarchically and store them.
>          But, when longitude and latitude are treated independently, 2D space 
> is linearized i.e., points in the two dimensional domain are ordered by 
> sorting first on longitide and then on latitude. Thus, data is not ordered by 
> geospatial proximity. Hence range queries require lot of IO operations and 
> query performance is degraded.
>         To alleviate it, we can use z-order curve to store geospatial data 
> points. This ensures that geographically nearer points are present at same 
> block/blocklet. This reduces the IO operations for range queries and improves 
> query performance. Also can support polygon queries of geodata. Attached 
> design document describes in detailed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (CARBONDATA-3548) Support for Geospatial indexing

2020-01-09 Thread Venugopal Reddy K (Jira)


 [ 
https://issues.apache.org/jira/browse/CARBONDATA-3548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Venugopal Reddy K updated CARBONDATA-3548:
--
Attachment: Geospatial Index Design Doc-OpenSource-Version 2.0.pdf

> Support for Geospatial indexing
> ---
>
> Key: CARBONDATA-3548
> URL: https://issues.apache.org/jira/browse/CARBONDATA-3548
> Project: CarbonData
>  Issue Type: New Feature
>Reporter: Venugopal Reddy K
>Priority: Major
> Attachments: Geospatial Index Design Doc-OpenSource-Version 2.0.pdf, 
> Geospatial Index Design Doc-OpenSource.pdf
>
>  Time Spent: 63h
>  Remaining Estimate: 0h
>
> In general, database may contain geographical location data. For instance, 
> Telecom operators require to perform analytics based on a particular region, 
> cell tower IDs(within a region) and/or may include geographical locations for 
> a particular period of time. At present, Carbon do not have native support to 
> store geographical locations/coordinates and to do filter queries based on 
> them. Yet, longitude and latitude of coordinates can be treated as 
> independent columns, sort hierarchically and store them.
>          But, when longitude and latitude are treated independently, 2D space 
> is linearized i.e., points in the two dimensional domain are ordered by 
> sorting first on longitide and then on latitude. Thus, data is not ordered by 
> geospatial proximity. Hence range queries require lot of IO operations and 
> query performance is degraded.
>         To alleviate it, we can use z-order curve to store geospatial data 
> points. This ensures that geographically nearer points are present at same 
> block/blocklet. This reduces the IO operations for range queries and improves 
> query performance. Also can support polygon queries of geodata. Attached 
> design document describes in detailed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (CARBONDATA-3519) Optimizations in write step to avoid unnecessary memory blk allocation/free

2019-12-23 Thread Venugopal Reddy K (Jira)


 [ 
https://issues.apache.org/jira/browse/CARBONDATA-3519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Venugopal Reddy K updated CARBONDATA-3519:
--
Summary: Optimizations in write step to avoid unnecessary memory blk 
allocation/free  (was: A new column page MemoryBlock is allocated at each row 
addition to table page if having string column with local dictionary enabled. )

> Optimizations in write step to avoid unnecessary memory blk allocation/free
> ---
>
> Key: CARBONDATA-3519
> URL: https://issues.apache.org/jira/browse/CARBONDATA-3519
> Project: CarbonData
>  Issue Type: Improvement
>  Components: core
>Reporter: Venugopal Reddy K
>Priority: Minor
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
>  +*Issue-1:*+
> {color:#0747a6}*Context:*{color}
> For a string column with local dictionary enabled, a column page of
> `{color:#de350b}UnsafeFixLengthColumnPage{color}` with datatype 
> `{color:#de350b}{{DataTypes.BYTE_ARRAY}}{color}` is created for 
> `{color:#de350b}{{encodedPage}}{color}` along with regular 
> `{color:#de350b}{{actualPage}}{color}` of 
> `{color:#de350b}{{UnsafeVarLengthColumnPage}}{color}`. 
> We have `{color:#de350b}*{{capacity}}*{color}` field in the 
> `{color:#de350b}{{UnsafeFixLengthColumnPage}}{color}`. And this field 
> indicates the capacity of  allocated
> `{color:#de350b}{{memoryBlock}}{color}` for the page. 
> `{{{color:#de350b}ensureMemory{color}()}}` method gets called while adding 
> rows to check if  `{color:#de350b}{{totalLength + requestSize > 
> capacity}}{color}` to allocate a new memoryBlock. If there is no room to add 
> the next row, allocates a new block, copy the old context(prev rows) and free 
> the old memoryBlock.
> {color:#0747a6} *Problem:*{color}
> While, `{color:#de350b}{{UnsafeFixLengthColumnPage}}{color}` with with 
> datatype `{color:#de350b}{{DataTypes.BYTE_ARRAY}}{color}` is created for 
> `{color:#de350b}{{encodedPage}}{color}`, we have not assigned the 
> *`{color:#de350b}{{capacity}}{color}`* field with allocated memory block 
> size. Hence, for each add row to tablePage, *ensureMemory() check always 
> fails*, allocates a new column page memoryBlock, copy the old context(prev 
> rows) and free the old memoryBlock. This *allocation of new memoryBlock and 
> free of old memoryBlock happens for each row row addition* for the string 
> columns with local dictionary.
>  
> +*Issue-2:*+
> {color:#0747a6}*Context:*{color}
> In`{color:#de350b}VarLengthColumnPageBase{color}`, we have a 
> `{color:#de350b}rowOffset{color}` column page of  
> `{color:#de350b}UnsafeFixLengthColumnPage{color}` of datatype 
> `{color:#de350b}INT{color}`
> to maintain the data offset to {color:#172b4d}each{color} row of variable 
> length columns. This `{color:#de350b}rowOffset{color}` page allocates to be 
> size of page. 
> {color:#0747a6} *Problem:*{color}
> {color:#172b4d}If we have 10 rows in the page, we need 11 rows for its 
> rowOffset page. Because we always keep 0 as offset to 1st row. So an 
> additional row is required for rowOffset page[pasted code below to show the 
> reference]. Otherwise, *ensureMemory() check always fails for the last 
> row*(10th row in this case) of data and *allocates a new rowOffset page 
> memoryBlock, copy the old context(prev rows) and free the old memoryBlock*. 
> This *can happen for the string columns with local dictionary, direct 
> dictionary columns, global disctionary columns*.{color}
>  
> {code:java}
> public abstract class VarLengthColumnPageBase extends ColumnPage {
> ...
> @Override
> public void putBytes(int rowId, byte[] bytes) {
>  ...
>  if (rowId == 0) {
>  rowOffset.putInt(0, 0); ==> offset to 1st row is 0.
>  }
>  rowOffset.putInt(rowId + 1, rowOffset.getInt(rowId) + bytes.length);
>  putBytesAtRow(rowId, bytes);
>  totalLength += bytes.length;
> }
> ...
> }
>  
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (CARBONDATA-3519) A new column page MemoryBlock is allocated at each row addition to table page if having string column with local dictionary enabled.

2019-12-20 Thread Venugopal Reddy K (Jira)


 [ 
https://issues.apache.org/jira/browse/CARBONDATA-3519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Venugopal Reddy K updated CARBONDATA-3519:
--
Description: 
 +*Issue-1:*+

{color:#0747a6}*Context:*{color}

For a string column with local dictionary enabled, a column page of

`{color:#de350b}UnsafeFixLengthColumnPage{color}` with datatype 
`{color:#de350b}{{DataTypes.BYTE_ARRAY}}{color}` is created for 
`{color:#de350b}{{encodedPage}}{color}` along with regular 
`{color:#de350b}{{actualPage}}{color}` of 
`{color:#de350b}{{UnsafeVarLengthColumnPage}}{color}`. 

We have `{color:#de350b}*{{capacity}}*{color}` field in the 
`{color:#de350b}{{UnsafeFixLengthColumnPage}}{color}`. And this field indicates 
the capacity of  allocated

`{color:#de350b}{{memoryBlock}}{color}` for the page. 
`{{{color:#de350b}ensureMemory{color}()}}` method gets called while adding rows 
to check if  `{color:#de350b}{{totalLength + requestSize > capacity}}{color}` 
to allocate a new memoryBlock. If there is no room to add the next row, 
allocates a new block, copy the old context(prev rows) and free the old 
memoryBlock.

{color:#0747a6} *Problem:*{color}

While, `{color:#de350b}{{UnsafeFixLengthColumnPage}}{color}` with with datatype 
`{color:#de350b}{{DataTypes.BYTE_ARRAY}}{color}` is created for 
`{color:#de350b}{{encodedPage}}{color}`, we have not assigned the 
*`{color:#de350b}{{capacity}}{color}`* field with allocated memory block size. 
Hence, for each add row to tablePage, *ensureMemory() check always fails*, 
allocates a new column page memoryBlock, copy the old context(prev rows) and 
free the old memoryBlock. This *allocation of new memoryBlock and free of old 
memoryBlock happens for each row row addition* for the string columns with 
local dictionary.

 

+*Issue-2:*+

{color:#0747a6}*Context:*{color}

In`{color:#de350b}VarLengthColumnPageBase{color}`, we have a 
`{color:#de350b}rowOffset{color}` column page of  
`{color:#de350b}UnsafeFixLengthColumnPage{color}` of datatype 
`{color:#de350b}INT{color}`

to maintain the data offset to {color:#172b4d}each{color} row of variable 
length columns. This `{color:#de350b}rowOffset{color}` page allocates to be 
size of page. 

{color:#0747a6} *Problem:*{color}

{color:#172b4d}If we have 10 rows in the page, we need 11 rows for its 
rowOffset page. Because we always keep 0 as offset to 1st row. So an additional 
row is required for rowOffset page[pasted code below to show the reference]. 
Otherwise, *ensureMemory() check always fails for the last row*(10th row in 
this case) of data and *allocates a new rowOffset page memoryBlock, copy the 
old context(prev rows) and free the old memoryBlock*. This *can happen for the 
string columns with local dictionary, direct dictionary columns, global 
disctionary columns*.{color}

 
{code:java}
public abstract class VarLengthColumnPageBase extends ColumnPage {
...
@Override
public void putBytes(int rowId, byte[] bytes) {
 ...
 if (rowId == 0) {
 rowOffset.putInt(0, 0); ==> offset to 1st row is 0.
 }
 rowOffset.putInt(rowId + 1, rowOffset.getInt(rowId) + bytes.length);
 putBytesAtRow(rowId, bytes);
 totalLength += bytes.length;
}
...
}
 
{code}
 

  was:
 
{code:java}
 {code}
*Issue:1*

{color:#0747a6}*Context:*{color}

For a string column with local dictionary enabled, a column page of

`{color:#de350b}UnsafeFixLengthColumnPage{color}` with datatype 
`{color:#de350b}{{DataTypes.BYTE_ARRAY}}{color}` is created for 
`{color:#de350b}{{encodedPage}}{color}` along with regular 
`{color:#de350b}{{actualPage}}{color}` of 
`{color:#de350b}{{UnsafeVarLengthColumnPage}}{color}`. 

We have `{color:#de350b}*{{capacity}}*{color}` field in the 
`{color:#de350b}{{UnsafeFixLengthColumnPage}}{color}`. And this field indicates 
the capacity of  allocated

`{color:#de350b}{{memoryBlock}}{color}` for the page. 
`{{{color:#de350b}ensureMemory{color}()}}` method gets called while adding rows 
to check if  `{color:#de350b}{{totalLength + requestSize > capacity}}{color}` 
to allocate a new memoryBlock. If there is no room to add the next row, 
allocates a new block, copy the old context(prev rows) and free the old 
memoryBlock.

{color:#0747a6} *Problem:*{color}

While, `{color:#de350b}{{UnsafeFixLengthColumnPage}}{color}` with with datatype 
`{color:#de350b}{{DataTypes.BYTE_ARRAY}}{color}` is created for 
`{color:#de350b}{{encodedPage}}{color}`, we have not assigned the 
*`{color:#de350b}{{capacity}}{color}`* field with allocated memory block size. 
Hence, for each add row to tablePage, *ensureMemory() check always fails*, 
allocates a new column page memoryBlock, copy the old context(prev rows) and 
free the old memoryBlock. This *allocation of new memoryBlock and free of old 
memoryBlock happens for each row row addition* for the string columns with 
local dictionary.

 

+*Issue-2:*+

{color:#0747a6}*Context:*{color}

In`{color:#de350b}VarLengthColumnPageBase{color}`, we have a 

[jira] [Updated] (CARBONDATA-3519) A new column page MemoryBlock is allocated at each row addition to table page if having string column with local dictionary enabled.

2019-12-20 Thread Venugopal Reddy K (Jira)


 [ 
https://issues.apache.org/jira/browse/CARBONDATA-3519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Venugopal Reddy K updated CARBONDATA-3519:
--
Description: 
 
{code:java}
 {code}
*Issue:1*

{color:#0747a6}*Context:*{color}

For a string column with local dictionary enabled, a column page of

`{color:#de350b}UnsafeFixLengthColumnPage{color}` with datatype 
`{color:#de350b}{{DataTypes.BYTE_ARRAY}}{color}` is created for 
`{color:#de350b}{{encodedPage}}{color}` along with regular 
`{color:#de350b}{{actualPage}}{color}` of 
`{color:#de350b}{{UnsafeVarLengthColumnPage}}{color}`. 

We have `{color:#de350b}*{{capacity}}*{color}` field in the 
`{color:#de350b}{{UnsafeFixLengthColumnPage}}{color}`. And this field indicates 
the capacity of  allocated

`{color:#de350b}{{memoryBlock}}{color}` for the page. 
`{{{color:#de350b}ensureMemory{color}()}}` method gets called while adding rows 
to check if  `{color:#de350b}{{totalLength + requestSize > capacity}}{color}` 
to allocate a new memoryBlock. If there is no room to add the next row, 
allocates a new block, copy the old context(prev rows) and free the old 
memoryBlock.

{color:#0747a6} *Problem:*{color}

While, `{color:#de350b}{{UnsafeFixLengthColumnPage}}{color}` with with datatype 
`{color:#de350b}{{DataTypes.BYTE_ARRAY}}{color}` is created for 
`{color:#de350b}{{encodedPage}}{color}`, we have not assigned the 
*`{color:#de350b}{{capacity}}{color}`* field with allocated memory block size. 
Hence, for each add row to tablePage, *ensureMemory() check always fails*, 
allocates a new column page memoryBlock, copy the old context(prev rows) and 
free the old memoryBlock. This *allocation of new memoryBlock and free of old 
memoryBlock happens for each row row addition* for the string columns with 
local dictionary.

 

+*Issue-2:*+

{color:#0747a6}*Context:*{color}

In`{color:#de350b}VarLengthColumnPageBase{color}`, we have a 
`{color:#de350b}rowOffset{color}` column page of  
`{color:#de350b}UnsafeFixLengthColumnPage{color}` of datatype 
`{color:#de350b}INT{color}`

to maintain the data offset to {color:#172b4d}each{color} row of variable 
length columns. This `{color:#de350b}rowOffset{color}` page allocates to be 
size of page. 

{color:#0747a6} *Problem:*{color}

{color:#172b4d}If we have 10 rows in the page, we need 11 rows for its 
rowOffset page. Because we always keep 0 as offset to 1st row. So an additional 
row is required for rowOffset page[pasted code below to show the reference]. 
Otherwise, *ensureMemory() check always fails for the last row*(10th row in 
this case) of data and *allocates a new rowOffset page memoryBlock, copy the 
old context(prev rows) and free the old memoryBlock*. This *can happen for the 
string columns with local dictionary, direct dictionary columns, global 
disctionary columns*.{color}

 
{code:java}
public abstract class VarLengthColumnPageBase extends ColumnPage {
...
@Override
public void putBytes(int rowId, byte[] bytes) {
 ...
 if (rowId == 0) {
 rowOffset.putInt(0, 0); ==> offset to 1st row is 0.
 }
 rowOffset.putInt(rowId + 1, rowOffset.getInt(rowId) + bytes.length);
 putBytesAtRow(rowId, bytes);
 totalLength += bytes.length;
}
...
}
 
{code}
 

  was:
 
{code:java}
 {code}
*// code placeholder**Issue-1:*

 

{color:#0747a6}*Context:*{color}

For a string column with local dictionary enabled, a column page of

`{color:#de350b}UnsafeFixLengthColumnPage{color}` with datatype 
`{color:#de350b}{{DataTypes.BYTE_ARRAY}}{color}` is created for 
`{color:#de350b}{{encodedPage}}{color}` along with regular 
`{color:#de350b}{{actualPage}}{color}` of 
`{color:#de350b}{{UnsafeVarLengthColumnPage}}{color}`. 

We have `{color:#de350b}*{{capacity}}*{color}` field in the 
`{color:#de350b}{{UnsafeFixLengthColumnPage}}{color}`. And this field indicates 
the capacity of  allocated

`{color:#de350b}{{memoryBlock}}{color}` for the page. 
`{{{color:#de350b}ensureMemory{color}()}}` method gets called while adding rows 
to check if  `{color:#de350b}{{totalLength + requestSize > capacity}}{color}` 
to allocate a new memoryBlock. If there is no room to add the next row, 
allocates a new block, copy the old context(prev rows) and free the old 
memoryBlock.

{color:#0747a6} *Problem:*{color}

While, `{color:#de350b}{{UnsafeFixLengthColumnPage}}{color}` with with datatype 
`{color:#de350b}{{DataTypes.BYTE_ARRAY}}{color}` is created for 
`{color:#de350b}{{encodedPage}}{color}`, we have not assigned the 
*`{color:#de350b}{{capacity}}{color}`* field with allocated memory block size. 
Hence, for each add row to tablePage, *ensureMemory() check always fails*, 
allocates a new column page memoryBlock, copy the old context(prev rows) and 
free the old memoryBlock. This *allocation of new memoryBlock and free of old 
memoryBlock happens for each row row addition* for the string columns with 
local dictionary.

 

 

 

+*Issue-2:*+

{color:#0747a6}*Context:*{color}


[jira] [Updated] (CARBONDATA-3519) A new column page MemoryBlock is allocated at each row addition to table page if having string column with local dictionary enabled.

2019-12-20 Thread Venugopal Reddy K (Jira)


 [ 
https://issues.apache.org/jira/browse/CARBONDATA-3519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Venugopal Reddy K updated CARBONDATA-3519:
--
Description: 
 
{code:java}
 {code}
*// code placeholder**Issue-1:*

 

{color:#0747a6}*Context:*{color}

For a string column with local dictionary enabled, a column page of

`{color:#de350b}UnsafeFixLengthColumnPage{color}` with datatype 
`{color:#de350b}{{DataTypes.BYTE_ARRAY}}{color}` is created for 
`{color:#de350b}{{encodedPage}}{color}` along with regular 
`{color:#de350b}{{actualPage}}{color}` of 
`{color:#de350b}{{UnsafeVarLengthColumnPage}}{color}`. 

We have `{color:#de350b}*{{capacity}}*{color}` field in the 
`{color:#de350b}{{UnsafeFixLengthColumnPage}}{color}`. And this field indicates 
the capacity of  allocated

`{color:#de350b}{{memoryBlock}}{color}` for the page. 
`{{{color:#de350b}ensureMemory{color}()}}` method gets called while adding rows 
to check if  `{color:#de350b}{{totalLength + requestSize > capacity}}{color}` 
to allocate a new memoryBlock. If there is no room to add the next row, 
allocates a new block, copy the old context(prev rows) and free the old 
memoryBlock.

{color:#0747a6} *Problem:*{color}

While, `{color:#de350b}{{UnsafeFixLengthColumnPage}}{color}` with with datatype 
`{color:#de350b}{{DataTypes.BYTE_ARRAY}}{color}` is created for 
`{color:#de350b}{{encodedPage}}{color}`, we have not assigned the 
*`{color:#de350b}{{capacity}}{color}`* field with allocated memory block size. 
Hence, for each add row to tablePage, *ensureMemory() check always fails*, 
allocates a new column page memoryBlock, copy the old context(prev rows) and 
free the old memoryBlock. This *allocation of new memoryBlock and free of old 
memoryBlock happens for each row row addition* for the string columns with 
local dictionary.

 

 

 

+*Issue-2:*+

{color:#0747a6}*Context:*{color}

In`{color:#de350b}VarLengthColumnPageBase{color}`, we have a 
`{color:#de350b}rowOffset{color}` column page of  
`{color:#de350b}UnsafeFixLengthColumnPage{color}` of datatype 
`{color:#de350b}INT{color}`

to maintain the data offset to {color:#172b4d}each{color} row of variable 
length columns. This `{color:#de350b}rowOffset{color}` page allocates to be 
size of page. 

{color:#0747a6} *Problem:*{color}

{color:#172b4d}If we have 10 rows in the page, we need 11 rows for its 
rowOffset page. Because we always keep 0 as offset to 1st row. So an additional 
row is required for rowOffset page[pasted code below to show the reference]. 
Otherwise, *ensureMemory() check always fails for the last row*(10th row in 
this case) of data and *allocates a new rowOffset page memoryBlock, copy the 
old context(prev rows) and free the old memoryBlock*. This *can happen for the 
string columns with local dictionary, direct dictionary columns, global 
disctionary columns*.{color}

 
{code:java}
public abstract class VarLengthColumnPageBase extends ColumnPage {
...
@Override
public void putBytes(int rowId, byte[] bytes) {
 ...
 if (rowId == 0) {
 rowOffset.putInt(0, 0); ==> offset to 1st row is 0.
 }
 rowOffset.putInt(rowId + 1, rowOffset.getInt(rowId) + bytes.length);
 putBytesAtRow(rowId, bytes);
 totalLength += bytes.length;
}
...
}
 
{code}
 

  was:
 
{code:java}

{code}
*// code placeholder**Issue-1:*

 

{color:#0747a6}*Context:*{color}

For a string column with local dictionary enabled, a column page of

`{color:#de350b}UnsafeFixLengthColumnPage{color}` with datatype 
`{color:#de350b}{{DataTypes.BYTE_ARRAY}}{color}` is created for 
`{color:#de350b}{{encodedPage}}{color}` along with regular 
`{color:#de350b}{{actualPage}}{color}` of 
`{color:#de350b}{{UnsafeVarLengthColumnPage}}{color}`. 

We have `{color:#de350b}*{{capacity}}*{color}` field in the 
`{color:#de350b}{{UnsafeFixLengthColumnPage}}{color}`. And this field indicates 
the capacity of  allocated

`{color:#de350b}{{memoryBlock}}{color}` for the page. 
`{{{color:#de350b}ensureMemory{color}()}}` method gets called while adding rows 
to check if  `{color:#de350b}{{totalLength + requestSize > capacity}}{color}` 
to allocate a new memoryBlock. If there is no room to add the next row, 
allocates a new block, copy the old context(prev rows) and free the old 
memoryBlock.

{color:#0747a6} *Problem:*{color}

While, `{color:#de350b}{{UnsafeFixLengthColumnPage}}{color}` with with datatype 
`{color:#de350b}{{DataTypes.BYTE_ARRAY}}{color}` is created for 
`{color:#de350b}{{encodedPage}}{color}`, we have not assigned the 
*`{color:#de350b}{{capacity}}{color}`* field with allocated memory block size. 
Hence, for each add row to tablePage, *ensureMemory() check always fails*, 
allocates a new column page memoryBlock, copy the old context(prev rows) and 
free the old memoryBlock. This *allocation of new memoryBlock and free of old 
memoryBlock happens for each row row addition* for the string columns with 
local dictionary.

 

 

 

+*Issue-2:*+


[jira] [Updated] (CARBONDATA-3519) A new column page MemoryBlock is allocated at each row addition to table page if having string column with local dictionary enabled.

2019-12-20 Thread Venugopal Reddy K (Jira)


 [ 
https://issues.apache.org/jira/browse/CARBONDATA-3519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Venugopal Reddy K updated CARBONDATA-3519:
--
Description: 
 
{code:java}

{code}
*// code placeholder**Issue-1:*

 

{color:#0747a6}*Context:*{color}

For a string column with local dictionary enabled, a column page of

`{color:#de350b}UnsafeFixLengthColumnPage{color}` with datatype 
`{color:#de350b}{{DataTypes.BYTE_ARRAY}}{color}` is created for 
`{color:#de350b}{{encodedPage}}{color}` along with regular 
`{color:#de350b}{{actualPage}}{color}` of 
`{color:#de350b}{{UnsafeVarLengthColumnPage}}{color}`. 

We have `{color:#de350b}*{{capacity}}*{color}` field in the 
`{color:#de350b}{{UnsafeFixLengthColumnPage}}{color}`. And this field indicates 
the capacity of  allocated

`{color:#de350b}{{memoryBlock}}{color}` for the page. 
`{{{color:#de350b}ensureMemory{color}()}}` method gets called while adding rows 
to check if  `{color:#de350b}{{totalLength + requestSize > capacity}}{color}` 
to allocate a new memoryBlock. If there is no room to add the next row, 
allocates a new block, copy the old context(prev rows) and free the old 
memoryBlock.

{color:#0747a6} *Problem:*{color}

While, `{color:#de350b}{{UnsafeFixLengthColumnPage}}{color}` with with datatype 
`{color:#de350b}{{DataTypes.BYTE_ARRAY}}{color}` is created for 
`{color:#de350b}{{encodedPage}}{color}`, we have not assigned the 
*`{color:#de350b}{{capacity}}{color}`* field with allocated memory block size. 
Hence, for each add row to tablePage, *ensureMemory() check always fails*, 
allocates a new column page memoryBlock, copy the old context(prev rows) and 
free the old memoryBlock. This *allocation of new memoryBlock and free of old 
memoryBlock happens for each row row addition* for the string columns with 
local dictionary.

 

 

 

+*Issue-2:*+

{color:#0747a6}*Context:*{color}

In`{color:#de350b}VarLengthColumnPageBase{color}`, we have a 
`{color:#de350b}rowOffset{color}` column page of  
`{color:#de350b}UnsafeFixLengthColumnPage{color}` of datatype 
`{color:#de350b}INT{color}`

to maintain the data offset to {color:#172b4d}each{color} row of variable 
length columns. This `{color:#de350b}rowOffset{color}` page allocates to be 
size of page. 

{color:#0747a6} *Problem:*{color}

{color:#172b4d}If we have 10 rows in the page, we need 11 rows for its 
rowOffset page. Because we always keep 0 as offset to 1st row. So an additional 
row is required for rowOffset page[pasted code below to show the reference]. 
Otherwise, *ensureMemory() check always fails for the last row*(10th row in 
this case) of data and *allocates a new rowOffset page memoryBlock, copy the 
old context(prev rows) and free the old memoryBlock*. This *can happen for the 
string columns with local dictionary, direct dictionary columns, global 
disctionary columns*.{color}

 
{code:java}
public abstract class VarLengthColumnPageBase extends ColumnPage {
...
@Override
public void putBytes(int rowId, byte[] bytes) {
 ...
 if (rowId == 0) {
 rowOffset.putInt(0, 0);
 }
 rowOffset.putInt(rowId + 1, rowOffset.getInt(rowId) + bytes.length);
 putBytesAtRow(rowId, bytes);
 totalLength += bytes.length;
}
...
}
 
{code}
 

  was:
+*Issue-1:*+

{color:#0747a6}*Context:*{color}

For a string column with local dictionary enabled, a column page of

`{color:#de350b}UnsafeFixLengthColumnPage{color}` with datatype 
`{color:#de350b}{{DataTypes.BYTE_ARRAY}}{color}` is created for 
`{color:#de350b}{{encodedPage}}{color}` along with regular 
`{color:#de350b}{{actualPage}}{color}` of 
`{color:#de350b}{{UnsafeVarLengthColumnPage}}{color}`. 

We have `{color:#de350b}*{{capacity}}*{color}` field in the 
`{color:#de350b}{{UnsafeFixLengthColumnPage}}{color}`. And this field indicates 
the capacity of  allocated

`{color:#de350b}{{memoryBlock}}{color}` for the page. 
`{{{color:#de350b}ensureMemory{color}()}}` method gets called while adding rows 
to check if  `{color:#de350b}{{totalLength + requestSize > capacity}}{color}` 
to allocate a new memoryBlock. If there is no room to add the next row, 
allocates a new block, copy the old context(prev rows) and free the old 
memoryBlock.

{color:#0747a6} *Problem:*{color}

While, `{color:#de350b}{{UnsafeFixLengthColumnPage}}{color}` with with datatype 
`{color:#de350b}{{DataTypes.BYTE_ARRAY}}{color}` is created for 
`{color:#de350b}{{encodedPage}}{color}`, we have not assigned the 
*`{color:#de350b}{{capacity}}{color}`* field with allocated memory block size. 
Hence, for each add row to tablePage, *ensureMemory() check always fails*, 
allocates a new column page memoryBlock, copy the old context(prev rows) and 
free the old memoryBlock. This *allocation of new memoryBlock and free of old 
memoryBlock happens for each row row addition* for the string columns with 
local dictionary.

 

+*Issue-2:*+

{color:#0747a6}*Context:*{color}

In`{color:#de350b}VarLengthColumnPageBase{color}`, we have a 

[jira] [Updated] (CARBONDATA-3519) A new column page MemoryBlock is allocated at each row addition to table page if having string column with local dictionary enabled.

2019-12-20 Thread Venugopal Reddy K (Jira)


 [ 
https://issues.apache.org/jira/browse/CARBONDATA-3519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Venugopal Reddy K updated CARBONDATA-3519:
--
Description: 
+*Issue-1:*+

{color:#0747a6}*Context:*{color}

For a string column with local dictionary enabled, a column page of

`{color:#de350b}UnsafeFixLengthColumnPage{color}` with datatype 
`{color:#de350b}{{DataTypes.BYTE_ARRAY}}{color}` is created for 
`{color:#de350b}{{encodedPage}}{color}` along with regular 
`{color:#de350b}{{actualPage}}{color}` of 
`{color:#de350b}{{UnsafeVarLengthColumnPage}}{color}`. 

We have `{color:#de350b}*{{capacity}}*{color}` field in the 
`{color:#de350b}{{UnsafeFixLengthColumnPage}}{color}`. And this field indicates 
the capacity of  allocated

`{color:#de350b}{{memoryBlock}}{color}` for the page. 
`{{{color:#de350b}ensureMemory{color}()}}` method gets called while adding rows 
to check if  `{color:#de350b}{{totalLength + requestSize > capacity}}{color}` 
to allocate a new memoryBlock. If there is no room to add the next row, 
allocates a new block, copy the old context(prev rows) and free the old 
memoryBlock.

{color:#0747a6} *Problem:*{color}

While, `{color:#de350b}{{UnsafeFixLengthColumnPage}}{color}` with with datatype 
`{color:#de350b}{{DataTypes.BYTE_ARRAY}}{color}` is created for 
`{color:#de350b}{{encodedPage}}{color}`, we have not assigned the 
*`{color:#de350b}{{capacity}}{color}`* field with allocated memory block size. 
Hence, for each add row to tablePage, *ensureMemory() check always fails*, 
allocates a new column page memoryBlock, copy the old context(prev rows) and 
free the old memoryBlock. This *allocation of new memoryBlock and free of old 
memoryBlock happens for each row row addition* for the string columns with 
local dictionary.

 

+*Issue-2:*+

{color:#0747a6}*Context:*{color}

In`{color:#de350b}VarLengthColumnPageBase{color}`, we have a 
`{color:#de350b}rowOffset{color}` column page of  
`{color:#de350b}UnsafeFixLengthColumnPage{color}` of datatype 
`{color:#de350b}INT{color}`

to maintain the data offset to {color:#172b4d}each{color} row of variable 
length columns. This `{color:#de350b}rowOffset{color}` page allocates to be 
size of page. 

{color:#0747a6} *Problem:*{color}

{color:#172b4d}If we have 10 rows in the page, we need 11 rows for its 
rowOffset page. Because we always keep 0 as offset to 1st row. So an additional 
row is required for rowOffset page[pasted code below to show the reference]. 
Otherwise, *ensureMemory() check always fails for the last row*(10th row in 
this case) of data and *allocates a new rowOffset page memoryBlock, copy the 
old context(prev rows) and free the old memoryBlock*. This *can happen for the 
string columns with local dictionary, direct dictionary columns, global 
disctionary columns*.{color}

 
{code:java}
public abstract class VarLengthColumnPageBase extends ColumnPage {
...
@Override
public void putBytes(int rowId, byte[] bytes) {
 ...
 if (rowId == 0) {
 rowOffset.putInt(0, 0);
 }
 rowOffset.putInt(rowId + 1, rowOffset.getInt(rowId) + bytes.length);
 putBytesAtRow(rowId, bytes);
 totalLength += bytes.length;
}
...
}
 
{code}
 

  was:
+*Issue-1:*+

{color:#0747a6}*Context:*{color}

For a string column with local dictionary enabled, a column page of

`{color:#de350b}UnsafeFixLengthColumnPage{color}` with datatype 
`{color:#de350b}{{DataTypes.BYTE_ARRAY}}{color}` is created for 
`{color:#de350b}{{encodedPage}}{color}` along with regular 
`{color:#de350b}{{actualPage}}{color}` of 
`{color:#de350b}{{UnsafeVarLengthColumnPage}}{color}`. 

We have `{color:#de350b}*{{capacity}}*{color}` field in the 
`{color:#de350b}{{UnsafeFixLengthColumnPage}}{color}`. And this field indicates 
the capacity of  allocated

`{color:#de350b}{{memoryBlock}}{color}` for the page. 
`{{{color:#de350b}ensureMemory{color}()}}` method gets called while adding rows 
to check if  `{color:#de350b}{{totalLength + requestSize > capacity}}{color}` 
to allocate a new memoryBlock. If there is no room to add the next row, 
allocates a new block, copy the old context(prev rows) and free the old 
memoryBlock.

{color:#0747a6} *Problem:*{color}

While, `{color:#de350b}{{UnsafeFixLengthColumnPage}}{color}` with with datatype 
`{color:#de350b}{{DataTypes.BYTE_ARRAY}}{color}` is created for 
`{color:#de350b}{{encodedPage}}{color}`, we have not assigned the 
*`{color:#de350b}{{capacity}}{color}`* field with allocated memory block size. 
Hence, for each add row to tablePage, *ensureMemory() check always fails*, 
allocates a new column page memoryBlock, copy the old context(prev rows) and 
free the old memoryBlock. This *allocation of new memoryBlock and free of old 
memoryBlock happens for each row row addition* for the string columns with 
local dictionary.

 

+*Issue-2:*+

{color:#0747a6}*Context:*{color}

In`{color:#de350b}VarLengthColumnPageBase{color}`, we have a 
`{color:#de350b}rowOffset{color}` column page of  

[jira] [Updated] (CARBONDATA-3519) A new column page MemoryBlock is allocated at each row addition to table page if having string column with local dictionary enabled.

2019-12-20 Thread Venugopal Reddy K (Jira)


 [ 
https://issues.apache.org/jira/browse/CARBONDATA-3519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Venugopal Reddy K updated CARBONDATA-3519:
--
Description: 
+*Issue-1:*+

{color:#0747a6}*Context:*{color}

For a string column with local dictionary enabled, a column page of

`{color:#de350b}UnsafeFixLengthColumnPage{color}` with datatype 
`{color:#de350b}{{DataTypes.BYTE_ARRAY}}{color}` is created for 
`{color:#de350b}{{encodedPage}}{color}` along with regular 
`{color:#de350b}{{actualPage}}{color}` of 
`{color:#de350b}{{UnsafeVarLengthColumnPage}}{color}`. 

We have `{color:#de350b}*{{capacity}}*{color}` field in the 
`{color:#de350b}{{UnsafeFixLengthColumnPage}}{color}`. And this field indicates 
the capacity of  allocated

`{color:#de350b}{{memoryBlock}}{color}` for the page. 
`{{{color:#de350b}ensureMemory{color}()}}` method gets called while adding rows 
to check if  `{color:#de350b}{{totalLength + requestSize > capacity}}{color}` 
to allocate a new memoryBlock. If there is no room to add the next row, 
allocates a new block, copy the old context(prev rows) and free the old 
memoryBlock.

{color:#0747a6} *Problem:*{color}

While, `{color:#de350b}{{UnsafeFixLengthColumnPage}}{color}` with with datatype 
`{color:#de350b}{{DataTypes.BYTE_ARRAY}}{color}` is created for 
`{color:#de350b}{{encodedPage}}{color}`, we have not assigned the 
*`{color:#de350b}{{capacity}}{color}`* field with allocated memory block size. 
Hence, for each add row to tablePage, *ensureMemory() check always fails*, 
allocates a new column page memoryBlock, copy the old context(prev rows) and 
free the old memoryBlock. This *allocation of new memoryBlock and free of old 
memoryBlock happens for each row row addition* for the string columns with 
local dictionary.

 

+*Issue-2:*+

{color:#0747a6}*Context:*{color}

In`{color:#de350b}VarLengthColumnPageBase{color}`, we have a 
`{color:#de350b}rowOffset{color}` column page of  
`{color:#de350b}UnsafeFixLengthColumnPage{color}` of datatype 
`{color:#de350b}INT{color}`

to maintain the data offset to {color:#172b4d}each{color} row of variable 
length columns. This `{color:#de350b}rowOffset{color}` page allocates to be 
size of page. 

{color:#0747a6} *Problem:*{color}

{color:#172b4d}If we have 10 rows in the page, we need 11 rows for its 
rowOffset page. Because we always keep 0 as offset to 1st row. So an additional 
row is required for rowOffset page[pasted code below to show the reference]. 
Otherwise, we ensureMemory() check always fails for the last row(10th row in 
this case) of data and allocates a new crowOffset page memoryBlock, copy the 
old context(prev rows) and free the old memoryBlock. This can happen for the 
string columns with local dictionary, direct dictionary columns, global 
disctionary columns.{color}**

 
{code:java}
public abstract class VarLengthColumnPageBase extends ColumnPage {
...
@Override
public void putBytes(int rowId, byte[] bytes) {
 ...
 if (rowId == 0) {
 rowOffset.putInt(0, 0);
 }
 rowOffset.putInt(rowId + 1, rowOffset.getInt(rowId) + bytes.length);
 putBytesAtRow(rowId, bytes);
 totalLength += bytes.length;
}
...
}
 
{code}
 

  was:
+*Issue-1:*+

{color:#0747a6}*Context:*{color}

For a string column with local dictionary enabled, a column page of

{{`{color:#de350b}UnsafeFixLengthColumnPage{color}` }}with datatype 
`{color:#de350b}{{DataTypes.BYTE_ARRAY}}{color}` is created for 
`{color:#de350b}{{encodedPage}}{color}` along with regular 
`{color:#de350b}{{actualPage}}{color}` of 
`{color:#de350b}{{UnsafeVarLengthColumnPage}}{color}`. 

We have `{color:#de350b}*{{capacity}}*{color}` field in the 
`{color:#de350b}{{UnsafeFixLengthColumnPage}}{color}`. And this field indicates 
the capacity of  allocated

`{color:#de350b}{{memoryBlock}}{color}` for the page. 
`{{{color:#de350b}ensureMemory{color}()}}` method gets called while adding rows 
to check if  `{color:#de350b}{{totalLength + requestSize > capacity}}{color}` 
to allocate a new memoryBlock. If there is no room to add the next row, 
allocates a new block, copy the old context(prev rows) and free the old 
memoryBlock.

{color:#0747a6} *Problem:*{color}

While, `{color:#de350b}{{UnsafeFixLengthColumnPage}}{color}` with with datatype 
`{color:#de350b}{{DataTypes.BYTE_ARRAY}}{color}` is created for 
`{color:#de350b}{{encodedPage}}{color}`, we have not assigned the 
*`{color:#de350b}{{capacity}}{color}`* field with allocated memory block size. 
Hence, for each add row to tablePage, *ensureMemory() check always fails*, 
allocates a new column page memoryBlock, copy the old context(prev rows) and 
free the old memoryBlock. This *allocation of new memoryBlock and free of old 
memoryBlock happens for each row row addition* for the string columns with 
local dictionary.

 

+*Issue-2:*+

{color:#0747a6}*Context:*{color}

In`{color:#de350b}VarLengthColumnPageBase{color}`, we have a 
`{color:#de350b}rowOffset{color}` column page 

[jira] [Updated] (CARBONDATA-3519) A new column page MemoryBlock is allocated at each row addition to table page if having string column with local dictionary enabled.

2019-12-20 Thread Venugopal Reddy K (Jira)


 [ 
https://issues.apache.org/jira/browse/CARBONDATA-3519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Venugopal Reddy K updated CARBONDATA-3519:
--
Description: 
+*Issue-1:*+

{color:#0747a6}*Context:*{color}

For a string column with local dictionary enabled, a column page of

{{`{color:#de350b}UnsafeFixLengthColumnPage{color}` }}with datatype 
`{color:#de350b}{{DataTypes.BYTE_ARRAY}}{color}` is created for 
`{color:#de350b}{{encodedPage}}{color}` along with regular 
`{color:#de350b}{{actualPage}}{color}` of 
`{color:#de350b}{{UnsafeVarLengthColumnPage}}{color}`. 

We have `{color:#de350b}*{{capacity}}*{color}` field in the 
`{color:#de350b}{{UnsafeFixLengthColumnPage}}{color}`. And this field indicates 
the capacity of  allocated

`{color:#de350b}{{memoryBlock}}{color}` for the page. 
`{{{color:#de350b}ensureMemory{color}()}}` method gets called while adding rows 
to check if  `{color:#de350b}{{totalLength + requestSize > capacity}}{color}` 
to allocate a new memoryBlock. If there is no room to add the next row, 
allocates a new block, copy the old context(prev rows) and free the old 
memoryBlock.

{color:#0747a6} *Problem:*{color}

While, `{color:#de350b}{{UnsafeFixLengthColumnPage}}{color}` with with datatype 
`{color:#de350b}{{DataTypes.BYTE_ARRAY}}{color}` is created for 
`{color:#de350b}{{encodedPage}}{color}`, we have not assigned the 
*`{color:#de350b}{{capacity}}{color}`* field with allocated memory block size. 
Hence, for each add row to tablePage, *ensureMemory() check always fails*, 
allocates a new column page memoryBlock, copy the old context(prev rows) and 
free the old memoryBlock. This *allocation of new memoryBlock and free of old 
memoryBlock happens for each row row addition* for the string columns with 
local dictionary.

 

+*Issue-2:*+

{color:#0747a6}*Context:*{color}

In`{color:#de350b}VarLengthColumnPageBase{color}`, we have a 
`{color:#de350b}rowOffset{color}` column page of  
`{color:#de350b}UnsafeFixLengthColumnPage{color}` of datatype 
`{color:#de350b}INT{color}`

to maintain the data offset to {color:#172b4d}each{color} row of variable 
length columns. This `{color:#de350b}rowOffset{color}` page allocates to be 
size of page. 

{color:#0747a6} *Problem:*{color}

{color:#0747a6}{color:#172b4d}If we have 10 rows in the page, we need 11 rows 
for its rowOffset page. Because we always keep 0 as offset to 1st row. So an 
additional row is required for rowOffset page[pasted code below to show the 
reference]. Otherwise, we ensureMemory() check always fails for the last 
row(10th row in this case) of data and allocates a new crowOffset page 
memoryBlock, copy the old context(prev rows) and free the old memoryBlock. This 
can happen for the string columns with local dictionary, direct dictionary 
columns, global disctionary columns.{color}**{color}

 
{code:java}
public abstract class VarLengthColumnPageBase extends ColumnPage {
...
@Override
public void putBytes(int rowId, byte[] bytes) {
 ...
 if (rowId == 0) {
 rowOffset.putInt(0, 0);
 }
 rowOffset.putInt(rowId + 1, rowOffset.getInt(rowId) + bytes.length);
 putBytesAtRow(rowId, bytes);
 totalLength += bytes.length;
}
...
}
 
{code}
 

  was:
*Context:*

For a string column with local dictionary enabled, a column page of

{{{color:#de350b}@UnsafeFixLengthColumnPage{color} }}with datatype 
`{color:#ff8b00}{{DataTypes.BYTE_ARRAY}}{color}` is created for 
`{{encodedPage}}` along with regular `{{actualPage}}` of 
`{{UnsafeVarLengthColumnPage}}`. 

We have `*{{capacity}}*` field in 

 the `{{UnsafeFixLengthColumnPage}}`. And this field indicates the capacity of  
allocated

`{{memoryBlock}}` for the page. `{{ensureMemory()}}` method is being called 
while adding rows to check if 

`{{totalLength + requestSize > capacity}}` to allocate a new memoryBlock if 
there is no room to add the next row, copy the old context(prev rows) and free 
the old memoryBlock.

 

*Issues:*
 # While, `{{UnsafeFixLengthColumnPage}}` with with datatype 
`{{DataTypes.BYTE_ARRAY}}` is created for `{{encodedPage}}`, we have not 
assigned the *`{{capacity}}`* field with allocated memory block size. Hence, 
for each add row to tablePage, ensureMemory() check always fails, allocates a 
new column page memoryBlock, copy the old context(prev rows) and free the old 
memoryBlock. This allocation of new memoryBlock and free of old memoryBlock 
happens at row addition for the string columns with local dictionary enabled.
 # And in `VarLengthColumnPageBase`, we have a `rowOffset` column page of type 
`UnsafeFixLengthColumnPage` to maintain the offset to each row of variable 
length columns. This `rowOffset` page is 


> A new column page MemoryBlock is allocated at each row addition to table page 
> if having string column with local dictionary enabled. 
> -
>
>  

[jira] [Updated] (CARBONDATA-3519) A new column page MemoryBlock is allocated at each row addition to table page if having string column with local dictionary enabled.

2019-12-20 Thread Venugopal Reddy K (Jira)


 [ 
https://issues.apache.org/jira/browse/CARBONDATA-3519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Venugopal Reddy K updated CARBONDATA-3519:
--
Description: 
*Context:*

For a string column with local dictionary enabled, a column page of

{{{color:#de350b}@UnsafeFixLengthColumnPage{color} }}with datatype 
`{color:#ff8b00}{{DataTypes.BYTE_ARRAY}}{color}` is created for 
`{{encodedPage}}` along with regular `{{actualPage}}` of 
`{{UnsafeVarLengthColumnPage}}`. 

We have `*{{capacity}}*` field in 

 the `{{UnsafeFixLengthColumnPage}}`. And this field indicates the capacity of  
allocated

`{{memoryBlock}}` for the page. `{{ensureMemory()}}` method is being called 
while adding rows to check if 

`{{totalLength + requestSize > capacity}}` to allocate a new memoryBlock if 
there is no room to add the next row, copy the old context(prev rows) and free 
the old memoryBlock.

 

*Issues:*
 # While, `{{UnsafeFixLengthColumnPage}}` with with datatype 
`{{DataTypes.BYTE_ARRAY}}` is created for `{{encodedPage}}`, we have not 
assigned the *`{{capacity}}`* field with allocated memory block size. Hence, 
for each add row to tablePage, ensureMemory() check always fails, allocates a 
new column page memoryBlock, copy the old context(prev rows) and free the old 
memoryBlock. This allocation of new memoryBlock and free of old memoryBlock 
happens at row addition for the string columns with local dictionary enabled.
 # And in `VarLengthColumnPageBase`, we have a `rowOffset` column page of type 
`UnsafeFixLengthColumnPage` to maintain the offset to each row of variable 
length columns. This `rowOffset` page is 

  was:
*Context:*

For a string column with local dictionary enabled, a column page of

{{{color:#de350b}UnsafeFixLengthColumnPage{color} }}with datatype 
`{{DataTypes.BYTE_ARRAY}}` is created for `{{encodedPage}}` along with regular 
`{{actualPage}}` of `{{UnsafeVarLengthColumnPage}}`. 

We have `*{{capacity}}*` field in 

 the `{{UnsafeFixLengthColumnPage}}`. And this field indicates the capacity of  
allocated

`{{memoryBlock}}` for the page. `{{ensureMemory()}}` method is being called 
while adding rows to check if 

`{{totalLength + requestSize > capacity}}` to allocate a new memoryBlock if 
there is no room to add the next row, copy the old context(prev rows) and free 
the old memoryBlock.

 

*Issues:*
 # While, `{{UnsafeFixLengthColumnPage}}` with with datatype 
`{{DataTypes.BYTE_ARRAY}}` is created for `{{encodedPage}}`, we have not 
assigned the *`{{capacity}}`* field with allocated memory block size. Hence, 
for each add row to tablePage, ensureMemory() check always fails, allocates a 
new column page memoryBlock, copy the old context(prev rows) and free the old 
memoryBlock. This allocation of new memoryBlock and free of old memoryBlock 
happens at row addition for the string columns with local dictionary enabled.
 # And in `VarLengthColumnPageBase`, we have a `rowOffset` column page of type 
`UnsafeFixLengthColumnPage` to maintain the offset to each row of variable 
length columns. This `rowOffset` page is 


> A new column page MemoryBlock is allocated at each row addition to table page 
> if having string column with local dictionary enabled. 
> -
>
> Key: CARBONDATA-3519
> URL: https://issues.apache.org/jira/browse/CARBONDATA-3519
> Project: CarbonData
>  Issue Type: Improvement
>  Components: core
>Reporter: Venugopal Reddy K
>Priority: Minor
>
> *Context:*
> For a string column with local dictionary enabled, a column page of
> {{{color:#de350b}@UnsafeFixLengthColumnPage{color} }}with datatype 
> `{color:#ff8b00}{{DataTypes.BYTE_ARRAY}}{color}` is created for 
> `{{encodedPage}}` along with regular `{{actualPage}}` of 
> `{{UnsafeVarLengthColumnPage}}`. 
> We have `*{{capacity}}*` field in 
>  the `{{UnsafeFixLengthColumnPage}}`. And this field indicates the capacity 
> of  allocated
> `{{memoryBlock}}` for the page. `{{ensureMemory()}}` method is being called 
> while adding rows to check if 
> `{{totalLength + requestSize > capacity}}` to allocate a new memoryBlock if 
> there is no room to add the next row, copy the old context(prev rows) and 
> free the old memoryBlock.
>  
> *Issues:*
>  # While, `{{UnsafeFixLengthColumnPage}}` with with datatype 
> `{{DataTypes.BYTE_ARRAY}}` is created for `{{encodedPage}}`, we have not 
> assigned the *`{{capacity}}`* field with allocated memory block size. Hence, 
> for each add row to tablePage, ensureMemory() check always fails, allocates a 
> new column page memoryBlock, copy the old context(prev rows) and free the old 
> memoryBlock. This allocation of new memoryBlock and free of old memoryBlock 
> happens at row addition for the string columns with local 

[jira] [Updated] (CARBONDATA-3519) A new column page MemoryBlock is allocated at each row addition to table page if having string column with local dictionary enabled.

2019-12-20 Thread Venugopal Reddy K (Jira)


 [ 
https://issues.apache.org/jira/browse/CARBONDATA-3519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Venugopal Reddy K updated CARBONDATA-3519:
--
Description: 
*Context:*

For a string column with local dictionary enabled, a column page of

{{{color:#de350b}UnsafeFixLengthColumnPage{color} }}with datatype 
`{{DataTypes.BYTE_ARRAY}}` is created for `{{encodedPage}}` along with regular 
`{{actualPage}}` of `{{UnsafeVarLengthColumnPage}}`. 

We have `*{{capacity}}*` field in 

 the `{{UnsafeFixLengthColumnPage}}`. And this field indicates the capacity of  
allocated

`{{memoryBlock}}` for the page. `{{ensureMemory()}}` method is being called 
while adding rows to check if 

`{{totalLength + requestSize > capacity}}` to allocate a new memoryBlock if 
there is no room to add the next row, copy the old context(prev rows) and free 
the old memoryBlock.

 

*Issues:*
 # While, `{{UnsafeFixLengthColumnPage}}` with with datatype 
`{{DataTypes.BYTE_ARRAY}}` is created for `{{encodedPage}}`, we have not 
assigned the *`{{capacity}}`* field with allocated memory block size. Hence, 
for each add row to tablePage, ensureMemory() check always fails, allocates a 
new column page memoryBlock, copy the old context(prev rows) and free the old 
memoryBlock. This allocation of new memoryBlock and free of old memoryBlock 
happens at row addition for the string columns with local dictionary enabled.
 # And in `VarLengthColumnPageBase`, we have a `rowOffset` column page of type 
`UnsafeFixLengthColumnPage` to maintain the offset to each row of variable 
length columns. This `rowOffset` page is 

  was:
*Context:*

For a string column with local dictionary enabled, a column page of

`{{UnsafeFixLengthColumnPage}}` with datatype `{{DataTypes.BYTE_ARRAY}}` is 
created for `{{encodedPage}}` along with regular `{{actualPage}}` of 
`{{UnsafeVarLengthColumnPage}}`. 

We have `*{{capacity}}*` field in 

 the `{{UnsafeFixLengthColumnPage}}`. And this field indicates the capacity of  
allocated

`{{memoryBlock}}` for the page. `{{ensureMemory()}}` method is being called 
while adding rows to check if 

`{{totalLength + requestSize > capacity}}` to allocate a new memoryBlock if 
there is no room to add the next row, copy the old context(prev rows) and free 
the old memoryBlock.

 

*Issues:*
 # While, `{{UnsafeFixLengthColumnPage}}` with with datatype 
`{{DataTypes.BYTE_ARRAY}}` is created for `{{encodedPage}}`, we have not 
assigned the *`{{capacity}}`* field with allocated memory block size. Hence, 
for each add row to tablePage, ensureMemory() check always fails, allocates a 
new column page memoryBlock, copy the old context(prev rows) and free the old 
memoryBlock. This allocation of new memoryBlock and free of old memoryBlock 
happens at row addition for the string columns with local dictionary enabled.
 # And in `VarLengthColumnPageBase`, we have a `rowOffset` column page of type 
`UnsafeFixLengthColumnPage` to maintain the offset to each row of variable 
length columns. This `rowOffset` page is 


> A new column page MemoryBlock is allocated at each row addition to table page 
> if having string column with local dictionary enabled. 
> -
>
> Key: CARBONDATA-3519
> URL: https://issues.apache.org/jira/browse/CARBONDATA-3519
> Project: CarbonData
>  Issue Type: Improvement
>  Components: core
>Reporter: Venugopal Reddy K
>Priority: Minor
>
> *Context:*
> For a string column with local dictionary enabled, a column page of
> {{{color:#de350b}UnsafeFixLengthColumnPage{color} }}with datatype 
> `{{DataTypes.BYTE_ARRAY}}` is created for `{{encodedPage}}` along with 
> regular `{{actualPage}}` of `{{UnsafeVarLengthColumnPage}}`. 
> We have `*{{capacity}}*` field in 
>  the `{{UnsafeFixLengthColumnPage}}`. And this field indicates the capacity 
> of  allocated
> `{{memoryBlock}}` for the page. `{{ensureMemory()}}` method is being called 
> while adding rows to check if 
> `{{totalLength + requestSize > capacity}}` to allocate a new memoryBlock if 
> there is no room to add the next row, copy the old context(prev rows) and 
> free the old memoryBlock.
>  
> *Issues:*
>  # While, `{{UnsafeFixLengthColumnPage}}` with with datatype 
> `{{DataTypes.BYTE_ARRAY}}` is created for `{{encodedPage}}`, we have not 
> assigned the *`{{capacity}}`* field with allocated memory block size. Hence, 
> for each add row to tablePage, ensureMemory() check always fails, allocates a 
> new column page memoryBlock, copy the old context(prev rows) and free the old 
> memoryBlock. This allocation of new memoryBlock and free of old memoryBlock 
> happens at row addition for the string columns with local dictionary enabled.
>  # And in `VarLengthColumnPageBase`, we have a 

[jira] [Updated] (CARBONDATA-3519) A new column page MemoryBlock is allocated at each row addition to table page if having string column with local dictionary enabled.

2019-12-20 Thread Venugopal Reddy K (Jira)


 [ 
https://issues.apache.org/jira/browse/CARBONDATA-3519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Venugopal Reddy K updated CARBONDATA-3519:
--
Description: 
*Context:*

For a string column with local dictionary enabled, a column page of

`{{UnsafeFixLengthColumnPage}}` with datatype `{{DataTypes.BYTE_ARRAY}}` is 
created for `{{encodedPage}}` along with regular `{{actualPage}}` of 
`{{UnsafeVarLengthColumnPage}}`. 

We have `*{{capacity}}*` field in 

 the `{{UnsafeFixLengthColumnPage}}`. And this field indicates the capacity of  
allocated

`{{memoryBlock}}` for the page. `{{ensureMemory()}}` method is being called 
while adding rows to check if 

`{{totalLength + requestSize > capacity}}` to allocate a new memoryBlock if 
there is no room to add the next row, copy the old context(prev rows) and free 
the old memoryBlock.

 

*Issues:*
 # While, `{{UnsafeFixLengthColumnPage}}` with with datatype 
`{{DataTypes.BYTE_ARRAY}}` is created for `{{encodedPage}}`, we have not 
assigned the *`{{capacity}}`* field with allocated memory block size. Hence, 
for each add row to tablePage, ensureMemory() check always fails, allocates a 
new column page memoryBlock, copy the old context(prev rows) and free the old 
memoryBlock. This allocation of new memoryBlock and free of old memoryBlock 
happens at row addition for the string columns with local dictionary enabled.
 # And in `VarLengthColumnPageBase`, we have a `rowOffset` column page of type 
`UnsafeFixLengthColumnPage` to maintain the offset to each row of variable 
length columns. This `rowOffset` page is 

  was:
*Context:*

For a string column with local dictionary enabled, a column page of

`{{UnsafeFixLengthColumnPage}}` with datatype `{{DataTypes.BYTE_ARRAY}}` is 
created for `{{encodedPage}}` along with regular `{{actualPage}}` of 
`{{UnsafeVarLengthColumnPage}}`. 

We have `{{capacity}}` field in 

 the `{{UnsafeFixLengthColumnPage}}`. And this field indicates the capacity of  
allocated

`{{memoryBlock}}` for the page. `{{ensureMemory()}}` method is being called 
while adding rows to check if 

`{{totalLength + requestSize > capacity}}` to allocate a new memoryBlock if 
there is no room to add the next row, copy the old context(prev rows) and free 
the old memoryBlock.

 

*Issue:*

While, `{{UnsafeFixLengthColumnPage}}` with with datatype 
`{{DataTypes.BYTE_ARRAY}}` is created for `{{encodedPage}}`, we have not 
assigned the `{{capacity}}` field with allocated memory block size. Hence, when 
we add a row to tablePage, ensureMemory() check always fails, allocates a new 
column page memoryBlock, copy the old context(prev rows) and free the old 
memoryBlock. This allocation of new memoryBlock and free of old memoryBlock 
happens at row addition for the string columns with local dictionary enabled.


> A new column page MemoryBlock is allocated at each row addition to table page 
> if having string column with local dictionary enabled. 
> -
>
> Key: CARBONDATA-3519
> URL: https://issues.apache.org/jira/browse/CARBONDATA-3519
> Project: CarbonData
>  Issue Type: Improvement
>  Components: core
>Reporter: Venugopal Reddy K
>Priority: Minor
>
> *Context:*
> For a string column with local dictionary enabled, a column page of
> `{{UnsafeFixLengthColumnPage}}` with datatype `{{DataTypes.BYTE_ARRAY}}` is 
> created for `{{encodedPage}}` along with regular `{{actualPage}}` of 
> `{{UnsafeVarLengthColumnPage}}`. 
> We have `*{{capacity}}*` field in 
>  the `{{UnsafeFixLengthColumnPage}}`. And this field indicates the capacity 
> of  allocated
> `{{memoryBlock}}` for the page. `{{ensureMemory()}}` method is being called 
> while adding rows to check if 
> `{{totalLength + requestSize > capacity}}` to allocate a new memoryBlock if 
> there is no room to add the next row, copy the old context(prev rows) and 
> free the old memoryBlock.
>  
> *Issues:*
>  # While, `{{UnsafeFixLengthColumnPage}}` with with datatype 
> `{{DataTypes.BYTE_ARRAY}}` is created for `{{encodedPage}}`, we have not 
> assigned the *`{{capacity}}`* field with allocated memory block size. Hence, 
> for each add row to tablePage, ensureMemory() check always fails, allocates a 
> new column page memoryBlock, copy the old context(prev rows) and free the old 
> memoryBlock. This allocation of new memoryBlock and free of old memoryBlock 
> happens at row addition for the string columns with local dictionary enabled.
>  # And in `VarLengthColumnPageBase`, we have a `rowOffset` column page of 
> type `UnsafeFixLengthColumnPage` to maintain the offset to each row of 
> variable length columns. This `rowOffset` page is 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (CARBONDATA-3548) Support for Geospatial indexing

2019-11-29 Thread Venugopal Reddy K (Jira)


 [ 
https://issues.apache.org/jira/browse/CARBONDATA-3548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Venugopal Reddy K updated CARBONDATA-3548:
--
Attachment: Geospatial Index Design Doc-OpenSource-Version 2.0.pdf

> Support for Geospatial indexing
> ---
>
> Key: CARBONDATA-3548
> URL: https://issues.apache.org/jira/browse/CARBONDATA-3548
> Project: CarbonData
>  Issue Type: New Feature
>Reporter: Venugopal Reddy K
>Priority: Major
> Attachments: Geospatial Index Design Doc-OpenSource-Version 2.0.pdf, 
> Geospatial Index Design Doc-OpenSource.pdf
>
>  Time Spent: 27h
>  Remaining Estimate: 0h
>
> In general, database may contain geographical location data. For instance, 
> Telecom operators require to perform analytics based on a particular region, 
> cell tower IDs(within a region) and/or may include geographical locations for 
> a particular period of time. At present, Carbon do not have native support to 
> store geographical locations/coordinates and to do filter queries based on 
> them. Yet, longitude and latitude of coordinates can be treated as 
> independent columns, sort hierarchically and store them.
>          But, when longitude and latitude are treated independently, 2D space 
> is linearized i.e., points in the two dimensional domain are ordered by 
> sorting first on longitide and then on latitude. Thus, data is not ordered by 
> geospatial proximity. Hence range queries require lot of IO operations and 
> query performance is degraded.
>         To alleviate it, we can use z-order curve to store geospatial data 
> points. This ensures that geographically nearer points are present at same 
> block/blocklet. This reduces the IO operations for range queries and improves 
> query performance. Also can support polygon queries of geodata. Attached 
> design document describes in detailed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (CARBONDATA-3548) Support for Geospatial indexing

2019-11-29 Thread Venugopal Reddy K (Jira)


 [ 
https://issues.apache.org/jira/browse/CARBONDATA-3548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Venugopal Reddy K updated CARBONDATA-3548:
--
Attachment: (was: Geospatial Index Design Doc-OpenSource-Version 
2.0.pdf)

> Support for Geospatial indexing
> ---
>
> Key: CARBONDATA-3548
> URL: https://issues.apache.org/jira/browse/CARBONDATA-3548
> Project: CarbonData
>  Issue Type: New Feature
>Reporter: Venugopal Reddy K
>Priority: Major
> Attachments: Geospatial Index Design Doc-OpenSource-Version 2.0.pdf, 
> Geospatial Index Design Doc-OpenSource.pdf
>
>  Time Spent: 27h
>  Remaining Estimate: 0h
>
> In general, database may contain geographical location data. For instance, 
> Telecom operators require to perform analytics based on a particular region, 
> cell tower IDs(within a region) and/or may include geographical locations for 
> a particular period of time. At present, Carbon do not have native support to 
> store geographical locations/coordinates and to do filter queries based on 
> them. Yet, longitude and latitude of coordinates can be treated as 
> independent columns, sort hierarchically and store them.
>          But, when longitude and latitude are treated independently, 2D space 
> is linearized i.e., points in the two dimensional domain are ordered by 
> sorting first on longitide and then on latitude. Thus, data is not ordered by 
> geospatial proximity. Hence range queries require lot of IO operations and 
> query performance is degraded.
>         To alleviate it, we can use z-order curve to store geospatial data 
> points. This ensures that geographically nearer points are present at same 
> block/blocklet. This reduces the IO operations for range queries and improves 
> query performance. Also can support polygon queries of geodata. Attached 
> design document describes in detailed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (CARBONDATA-3548) Support for Geospatial indexing

2019-11-29 Thread Venugopal Reddy K (Jira)


 [ 
https://issues.apache.org/jira/browse/CARBONDATA-3548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Venugopal Reddy K updated CARBONDATA-3548:
--
Attachment: Geospatial Index Design Doc-OpenSource-Version 2.0.pdf

> Support for Geospatial indexing
> ---
>
> Key: CARBONDATA-3548
> URL: https://issues.apache.org/jira/browse/CARBONDATA-3548
> Project: CarbonData
>  Issue Type: New Feature
>Reporter: Venugopal Reddy K
>Priority: Major
> Attachments: Geospatial Index Design Doc-OpenSource-Version 2.0.pdf, 
> Geospatial Index Design Doc-OpenSource.pdf
>
>  Time Spent: 27h
>  Remaining Estimate: 0h
>
> In general, database may contain geographical location data. For instance, 
> Telecom operators require to perform analytics based on a particular region, 
> cell tower IDs(within a region) and/or may include geographical locations for 
> a particular period of time. At present, Carbon do not have native support to 
> store geographical locations/coordinates and to do filter queries based on 
> them. Yet, longitude and latitude of coordinates can be treated as 
> independent columns, sort hierarchically and store them.
>          But, when longitude and latitude are treated independently, 2D space 
> is linearized i.e., points in the two dimensional domain are ordered by 
> sorting first on longitide and then on latitude. Thus, data is not ordered by 
> geospatial proximity. Hence range queries require lot of IO operations and 
> query performance is degraded.
>         To alleviate it, we can use z-order curve to store geospatial data 
> points. This ensures that geographically nearer points are present at same 
> block/blocklet. This reduces the IO operations for range queries and improves 
> query performance. Also can support polygon queries of geodata. Attached 
> design document describes in detailed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (CARBONDATA-3548) Support for Geospatial indexing

2019-11-29 Thread Venugopal Reddy K (Jira)


 [ 
https://issues.apache.org/jira/browse/CARBONDATA-3548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Venugopal Reddy K updated CARBONDATA-3548:
--
Attachment: (was: Geospatial Index Design Doc-OpenSource-Version 
2.0.pdf)

> Support for Geospatial indexing
> ---
>
> Key: CARBONDATA-3548
> URL: https://issues.apache.org/jira/browse/CARBONDATA-3548
> Project: CarbonData
>  Issue Type: New Feature
>Reporter: Venugopal Reddy K
>Priority: Major
> Attachments: Geospatial Index Design Doc-OpenSource.pdf
>
>  Time Spent: 27h
>  Remaining Estimate: 0h
>
> In general, database may contain geographical location data. For instance, 
> Telecom operators require to perform analytics based on a particular region, 
> cell tower IDs(within a region) and/or may include geographical locations for 
> a particular period of time. At present, Carbon do not have native support to 
> store geographical locations/coordinates and to do filter queries based on 
> them. Yet, longitude and latitude of coordinates can be treated as 
> independent columns, sort hierarchically and store them.
>          But, when longitude and latitude are treated independently, 2D space 
> is linearized i.e., points in the two dimensional domain are ordered by 
> sorting first on longitide and then on latitude. Thus, data is not ordered by 
> geospatial proximity. Hence range queries require lot of IO operations and 
> query performance is degraded.
>         To alleviate it, we can use z-order curve to store geospatial data 
> points. This ensures that geographically nearer points are present at same 
> block/blocklet. This reduces the IO operations for range queries and improves 
> query performance. Also can support polygon queries of geodata. Attached 
> design document describes in detailed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (CARBONDATA-3548) Support for Geospatial indexing

2019-11-26 Thread Venugopal Reddy K (Jira)


 [ 
https://issues.apache.org/jira/browse/CARBONDATA-3548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Venugopal Reddy K updated CARBONDATA-3548:
--
Attachment: Geospatial Index Design Doc-OpenSource-Version 2.0.pdf

> Support for Geospatial indexing
> ---
>
> Key: CARBONDATA-3548
> URL: https://issues.apache.org/jira/browse/CARBONDATA-3548
> Project: CarbonData
>  Issue Type: New Feature
>Reporter: Venugopal Reddy K
>Priority: Major
> Attachments: Geospatial Index Design Doc-OpenSource-Version 2.0.pdf, 
> Geospatial Index Design Doc-OpenSource.pdf
>
>  Time Spent: 22h 40m
>  Remaining Estimate: 0h
>
> In general, database may contain geographical location data. For instance, 
> Telecom operators require to perform analytics based on a particular region, 
> cell tower IDs(within a region) and/or may include geographical locations for 
> a particular period of time. At present, Carbon do not have native support to 
> store geographical locations/coordinates and to do filter queries based on 
> them. Yet, longitude and latitude of coordinates can be treated as 
> independent columns, sort hierarchically and store them.
>          But, when longitude and latitude are treated independently, 2D space 
> is linearized i.e., points in the two dimensional domain are ordered by 
> sorting first on longitide and then on latitude. Thus, data is not ordered by 
> geospatial proximity. Hence range queries require lot of IO operations and 
> query performance is degraded.
>         To alleviate it, we can use z-order curve to store geospatial data 
> points. This ensures that geographically nearer points are present at same 
> block/blocklet. This reduces the IO operations for range queries and improves 
> query performance. Also can support polygon queries of geodata. Attached 
> design document describes in detailed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (CARBONDATA-3548) Support for Geospatial indexing

2019-11-26 Thread Venugopal Reddy K (Jira)


[ 
https://issues.apache.org/jira/browse/CARBONDATA-3548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16982525#comment-16982525
 ] 

Venugopal Reddy K commented on CARBONDATA-3548:
---

Yes. It is at polygon query.

> Support for Geospatial indexing
> ---
>
> Key: CARBONDATA-3548
> URL: https://issues.apache.org/jira/browse/CARBONDATA-3548
> Project: CarbonData
>  Issue Type: New Feature
>Reporter: Venugopal Reddy K
>Priority: Major
> Attachments: Geospatial Index Design Doc-OpenSource.pdf
>
>  Time Spent: 22h 20m
>  Remaining Estimate: 0h
>
> In general, database may contain geographical location data. For instance, 
> Telecom operators require to perform analytics based on a particular region, 
> cell tower IDs(within a region) and/or may include geographical locations for 
> a particular period of time. At present, Carbon do not have native support to 
> store geographical locations/coordinates and to do filter queries based on 
> them. Yet, longitude and latitude of coordinates can be treated as 
> independent columns, sort hierarchically and store them.
>          But, when longitude and latitude are treated independently, 2D space 
> is linearized i.e., points in the two dimensional domain are ordered by 
> sorting first on longitide and then on latitude. Thus, data is not ordered by 
> geospatial proximity. Hence range queries require lot of IO operations and 
> query performance is degraded.
>         To alleviate it, we can use z-order curve to store geospatial data 
> points. This ensures that geographically nearer points are present at same 
> block/blocklet. This reduces the IO operations for range queries and improves 
> query performance. Also can support polygon queries of geodata. Attached 
> design document describes in detailed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (CARBONDATA-3548) Support for Geospatial indexing

2019-10-15 Thread Venugopal Reddy K (Jira)


 [ 
https://issues.apache.org/jira/browse/CARBONDATA-3548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Venugopal Reddy K updated CARBONDATA-3548:
--
Description: 
In general, database may contain geographical location data. For instance, 
Telecom operators require to perform analytics based on a particular region, 
cell tower IDs(within a region) and/or may include geographical locations for a 
particular period of time. At present, Carbon do not have native support to 
store geographical locations/coordinates and to do filter queries based on 
them. Yet, longitude and latitude of coordinates can be treated as independent 
columns, sort hierarchically and store them.

         But, when longitude and latitude are treated independently, 2D space 
is linearized i.e., points in the two dimensional domain are ordered by sorting 
first on longitide and then on latitude. Thus, data is not ordered by 
geospatial proximity. Hence range queries require lot of IO operations and 
query performance is degraded.

        To alleviate it, we can use z-order curve to store geospatial data 
points. This ensures that geographically nearer points are present at same 
block/blocklet. This reduces the IO operations for range queries and improves 
query performance. Also can support polygon queries of geodata. Attached design 
document describes in detailed.

  was:
In general, database may contain geographical location data. For instance, 
Telecom operators require to perform analytics based on a particular region, 
cell tower IDs(within a region) and/or may include geographical locations for a 
particular period of time. At present, Carbon do not have native support to 
store geographical locations/coordinates and to do filter queries based on 
them. Yet, longitude and latitude of coordinates can be treated as independent 
columns, sort hierarchically and store them.

         But, when longitude and latitude are treated independently, 2D space 
is linearized i.e., points in the two dimensional domain are ordered by sorting 
first on longitide and then on latitude. Thus, data is not ordered by 
geospatial proximity. Hence range queries require lot of IO operations and 
query performance is degraded.

        To alleviate it, we can use z-order curve to store geospatial data 
points. This ensures that geographically nearer points are present at same 
block/blocklet. This reduces the IO operations for range queries and improves 
query performance.


> Support for Geospatial indexing
> ---
>
> Key: CARBONDATA-3548
> URL: https://issues.apache.org/jira/browse/CARBONDATA-3548
> Project: CarbonData
>  Issue Type: New Feature
>Reporter: Venugopal Reddy K
>Priority: Major
> Attachments: Geospatial Index Design Doc-OpenSource.pdf
>
>
> In general, database may contain geographical location data. For instance, 
> Telecom operators require to perform analytics based on a particular region, 
> cell tower IDs(within a region) and/or may include geographical locations for 
> a particular period of time. At present, Carbon do not have native support to 
> store geographical locations/coordinates and to do filter queries based on 
> them. Yet, longitude and latitude of coordinates can be treated as 
> independent columns, sort hierarchically and store them.
>          But, when longitude and latitude are treated independently, 2D space 
> is linearized i.e., points in the two dimensional domain are ordered by 
> sorting first on longitide and then on latitude. Thus, data is not ordered by 
> geospatial proximity. Hence range queries require lot of IO operations and 
> query performance is degraded.
>         To alleviate it, we can use z-order curve to store geospatial data 
> points. This ensures that geographically nearer points are present at same 
> block/blocklet. This reduces the IO operations for range queries and improves 
> query performance. Also can support polygon queries of geodata. Attached 
> design document describes in detailed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (CARBONDATA-3548) Support for Geospatial indexing

2019-10-15 Thread Venugopal Reddy K (Jira)
Venugopal Reddy K created CARBONDATA-3548:
-

 Summary: Support for Geospatial indexing
 Key: CARBONDATA-3548
 URL: https://issues.apache.org/jira/browse/CARBONDATA-3548
 Project: CarbonData
  Issue Type: New Feature
Reporter: Venugopal Reddy K
 Attachments: Geospatial Index Design Doc-OpenSource.pdf

In general, database may contain geographical location data. For instance, 
Telecom operators require to perform analytics based on a particular region, 
cell tower IDs(within a region) and/or may include geographical locations for a 
particular period of time. At present, Carbon do not have native support to 
store geographical locations/coordinates and to do filter queries based on 
them. Yet, longitude and latitude of coordinates can be treated as independent 
columns, sort hierarchically and store them.

         But, when longitude and latitude are treated independently, 2D space 
is linearized i.e., points in the two dimensional domain are ordered by sorting 
first on longitide and then on latitude. Thus, data is not ordered by 
geospatial proximity. Hence range queries require lot of IO operations and 
query performance is degraded.

        To alleviate it, we can use z-order curve to store geospatial data 
points. This ensures that geographically nearer points are present at same 
block/blocklet. This reduces the IO operations for range queries and improves 
query performance.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (CARBONDATA-3519) A new column page MemoryBlock is allocated at each row addition to table page if having string column with local dictionary enabled.

2019-09-15 Thread Venugopal Reddy K (Jira)


 [ 
https://issues.apache.org/jira/browse/CARBONDATA-3519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Venugopal Reddy K updated CARBONDATA-3519:
--
Description: 
*Context:*

For a string column with local dictionary enabled, a column page of

`{{UnsafeFixLengthColumnPage}}` with datatype `{{DataTypes.BYTE_ARRAY}}` is 
created for `{{encodedPage}}` along with regular `{{actualPage}}` of 
`{{UnsafeVarLengthColumnPage}}`. 

We have `{{capacity}}` field in 

 the `{{UnsafeFixLengthColumnPage}}`. And this field indicates the capacity of  
allocated

`{{memoryBlock}}` for the page. `{{ensureMemory()}}` method is being called 
while adding rows to check if 

`{{totalLength + requestSize > capacity}}` to allocate a new memoryBlock if 
there is no room to add the next row, copy the old context(prev rows) and free 
the old memoryBlock.

 

*Issue:*

While, `{{UnsafeFixLengthColumnPage}}` with with datatype 
`{{DataTypes.BYTE_ARRAY}}` is created for `{{encodedPage}}`, we have not 
assigned the `{{capacity}}` field with allocated memory block size. Hence, when 
we add a row to tablePage, ensureMemory() check always fails, allocates a new 
column page memoryBlock, copy the old context(prev rows) and free the old 
memoryBlock. This allocation of new memoryBlock and free of old memoryBlock 
happens at row addition for the string columns with local dictionary enabled.

  was:
*Context:*

For a string column with local dictionary enabled, a column page of

`{{UnsafeFixLengthColumnPage}}` with datatype `DataTypes.BYTE_ARRAY` is created 
for `encodedPage` along with regular `actualPage` of 
`UnsafeVarLengthColumnPage`. 

We have `capacity` field in 

 the `UnsafeFixLengthColumnPage`. And this field indicates the capacity of  
allocated

`memoryBlock` for the page. `ensureMemory()` method is being called while 
adding rows to check if 

`totalLength + requestSize > capacity` to allocate a new memoryBlock if there 
is no room to add the next row, copy the old context(prev rows) and free the 
old memoryBlock.

 

*Issue:*

While, UnsafeFixLengthColumnPage with with datatype `DataTypes.BYTE_ARRAY` is 
created for `encodedPage`, we have not assigned the `capacity` field with 
allocated memory block size. Hence, when we add a row to tablePage, 
ensureMemory() check fails, allocates a new column page memoryBlock, copy the 
old context(prev rows) and free the old memoryBlock. This allocation of new 
memoryBlock and free of old memoryBlock happens at row addition for the string 
columns with local dictionary enabled.


> A new column page MemoryBlock is allocated at each row addition to table page 
> if having string column with local dictionary enabled. 
> -
>
> Key: CARBONDATA-3519
> URL: https://issues.apache.org/jira/browse/CARBONDATA-3519
> Project: CarbonData
>  Issue Type: Improvement
>  Components: core
>Reporter: Venugopal Reddy K
>Priority: Minor
>
> *Context:*
> For a string column with local dictionary enabled, a column page of
> `{{UnsafeFixLengthColumnPage}}` with datatype `{{DataTypes.BYTE_ARRAY}}` is 
> created for `{{encodedPage}}` along with regular `{{actualPage}}` of 
> `{{UnsafeVarLengthColumnPage}}`. 
> We have `{{capacity}}` field in 
>  the `{{UnsafeFixLengthColumnPage}}`. And this field indicates the capacity 
> of  allocated
> `{{memoryBlock}}` for the page. `{{ensureMemory()}}` method is being called 
> while adding rows to check if 
> `{{totalLength + requestSize > capacity}}` to allocate a new memoryBlock if 
> there is no room to add the next row, copy the old context(prev rows) and 
> free the old memoryBlock.
>  
> *Issue:*
> While, `{{UnsafeFixLengthColumnPage}}` with with datatype 
> `{{DataTypes.BYTE_ARRAY}}` is created for `{{encodedPage}}`, we have not 
> assigned the `{{capacity}}` field with allocated memory block size. Hence, 
> when we add a row to tablePage, ensureMemory() check always fails, allocates 
> a new column page memoryBlock, copy the old context(prev rows) and free the 
> old memoryBlock. This allocation of new memoryBlock and free of old 
> memoryBlock happens at row addition for the string columns with local 
> dictionary enabled.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Updated] (CARBONDATA-3519) A new column page MemoryBlock is allocated at each row addition to table page if having string column with local dictionary enabled.

2019-09-15 Thread Venugopal Reddy K (Jira)


 [ 
https://issues.apache.org/jira/browse/CARBONDATA-3519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Venugopal Reddy K updated CARBONDATA-3519:
--
Description: 
*Context:*

For a string column with local dictionary enabled, a column page of

`{{UnsafeFixLengthColumnPage}}` with datatype `DataTypes.BYTE_ARRAY` is created 
for `encodedPage` along with regular `actualPage` of 
`UnsafeVarLengthColumnPage`. 

We have `capacity` field in 

 the `UnsafeFixLengthColumnPage`. And this field indicates the capacity of  
allocated

`memoryBlock` for the page. `ensureMemory()` method is being called while 
adding rows to check if 

`totalLength + requestSize > capacity` to allocate a new memoryBlock if there 
is no room to add the next row, copy the old context(prev rows) and free the 
old memoryBlock.

 

*Issue:*

While, UnsafeFixLengthColumnPage with with datatype `DataTypes.BYTE_ARRAY` is 
created for `encodedPage`, we have not assigned the `capacity` field with 
allocated memory block size. Hence, when we add a row to tablePage, 
ensureMemory() check fails, allocates a new column page memoryBlock, copy the 
old context(prev rows) and free the old memoryBlock. This allocation of new 
memoryBlock and free of old memoryBlock happens at row addition for the string 
columns with local dictionary enabled.

  was:
*Context:*

For a string column with local dictionary enabled, a column page of

??`UnsafeFixLengthColumnPage`?? with datatype `DataTypes.BYTE_ARRAY` is created 
for `encodedPage` along with regular `actualPage` of 
`UnsafeVarLengthColumnPage`. 

We have `capacity` field in 

 the `UnsafeFixLengthColumnPage`. And this field indicates the capacity of  
allocated

`memoryBlock` for the page. `ensureMemory()` method is being called while 
adding rows to check if 

`totalLength + requestSize > capacity` to allocate a new memoryBlock if there 
is no room to add the next row, copy the old context(prev rows) and free the 
old memoryBlock.

 

*Issue:*

While, UnsafeFixLengthColumnPage with with datatype `DataTypes.BYTE_ARRAY` is 
created for `encodedPage`, we have not assigned the `capacity` field with 
allocated memory block size. Hence, when we add a row to tablePage, 
ensureMemory() check fails, allocates a new column page memoryBlock, copy the 
old context(prev rows) and free the old memoryBlock. This allocation of new 
memoryBlock and free of old memoryBlock happens at row addition for the string 
columns with local dictionary enabled.


> A new column page MemoryBlock is allocated at each row addition to table page 
> if having string column with local dictionary enabled. 
> -
>
> Key: CARBONDATA-3519
> URL: https://issues.apache.org/jira/browse/CARBONDATA-3519
> Project: CarbonData
>  Issue Type: Improvement
>  Components: core
>Reporter: Venugopal Reddy K
>Priority: Minor
>
> *Context:*
> For a string column with local dictionary enabled, a column page of
> `{{UnsafeFixLengthColumnPage}}` with datatype `DataTypes.BYTE_ARRAY` is 
> created for `encodedPage` along with regular `actualPage` of 
> `UnsafeVarLengthColumnPage`. 
> We have `capacity` field in 
>  the `UnsafeFixLengthColumnPage`. And this field indicates the capacity of  
> allocated
> `memoryBlock` for the page. `ensureMemory()` method is being called while 
> adding rows to check if 
> `totalLength + requestSize > capacity` to allocate a new memoryBlock if there 
> is no room to add the next row, copy the old context(prev rows) and free the 
> old memoryBlock.
>  
> *Issue:*
> While, UnsafeFixLengthColumnPage with with datatype `DataTypes.BYTE_ARRAY` is 
> created for `encodedPage`, we have not assigned the `capacity` field with 
> allocated memory block size. Hence, when we add a row to tablePage, 
> ensureMemory() check fails, allocates a new column page memoryBlock, copy the 
> old context(prev rows) and free the old memoryBlock. This allocation of new 
> memoryBlock and free of old memoryBlock happens at row addition for the 
> string columns with local dictionary enabled.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Updated] (CARBONDATA-3519) A new column page MemoryBlock is allocated at each row addition to table page if having string column with local dictionary enabled.

2019-09-15 Thread Venugopal Reddy K (Jira)


 [ 
https://issues.apache.org/jira/browse/CARBONDATA-3519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Venugopal Reddy K updated CARBONDATA-3519:
--
Description: 
*Context:*

For a string column with local dictionary enabled, a column page of

??`UnsafeFixLengthColumnPage`?? with datatype `DataTypes.BYTE_ARRAY` is created 
for `encodedPage` along with regular `actualPage` of 
`UnsafeVarLengthColumnPage`. 

We have `capacity` field in 

 the `UnsafeFixLengthColumnPage`. And this field indicates the capacity of  
allocated

`memoryBlock` for the page. `ensureMemory()` method is being called while 
adding rows to check if 

`totalLength + requestSize > capacity` to allocate a new memoryBlock if there 
is no room to add the next row, copy the old context(prev rows) and free the 
old memoryBlock.

 

*Issue:*

While, UnsafeFixLengthColumnPage with with datatype `DataTypes.BYTE_ARRAY` is 
created for `encodedPage`, we have not assigned the `capacity` field with 
allocated memory block size. Hence, when we add a row to tablePage, 
ensureMemory() check fails, allocates a new column page memoryBlock, copy the 
old context(prev rows) and free the old memoryBlock. This allocation of new 
memoryBlock and free of old memoryBlock happens at row addition for the string 
columns with local dictionary enabled.

  was:
*Context:*

For a string column with local dictionary enabled, a column page of

`UnsafeFixLengthColumnPage` with datatype `DataTypes.BYTE_ARRAY` is created for 
`encodedPage` along with regular `actualPage` of `UnsafeVarLengthColumnPage`. 

We have `capacity` field in 

 the `UnsafeFixLengthColumnPage`. And this field indicates the capacity of  
allocated

`memoryBlock` for the page. `ensureMemory()` method is being called while 
adding rows to check if 

`totalLength + requestSize > capacity` to allocate a new memoryBlock if there 
is no room to add the next row, copy the old context(prev rows) and free the 
old memoryBlock.

 

*Issue:*

While, UnsafeFixLengthColumnPage with with datatype `DataTypes.BYTE_ARRAY` is 
created for `encodedPage`, we have not assigned the `capacity` field with 
allocated memory block size. Hence, when we add a row to tablePage, 
ensureMemory() check fails, allocates a new column page memoryBlock, copy the 
old context(prev rows) and free the old memoryBlock. This allocation of new 
memoryBlock and free of old memoryBlock happens at row addition for the string 
columns with local dictionary enabled.


> A new column page MemoryBlock is allocated at each row addition to table page 
> if having string column with local dictionary enabled. 
> -
>
> Key: CARBONDATA-3519
> URL: https://issues.apache.org/jira/browse/CARBONDATA-3519
> Project: CarbonData
>  Issue Type: Improvement
>  Components: core
>Reporter: Venugopal Reddy K
>Priority: Minor
>
> *Context:*
> For a string column with local dictionary enabled, a column page of
> ??`UnsafeFixLengthColumnPage`?? with datatype `DataTypes.BYTE_ARRAY` is 
> created for `encodedPage` along with regular `actualPage` of 
> `UnsafeVarLengthColumnPage`. 
> We have `capacity` field in 
>  the `UnsafeFixLengthColumnPage`. And this field indicates the capacity of  
> allocated
> `memoryBlock` for the page. `ensureMemory()` method is being called while 
> adding rows to check if 
> `totalLength + requestSize > capacity` to allocate a new memoryBlock if there 
> is no room to add the next row, copy the old context(prev rows) and free the 
> old memoryBlock.
>  
> *Issue:*
> While, UnsafeFixLengthColumnPage with with datatype `DataTypes.BYTE_ARRAY` is 
> created for `encodedPage`, we have not assigned the `capacity` field with 
> allocated memory block size. Hence, when we add a row to tablePage, 
> ensureMemory() check fails, allocates a new column page memoryBlock, copy the 
> old context(prev rows) and free the old memoryBlock. This allocation of new 
> memoryBlock and free of old memoryBlock happens at row addition for the 
> string columns with local dictionary enabled.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Created] (CARBONDATA-3519) A new column page MemoryBlock is allocated at each row addition to table page if having string column with local dictionary enabled.

2019-09-15 Thread Venugopal Reddy K (Jira)
Venugopal Reddy K created CARBONDATA-3519:
-

 Summary: A new column page MemoryBlock is allocated at each row 
addition to table page if having string column with local dictionary enabled. 
 Key: CARBONDATA-3519
 URL: https://issues.apache.org/jira/browse/CARBONDATA-3519
 Project: CarbonData
  Issue Type: Improvement
  Components: core
Reporter: Venugopal Reddy K


*Context:*

For a string column with local dictionary enabled, a column page of

`UnsafeFixLengthColumnPage` with datatype `DataTypes.BYTE_ARRAY` is created for 
`encodedPage` along with regular `actualPage` of `UnsafeVarLengthColumnPage`. 

We have `capacity` field in 

 the `UnsafeFixLengthColumnPage`. And this field indicates the capacity of  
allocated

`memoryBlock` for the page. `ensureMemory()` method is being called while 
adding rows to check if 

`totalLength + requestSize > capacity` to allocate a new memoryBlock if there 
is no room to add the next row, copy the old context(prev rows) and free the 
old memoryBlock.

 

*Issue:*

While, UnsafeFixLengthColumnPage with with datatype `DataTypes.BYTE_ARRAY` is 
created for `encodedPage`, we have not assigned the `capacity` field with 
allocated memory block size. Hence, when we add a row to tablePage, 
ensureMemory() check fails, allocates a new column page memoryBlock, copy the 
old context(prev rows) and free the old memoryBlock. This allocation of new 
memoryBlock and free of old memoryBlock happens at row addition for the string 
columns with local dictionary enabled.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)