Re: Query regarding behaviour of sort column

Pallavi Singh Wed, 17 May 2017 04:20:07 -0700

Hi Community,

While working with the above problem I found two discussions regarding the
sort column


1. https://github.com/apache/carbondata/pull/635 which states : If the
table need be sorted by a measure, we should use dictionary_include to add
it to dimension list.

2. https://github.com/apache/carbondata/pull/757 : if a column of
SORT_COLUMNS is a measure before, now this column will be created as a
dimension. And this dimension is a no-dicitonary column(Better to use other
direct-dictionary).

Now if the columns in my sort_column be measures then I have to add the
same columns in the dictionary_include other wise in case of null value in
case of sort_column column the loading fails after the first null encounter
itself.

for example like this :

CREATE TABLE test_sort_col
   | (id INT,
   | name STRING,
   | age INT
   | )STORED BY 'org.apache.carbondata.format'
   | TBLPROPERTIES('SORT_COLUMNS'='id,age','DICTIONARY_INCLUDE'='id,age')

and the csv has following data :

1,Pallavi,25
2,Geetika,24
3,Prabhat,twenty six
7,Neha,25
2,Geetika,22
3,Sangeeta,26

and the load gets successful like shown below :

+---+--------+----+
| id|    name| age|
+---+--------+----+
|  1| Pallavi|  25|
|  2| Geetika|  22|
|  2| Geetika|  24|
|  3| Prabhat|null|
|  3|Sangeeta|  26|
|  7|    Neha|  25|
+---+--------+----+

now if i remove the measures of the sort_column from the dictionary_include
in the  query I get an error and partial data gets loaded, snapshot is
provided below

Data load request has been received for table default.test_sort_col
17/05/17 16:46:51 ERROR UnsafeBatchParallelReadMergeSorterImpl:
pool-20-thread-1
java.lang.ClassCastException: java.lang.String cannot be cast to [B
at
org.apache.carbondata.processing.newflow.sort.unsafe.UnsafeCarbonRowPage.addRow(UnsafeCarbonRowPage.java:89)
at
org.apache.carbondata.processing.newflow.sort.unsafe.UnsafeCarbonRowPage.addRow(UnsafeCarbonRowPage.java:74)
at
org.apache.carbondata.processing.newflow.sort.unsafe.UnsafeSortDataRows.addRowBatch(UnsafeSortDataRows.java:170)
at
org.apache.carbondata.processing.newflow.sort.impl.UnsafeBatchParallelReadMergeSorterImpl$SortIteratorThread.call(UnsafeBatchParallelReadMergeSorterImpl.java:150)
at
org.apache.carbondata.processing.newflow.sort.impl.UnsafeBatchParallelReadMergeSorterImpl$SortIteratorThread.call(UnsafeBatchParallelReadMergeSorterImpl.java:117)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
17/05/17 16:46:51 AUDIT UnsafeInmemoryHolder:
[pallavi][pallavi][Thread-83]Processing unsafe inmemory rows page with size
: 2
17/05/17 16:46:52 ERROR DataLoadExecutor: [Executor task launch
worker-0][partitionID:default_test_sort_col_296196d0-a469-4273-a382-c41531c32591]
Data Load is partially success for table test_sort_col
17/05/17 16:46:52 AUDIT CarbonDataRDDFactory$:
[pallavi][pallavi][Thread-1]Data load is partially successful for
default.test_sort_col
+---+-------+---+
| id|   name|age|
+---+-------+---+
|  1|Pallavi| 25|
|  2|Geetika| 24|
+---+-------+---+

What is the correct behavior, should we add measures in sort_column to
dictionary_include or should we modify the load flow to handle null values?


On Tue, May 16, 2017 at 11:38 AM, Pallavi Singh <[email protected]>
wrote:

> Hi Community,
>
> While working with the sort columns, I saw that if the column listed in
> the sort_column happens to have a null value in the data while sorting ,
> then the row corresponding to that null value is eliminated from the result
> set. Is this correct behavior? Ideally null values should be sorted and
> listed on the top in the result set.
>

-- 
Regards | Pallavi Singh
Software Consultant

Re: Query regarding behaviour of sort column

Reply via email to