Hi Community, While working with the above problem I found two discussions regarding the sort column
1. https://github.com/apache/carbondata/pull/635 which states : If the table need be sorted by a measure, we should use dictionary_include to add it to dimension list. 2. https://github.com/apache/carbondata/pull/757 : if a column of SORT_COLUMNS is a measure before, now this column will be created as a dimension. And this dimension is a no-dicitonary column(Better to use other direct-dictionary). Now if the columns in my sort_column be measures then I have to add the same columns in the dictionary_include other wise in case of null value in case of sort_column column the loading fails after the first null encounter itself. for example like this : CREATE TABLE test_sort_col | (id INT, | name STRING, | age INT | )STORED BY 'org.apache.carbondata.format' | TBLPROPERTIES('SORT_COLUMNS'='id,age','DICTIONARY_INCLUDE'='id,age') and the csv has following data : 1,Pallavi,25 2,Geetika,24 3,Prabhat,twenty six 7,Neha,25 2,Geetika,22 3,Sangeeta,26 and the load gets successful like shown below : +---+--------+----+ | id| name| age| +---+--------+----+ | 1| Pallavi| 25| | 2| Geetika| 22| | 2| Geetika| 24| | 3| Prabhat|null| | 3|Sangeeta| 26| | 7| Neha| 25| +---+--------+----+ now if i remove the measures of the sort_column from the dictionary_include in the query I get an error and partial data gets loaded, snapshot is provided below Data load request has been received for table default.test_sort_col 17/05/17 16:46:51 ERROR UnsafeBatchParallelReadMergeSorterImpl: pool-20-thread-1 java.lang.ClassCastException: java.lang.String cannot be cast to [B at org.apache.carbondata.processing.newflow.sort.unsafe.UnsafeCarbonRowPage.addRow(UnsafeCarbonRowPage.java:89) at org.apache.carbondata.processing.newflow.sort.unsafe.UnsafeCarbonRowPage.addRow(UnsafeCarbonRowPage.java:74) at org.apache.carbondata.processing.newflow.sort.unsafe.UnsafeSortDataRows.addRowBatch(UnsafeSortDataRows.java:170) at org.apache.carbondata.processing.newflow.sort.impl.UnsafeBatchParallelReadMergeSorterImpl$SortIteratorThread.call(UnsafeBatchParallelReadMergeSorterImpl.java:150) at org.apache.carbondata.processing.newflow.sort.impl.UnsafeBatchParallelReadMergeSorterImpl$SortIteratorThread.call(UnsafeBatchParallelReadMergeSorterImpl.java:117) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) 17/05/17 16:46:51 AUDIT UnsafeInmemoryHolder: [pallavi][pallavi][Thread-83]Processing unsafe inmemory rows page with size : 2 17/05/17 16:46:52 ERROR DataLoadExecutor: [Executor task launch worker-0][partitionID:default_test_sort_col_296196d0-a469-4273-a382-c41531c32591] Data Load is partially success for table test_sort_col 17/05/17 16:46:52 AUDIT CarbonDataRDDFactory$: [pallavi][pallavi][Thread-1]Data load is partially successful for default.test_sort_col +---+-------+---+ | id| name|age| +---+-------+---+ | 1|Pallavi| 25| | 2|Geetika| 24| +---+-------+---+ What is the correct behavior, should we add measures in sort_column to dictionary_include or should we modify the load flow to handle null values? On Tue, May 16, 2017 at 11:38 AM, Pallavi Singh <[email protected]> wrote: > Hi Community, > > While working with the sort columns, I saw that if the column listed in > the sort_column happens to have a null value in the data while sorting , > then the row corresponding to that null value is eliminated from the result > set. Is this correct behavior? Ideally null values should be sorted and > listed on the top in the result set. > -- Regards | Pallavi Singh Software Consultant
