Re: Query regarding behaviour of sort column

manish gupta Wed, 17 May 2017 07:30:34 -0700

Hi Pallavi,

Behavior of column should be same irrespective of whether it is dictionary,
no dictionary, measure or sort column.


When you specify any numeric column as sort column, it is processed as no
dictionary column.

In your case as the BAD_RECORDS_FORCE  is force so data load should not
fail and consistent behavior should be displayed. However there is a bug in
the code because of which some inconsistency is being encountered.

I will raise a jira for the same and track the issue.

Regards
Manish Gupta

On Wed, May 17, 2017 at 4:49 PM, Pallavi Singh <[email protected]>
wrote:

> Hi Community,
>
> While working with the above problem I found two discussions regarding the
> sort column
>
> 1. https://github.com/apache/carbondata/pull/635 which states : If the
> table need be sorted by a measure, we should use dictionary_include to add
> it to dimension list.
>
> 2. https://github.com/apache/carbondata/pull/757 : if a column of
> SORT_COLUMNS is a measure before, now this column will be created as a
> dimension. And this dimension is a no-dicitonary column(Better to use other
> direct-dictionary).
>
> Now if the columns in my sort_column be measures then I have to add the
> same columns in the dictionary_include other wise in case of null value in
> case of sort_column column the loading fails after the first null encounter
> itself.
>
> for example like this :
>
> CREATE TABLE test_sort_col
>    | (id INT,
>    | name STRING,
>    | age INT
>    | )STORED BY 'org.apache.carbondata.format'
>    | TBLPROPERTIES('SORT_COLUMNS'='id,age','DICTIONARY_INCLUDE'='id,age')
>
> and the csv has following data :
>
> 1,Pallavi,25
> 2,Geetika,24
> 3,Prabhat,twenty six
> 7,Neha,25
> 2,Geetika,22
> 3,Sangeeta,26
>
> and the load gets successful like shown below :
>
> +---+--------+----+
> | id|    name| age|
> +---+--------+----+
> |  1| Pallavi|  25|
> |  2| Geetika|  22|
> |  2| Geetika|  24|
> |  3| Prabhat|null|
> |  3|Sangeeta|  26|
> |  7|    Neha|  25|
> +---+--------+----+
>
> now if i remove the measures of the sort_column from the
> dictionary_include in the  query I get an error and partial data gets
> loaded, snapshot is provided below
>
> Data load request has been received for table default.test_sort_col
> 17/05/17 16:46:51 ERROR UnsafeBatchParallelReadMergeSorterImpl:
> pool-20-thread-1
> java.lang.ClassCastException: java.lang.String cannot be cast to [B
> at org.apache.carbondata.processing.newflow.sort.
> unsafe.UnsafeCarbonRowPage.addRow(UnsafeCarbonRowPage.java:89)
> at org.apache.carbondata.processing.newflow.sort.
> unsafe.UnsafeCarbonRowPage.addRow(UnsafeCarbonRowPage.java:74)
> at org.apache.carbondata.processing.newflow.sort.
> unsafe.UnsafeSortDataRows.addRowBatch(UnsafeSortDataRows.java:170)
> at org.apache.carbondata.processing.newflow.sort.impl.
> UnsafeBatchParallelReadMergeSorterImpl$SortIteratorThread.call(
> UnsafeBatchParallelReadMergeSorterImpl.java:150)
> at org.apache.carbondata.processing.newflow.sort.impl.
> UnsafeBatchParallelReadMergeSorterImpl$SortIteratorThread.call(
> UnsafeBatchParallelReadMergeSorterImpl.java:117)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at java.util.concurrent.ThreadPoolExecutor.runWorker(
> ThreadPoolExecutor.java:1142)
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(
> ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> 17/05/17 16:46:51 AUDIT UnsafeInmemoryHolder: 
> [pallavi][pallavi][Thread-83]Processing
> unsafe inmemory rows page with size : 2
> 17/05/17 16:46:52 ERROR DataLoadExecutor: [Executor task launch
> worker-0][partitionID:default_test_sort_col_296196d0-a469-4273-a382-c41531c32591]
> Data Load is partially success for table test_sort_col
> 17/05/17 16:46:52 AUDIT CarbonDataRDDFactory$: 
> [pallavi][pallavi][Thread-1]Data
> load is partially successful for default.test_sort_col
> +---+-------+---+
> | id|   name|age|
> +---+-------+---+
> |  1|Pallavi| 25|
> |  2|Geetika| 24|
> +---+-------+---+
>
> What is the correct behavior, should we add measures in sort_column to
> dictionary_include or should we modify the load flow to handle null values?
>
>
> On Tue, May 16, 2017 at 11:38 AM, Pallavi Singh <[email protected]>
> wrote:
>
>> Hi Community,
>>
>> While working with the sort columns, I saw that if the column listed in
>> the sort_column happens to have a null value in the data while sorting ,
>> then the row corresponding to that null value is eliminated from the result
>> set. Is this correct behavior? Ideally null values should be sorted and
>> listed on the top in the result set.
>>
>
> --
> Regards | Pallavi Singh
> Software Consultant
>

Re: Query regarding behaviour of sort column

Reply via email to