[GitHub] carbondata pull request #2706: [CARBONDATA-2927] multiple issue fixes for va...
Github user asfgit closed the pull request at: https://github.com/apache/carbondata/pull/2706 ---
[GitHub] carbondata pull request #2706: [CARBONDATA-2927] multiple issue fixes for va...
Github user ajantha-bhat commented on a diff in the pull request: https://github.com/apache/carbondata/pull/2706#discussion_r218474623 --- Diff: processing/src/main/java/org/apache/carbondata/processing/store/CarbonFactDataHandlerColumnar.java --- @@ -239,6 +239,7 @@ public void addDataToStore(CarbonRow row) throws CarbonDataWriterException { * @return false if any varchar column page cannot add one more value(2MB) */ private boolean isVarcharColumnFull(CarbonRow row) { +//TODO: test and remove this as now UnsafeSortDataRows can exceed 2MB --- End diff -- @xuchuanyin @kevinjmh @ravipesala @kumarvishal09 : As per discussion let us handle this with configurable page size [from 1 MB to 2GB(snappy max)] and split the complex child pages here only and add validation for each column based on row, This will be analyzed more and I will open a discussion in community and separate PR will be raised for this. ---
[GitHub] carbondata pull request #2706: [CARBONDATA-2927] multiple issue fixes for va...
Github user ravipesala commented on a diff in the pull request: https://github.com/apache/carbondata/pull/2706#discussion_r218415262 --- Diff: processing/src/main/java/org/apache/carbondata/processing/store/CarbonFactDataHandlerColumnar.java --- @@ -239,6 +239,7 @@ public void addDataToStore(CarbonRow row) throws CarbonDataWriterException { * @return false if any varchar column page cannot add one more value(2MB) */ private boolean isVarcharColumnFull(CarbonRow row) { +//TODO: test and remove this as now UnsafeSortDataRows can exceed 2MB --- End diff -- I am not sure how we come to the conclusion of 2MB. There is no guarantee that we always sort the data to use UnsafeSortDataRows always. How about nosort case? And how about if user wants to add 100MB varchar how to support it. And also this is not just limited to varchar, we should consider for complex and string columns as well here. @ajantha-bhat Please remove that todo, But we need to refactor the code to ensure to keep the page size within the snappy max compressed length for complex and string datatypes as well. ---
[GitHub] carbondata pull request #2706: [CARBONDATA-2927] multiple issue fixes for va...
Github user kevinjmh commented on a diff in the pull request: https://github.com/apache/carbondata/pull/2706#discussion_r218403711 --- Diff: processing/src/main/java/org/apache/carbondata/processing/store/CarbonFactDataHandlerColumnar.java --- @@ -239,6 +239,7 @@ public void addDataToStore(CarbonRow row) throws CarbonDataWriterException { * @return false if any varchar column page cannot add one more value(2MB) */ private boolean isVarcharColumnFull(CarbonRow row) { +//TODO: test and remove this as now UnsafeSortDataRows can exceed 2MB --- End diff -- Original implementation use `2MB` to ensure next varchar column value can be filled safely, because size of value of single column won't exceed size of a row. If UnsafeSortDataRows can exceed 2MB(growing dynamically), then we cannot check whether we have enough space for next value because we are not sure how many space next value will take. So the column page size check should be run before adding row to `dataRows` ---
[GitHub] carbondata pull request #2706: [CARBONDATA-2927] multiple issue fixes for va...
Github user ajantha-bhat commented on a diff in the pull request: https://github.com/apache/carbondata/pull/2706#discussion_r218403232 --- Diff: processing/src/main/java/org/apache/carbondata/processing/store/CarbonFactDataHandlerColumnar.java --- @@ -239,6 +239,7 @@ public void addDataToStore(CarbonRow row) throws CarbonDataWriterException { * @return false if any varchar column page cannot add one more value(2MB) */ private boolean isVarcharColumnFull(CarbonRow row) { +//TODO: test and remove this as now UnsafeSortDataRows can exceed 2MB --- End diff -- complex column also can grow very big (so checking only for var char is not good) Also now columns grow more than 2MB. so, we need to modify this check. This can be handled in separate PR. now no impact from this method as "if 2MB itself space not there, more than 2MB space will never be there". so functionality remains same. ---
[GitHub] carbondata pull request #2706: [CARBONDATA-2927] multiple issue fixes for va...
Github user ajantha-bhat commented on a diff in the pull request: https://github.com/apache/carbondata/pull/2706#discussion_r218047331 --- Diff: core/src/main/java/org/apache/carbondata/core/util/ReUsableByteArrayDataOutputStream.java --- @@ -0,0 +1,47 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.carbondata.core.util; + +import java.io.ByteArrayOutputStream; +import java.io.DataOutputStream; + +/** + * wrapper around DataOutputStream. Which clears the buffer for reuse + */ +public final class ReUsableByteArrayDataOutputStream extends DataOutputStream { + + private ByteArrayOutputStream outputStream; + + public ReUsableByteArrayDataOutputStream(ByteArrayOutputStream outputStream) { +super(outputStream); +this.outputStream = outputStream; + } + + public void resetByteArrayOutputStream() { +outputStream.reset(); + } + + public int getByteArrayOutputStreamSize() { --- End diff -- done ---
[GitHub] carbondata pull request #2706: [CARBONDATA-2927] multiple issue fixes for va...
Github user ajantha-bhat commented on a diff in the pull request: https://github.com/apache/carbondata/pull/2706#discussion_r218047291 --- Diff: core/src/main/java/org/apache/carbondata/core/util/ReUsableByteArrayDataOutputStream.java --- @@ -0,0 +1,47 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.carbondata.core.util; + +import java.io.ByteArrayOutputStream; +import java.io.DataOutputStream; + +/** + * wrapper around DataOutputStream. Which clears the buffer for reuse + */ +public final class ReUsableByteArrayDataOutputStream extends DataOutputStream { + + private ByteArrayOutputStream outputStream; + + public ReUsableByteArrayDataOutputStream(ByteArrayOutputStream outputStream) { +super(outputStream); +this.outputStream = outputStream; + } + + public void resetByteArrayOutputStream() { --- End diff -- done ---
[GitHub] carbondata pull request #2706: [CARBONDATA-2927] multiple issue fixes for va...
Github user ravipesala commented on a diff in the pull request: https://github.com/apache/carbondata/pull/2706#discussion_r218038298 --- Diff: core/src/main/java/org/apache/carbondata/core/util/ReUsableByteArrayDataOutputStream.java --- @@ -0,0 +1,47 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.carbondata.core.util; + +import java.io.ByteArrayOutputStream; +import java.io.DataOutputStream; + +/** + * wrapper around DataOutputStream. Which clears the buffer for reuse + */ +public final class ReUsableByteArrayDataOutputStream extends DataOutputStream { + + private ByteArrayOutputStream outputStream; + + public ReUsableByteArrayDataOutputStream(ByteArrayOutputStream outputStream) { +super(outputStream); +this.outputStream = outputStream; + } + + public void resetByteArrayOutputStream() { --- End diff -- Just change name to reset ---
[GitHub] carbondata pull request #2706: [CARBONDATA-2927] multiple issue fixes for va...
Github user ravipesala commented on a diff in the pull request: https://github.com/apache/carbondata/pull/2706#discussion_r218038401 --- Diff: core/src/main/java/org/apache/carbondata/core/util/ReUsableByteArrayDataOutputStream.java --- @@ -0,0 +1,47 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.carbondata.core.util; + +import java.io.ByteArrayOutputStream; +import java.io.DataOutputStream; + +/** + * wrapper around DataOutputStream. Which clears the buffer for reuse + */ +public final class ReUsableByteArrayDataOutputStream extends DataOutputStream { + + private ByteArrayOutputStream outputStream; + + public ReUsableByteArrayDataOutputStream(ByteArrayOutputStream outputStream) { +super(outputStream); +this.outputStream = outputStream; + } + + public void resetByteArrayOutputStream() { +outputStream.reset(); + } + + public int getByteArrayOutputStreamSize() { --- End diff -- Just change name to getSize ---
[GitHub] carbondata pull request #2706: [CARBONDATA-2927] multiple issue fixes for va...
Github user xuchuanyin commented on a diff in the pull request: https://github.com/apache/carbondata/pull/2706#discussion_r217987255 --- Diff: processing/src/main/java/org/apache/carbondata/processing/store/CarbonFactDataHandlerColumnar.java --- @@ -239,6 +239,7 @@ public void addDataToStore(CarbonRow row) throws CarbonDataWriterException { * @return false if any varchar column page cannot add one more value(2MB) */ private boolean isVarcharColumnFull(CarbonRow row) { +//TODO: test and remove this as now UnsafeSortDataRows can exceed 2MB --- End diff -- This is for a column page cannot exceed 2GB ï¼actually it is 1.67GB since snappy cannot compress a bigger size in one runï¼, so there is no need to add a comment here ---
[GitHub] carbondata pull request #2706: [CARBONDATA-2927] multiple issue fixes for va...
Github user ravipesala commented on a diff in the pull request: https://github.com/apache/carbondata/pull/2706#discussion_r217049111 --- Diff: processing/src/main/java/org/apache/carbondata/processing/loading/sort/unsafe/UnsafeSortDataRows.java --- @@ -94,21 +94,30 @@ private final long taskId; - public UnsafeSortDataRows(SortParameters parameters, - UnsafeIntermediateMerger unsafeInMemoryIntermediateFileMerger, int inMemoryChunkSize) { -this.parameters = parameters; -this.tableFieldStat = new TableFieldStat(parameters); -this.rowBuffer = new ThreadLocal() { + public static ThreadLocal getRowBuffer() { +return rowBuffer; + } + + static { +rowBuffer = new ThreadLocal() { --- End diff -- Please try using DataOutSteam backed by ByteOutPutStream, it can expand dynamically. ---
[GitHub] carbondata pull request #2706: [CARBONDATA-2927] multiple issue fixes for va...
Github user ravipesala commented on a diff in the pull request: https://github.com/apache/carbondata/pull/2706#discussion_r217048602 --- Diff: processing/src/main/java/org/apache/carbondata/processing/loading/sort/unsafe/UnsafeCarbonRowPage.java --- @@ -59,12 +60,11 @@ public UnsafeCarbonRowPage(TableFieldStat tableFieldStat, MemoryBlock memoryBloc this.taskId = taskId; buffer = new IntPointerBuffer(this.taskId); this.dataBlock = memoryBlock; -// TODO Only using 98% of space for safe side.May be we can have different logic. -sizeToBeUsed = dataBlock.size() - (dataBlock.size() * 5) / 100; +sizeToBeUsed = dataBlock.size(); --- End diff -- Please keep back this check, it is for reserving memory for a row in case it exceeds ---
[GitHub] carbondata pull request #2706: [CARBONDATA-2927] multiple issue fixes for va...
Github user xuchuanyin commented on a diff in the pull request: https://github.com/apache/carbondata/pull/2706#discussion_r216885804 --- Diff: processing/src/main/java/org/apache/carbondata/processing/loading/sort/unsafe/UnsafeSortDataRows.java --- @@ -240,11 +249,11 @@ public void addRow(Object[] row) throws CarbonSortKeyAndGroupByException { throw new CarbonSortKeyAndGroupByException(ex); } rowPage.addRow(row, rowBuffer.get()); - } catch (Exception e) { -LOGGER.error( -"exception occurred while trying to acquire a semaphore lock: " + e.getMessage()); -throw new CarbonSortKeyAndGroupByException(e); } +} catch (Exception e) { + LOGGER --- End diff -- bad indent. we can move the msg to next line and keep method call in this line ---
[GitHub] carbondata pull request #2706: [CARBONDATA-2927] multiple issue fixes for va...
Github user xuchuanyin commented on a diff in the pull request: https://github.com/apache/carbondata/pull/2706#discussion_r216884982 --- Diff: processing/src/main/java/org/apache/carbondata/processing/loading/sort/SortStepRowHandler.java --- @@ -570,23 +589,31 @@ public int writeRawRowAsIntermediateSortTempRowToUnsafeMemory(Object[] row, private void packNoSortFieldsToBytes(Object[] row, ByteBuffer rowBuffer) { // convert dict & no-sort for (int idx = 0; idx < this.dictNoSortDimCnt; idx++) { + // cannot exceed default 2MB, hence no need to call ensureArraySize rowBuffer.putInt((int) row[this.dictNoSortDimIdx[idx]]); } // convert no-dict & no-sort for (int idx = 0; idx < this.noDictNoSortDimCnt; idx++) { byte[] bytes = (byte[]) row[this.noDictNoSortDimIdx[idx]]; + // cannot exceed default 2MB, hence no need to call ensureArraySize --- End diff -- for one column, it may not exceed 2MB, what if we lots of no-sort-no-dict columns? ---
[GitHub] carbondata pull request #2706: [CARBONDATA-2927] multiple issue fixes for va...
Github user xuchuanyin commented on a diff in the pull request: https://github.com/apache/carbondata/pull/2706#discussion_r216884722 --- Diff: processing/src/main/java/org/apache/carbondata/processing/loading/sort/SortStepRowHandler.java --- @@ -559,7 +572,13 @@ public int writeRawRowAsIntermediateSortTempRowToUnsafeMemory(Object[] row, return size; } - + private void validateUnsafeMemoryBlockSizeLimit(long unsafeRemainingLength, int size) --- End diff -- please optimize the parameter name of 'size' for better reading, it seems that it represents the requestedSize ---
[GitHub] carbondata pull request #2706: [CARBONDATA-2927] multiple issue fixes for va...
Github user xuchuanyin commented on a diff in the pull request: https://github.com/apache/carbondata/pull/2706#discussion_r216885444 --- Diff: processing/src/main/java/org/apache/carbondata/processing/loading/sort/unsafe/UnsafeCarbonRowPage.java --- @@ -59,12 +60,11 @@ public UnsafeCarbonRowPage(TableFieldStat tableFieldStat, MemoryBlock memoryBloc this.taskId = taskId; buffer = new IntPointerBuffer(this.taskId); this.dataBlock = memoryBlock; -// TODO Only using 98% of space for safe side.May be we can have different logic. -sizeToBeUsed = dataBlock.size() - (dataBlock.size() * 5) / 100; +sizeToBeUsed = dataBlock.size(); --- End diff -- Is the old comment outdated? Have you ensured the 'safe side' it mentioned? ---
[GitHub] carbondata pull request #2706: [CARBONDATA-2927] multiple issue fixes for va...
Github user xuchuanyin commented on a diff in the pull request: https://github.com/apache/carbondata/pull/2706#discussion_r216885323 --- Diff: processing/src/main/java/org/apache/carbondata/processing/loading/sort/SortStepRowHandler.java --- @@ -598,26 +625,53 @@ private void packNoSortFieldsToBytes(Object[] row, ByteBuffer rowBuffer) { tmpValue = row[this.measureIdx[idx]]; tmpDataType = this.dataTypes[idx]; if (null == tmpValue) { +// can exceed default 2MB, hence need to call ensureArraySize +rowBuffer = UnsafeSortDataRows +.ensureArraySize(1); rowBuffer.put((byte) 0); continue; } + // can exceed default 2MB, hence need to call ensureArraySize + rowBuffer = UnsafeSortDataRows + .ensureArraySize(1); --- End diff -- bad indent, can be moved to previous line The same with line#642, line#647 ---
[GitHub] carbondata pull request #2706: [CARBONDATA-2927] multiple issue fixes for va...
Github user xuchuanyin commented on a diff in the pull request: https://github.com/apache/carbondata/pull/2706#discussion_r216884374 --- Diff: core/src/main/java/org/apache/carbondata/core/memory/UnsafeMemoryManager.java --- @@ -200,7 +200,7 @@ public static MemoryBlock allocateMemoryWithRetry(long taskId, long size) } if (baseBlock == null) { INSTANCE.printCurrentMemoryUsage(); - throw new MemoryException("Not enough memory"); + throw new MemoryException("Not enough memory, increase carbon.unsafe.working.memory.in.mb"); --- End diff -- I think you can optimize the error message to `Not enough unsafe working memory (total: , available: , request: )` ---
[GitHub] carbondata pull request #2706: [CARBONDATA-2927] multiple issue fixes for va...
Github user xuchuanyin commented on a diff in the pull request: https://github.com/apache/carbondata/pull/2706#discussion_r216885250 --- Diff: processing/src/main/java/org/apache/carbondata/processing/loading/sort/SortStepRowHandler.java --- @@ -598,26 +625,53 @@ private void packNoSortFieldsToBytes(Object[] row, ByteBuffer rowBuffer) { tmpValue = row[this.measureIdx[idx]]; tmpDataType = this.dataTypes[idx]; if (null == tmpValue) { +// can exceed default 2MB, hence need to call ensureArraySize +rowBuffer = UnsafeSortDataRows +.ensureArraySize(1); --- End diff -- bad indent, can be moved to previous line ---
[GitHub] carbondata pull request #2706: [CARBONDATA-2927] multiple issue fixes for va...
Github user xuchuanyin commented on a diff in the pull request: https://github.com/apache/carbondata/pull/2706#discussion_r216886119 --- Diff: processing/src/main/java/org/apache/carbondata/processing/loading/sort/unsafe/UnsafeSortDataRows.java --- @@ -326,6 +335,19 @@ private void startFileBasedMerge() throws InterruptedException { dataSorterAndWriterExecutorService.awaitTermination(2, TimeUnit.DAYS); } + public static ByteBuffer ensureArraySize(int requestSize) { --- End diff -- If we increase the rowbuffer runtime, is there a way to decrease it? Or if there is no need to do so, how long will this rowbuffer last? ---
[GitHub] carbondata pull request #2706: [CARBONDATA-2927] multiple issue fixes for va...
Github user xuchuanyin commented on a diff in the pull request: https://github.com/apache/carbondata/pull/2706#discussion_r216885202 --- Diff: processing/src/main/java/org/apache/carbondata/processing/loading/sort/SortStepRowHandler.java --- @@ -570,23 +589,31 @@ public int writeRawRowAsIntermediateSortTempRowToUnsafeMemory(Object[] row, private void packNoSortFieldsToBytes(Object[] row, ByteBuffer rowBuffer) { // convert dict & no-sort for (int idx = 0; idx < this.dictNoSortDimCnt; idx++) { + // cannot exceed default 2MB, hence no need to call ensureArraySize rowBuffer.putInt((int) row[this.dictNoSortDimIdx[idx]]); } // convert no-dict & no-sort for (int idx = 0; idx < this.noDictNoSortDimCnt; idx++) { byte[] bytes = (byte[]) row[this.noDictNoSortDimIdx[idx]]; + // cannot exceed default 2MB, hence no need to call ensureArraySize rowBuffer.putShort((short) bytes.length); rowBuffer.put(bytes); } // convert varchar dims for (int idx = 0; idx < this.varcharDimCnt; idx++) { byte[] bytes = (byte[]) row[this.varcharDimIdx[idx]]; + // can exceed default 2MB, hence need to call ensureArraySize + rowBuffer = UnsafeSortDataRows --- End diff -- Should we call this method per row per column? Since in most scenarios, 2MB per row is enough, so will the method calling here cause performance decrease? ---
[GitHub] carbondata pull request #2706: [CARBONDATA-2927] multiple issue fixes for va...
Github user xuchuanyin commented on a diff in the pull request: https://github.com/apache/carbondata/pull/2706#discussion_r216885637 --- Diff: processing/src/main/java/org/apache/carbondata/processing/loading/sort/unsafe/UnsafeSortDataRows.java --- @@ -72,7 +72,7 @@ private SortParameters parameters; private TableFieldStat tableFieldStat; - private ThreadLocal rowBuffer; + private static ThreadLocal rowBuffer; --- End diff -- I think the 'static' here may cause problem for concurrent loading. Each loading should their own rowBuffer. ---
[GitHub] carbondata pull request #2706: [CARBONDATA-2927] multiple issue fixes for va...
Github user xuchuanyin commented on a diff in the pull request: https://github.com/apache/carbondata/pull/2706#discussion_r216885885 --- Diff: processing/src/main/java/org/apache/carbondata/processing/loading/sort/unsafe/UnsafeSortDataRows.java --- @@ -326,6 +335,19 @@ private void startFileBasedMerge() throws InterruptedException { dataSorterAndWriterExecutorService.awaitTermination(2, TimeUnit.DAYS); } + public static ByteBuffer ensureArraySize(int requestSize) { --- End diff -- please give a comment that this method is used to increase the rowbuffer during loading. ---