Re: [DISCUSS][JAVA] Make FixedSizedListVector inherit from ListVector
Hi Micah, thanks for your suggestion. You are right, the mainly difference between FixSizedListVector and ListVector is the offsetBuffer, but I think this could be avoided through allocateNewSafe() overwrite which calls allocateOffsetBuffer() in BaseRepeatedValueVector, in this way, offsetBuffer in FixSizedListVector will remain allocator.getEmpty(). Meanwhile, we could add getStartIndex(int index)/getEndIndex(int index) API to handle read data logic respectively which could be used in getObject(int index) or encoding parts. What’s more, no new interface need to be introduced. What do you think? Thanks, Ji Liu -- From:Micah Kornfield Send Time:2019年8月11日(星期日) 08:47 To:dev ; Ji Liu Subject:Re: [DISCUSS][JAVA] Make FixedSizedListVector inherit from ListVector Hi Ji Liu, I think have a common interface/base-class for the two makes sense (but don't have historical context) from a reading data perspective. I think the change would need to be something above BaseRepeatedValueVector, since the FixedSizeListVector doesn't contain an offset buffer, and that field is contained on BaseRepeatedValueVector. Thanks, Micah On Sat, Aug 10, 2019 at 5:25 PM Ji Liu wrote: Hi, all While working on the issue to implement dictionary-encoded subfields[1] [2], I found FixedSizeListVector not extends ListVector(Thanks Micah pointing this out and curious why implemented FixedSizeListVector this way before). Since FixedSizeListVector is a specific case of ListVector, should we make former extends the latter to reduce the plenty duplicated logic in these two and writer/reader classes? Thanks, Ji Liu [1] https://issues.apache.org/jira/browse/ARROW-1175[2]https://github.com/apache/arrow/pull/4972
Re: [DISCUSS][JAVA] Make FixedSizedListVector inherit from ListVector
Hi Ji Liu, I think have a common interface/base-class for the two makes sense (but don't have historical context) from a reading data perspective. I think the change would need to be something above BaseRepeatedValueVector, since the FixedSizeListVector doesn't contain an offset buffer, and that field is contained on BaseRepeatedValueVector. Thanks, Micah On Sat, Aug 10, 2019 at 5:25 PM Ji Liu wrote: > Hi, all > > While working on the issue to implement dictionary-encoded subfields[1] > [2], I found FixedSizeListVector not extends ListVector(Thanks Micah > pointing this out and curious why implemented FixedSizeListVector this way > before). Since FixedSizeListVector is a specific case of ListVector, > should we make former extends the latter to reduce the plenty duplicated > logic in these two and writer/reader classes? > > > Thanks, > Ji Liu > > [1] > https://issues.apache.org/jira/browse/ARROW-1175[2]https://github.com/apache/arrow/pull/4972 > >
Re: [Discuss][Java] 64-bit lengths for ValueVectors
Hi Jacques, I definitely understand these concerns and this change is risky because it is so large. Perhaps, creating a new hierarchy, might be the cleanest way of dealing with this. This could have other benefits like cleaning up some cruft around dictionary encode and "orphaned" method. Per past e-mail threads I agree it is beneficial to have 2 separate reference implementations that can communicate fully, and my intent here was to close that gap. Trying to > determine the ramifications of these changes would be challenging and time > consuming against all the different ways we interact with the Arrow Java > library. Understood. I took a quick look at Dremio-OSS it seems like it has a simple java build system? If it is helpful, I can try to get a fork running that at least compiles against this PR. My plan would be to cast any place that was changed to return a long back to an int, so in essence the Dremio algorithms would reman 32-bit implementations. I don't have the infrastructure to test this change properly from a distributed systems perspective, so it would still take some time from Dremio to validate for regressions. I'm not saying I'm against this but want to make sure we've > explored all less disruptive options before considering changing something > this fundamental (especially when I generally hold the view that large cell > counts against massive contiguous memory is an anti pattern to scalable > analytical processing--purely subjective of course). I'm open to other ideas here, as well. I don't think it is out of the question to leave the Java implementation as 32-bit, but if we do, then I think we should consider a different strategy for reference implementations. Thanks, Micah On Sat, Aug 10, 2019 at 5:09 PM Jacques Nadeau wrote: > Hey Micah, I didn't have a particular path in mind. Was thinking more along > the lines of extra methods as opposed to separate classes. > > Arrow hasn't historically been a place where we're writing algorithms in > Java so the fact that they aren't there doesn't mean they don't exist. We > have a large amount of code that depends on the current behavior that is > deployed in hundreds of customer clusters (you can peruse our dremio repo > to see how extensively we leverage Arrow if interested). Trying to > determine the ramifications of these changes would be challenging and time > consuming against all the different ways we interact with the Arrow Java > library. I'm not saying I'm against this but want to make sure we've > explored all less disruptive options before considering changing something > this fundamental (especially when I generally hold the view that large cell > counts against massive contiguous memory is an anti pattern to scalable > analytical processing--purely subjective of course). > > On Sat, Aug 10, 2019, 4:17 PM Micah Kornfield > wrote: > > > Hi Jacques, > > What avenue were you thinking for supporting both paths? I didn't want > > to pursue a different class hierarchy, because I felt like that would > > effectively fork the code base, but that is potentially an option that > > would allow us to have a complete reference implementation in Java that > can > > fully interact with C++, without major changes to this code. > > > > For supporting both APIs on the same classes/interfaces, I think they > > roughly fall into three categories, changes to input parameters, changes > to > > output parameters and algorithm changes. > > > > For inputs, changing from int to long is essentially a no-op from the > > compiler perspective. From the limited micro-benchmarking this also > > doesn't seem to have a performance impact. So we could keep two versions > > of the methods that only differ on inputs, but it is not clear what the > > value of that would be. > > > > For outputs, we can't support methods "long getLength()" and "int > > getLength()" in the same class, so we would be forced into something like > > "long getLength(boolean dummy)" which I think is a less desirable. > > > > For algorithm changes, there did not appear to be too many places where > we > > actually loop over all elements (it is quite possible I missed something > > here), the ones that I did find I was able to mitigate performance > > penalties as noted above. Some of the current implementation will get a > > lot slower for "large arrays", but we can likely fix those later or in > this > > PR with a nested while loop instead of 2 for loops. > > > > Thanks, > > Micah > > > > > > On Saturday, August 10, 2019, Jacques Nadeau wrote: > > > >> This is a pretty massive change to the apis. I wonder how nasty it would > >> be to just support both paths. Have you evaluated how complex that > would be? > >> > >> On Wed, Aug 7, 2019 at 11:08 PM Micah Kornfield > >> wrote: > >> > >>> After more investigation, it looks like Float8Benchmarks at least on my > >>> machine are within the range of noise. > >>> > >>> For BitVectorHelper I pushed a new commit [1], seems to
Re: [Java] Arrow PR queue build up?
Hi, Jacques, thanks for your valuable feedback. Sorry for the lack of discuss. Some of these PRs are small change/bugfix which not deserving a discuss. You are right, some PRs are more complex than we thought before in the review process, making a discuss on ML/JIRA would actually help. This situation will be avoided as far as possible in the future work. ThanksJi Liu -- From:Jacques Nadeau Send Time:2019年8月11日(星期日) 03:36 To:dev ; Micah Kornfield Subject:Re: [Java] Arrow PR queue build up? I think one of the issues here is that there is no upfront discussion about most of the changes that are being proposed. In most cases, a pull request just appears without. This makes the reviews much more intensive and time consuming as frequently there are questions about the validity, nature or rationale of the change. Having short design discussions before starting on these changes would ensure that the nature of the reviews are less involved, thus decreasing the effort associated with reviewing them. On Thu, Aug 8, 2019 at 10:18 PM Micah Kornfield wrote: > I did a pass through most of the open PRs (I might have missed one or > two). Most had at least a few minor comments so the backlog hasn't gone > down that much, but I expect most will be mergeable very soon. > > On Thu, Aug 8, 2019 at 9:44 AM Micah Kornfield > wrote: > > > Not a full solution, but I've fallen behind a bit. I'm going to plan to > > spend some time tonight at least reviewing PRs I've already done the > first > > pass on and I'll try to pickup some more. > > > > Having more engaged reviewers would be helpful though. > > > > Cheers, > > Micah > > > > On Thursday, August 8, 2019, Wes McKinney wrote: > > > >> hi folks, > >> > >> Liya Fan and Ji Liu have about 24 open Java PRs between them if I > >> counted right -- it seems like the project is having a hard time > >> keeping up with code reviews and merging on these. It looks to me like > >> they are making a lot of material improvements to the Java library > >> where previously there had not been a lot of development, so I would > >> like to see PRs get merged faster -- any ideas how we might be able to > >> achieve that? I know that Micah has been spending a lot of time > >> reviewing and giving feedback on these PRs so that is much appreciated > >> > >> Thanks, > >> Wes > >> > > >
[DISCUSS][JAVA] Make FixedSizedListVector inherit from ListVector
Hi, all While working on the issue to implement dictionary-encoded subfields[1] [2], I found FixedSizeListVector not extends ListVector(Thanks Micah pointing this out and curious why implemented FixedSizeListVector this way before). Since FixedSizeListVector is a specific case of ListVector, should we make former extends the latter to reduce the plenty duplicated logic in these two and writer/reader classes? Thanks, Ji Liu [1] https://issues.apache.org/jira/browse/ARROW-1175[2]https://github.com/apache/arrow/pull/4972
Re: [Discuss][Java] 64-bit lengths for ValueVectors
Hey Micah, I didn't have a particular path in mind. Was thinking more along the lines of extra methods as opposed to separate classes. Arrow hasn't historically been a place where we're writing algorithms in Java so the fact that they aren't there doesn't mean they don't exist. We have a large amount of code that depends on the current behavior that is deployed in hundreds of customer clusters (you can peruse our dremio repo to see how extensively we leverage Arrow if interested). Trying to determine the ramifications of these changes would be challenging and time consuming against all the different ways we interact with the Arrow Java library. I'm not saying I'm against this but want to make sure we've explored all less disruptive options before considering changing something this fundamental (especially when I generally hold the view that large cell counts against massive contiguous memory is an anti pattern to scalable analytical processing--purely subjective of course). On Sat, Aug 10, 2019, 4:17 PM Micah Kornfield wrote: > Hi Jacques, > What avenue were you thinking for supporting both paths? I didn't want > to pursue a different class hierarchy, because I felt like that would > effectively fork the code base, but that is potentially an option that > would allow us to have a complete reference implementation in Java that can > fully interact with C++, without major changes to this code. > > For supporting both APIs on the same classes/interfaces, I think they > roughly fall into three categories, changes to input parameters, changes to > output parameters and algorithm changes. > > For inputs, changing from int to long is essentially a no-op from the > compiler perspective. From the limited micro-benchmarking this also > doesn't seem to have a performance impact. So we could keep two versions > of the methods that only differ on inputs, but it is not clear what the > value of that would be. > > For outputs, we can't support methods "long getLength()" and "int > getLength()" in the same class, so we would be forced into something like > "long getLength(boolean dummy)" which I think is a less desirable. > > For algorithm changes, there did not appear to be too many places where we > actually loop over all elements (it is quite possible I missed something > here), the ones that I did find I was able to mitigate performance > penalties as noted above. Some of the current implementation will get a > lot slower for "large arrays", but we can likely fix those later or in this > PR with a nested while loop instead of 2 for loops. > > Thanks, > Micah > > > On Saturday, August 10, 2019, Jacques Nadeau wrote: > >> This is a pretty massive change to the apis. I wonder how nasty it would >> be to just support both paths. Have you evaluated how complex that would be? >> >> On Wed, Aug 7, 2019 at 11:08 PM Micah Kornfield >> wrote: >> >>> After more investigation, it looks like Float8Benchmarks at least on my >>> machine are within the range of noise. >>> >>> For BitVectorHelper I pushed a new commit [1], seems to bring the >>> BitVectorHelper benchmarks back inline (and even with some improvement >>> for >>> getNullCountBenchmark). >>> >>> BenchmarkMode Cnt Score >>> Error >>> Units >>> BitVectorHelperBenchmarks.allBitsNullBenchmark avgt5 3.821 ± >>> 0.031 >>> ns/op >>> BitVectorHelperBenchmarks.getNullCountBenchmark avgt5 14.884 ± >>> 0.141 >>> ns/op >>> >>> I applied the same pattern to other loops that I could find, and for any >>> "for (long" loop on the critical path, I broke it up into two loops. the >>> first loop does iteration by integer, the second finishes off for any >>> long >>> values. As a side note it seems like optimization for loops using long >>> counters at least have a semi-recent open bug for the JVM [2] >>> >>> Thanks, >>> Micah >>> >>> [1] >>> >>> https://github.com/apache/arrow/pull/5020/commits/2ea2c1ae83e3baa7b9a99a6d06276d968df41797 >>> [2] https://bugs.openjdk.java.net/browse/JDK-8223051 >>> >>> On Wed, Aug 7, 2019 at 8:11 PM Micah Kornfield >>> wrote: >>> >>> > Indeed, the BoundChecking and CheckNullForGet variables can make a big >>> > difference. I didn't initially run the benchmarks with these turned on >>> > (you can see the result from above with Float8Benchmarks). Here are >>> new >>> > numbers including with the flags enabled. It looks like using longs >>> might >>> > be a little bit slower, I'll see what I can do to mitigate this. >>> > >>> > Ravindra also volunteered to try to benchmark the changes with Dremio's >>> > code on today's sync call. >>> > >>> > New >>> > >>> > BenchmarkMode Cnt Score >>> Error >>> > Units >>> > >>> > BitVectorHelperBenchmarks.allBitsNullBenchmark avgt5 4.176 ± >>> 1.292 >>> > ns/op >>> > >>> > BitVectorHelperBenchmarks.getNullCountBenchmark avgt5 26.102 ± >>> 0.700 >>> > ns/op >>> > >>> >
[jira] [Created] (ARROW-6197) [GLib] Add garrow_decimal128_rescale()
Sutou Kouhei created ARROW-6197: --- Summary: [GLib] Add garrow_decimal128_rescale() Key: ARROW-6197 URL: https://issues.apache.org/jira/browse/ARROW-6197 Project: Apache Arrow Issue Type: New Feature Components: GLib Reporter: Sutou Kouhei Assignee: Sutou Kouhei -- This message was sent by Atlassian JIRA (v7.6.14#76016)
Re: [Format] Semantics for dictionary batches in streams
Reading data from two different parquet files sequentially with different dictionaries for the same column. This could be handled by re-encoding data but that seems potentially sub-optimal. On Sat, Aug 10, 2019 at 12:38 PM Jacques Nadeau wrote: > What situation are anticipating where you're going to be restating ids mid > stream? > > On Sat, Aug 10, 2019 at 12:13 AM Micah Kornfield > wrote: > >> The IPC specification [1] defines behavior when isDelta on a >> DictionaryBatch [2] is "true". I might have missed it in the >> specification, but I couldn't find the interpretation for what the >> expected >> behavior is when isDelta=false and and two dictionary batches with the >> same ID are sent. >> >> It seems like there are two options: >> 1. Interpret the new dictionary batch as replacing the old one. >> 2. Regard this as an error condition. >> >> Based on the fact that in the "file format" dictionaries are allowed to be >> placed in any order relative to the record batches, I assume it is the >> second, but just wanted to make sure. >> >> Thanks, >> Micah >> >> [1] https://arrow.apache.org/docs/ipc.html >> [2] https://github.com/apache/arrow/blob/master/format/Message.fbs#L72 >> >
Re: [Discuss][Java] 64-bit lengths for ValueVectors
Hi Jacques, What avenue were you thinking for supporting both paths? I didn't want to pursue a different class hierarchy, because I felt like that would effectively fork the code base, but that is potentially an option that would allow us to have a complete reference implementation in Java that can fully interact with C++, without major changes to this code. For supporting both APIs on the same classes/interfaces, I think they roughly fall into three categories, changes to input parameters, changes to output parameters and algorithm changes. For inputs, changing from int to long is essentially a no-op from the compiler perspective. From the limited micro-benchmarking this also doesn't seem to have a performance impact. So we could keep two versions of the methods that only differ on inputs, but it is not clear what the value of that would be. For outputs, we can't support methods "long getLength()" and "int getLength()" in the same class, so we would be forced into something like "long getLength(boolean dummy)" which I think is a less desirable. For algorithm changes, there did not appear to be too many places where we actually loop over all elements (it is quite possible I missed something here), the ones that I did find I was able to mitigate performance penalties as noted above. Some of the current implementation will get a lot slower for "large arrays", but we can likely fix those later or in this PR with a nested while loop instead of 2 for loops. Thanks, Micah On Saturday, August 10, 2019, Jacques Nadeau wrote: > This is a pretty massive change to the apis. I wonder how nasty it would > be to just support both paths. Have you evaluated how complex that would be? > > On Wed, Aug 7, 2019 at 11:08 PM Micah Kornfield > wrote: > >> After more investigation, it looks like Float8Benchmarks at least on my >> machine are within the range of noise. >> >> For BitVectorHelper I pushed a new commit [1], seems to bring the >> BitVectorHelper benchmarks back inline (and even with some improvement for >> getNullCountBenchmark). >> >> BenchmarkMode Cnt Score Error >> Units >> BitVectorHelperBenchmarks.allBitsNullBenchmark avgt5 3.821 ± 0.031 >> ns/op >> BitVectorHelperBenchmarks.getNullCountBenchmark avgt5 14.884 ± 0.141 >> ns/op >> >> I applied the same pattern to other loops that I could find, and for any >> "for (long" loop on the critical path, I broke it up into two loops. the >> first loop does iteration by integer, the second finishes off for any long >> values. As a side note it seems like optimization for loops using long >> counters at least have a semi-recent open bug for the JVM [2] >> >> Thanks, >> Micah >> >> [1] >> >> https://github.com/apache/arrow/pull/5020/commits/2ea2c1ae83e3baa7b9a99a6d06276d968df41797 >> [2] https://bugs.openjdk.java.net/browse/JDK-8223051 >> >> On Wed, Aug 7, 2019 at 8:11 PM Micah Kornfield >> wrote: >> >> > Indeed, the BoundChecking and CheckNullForGet variables can make a big >> > difference. I didn't initially run the benchmarks with these turned on >> > (you can see the result from above with Float8Benchmarks). Here are new >> > numbers including with the flags enabled. It looks like using longs >> might >> > be a little bit slower, I'll see what I can do to mitigate this. >> > >> > Ravindra also volunteered to try to benchmark the changes with Dremio's >> > code on today's sync call. >> > >> > New >> > >> > BenchmarkMode Cnt Score >> Error >> > Units >> > >> > BitVectorHelperBenchmarks.allBitsNullBenchmark avgt5 4.176 ± >> 1.292 >> > ns/op >> > >> > BitVectorHelperBenchmarks.getNullCountBenchmark avgt5 26.102 ± >> 0.700 >> > ns/op >> > >> > Float8Benchmarks.copyFromBenchmark avgt5 7.398 ± 0.084 us/op >> > >> > Float8Benchmarks.readWriteBenchmark avgt5 2.711 ± 0.057 us/op >> > >> > >> > >> > old >> > >> > BitVectorHelperBenchmarks.allBitsNullBenchmark avgt5 3.828 ± >> 0.030 >> > ns/op >> > >> > BitVectorHelperBenchmarks.getNullCountBenchmark avgt5 20.611 ± >> 0.188 >> > ns/op >> > >> > Float8Benchmarks.copyFromBenchmark avgt5 6.597 ± 0.462 us/op >> > >> > Float8Benchmarks.readWriteBenchmark avgt5 2.615 ± 0.027 us/op >> > >> > On Wed, Aug 7, 2019 at 7:13 PM Fan Liya wrote: >> > >> >> Hi Gonzalo, >> >> >> >> Thanks for sharing the performance results. >> >> I am wondering if you have turned off the flag >> >> BoundsChecking#BOUNDS_CHECKING_ENABLED. >> >> If not, the lower throughput should be expected. >> >> >> >> Best, >> >> Liya Fan >> >> >> >> On Wed, Aug 7, 2019 at 10:23 PM Micah Kornfield > > >> >> wrote: >> >> >> >>> Hi Gonzalo, >> >>> Thank you for the feedback. I wasn't aware of the JIT implications. >> At >> >>> least on the benchmark run they don't seem to have an impact. >> >>> >> >>> If there are other benchmarks that people have that can validate if >> this >> >>> change will be
[jira] [Created] (ARROW-6196) [Ruby] Add support for building Arrow::TimeNNArray by .new
Sutou Kouhei created ARROW-6196: --- Summary: [Ruby] Add support for building Arrow::TimeNNArray by .new Key: ARROW-6196 URL: https://issues.apache.org/jira/browse/ARROW-6196 Project: Apache Arrow Issue Type: New Feature Components: Ruby Reporter: Sutou Kouhei Assignee: Sutou Kouhei -- This message was sent by Atlassian JIRA (v7.6.14#76016)
Re: [Discuss][Java] 64-bit lengths for ValueVectors
This is a pretty massive change to the apis. I wonder how nasty it would be to just support both paths. Have you evaluated how complex that would be? On Wed, Aug 7, 2019 at 11:08 PM Micah Kornfield wrote: > After more investigation, it looks like Float8Benchmarks at least on my > machine are within the range of noise. > > For BitVectorHelper I pushed a new commit [1], seems to bring the > BitVectorHelper benchmarks back inline (and even with some improvement for > getNullCountBenchmark). > > BenchmarkMode Cnt Score Error > Units > BitVectorHelperBenchmarks.allBitsNullBenchmark avgt5 3.821 ± 0.031 > ns/op > BitVectorHelperBenchmarks.getNullCountBenchmark avgt5 14.884 ± 0.141 > ns/op > > I applied the same pattern to other loops that I could find, and for any > "for (long" loop on the critical path, I broke it up into two loops. the > first loop does iteration by integer, the second finishes off for any long > values. As a side note it seems like optimization for loops using long > counters at least have a semi-recent open bug for the JVM [2] > > Thanks, > Micah > > [1] > > https://github.com/apache/arrow/pull/5020/commits/2ea2c1ae83e3baa7b9a99a6d06276d968df41797 > [2] https://bugs.openjdk.java.net/browse/JDK-8223051 > > On Wed, Aug 7, 2019 at 8:11 PM Micah Kornfield > wrote: > > > Indeed, the BoundChecking and CheckNullForGet variables can make a big > > difference. I didn't initially run the benchmarks with these turned on > > (you can see the result from above with Float8Benchmarks). Here are new > > numbers including with the flags enabled. It looks like using longs > might > > be a little bit slower, I'll see what I can do to mitigate this. > > > > Ravindra also volunteered to try to benchmark the changes with Dremio's > > code on today's sync call. > > > > New > > > > BenchmarkMode Cnt Score > Error > > Units > > > > BitVectorHelperBenchmarks.allBitsNullBenchmark avgt5 4.176 ± > 1.292 > > ns/op > > > > BitVectorHelperBenchmarks.getNullCountBenchmark avgt5 26.102 ± > 0.700 > > ns/op > > > > Float8Benchmarks.copyFromBenchmark avgt5 7.398 ± 0.084 us/op > > > > Float8Benchmarks.readWriteBenchmark avgt5 2.711 ± 0.057 us/op > > > > > > > > old > > > > BitVectorHelperBenchmarks.allBitsNullBenchmark avgt5 3.828 ± > 0.030 > > ns/op > > > > BitVectorHelperBenchmarks.getNullCountBenchmark avgt5 20.611 ± > 0.188 > > ns/op > > > > Float8Benchmarks.copyFromBenchmark avgt5 6.597 ± 0.462 us/op > > > > Float8Benchmarks.readWriteBenchmark avgt5 2.615 ± 0.027 us/op > > > > On Wed, Aug 7, 2019 at 7:13 PM Fan Liya wrote: > > > >> Hi Gonzalo, > >> > >> Thanks for sharing the performance results. > >> I am wondering if you have turned off the flag > >> BoundsChecking#BOUNDS_CHECKING_ENABLED. > >> If not, the lower throughput should be expected. > >> > >> Best, > >> Liya Fan > >> > >> On Wed, Aug 7, 2019 at 10:23 PM Micah Kornfield > >> wrote: > >> > >>> Hi Gonzalo, > >>> Thank you for the feedback. I wasn't aware of the JIT implications. > At > >>> least on the benchmark run they don't seem to have an impact. > >>> > >>> If there are other benchmarks that people have that can validate if > this > >>> change will be problematic I would appreciate trying to run them with > the > >>> PR. I will try to run the ones for zeroing/popcnt tonight to see if > >>> there > >>> is a change in those. > >>> > >>> -Micah > >>> > >>> > >>> > >>> On Wednesday, August 7, 2019, Gonzalo Ortiz Jaureguizar < > >>> golthir...@gmail.com> wrote: > >>> > >>> > I would recommend to take care with this kind of changes. > >>> > > >>> > I didn't try Arrow in more than one year, but by then the performance > >>> was > >>> > quite bad in comparison with plain byte buffer access > >>> > (see http://git.net/apache-arrow-development/msg02353.html *) and > >>> > there are several optimizations that the JVM (specifically, C2) does > >>> not > >>> > apply when dealing with int instead of longs. One of the > >>> > most commons is the loop unrolling and vectorization. > >>> > > >>> > * It doesn't seem the best way to reference an old email on the list, > >>> but > >>> > it is the only result shown by Google > >>> > > >>> > El mié., 7 ago. 2019 a las 11:42, Fan Liya () > >>> > escribió: > >>> > > >>> >> Hi Micah, > >>> >> > >>> >> Thanks for your effort. The performance result looks good. > >>> >> > >>> >> As you indicated, ArrowBuf will take additional 12 bytes (4 bytes > for > >>> each > >>> >> of length, write index, and read index). > >>> >> Similar overheads also exist for vectors like BaseFixedWidthVector, > >>> >> BaseVariableWidthVector, etc. > >>> >> > >>> >> IMO, such overheads are small enough to justify the change. > >>> >> Let's check if there are other overheads. > >>> >> > >>> >> Best, > >>> >> Liya Fan > >>> >> > >>> >> On Wed, Aug 7, 2019 at 3:30 PM Micah Kornfield < >
Re: [Format] Semantics for dictionary batches in streams
What situation are anticipating where you're going to be restating ids mid stream? On Sat, Aug 10, 2019 at 12:13 AM Micah Kornfield wrote: > The IPC specification [1] defines behavior when isDelta on a > DictionaryBatch [2] is "true". I might have missed it in the > specification, but I couldn't find the interpretation for what the expected > behavior is when isDelta=false and and two dictionary batches with the > same ID are sent. > > It seems like there are two options: > 1. Interpret the new dictionary batch as replacing the old one. > 2. Regard this as an error condition. > > Based on the fact that in the "file format" dictionaries are allowed to be > placed in any order relative to the record batches, I assume it is the > second, but just wanted to make sure. > > Thanks, > Micah > > [1] https://arrow.apache.org/docs/ipc.html > [2] https://github.com/apache/arrow/blob/master/format/Message.fbs#L72 >
[jira] [Created] (ARROW-6195) [C++] CMake fails with file not found error while bundling thrift if python is not installed
Omer Ozarslan created ARROW-6195: Summary: [C++] CMake fails with file not found error while bundling thrift if python is not installed Key: ARROW-6195 URL: https://issues.apache.org/jira/browse/ARROW-6195 Project: Apache Arrow Issue Type: Bug Reporter: Omer Ozarslan I had this error message while I was trying to reproduce another issue in docker. To reproduce: ``` FROM debian:buster RUN apt-get update RUN DEBIAN_FRONTEND=noninteractive apt-get install -y git build-essential cmake WORKDIR /app RUN git clone https://github.com/apache/arrow.git RUN git checkout 167cea0 # HEAD as of 10-Aug-19 WORKDIR /app/arrow/cpp/build RUN cmake -DARROW_PARQUET=ON -DARROW_DEPENDENCY_SOURCE=BUNDLED .. RUN cmake --build . --target thrift_ep -j 8 ``` Relevant part of output: ``` Scanning dependencies of target thrift_ep [ 66%] Creating directories for 'thrift_ep' [ 66%] Performing download step (verify and extract) for 'thrift_ep' CMake Error at thrift_ep-stamp/verify-thrift_ep.cmake:11 (message): File not found: /thrift/0.12.0/thrift-0.12.0.tar.gz make[3]: *** [CMakeFiles/thrift_ep.dir/build.make:90: thrift_ep-prefix/src/thrift_ep-stamp/thrift_ep-download] Error 1 make[2]: *** [CMakeFiles/Makefile2:916: CMakeFiles/thrift_ep.dir/all] Error 2 make[1]: *** [CMakeFiles/Makefile2:928: CMakeFiles/thrift_ep.dir/rule] Error 2 make: *** [Makefile:487: thrift_ep] Error 2 ``` Installing python fixes the problem, but this isn't directly clear from the error message. The source of issue is that execute_process in get_apache_mirrors macro silently fails and returns empty APACHE_MIRROR value since PYTHON_EXECUTABLE was empty. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-6194) [Java] Make DictionaryEncoder non-static making it easy to extend and reuse
Ji Liu created ARROW-6194: - Summary: [Java] Make DictionaryEncoder non-static making it easy to extend and reuse Key: ARROW-6194 URL: https://issues.apache.org/jira/browse/ARROW-6194 Project: Apache Arrow Issue Type: Improvement Components: Java Reporter: Ji Liu Assignee: Ji Liu As discussed in [https://github.com/apache/arrow/pull/4994]. Current static DictionaryEncoder has some limitation for extension and reuse. Slightly change the APIs and migrate static method to object based approach. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
Re: [Format] Semantics for dictionary batches in streams
I should add that Option #1 above would be my preference, even though it adds some complications (especially for the file format). On Sat, Aug 10, 2019 at 12:12 AM Micah Kornfield wrote: > The IPC specification [1] defines behavior when isDelta on a > DictionaryBatch [2] is "true". I might have missed it in the > specification, but I couldn't find the interpretation for what the expected > behavior is when isDelta=false and and two dictionary batches with the > same ID are sent. > > It seems like there are two options: > 1. Interpret the new dictionary batch as replacing the old one. > 2. Regard this as an error condition. > > Based on the fact that in the "file format" dictionaries are allowed to be > placed in any order relative to the record batches, I assume it is the > second, but just wanted to make sure. > > Thanks, > Micah > > [1] https://arrow.apache.org/docs/ipc.html > [2] https://github.com/apache/arrow/blob/master/format/Message.fbs#L72 >
[Format] Semantics for dictionary batches in streams
The IPC specification [1] defines behavior when isDelta on a DictionaryBatch [2] is "true". I might have missed it in the specification, but I couldn't find the interpretation for what the expected behavior is when isDelta=false and and two dictionary batches with the same ID are sent. It seems like there are two options: 1. Interpret the new dictionary batch as replacing the old one. 2. Regard this as an error condition. Based on the fact that in the "file format" dictionaries are allowed to be placed in any order relative to the record batches, I assume it is the second, but just wanted to make sure. Thanks, Micah [1] https://arrow.apache.org/docs/ipc.html [2] https://github.com/apache/arrow/blob/master/format/Message.fbs#L72