[GitHub] [arrow] zhztheplayer commented on pull request #7030: ARROW-7808: [Java][Dataset] Implement Datasets Java API by JNI to C++

2020-10-11 Thread GitBox


zhztheplayer commented on pull request #7030:
URL: https://github.com/apache/arrow/pull/7030#issuecomment-706836000


   > > The logic might warrant its own interface instead of a std::Function.
   > 
   > I believe we already have this functionality in Gandiva. Let's see if we 
can use a consistent pattern (whether that is use what Gandiva has or make sure 
Gandiva uses whatever is proposed here--either way).
   
   Hi @jacques-n , I already noticed that Gandiva has a JavaResizableBuffer 
implementation for resizing Java-allocated buffers from C++ via JNI. Did you 
mean this one by the comment?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] zhztheplayer commented on pull request #7030: ARROW-7808: [Java][Dataset] Implement Datasets Java API by JNI to C++

2020-09-28 Thread GitBox


zhztheplayer commented on pull request #7030:
URL: https://github.com/apache/arrow/pull/7030#issuecomment-700390598


   Hi @Sebastiaan-Alvarez-Rodriguez ,
   
   First of all, thanks for the detailed explanation about the issue and your 
suggestions. Sorry for the late reply.
   
   Actually I don't mind to change approach from `System.load` to 
`System.loadLibrary` as long as it too works fine or even better. I am just 
worried about why `getResourceAsStream` returns null in your case.
   
   Just saw your practice on compiling Java code:
   
   > cd /java/dataset
   > mvn clean install -Dmaven.test.skip=true # skipping tests to let build 
succeed
   
   Have you checked if arrow_dataset_jni.so was successfully installed within 
the jar `arrow-dataset-XXX.jar`? If not, specifying `arrow.cpp.build.dir` may 
helps.
   
   > Maybe it is a good idea to pick one of the following:
   > 1. Change the 'java.io.FileNotFoundException: libarrow_dataset_jni.so' to 
display the full path (so people know where your library > looks to find the 
shared object)
   > 2. Make a small JNI_dataset_dev_install_note.md somewhere on getting this 
library to work.
   > 3. Use System.loadLibrary(libraryToLoad);
   > 4. Use some other way to find the location of a shared library and copy it 
to a tmpfile
   
   I did a little try and it appears that `System.loadLibrary` requires for 
property `java.library.path` to be set to include library locations. If we 
change to use this way, do users have to do this every time before getting 
started?
   
   I may intend to pick option 1 here. While we should iterate on multiple 
resource path roots and display several possible paths to let users find in. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] zhztheplayer commented on pull request #7030: ARROW-7808: [Java][Dataset] Implement Datasets Java API by JNI to C++

2020-08-23 Thread GitBox


zhztheplayer commented on pull request #7030:
URL: https://github.com/apache/arrow/pull/7030#issuecomment-678839259


   @emkornfield Sorry for the late reply. And yes I was planing to try adding a 
 `Bits.java` based implementation to this PR, but I may not be able to working 
on it instantly as I am occupied by some other projects these days. Also I 
suppose if we want to consider block level memory reservation then we should 
dive into jemalloc and use some thing like a hook API right? That will appear 
to take me sometime as I am new to jemalloc :)
   
   If we'd like to see the feature in a release quickly maybe we can possibly 
resolve this one in advance and I could continue working on the above in a 
separated JIRA ticket. Either way is fine to me.
   
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] zhztheplayer commented on pull request #7030: ARROW-7808: [Java][Dataset] Implement Datasets Java API by JNI to C++

2020-07-20 Thread GitBox


zhztheplayer commented on pull request #7030:
URL: https://github.com/apache/arrow/pull/7030#issuecomment-660927861


   Sorry I am reconsidering whether to use JVM direct memory as a limit of 
native buffer allocation. By Oracle's explanation[1] about direct memory, 
direct memory seems only to limit direct byte buffer allocations. As the 
discussion has lasted for days I'll send a mail for this topic. Hopefully we 
can make a conclusion :)
   
   [1] https://docs.oracle.com/cd/E15289_01/JRCLR/optionxx.htm#BABGCFFB



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] zhztheplayer commented on pull request #7030: ARROW-7808: [Java][Dataset] Implement Datasets Java API by JNI to C++

2020-07-15 Thread GitBox


zhztheplayer commented on pull request #7030:
URL: https://github.com/apache/arrow/pull/7030#issuecomment-658605439


   > I was suggesting a third approach: C++ memory pool that simply updates the 
direct memory info via Bits in java. It should be done chunk-wise to avoid 
excessive JNI boundary calls.
   
   It looks a little bit more complicated: IIUC Netty doesn't seem to reserve 
from Bits internally, it has a individual counter that is limited to max direct 
mem amount. If we use Bits then we'll be able to exceed the overall usage by 
using up to 2x amount. My suggestion is to maintain our own mechanism to 
reserve Bits for every byte Java allocator managed, then ignore Netty's own 
direct mem counter. This way no JNI overhead will be brought. Would you think 
of this a feasible direction?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] zhztheplayer commented on pull request #7030: ARROW-7808: [Java][Dataset] Implement Datasets Java API by JNI to C++

2020-07-15 Thread GitBox


zhztheplayer commented on pull request #7030:
URL: https://github.com/apache/arrow/pull/7030#issuecomment-658593410


   @jacques-n 
   
   Addressing a previous comment:
   
   > We need to avoid this entirely. If you need some functionality, let's 
figure out what should be exposed. BaseAllocator should not be referenced.
   
   I've opened [ARROW-9475](https://issues.apache.org/jira/browse/ARROW-9475) 
and #7768 . We can eliminate package-hackings in this PR by solving that one 
first.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] zhztheplayer commented on pull request #7030: ARROW-7808: [Java][Dataset] Implement Datasets Java API by JNI to C++

2020-07-05 Thread GitBox


zhztheplayer commented on pull request #7030:
URL: https://github.com/apache/arrow/pull/7030#issuecomment-654004147


   @jacques-n 
   
   I think we have several choices: we can try implement a c++ memory pool 
which is backed by a Java Allocator, actually I have a PoC branch for that: 
https://github.com/zhztheplayer/arrow-1/tree/ARROW-7808-wip-bridged-allocator.
   
   Or we can continue hacking Java API to reserve direct memory after buffer is 
registered. The reservation should happen inside Netty API because Netty has 
its own direct memory counter: 
https://github.com/netty/netty/blob/cbe238a95bab238455e3ff1849d450a9836f39ef/common/src/main/java/io/netty/util/internal/PlatformDependent.java#L749-L766
   
   Which one do you suggest or maybe there's a better one than both? 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] zhztheplayer commented on pull request #7030: ARROW-7808: [Java][Dataset] Implement Datasets Java API by JNI to C++

2020-07-05 Thread GitBox


zhztheplayer commented on pull request #7030:
URL: https://github.com/apache/arrow/pull/7030#issuecomment-653997609


   @rymurr You are right, schemas are not frequently transferred over JNI so 
it's more straightforward to use byte arrays in development. I think your 
suggestion makes sense hence we can try optimize a bit on this behaviour (maybe 
in following tickets? not sure).
   
   I'll remove the schema files and put schemas in code.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] zhztheplayer commented on pull request #7030: ARROW-7808: [Java][Dataset] Implement Datasets Java API by JNI to C++

2020-06-29 Thread GitBox


zhztheplayer commented on pull request #7030:
URL: https://github.com/apache/arrow/pull/7030#issuecomment-651422684


   Thanks for the comments! I've got some stuffs to deal with these days. Will 
address as soon as possible.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] zhztheplayer commented on pull request #7030: ARROW-7808: [Java][Dataset] Implement Datasets Java API by JNI to C++

2020-06-15 Thread GitBox


zhztheplayer commented on pull request #7030:
URL: https://github.com/apache/arrow/pull/7030#issuecomment-644494295


   @fsaintjacques Rebased now. 
   
   As we have a unresolved 
[discussion](https://github.com/apache/arrow/pull/7030#discussion_r440075959) 
about memory management of native buffers. Should we clarify a bit more before 
writing any fixes?
   
   May first question is what's the major problem now? I list some that I can 
see in comments:
   
   1. We are package-hacking `org.apache.arrow.memory` in module 
`arrow-dataset`.
   Yes `Ownerships.java` uses some hacks, so I've removed the class in latest 
commits (the whole class is not necessary for current use case). Now only 
`NativeUnderlyingMemory.java` is still in the package 
`org.apache.arrow.memory`. 
   2. We lose java native heap caps.
   Can we clarify this? I think the explanation can be either we lose caps from 
JVM direct memory, or we lose caps from BufferAllocator. For the latter one I 
think buffers are already registered to allocator.
   3. OOM-check is proceed after buffer is created.
   Yes. But is this really a big problem? see 
[comment](https://github.com/apache/arrow/pull/7030#discussion_r440075959).
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] zhztheplayer commented on pull request #7030: ARROW-7808: [Java][Dataset] Implement Datasets Java API by JNI to C++

2020-06-10 Thread GitBox


zhztheplayer commented on pull request #7030:
URL: https://github.com/apache/arrow/pull/7030#issuecomment-641936183


   Hi @fsaintjacques, @emkornfield  it would be much appreciated if you would 
like to continue reviewing this, here is a list for modifications since latest 
review comment:
   
   - code rebase
   - removed SingleFileDataset(Factory)
   - added API `FileSystemDatasetFactory::Make(
   std::string uri, std::shared_ptr format,
   FileSystemFactoryOptions options)` 
   - other minor changes (see commits after 6f4a914)



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] zhztheplayer commented on pull request #7030: ARROW-7808: [Java][Dataset] Implement Datasets Java API by JNI to C++

2020-05-19 Thread GitBox


zhztheplayer commented on pull request #7030:
URL: https://github.com/apache/arrow/pull/7030#issuecomment-630624723


   Hi @emkornfield @wesm, comments addressed and filter implementation is 
removed from this PR (I suppose we expect a separate topic about it). Any 
further changes needed please kindly let me know.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] zhztheplayer commented on pull request #7030: ARROW-7808: [Java][Dataset] Implement Datasets Java API by JNI to C++

2020-04-28 Thread GitBox


zhztheplayer commented on pull request #7030:
URL: https://github.com/apache/arrow/pull/7030#issuecomment-620528808


   Hi @wesm @emkornfield , based on the discussion in the [latest 
commit](https://github.com/apache/arrow/pull/7030/commits/7b16a41d0f9f3265f00c92ce50304b68ee03315f)
 I removed some of the JNI mappings of c++ components. Now only 
Dataset(Factory) and Scanner are there to be bridged. Would you think of this a 
feasible direction?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org