[GitHub] [arrow] zhztheplayer commented on pull request #7030: ARROW-7808: [Java][Dataset] Implement Datasets Java API by JNI to C++
zhztheplayer commented on pull request #7030: URL: https://github.com/apache/arrow/pull/7030#issuecomment-706836000 > > The logic might warrant its own interface instead of a std::Function. > > I believe we already have this functionality in Gandiva. Let's see if we can use a consistent pattern (whether that is use what Gandiva has or make sure Gandiva uses whatever is proposed here--either way). Hi @jacques-n , I already noticed that Gandiva has a JavaResizableBuffer implementation for resizing Java-allocated buffers from C++ via JNI. Did you mean this one by the comment? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] zhztheplayer commented on pull request #7030: ARROW-7808: [Java][Dataset] Implement Datasets Java API by JNI to C++
zhztheplayer commented on pull request #7030: URL: https://github.com/apache/arrow/pull/7030#issuecomment-700390598 Hi @Sebastiaan-Alvarez-Rodriguez , First of all, thanks for the detailed explanation about the issue and your suggestions. Sorry for the late reply. Actually I don't mind to change approach from `System.load` to `System.loadLibrary` as long as it too works fine or even better. I am just worried about why `getResourceAsStream` returns null in your case. Just saw your practice on compiling Java code: > cd /java/dataset > mvn clean install -Dmaven.test.skip=true # skipping tests to let build succeed Have you checked if arrow_dataset_jni.so was successfully installed within the jar `arrow-dataset-XXX.jar`? If not, specifying `arrow.cpp.build.dir` may helps. > Maybe it is a good idea to pick one of the following: > 1. Change the 'java.io.FileNotFoundException: libarrow_dataset_jni.so' to display the full path (so people know where your library > looks to find the shared object) > 2. Make a small JNI_dataset_dev_install_note.md somewhere on getting this library to work. > 3. Use System.loadLibrary(libraryToLoad); > 4. Use some other way to find the location of a shared library and copy it to a tmpfile I did a little try and it appears that `System.loadLibrary` requires for property `java.library.path` to be set to include library locations. If we change to use this way, do users have to do this every time before getting started? I may intend to pick option 1 here. While we should iterate on multiple resource path roots and display several possible paths to let users find in. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] zhztheplayer commented on pull request #7030: ARROW-7808: [Java][Dataset] Implement Datasets Java API by JNI to C++
zhztheplayer commented on pull request #7030: URL: https://github.com/apache/arrow/pull/7030#issuecomment-678839259 @emkornfield Sorry for the late reply. And yes I was planing to try adding a `Bits.java` based implementation to this PR, but I may not be able to working on it instantly as I am occupied by some other projects these days. Also I suppose if we want to consider block level memory reservation then we should dive into jemalloc and use some thing like a hook API right? That will appear to take me sometime as I am new to jemalloc :) If we'd like to see the feature in a release quickly maybe we can possibly resolve this one in advance and I could continue working on the above in a separated JIRA ticket. Either way is fine to me. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] zhztheplayer commented on pull request #7030: ARROW-7808: [Java][Dataset] Implement Datasets Java API by JNI to C++
zhztheplayer commented on pull request #7030: URL: https://github.com/apache/arrow/pull/7030#issuecomment-660927861 Sorry I am reconsidering whether to use JVM direct memory as a limit of native buffer allocation. By Oracle's explanation[1] about direct memory, direct memory seems only to limit direct byte buffer allocations. As the discussion has lasted for days I'll send a mail for this topic. Hopefully we can make a conclusion :) [1] https://docs.oracle.com/cd/E15289_01/JRCLR/optionxx.htm#BABGCFFB This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] zhztheplayer commented on pull request #7030: ARROW-7808: [Java][Dataset] Implement Datasets Java API by JNI to C++
zhztheplayer commented on pull request #7030: URL: https://github.com/apache/arrow/pull/7030#issuecomment-658605439 > I was suggesting a third approach: C++ memory pool that simply updates the direct memory info via Bits in java. It should be done chunk-wise to avoid excessive JNI boundary calls. It looks a little bit more complicated: IIUC Netty doesn't seem to reserve from Bits internally, it has a individual counter that is limited to max direct mem amount. If we use Bits then we'll be able to exceed the overall usage by using up to 2x amount. My suggestion is to maintain our own mechanism to reserve Bits for every byte Java allocator managed, then ignore Netty's own direct mem counter. This way no JNI overhead will be brought. Would you think of this a feasible direction? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] zhztheplayer commented on pull request #7030: ARROW-7808: [Java][Dataset] Implement Datasets Java API by JNI to C++
zhztheplayer commented on pull request #7030: URL: https://github.com/apache/arrow/pull/7030#issuecomment-658593410 @jacques-n Addressing a previous comment: > We need to avoid this entirely. If you need some functionality, let's figure out what should be exposed. BaseAllocator should not be referenced. I've opened [ARROW-9475](https://issues.apache.org/jira/browse/ARROW-9475) and #7768 . We can eliminate package-hackings in this PR by solving that one first. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] zhztheplayer commented on pull request #7030: ARROW-7808: [Java][Dataset] Implement Datasets Java API by JNI to C++
zhztheplayer commented on pull request #7030: URL: https://github.com/apache/arrow/pull/7030#issuecomment-654004147 @jacques-n I think we have several choices: we can try implement a c++ memory pool which is backed by a Java Allocator, actually I have a PoC branch for that: https://github.com/zhztheplayer/arrow-1/tree/ARROW-7808-wip-bridged-allocator. Or we can continue hacking Java API to reserve direct memory after buffer is registered. The reservation should happen inside Netty API because Netty has its own direct memory counter: https://github.com/netty/netty/blob/cbe238a95bab238455e3ff1849d450a9836f39ef/common/src/main/java/io/netty/util/internal/PlatformDependent.java#L749-L766 Which one do you suggest or maybe there's a better one than both? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] zhztheplayer commented on pull request #7030: ARROW-7808: [Java][Dataset] Implement Datasets Java API by JNI to C++
zhztheplayer commented on pull request #7030: URL: https://github.com/apache/arrow/pull/7030#issuecomment-653997609 @rymurr You are right, schemas are not frequently transferred over JNI so it's more straightforward to use byte arrays in development. I think your suggestion makes sense hence we can try optimize a bit on this behaviour (maybe in following tickets? not sure). I'll remove the schema files and put schemas in code. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] zhztheplayer commented on pull request #7030: ARROW-7808: [Java][Dataset] Implement Datasets Java API by JNI to C++
zhztheplayer commented on pull request #7030: URL: https://github.com/apache/arrow/pull/7030#issuecomment-651422684 Thanks for the comments! I've got some stuffs to deal with these days. Will address as soon as possible. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] zhztheplayer commented on pull request #7030: ARROW-7808: [Java][Dataset] Implement Datasets Java API by JNI to C++
zhztheplayer commented on pull request #7030: URL: https://github.com/apache/arrow/pull/7030#issuecomment-644494295 @fsaintjacques Rebased now. As we have a unresolved [discussion](https://github.com/apache/arrow/pull/7030#discussion_r440075959) about memory management of native buffers. Should we clarify a bit more before writing any fixes? May first question is what's the major problem now? I list some that I can see in comments: 1. We are package-hacking `org.apache.arrow.memory` in module `arrow-dataset`. Yes `Ownerships.java` uses some hacks, so I've removed the class in latest commits (the whole class is not necessary for current use case). Now only `NativeUnderlyingMemory.java` is still in the package `org.apache.arrow.memory`. 2. We lose java native heap caps. Can we clarify this? I think the explanation can be either we lose caps from JVM direct memory, or we lose caps from BufferAllocator. For the latter one I think buffers are already registered to allocator. 3. OOM-check is proceed after buffer is created. Yes. But is this really a big problem? see [comment](https://github.com/apache/arrow/pull/7030#discussion_r440075959). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] zhztheplayer commented on pull request #7030: ARROW-7808: [Java][Dataset] Implement Datasets Java API by JNI to C++
zhztheplayer commented on pull request #7030: URL: https://github.com/apache/arrow/pull/7030#issuecomment-641936183 Hi @fsaintjacques, @emkornfield it would be much appreciated if you would like to continue reviewing this, here is a list for modifications since latest review comment: - code rebase - removed SingleFileDataset(Factory) - added API `FileSystemDatasetFactory::Make( std::string uri, std::shared_ptr format, FileSystemFactoryOptions options)` - other minor changes (see commits after 6f4a914) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] zhztheplayer commented on pull request #7030: ARROW-7808: [Java][Dataset] Implement Datasets Java API by JNI to C++
zhztheplayer commented on pull request #7030: URL: https://github.com/apache/arrow/pull/7030#issuecomment-630624723 Hi @emkornfield @wesm, comments addressed and filter implementation is removed from this PR (I suppose we expect a separate topic about it). Any further changes needed please kindly let me know. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] zhztheplayer commented on pull request #7030: ARROW-7808: [Java][Dataset] Implement Datasets Java API by JNI to C++
zhztheplayer commented on pull request #7030: URL: https://github.com/apache/arrow/pull/7030#issuecomment-620528808 Hi @wesm @emkornfield , based on the discussion in the [latest commit](https://github.com/apache/arrow/pull/7030/commits/7b16a41d0f9f3265f00c92ce50304b68ee03315f) I removed some of the JNI mappings of c++ components. Now only Dataset(Factory) and Scanner are there to be bridged. Would you think of this a feasible direction? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org