Will try to get the example data and code out - there's a lot of internal logic at the moment
The commit we're using is e402f61aceb64f659845bdc5f03cf4f29797277b - Andy On Mon, Jun 29, 2020 at 9:29 AM Jon Malkin <jon.mal...@gmail.com> wrote: > You mean you were calling the java library from python? Our testing > generally has generally shown C++ to be faster. > > This is still too vague for me to be able to say much. There's no specific > git version (tag or hash), no code, and no data. > > jon > > On Mon, Jun 29, 2020 at 9:08 AM Andy Dang <nam...@gmail.com> wrote: > >> I was using the Git version and was running with various sketches. I >> thought the slowness is from Python, but I was able to scan through the >> same data calculating the same statistics with the Java library in roughly >> 3 minutes. >> >> Any idea why there's such a big difference between the two languages? >> >> - Andy >> >> On Fri, Jun 26, 2020, 21:02 Jon Malkin <jmal...@apache.org> wrote: >> >>> I haven't done long running python tests recently but I haven't seen >>> that. >>> >>> After you using a release version of the library or did you check out >>> from git? And which sketch or sketches are you using? >>> >>> I've compiled the library in debug mode (gotta modify setup.py to force >>> that) and run python via gdb but that's not gonna work nicely on 1.6gb of >>> data. It's sloooooooowwwwwww. >>> >>> jon >>> >>> >>> On Fri, Jun 26, 2020, 4:39 PM Andy Dang <nam...@gmail.com> wrote: >>> >>>> Hi all, >>>> >>>> I've been trying to integrate Datasketches into our ecosystem - really >>>> great work! >>>> >>>> However, when I tried to run various sketches with the lending club >>>> data from Kaggle (1.6GB in size) on the raw CSV data in Python on my MacOS. >>>> I noticed after a while that the process will crash with a mysterious >>>> segfault on my Mac OS (Catalina) >>>> My CLang version: >>>> >>>> *➜ **Workspace* c++ --version >>>> >>>> Apple clang version 11.0.0 (clang-1100.0.33.17) >>>> >>>> Target: x86_64-apple-darwin19.5.0 >>>> >>>> Thread model: posix >>>> >>>> InstalledDir: /Library/Developer/CommandLineTools/usr/bin >>>> >>>> *➜ **Workspace* gcc --version >>>> >>>> Configured with: --prefix=/Library/Developer/CommandLineTools/usr >>>> --with-gxx-include-dir=/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/usr/include/c++/4.2.1 >>>> >>>> Apple clang version 11.0.0 (clang-1100.0.33.17) >>>> >>>> Target: x86_64-apple-darwin19.5.0 >>>> >>>> Thread model: posix >>>> >>>> InstalledDir: /Library/Developer/CommandLineTools/usr/bin >>>> >>>> Replacing this with Miniconda cxx toolchain solves the problem. >>>> >>>> I'll get a script along with the data for reproducibility, but before >>>> that I wonder if anyone has come across this issue before? >>>> >>>> Cheers! >>>> - Andy >>>> >>>