high memory usage with large number of HashSets. 3X more memory than Python

2021-12-25 Thread cblake
With an inverted index your keys generally will be `string` or similar, not `int` since a document contains generally contains search terms or whatnot which are strings not numbers. So, with `seq` rather than `HashSet` for posting lists there should be no need for an integer keyed hash at all.

high memory usage with large number of HashSets. 3X more memory than Python

2021-12-25 Thread Gtriangle
@Araq's warning was definitely justified, when I compiled with the nimIntHash1 flag the code was **many** times faster, but I ran into KeyErrors when looking for document ids in the reverse index tables. I'll try changing all my ints to distinct ints and see what happens ! > FWIW, there was an

high memory usage with large number of HashSets. 3X more memory than Python

2021-12-24 Thread cblake
Ok..Here you go:

high memory usage with large number of HashSets. 3X more memory than Python

2021-12-24 Thread Pyautogui
I would be interested!

high memory usage with large number of HashSets. 3X more memory than Python

2021-12-24 Thread cblake
FWIW, there was an [article early this year](https://bart.degoe.de/building-a-full-text-search-engine-150-lines-of-code/) whose poor engineering annoyed me. That annoyance inspired me to re-do the impl in Nim (in 114 lines and already faster/more memory conservative as well as briefer) and

high memory usage with large number of HashSets. 3X more memory than Python

2021-12-24 Thread Gtriangle
wow, amazing. Thank you @Araq, @ElegantBeef and @cblake. With Nim it feels like I got the keys to a Lamborghini :) @cblake: You're right, this part of the code is used to generate an inverted index. I then calculate the Jaccard-Index between all pairs of 'docs' to find similar documents for a

high memory usage with large number of HashSets. 3X more memory than Python

2021-12-24 Thread cblake
Follow-up: also you should use `--gc:arc` if you are not already (sounds like not). Just with stdlib Table & HashSet (with init size hint of `3`), I get 2.57 sec, 1313MB with the default gc, 1.73 sec, 843MB with `gc:arc`, then adding `-d:nimIntHash1` this falls to 1.08 sec, 780 MB. Then with

high memory usage with large number of HashSets. 3X more memory than Python

2021-12-24 Thread cblake
You can test/implement @Araq's idea more simply by defining `nimIntHash1` which does the identity hash - perhaps useful in combination with @ElegantBeef's suggestion. (Just `nim c -d:nimIntHash1`) or put the define in a `foo.nim.cfg` or `foo.nims` file.) Also, there is a somewhat more full

high memory usage with large number of HashSets. 3X more memory than Python

2021-12-24 Thread xbello
`intsets` uses 2.7Gb. Slighly faster than HashSet(10) in my machine: 3.10 seconds vs 3.4.

high memory usage with large number of HashSets. 3X more memory than Python

2021-12-24 Thread ynfle
Besides for @ElegantBeef's suggestions, you could also try `std/intsets`

high memory usage with large number of HashSets. 3X more memory than Python

2021-12-23 Thread Araq
Ah, don't listen to me, listen to @ElegantBeef.

high memory usage with large number of HashSets. 3X more memory than Python

2021-12-23 Thread ElegantBeef
`initHashSet` takes in parameter of the default size as can be seen [here](https://nim-lang.org/docs/sets.html#initHashSet) by default it's `64` so just by doing `initHashSet[int](10)` we get the memory usage down to 1.6GB and it gets much faster thanks to not allocating as much(atleast on my

high memory usage with large number of HashSets. 3X more memory than Python

2021-12-23 Thread Araq
Your port has a bug, you need to use `0..<10` and not `0..10` etc. As for the memory consumption ... I don't know, every hash table implementation has its problems. Probably all you need to do is to override the hash for integers, but beware, you need to use `distinct int` for the type. Since

high memory usage with large number of HashSets. 3X more memory than Python

2021-12-23 Thread Gtriangle
Hi, I'm really loving Nim but ran into a strange issue today when porting some Python code. I have some simple code that builds a table, mapping an int to a HashSet[int]. The table has about 2.5 million entries. Each HashSet contains 10 random integers (max integer value being 115000) I was