Re: [racket-users] Re: appending files

2016-01-29 Thread Brandon Thomas
On Fri, 2016-01-29 at 13:00 -0800, Scotty C wrote:
> ok, had time to run my hash on my one test file
> '(611 1 1 19 24783208 4.19)
> this means
> # buckets
> % buckets empty
> non empty bucket # keys least
> non empty bucket # keys most
> total number of keys
> average number of keys per non empty bucket
> 
> it took 377 sec.
> original # records is 26570359 so 6.7% dupes.
> processing rate is 70478/sec
> 
> my plan right now is to rework my current hash so that it runs byte
> strings instead of bignums. i will probably be tomorrow afternoon
> before i have more stats.
> 
> question for you all. right now i use modulo on my bignums. i know i
> can't do that to a byte string. i'll figure something out. if any of
> you know how to do this, can you post a method?
> 

I'm not sure what your asking exactly. Because a byte string is an
array, all you need to do is jump to the byte your information is in
and then get your bit out. So let's just say you want to get the nth
bit in an array of 8-bit bytes. You first need to calculate which byte
your bit is in, which is clearly (bytes-ref data (floor (/ n 8))). Then
you get the bit out of it, which would be something (untested) like
(modulo (arithmetic-shift (bytes-ref data (floor (/ n 8))) (* -1
(modulo n 8))) 2). This is easy enough to modify for elements that are
a divisor of 8 in size.

Regards,
Brandon Thomas

-- 
You received this message because you are subscribed to the Google Groups 
"Racket Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to racket-users+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [racket-users] Re: appending files

2016-01-28 Thread Brandon Thomas
On Thu, 2016-01-28 at 20:32 -0800, Scotty C wrote:
> > I think you understand perfectly.
> i'm coming around
> 
> > You said the keys are 128-bit (16 byte) values.  You can store one
> > key
> > directly in a byte string of length 16.
> yup
> 
> > So instead of using a vector of pointers to individual byte
> > strings,
> > you would allocate a single byte string of length 
> >    {buckets x chain length x 16} 
> > and index it directly as if it were a 3-dimensional array [using
> > offset calculations as you would in C].
> i'm going to actually run my code later (probably in the morning).
> and then grab some stats from the hash. take a look at the stats.
> give me some thoughts on the above implementation after looking at
> the stats.
> 
> > Can you explain why the bignums are important.  For a simple task
> > like
> > filtering a file, they would seem to be both a waste of memory and
> > a
> > performance drain wrt storing the key values as byte strings.
> ok, true story. about 10 years ago i'm a student taking an intro to
> ai class we have been assigned the 3x3 tile puzzle and 2 algorithms,
> ida* and a*. also, try it on a 4x4. so i whip these out and am
> appalled by how slow the fn things are but intrigued by what they can
> do and i'm just a puzzle kind of guy. btw, the 4x4 wasn't solvable by
> my stuff at the time. so i head to the chief's office with my little
> bit of code. tell him it's way to slow. what do you think? he takes 5
> seconds to peruse the item and says "you're probably making too much
> memory". side note, later i graded this class for that prof and the
> same project. not a single student, including the ms and phd types,
> did anything better than a vector of vectors of ints. so back to me,
> i'm thinking, how do i cram 4 bits together so that there is
> absolutely no wasted space. i start digging through the documentation
> and i find the bitwise stuff. i see arithmetic shift and having no
> idea whether it will work or not i type into the interface
> (arithmetic-shift 1 100). if we blow up, well we blow up big but
> we didn't. i flipped everything in my project to bignum at that
> point. the bignum stuff is massively faster than vector of vectors of
> ints and faster than just a single vector of ints. lots of bignum is
> easy to implement, like the hash. making a child state, not so much.
> i'm not married to bignums despite all this.
> 
> > I've been programming for over 30 years, professionally for 23.
> i was a programmer. dot com bought us as a brick and mortar back in
> '99 and the whole shebang blew up 2 years later. idiots. anyway, been
> a stay at home dad for the most part since then.
> 
> > have an MS in Computer Science
> me too. that's the other piece of the most part from up above.
> 
> > (current-memory-use)
> yup, tried that a while back didn't like what i saw. check this out:
> 
> > (current-memory-use)
> 581753864
> > (current-memory-use)
> 586242568
> > (current-memory-use)
> 591181736
> > (current-memory-use)
> 595527064
> 
> does that work for you?
> 

For whatever you're storing, I still suggest using a disk based
structure (preferably using one that's already optimised and built for
you). I've done a bit of work on cache aware algortihms, where reducing
memory footprint is really big (along with the memory juggling). Yes,
if you try to store something that takes only a few bits and store each
one into an integer, you'll have waisted space. In theory, you could
use a bignum, and only shift it as many bits as you need, which is what
you have done. The issue with that is that bignum's have extra overhead
thats neccessary for it to do arithmetic. Obviously, there needs to be
a way to match or beat bignums with primitive structures, since bignums
are implemented with primitive structures. So, if you want to beat
bignum for storage, you'll want to use some contiguous memory with
fixed sized elements (like byte strings, or arrays of uint32_t's in C)
- but using bit manipulation on each byte, such that you have multiple
stored values in each one, directly beside eachother bitwise, like a
bignum has, but without it's overhead.

Regards,
Brandon Thomas

-- 
You received this message because you are subscribed to the Google Groups 
"Racket Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to racket-users+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [racket-users] Re: appending files

2016-01-27 Thread Brandon Thomas
On Wed, 2016-01-27 at 17:49 -0500, George Neuner wrote:
> On 1/27/2016 10:50 AM, Brandon Thomas wrote:
> > On Wed, 2016-01-27 at 04:01 -0500, George Neuner wrote:
> > > On Tue, 26 Jan 2016 23:00:01 -0500, Brandon Thomas
> > > <bthoma...@gmail.com> wrote:
> > > 
> > > > Is there anything stopping you from restructuring
> > > > the data on disk and using the hash directly from there
> > > 
> > > Scotty's hash table is much larger than he thinks it is and very
> > > likely is being paged to disk already.  Deliberately implementing
> > > it
> > > as a disk file is unlikely to improve anything.
> > 
> > That's a good point to keep in mind. But there are advantages,
> > including faster startup time, using less ram+swap, easier to keep
> > the
> > file updated and it makes it easier to make a resize solution.
> > There
> > are probably more, but basically it's the reasons why all large
> > (key,value) storage solutions I've herd of use an explicit file
> > instead
> > of swap.
> 
> I miscalculated the scale of Scotty's hash structure - it's not as
> bad 
> as I thought initially.  But even so, it is of a scale where it is 
> unwieldy and bound to have virtual memory problems unless the machine
> is 
> dedicated.
> 
> Hashing is latency sensitive - it was designed to be a memory
> resident 
> technique.  Obviously it _can_ be done using file based buckets ...
> the 
> effect is of querying an ISAM database for ever access.  The problem
> is 
> that the latency increases by orders of magnitude: even from
> resident 
> blocks, every access involves the file system API and the
> kernel.  You 
> did mention (user space) caching, but that greatly complicates the
> solution.
> Making the hash external I think is not a win - it definitely will 
> handle much bigger files, but it will handle every file more
> slowly.  I 
> think it is better to leverage the file system rather than fight it.
> 
> YMMV,
> George

Yup, the canonical solutions to storing large amounts of data scalably
are rather complicated and are work to implement (I wouldn't say
unmanagable though). And they get much harder if you wanted, say, ACID
properties. This is why it's been factored into libraries, like Redis
for example. If it can fit into ram efficiently enough, that's ok. If
he's planning to scale though, maybe it's worth looking at using
Racket's ffi with a library like Redis or something. It might be the
least work and the cleanest solution.

Regards,
Brandon Thomas

-- 
You received this message because you are subscribed to the Google Groups 
"Racket Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to racket-users+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [racket-users] Re: appending files

2016-01-27 Thread Brandon Thomas
On Wed, 2016-01-27 at 04:01 -0500, George Neuner wrote:
> On Tue, 26 Jan 2016 23:00:01 -0500, Brandon Thomas
> <bthoma...@gmail.com> wrote:
> 
> > Is there anything stopping you from restructuring
> > the data on disk and using the hash directly from there
> 
> Scotty's hash table is much larger than he thinks it is and very
> likely is being paged to disk already.  Deliberately implementing it
> as a disk file is unlikely to improve anything.
> 
> George
> 

That's a good point to keep in mind. But there are advantages,
including faster startup time, using less ram+swap, easier to keep the
file updated and it makes it easier to make a resize solution. There
are probably more, but basically it's the reasons why all large
(key,value) storage solutions I've herd of use an explicit file instead
of swap.

Regards,
Bradon Thomas

-- 
You received this message because you are subscribed to the Google Groups 
"Racket Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to racket-users+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [racket-users] Re: appending files

2016-01-27 Thread Brandon Thomas
On Tue, 2016-01-26 at 22:48 -0800, Scotty C wrote:
> ok brandon, that's a thought. build the hash on the hard drive at the
> time of data creation. you mention collision resolution. so let me
> build my hash on the hard drive using my 6 million buckets but
> increase the size of each bucket from 5 slots to 20. right?
> i can't exactly recreate my vector/bignum hash on the hard drive
> because i can't dynamically resize the buckets like i can the
> bignums. this gives me a 4 gb file whereas my original was 1 gb. i
> have enough space for that so that's not a problem.

Sure, a bigger file works. Your bignums are probably implemented using
linked lists, if I'm not mistaken. That's why they are O(1) resizable.
You could do the same thing on hdd. The problem is that you might need
to seek and read through the file more with linked lists. Since your
probably paging, your implementation also has this same problem. You
can get rid of this problem in both by using vectos though, as you
suggest. And you can have a proccess in the background that resizes the
file (by creating a new one) when your other one gets too full, so you
don't block operations.

> so as my buckets fill up they head towards the average of 5 data
> items per bucket. so on average here's what happens with each hd hash
> record. i go to my hd hash and read 3.5 (think about it) items and
> 90% of the time i don't find my data so i do a write. in my process i
> do an initial write, then a read, a write, a read, a write. compare:
> 3.5 vs 2 reads; 1 vs 3 writes. the reads are more costly and if i
> exceed 20 items in a bucket the hd hash breaks. what do you think? is
> it worth it?
> 

I think that implementation has room for improvement. Let's say you
look for a record, it does not exist, and you need to write it. So, you
seek to the location, read the bucket into ram (1 read). If the entry
isn't in the bucket, you seek to the location in the bucket and write
your entry (1 write). Of course to do this, you'd think you'd need to
make a function for your hash that searches for an entry, and uses a
fallback to write the record if it fails. That's a reasonable function
though. But you can also introduce the idea of using a cache, and have
a background proccess update the file (that way your hahs oeprations
aren't waiting on the hdd). Though, your operating systems cache file
cache might even be taking care of this for you, so you might want to
do some benchmarking.

There are probably several other optimizations you can make, and you
can probably find them by going through the source/docs of other
(key,value) storage solutions. Or if you wanted to cut down on work,
you can just use one of the other (key,value) storage solutions and use
it via libffi.

Regards,
Brandon Thomas

-- 
You received this message because you are subscribed to the Google Groups 
"Racket Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to racket-users+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [racket-users] Re: appending files

2016-01-26 Thread Brandon Thomas
On Tue, 2016-01-26 at 18:40 -0800, Scotty C wrote:
> alright george, i'm open to new ideas. here's what i've got going.
> running 64 bit linux mint OS on a 2 core laptop with 2 gb of ram. my
> key is 128 bits with ~256 bits per record. so my 1 gb file contains
> ~63 million records and ~32 million keys. about 8% will be dupes
> leaving me with ~30 million keys. i run a custom built hash. i use
> separate chaining with a vector of bignums. i am willing to let my
> chains run up to 5 keys per chain so i need a vector of 6 million
> pointers. that's 48 mb for the array. another 480 mb for the bignums.
> let's round that sum to .5 gb. i have another rather large bignum in
> memory that i use to reduce but not eliminate record duplication of
> about .5 gb. i'm attempting to get this thing to run in 2 places so i
> need 2 hashes. add this up .5+.5+.5 is 1.5 gb and that gets me to
> about my memory limit. the generated keys are random but i use one of
> the associated fields for sorting during the initial write to the
> hard drive. what goes in each of those files is totally random but
> dupes do not run across files. also, the number of keys is >1e25.
> 

Sorry, I haven't read through the entire conversation, so I hope I'm
not missing anything. Is there anything stopping you from restructuring
the data on disk and using the hash directly from there (possibly with
the help of a cache if speed is important)? For example, let's say each
entry is 256 bits. Use something like "(file-position dbfile (* 32
(hash-custom key)))" to seek over to the appropriate entry on disk and
read just the entry you need (using whatever colliosion resolution).
Then you'll be using no auxillary memory (unless your caching, which
can just be a smaller ram hash table). Unless of course I'm just
missing something completly.

Regards,
Brandon Thomas

-- 
You received this message because you are subscribed to the Google Groups 
"Racket Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to racket-users+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.