[racket-users] Re: appending files

2016-01-31 Thread Scotty C
On Sunday, January 31, 2016 at 12:13:31 AM UTC-6, Scotty C wrote:
> > that's what i did. so new performance data. this is with bytes instead of 
> > strings for data on the hard drive but bignums in the hash still.
> > 
> > as a single large file and a hash with 203 buckets for 26.6 million 
> > records the data rate is 98408/sec.
> > 
> > when i split and go with 11 smaller files and hash with 59 buckets the 
> > data rate is 106281/sec.
> 
> hash is reworked, bytes based. same format though, vector of bytes. so time 
> test results:
> 
> single large file same # buckets as above data rate 175962/sec.
> 
> 11 smaller files same # buckets as above data rate 205971/sec.

throughput update. i had to hand code some of the stuff (places are just not 
working for me) but i just managed to hack my way through running this in 
parallel. i copied the original 26.6 million records to a new file. ran two 
slightly reworked copies of my duplicate removal code at a shell prompt like 
this:
racket ddd-parallel.rkt &
racket ddd-parallel1.rkt &
i'm not messing with the single large file anymore. so for twice the data the 
data rate is up to 356649/sec.

-- 
You received this message because you are subscribed to the Google Groups 
"Racket Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to racket-users+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[racket-users] Re: appending files

2016-01-30 Thread Scotty C
> Yes.   You probably do need to convert the files.  Your originalat
> coding likely is not [easily] compatible with binary I/O.

that's what i did. so new performance data. this is with bytes instead of 
strings for data on the hard drive but bignums in the hash still.

as a single large file and a hash with 203 buckets for 26.6 million records 
the data rate is 98408/sec.

when i split and go with 11 smaller files and hash with 59 buckets the data 
rate is 106281/sec.

clearly it is quicker to read/write twice than to read/write once. but of 
course my laptop is pretty sad and my hash may end up being sad as well. with 
these revised and quicker rates, i'm ready to migrate the hash to bytes.

-- 
You received this message because you are subscribed to the Google Groups 
"Racket Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to racket-users+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[racket-users] Re: appending files

2016-01-30 Thread Scotty C
just found a small mistake in the documentation: can you find it?

(numerator q) → integer?

  q : rational?

Coerces q to an exact number, finds the numerator of the number expressed in 
its simplest fractional form, and returns this number coerced to the exactness 
of q.

(denominator q) → integer?

  q : rational?

Coerces q to an exact number, finds the numerator of the number expressed in 
its simplest fractional form, and returns this number coerced to the exactness 
of q.

-- 
You received this message because you are subscribed to the Google Groups 
"Racket Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to racket-users+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[racket-users] Re: appending files

2016-01-30 Thread Scotty C
> that's what i did. so new performance data. this is with bytes instead of 
> strings for data on the hard drive but bignums in the hash still.
> 
> as a single large file and a hash with 203 buckets for 26.6 million 
> records the data rate is 98408/sec.
> 
> when i split and go with 11 smaller files and hash with 59 buckets the 
> data rate is 106281/sec.

hash is reworked, bytes based. same format though, vector of bytes. so time 
test results:

single large file same # buckets as above data rate 175962/sec.

11 smaller files same # buckets as above data rate 205971/sec.

i played around with the # buckets parameter but what worked for bignums worked 
for bytes too. overall speed has nearly doubled. very nice, thanks to all who 
contributed some ideas. and to think, all i wanted to do was paste some files 
together.

-- 
You received this message because you are subscribed to the Google Groups 
"Racket Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to racket-users+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [racket-users] Re: appending files

2016-01-29 Thread Scotty C
> > question for you all. right now i use modulo on my bignums. i know i
> > can't do that to a byte string. i'll figure something out. if any of
> > you know how to do this, can you post a method?
> > 
> 
> I'm not sure what your asking exactly.

i'm talking about getting the hash index of a key. see my key is a bignum and i 
get the hash index with (modulo key 611). so i either need to turn the key 
(which will be a byte string) into a number. stuff that right in where i have 
key. or i replace modulo.

-- 
You received this message because you are subscribed to the Google Groups 
"Racket Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to racket-users+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [racket-users] Re: appending files

2016-01-29 Thread Scotty C
> However, if you have implemented your own, you can still call 
> `equal-hash-code` 
yes, my own hash.
i think the equal-hash-code will work.

-- 
You received this message because you are subscribed to the Google Groups 
"Racket Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to racket-users+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[racket-users] Re: appending files

2016-01-29 Thread Scotty C
> i get the feeling that i will need to read the entire file as i used to read 
> it taking each record and doing the following:
> convert the string record to a bignum record
> convert the bignum record into a byte string
> write the byte string to a new data file
> 
> does that seem right?

nevermind. this is indeed what i needed to do. the new file is 438.4 mb. the 
time to read, hash, write is now 317 seconds. processing rate is 83818/sec. the 
hash still uses bignums. the speed change is just from reading and writing 
bytes instead of strings.

the drop in filesize is about 30% the gain is speed is about 15%

-- 
You received this message because you are subscribed to the Google Groups 
"Racket Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to racket-users+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[racket-users] Re: appending files

2016-01-29 Thread Scotty C
> i get the feeling that i will need to read the entire file as i used to read 
> it taking each record and doing the following:
> convert the string record to a bignum record
> convert the bignum record into a byte string
> write the byte string to a new data file
> 
> does that seem right?

nevermind. this is indeed what i needed to do. the new file is 438.4 mb. the 
time to read, hash, write is now 83818 rec/sec. the hash still uses bignums. 
the speed change is just from reading and writing bytes instead of strings.


-- 
You received this message because you are subscribed to the Google Groups 
"Racket Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to racket-users+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[racket-users] Re: appending files

2016-01-29 Thread Scotty C
> my plan right now is to rework my current hash so that it runs byte strings 
> instead of bignums.

i have a new issue. i wrote my data as char and end records with 'return. i use 
(read-line x 'return) and the first record is 15 char. when i use 
(read-line-bytes x 'return) i get 23 byte. i have to assume that my old 
assumption that an 8 bit char would write to disk as 8 bits is incorrect?

from documentation on read-char
Reads a single character from in—which may involve reading several bytes to 
UTF-8-decode them into a character

i get the feeling that i will need to read the entire file as i used to read it 
taking each record and doing the following:
convert the string record to a bignum record
convert the bignum record into a byte string
write the byte string to a new data file

does that seem right?

-- 
You received this message because you are subscribed to the Google Groups 
"Racket Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to racket-users+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[racket-users] Re: appending files

2016-01-29 Thread Scotty C
ok, had time to run my hash on my one test file
'(611 1 1 19 24783208 4.19)
this means
# buckets
% buckets empty
non empty bucket # keys least
non empty bucket # keys most
total number of keys
average number of keys per non empty bucket

it took 377 sec.
original # records is 26570359 so 6.7% dupes.
processing rate is 70478/sec

my plan right now is to rework my current hash so that it runs byte strings 
instead of bignums. i will probably be tomorrow afternoon before i have more 
stats.

question for you all. right now i use modulo on my bignums. i know i can't do 
that to a byte string. i'll figure something out. if any of you know how to do 
this, can you post a method?

-- 
You received this message because you are subscribed to the Google Groups 
"Racket Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to racket-users+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [racket-users] Re: appending files

2016-01-28 Thread Scotty C
On Thursday, January 28, 2016 at 11:36:50 PM UTC-6, Brandon Thomas wrote:
> On Thu, 2016-01-28 at 20:32 -0800, Scotty C wrote:
> > > I think you understand perfectly.
> > i'm coming around
> > 
> > > You said the keys are 128-bit (16 byte) values.  You can store one
> > > key
> > > directly in a byte string of length 16.
> > yup
> > 
> > > So instead of using a vector of pointers to individual byte
> > > strings,
> > > you would allocate a single byte string of length 
> > >    {buckets x chain length x 16} 
> > > and index it directly as if it were a 3-dimensional array [using
> > > offset calculations as you would in C].
> > i'm going to actually run my code later (probably in the morning).
> > and then grab some stats from the hash. take a look at the stats.
> > give me some thoughts on the above implementation after looking at
> > the stats.
> > 
> > > Can you explain why the bignums are important.  For a simple task
> > > like
> > > filtering a file, they would seem to be both a waste of memory and
> > > a
> > > performance drain wrt storing the key values as byte strings.
> > ok, true story. about 10 years ago i'm a student taking an intro to
> > ai class we have been assigned the 3x3 tile puzzle and 2 algorithms,
> > ida* and a*. also, try it on a 4x4. so i whip these out and am
> > appalled by how slow the fn things are but intrigued by what they can
> > do and i'm just a puzzle kind of guy. btw, the 4x4 wasn't solvable by
> > my stuff at the time. so i head to the chief's office with my little
> > bit of code. tell him it's way to slow. what do you think? he takes 5
> > seconds to peruse the item and says "you're probably making too much
> > memory". side note, later i graded this class for that prof and the
> > same project. not a single student, including the ms and phd types,
> > did anything better than a vector of vectors of ints. so back to me,
> > i'm thinking, how do i cram 4 bits together so that there is
> > absolutely no wasted space. i start digging through the documentation
> > and i find the bitwise stuff. i see arithmetic shift and having no
> > idea whether it will work or not i type into the interface
> > (arithmetic-shift 1 100). if we blow up, well we blow up big but
> > we didn't. i flipped everything in my project to bignum at that
> > point. the bignum stuff is massively faster than vector of vectors of
> > ints and faster than just a single vector of ints. lots of bignum is
> > easy to implement, like the hash. making a child state, not so much.
> > i'm not married to bignums despite all this.
> > 
> > > I've been programming for over 30 years, professionally for 23.
> > i was a programmer. dot com bought us as a brick and mortar back in
> > '99 and the whole shebang blew up 2 years later. idiots. anyway, been
> > a stay at home dad for the most part since then.
> > 
> > > have an MS in Computer Science
> > me too. that's the other piece of the most part from up above.
> > 
> > > (current-memory-use)
> > yup, tried that a while back didn't like what i saw. check this out:
> > 
> > > (current-memory-use)
> > 581753864
> > > (current-memory-use)
> > 586242568
> > > (current-memory-use)
> > 591181736
> > > (current-memory-use)
> > 595527064
> > 
> > does that work for you?
> > 
> 
> For whatever you're storing, I still suggest using a disk based
> structure (preferably using one that's already optimised and built for
> you). I've done a bit of work on cache aware algortihms, where reducing
> memory footprint is really big (along with the memory juggling). Yes,
> if you try to store something that takes only a few bits and store each
> one into an integer, you'll have waisted space. In theory, you could
> use a bignum, and only shift it as many bits as you need, which is what
> you have done. The issue with that is that bignum's have extra overhead
> thats neccessary for it to do arithmetic. Obviously, there needs to be
> a way to match or beat bignums with primitive structures, since bignums
> are implemented with primitive structures. So, if you want to beat
> bignum for storage, you'll want to use some contiguous memory with
> fixed sized elements (like byte strings, or arrays of uint32_t's in C)
> - but using bit manipulation on each byte, such that you have multiple
> stored values in each one, directly beside eachother bitwise, like a
> bignum has, but without it's overhead.
> 
> Regards,
> Brandon Thomas

-- 
You received this message because you are subscribed to the Google Groups 
"Racket Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to racket-users+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[racket-users] Re: appending files

2016-01-28 Thread Scotty C
> You claim you want filtering to be as fast as possible.  If that were
> so, you would not pack multiple keys (or features thereof) into a
> bignum but rather would store the keys individually.
chasing pointers? no, you're thinking about doing some sort of byte-append and 
subbytes type of thing. only way that data in a hash would be small in memory 
and reasonably quick. care to elaborate?

> The 16-byte keys are 1/3 the size of even the _smallest_ bignum, and
where are you getting this? i've been digging all over the documentation and 
can't find a fn thing on how much space is required for any data type or it's 
overhead. what i do is open up htop and check memory then load drracket, a huge 
bignum and recheck. close drracket, check memory, restart drracket, load a huge 
vector and it's associated bignums and recheck. so 2 questions for you all
1)where is this info about data type memory requirements?
2)what is a better way of checking actual memory sucked up by my data 
structures?

> comparing two small byte strings is faster than anything you can do
> with the bignum.  With the right data structure can put a lot more
> keys into the same space you're using now and use them faster.
you knew this was coming, right? put this into your data structure of choice:
16 5 1 12 6 24 17 9 2 22 4 10 13 18 19 20 0 23 7 21 15 11 8 3 14
this is a particular 5x5 tile puzzle (#6 in 
www.aaai.org/Papers/AAAI/1996/AAAI96-178.pdf) with the blank in a position 
where 4 children can be made. make a child while extracting the value of the 
tile being swapped with the blank. compare child to parent for equality. repeat 
that for the other 3 children.

time testing that won't matter because we have different hardware. if you (any 
of you) think you have something that will best what i'm doing (bignums), bring 
it on. show me something cool and fast. let me check out your approach.

i experimented a bit with the byte strings yesterday. what i'm doing with the 
bignums can't be done with byte strings. i'd have to rewrite just about 
everything i've got.

-- 
You received this message because you are subscribed to the Google Groups 
"Racket Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to racket-users+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[racket-users] Re: appending files

2016-01-28 Thread Scotty C
what's been bothering me was trying to get the data into 16 bytes in a byte 
string of that length. i couldn't get that to work so gave up and just shoved 
the data into 25 bytes. here's a bit of code. i think it's faster than my 
bignum stuff.

(define p (bytes 16 5 1 12 6 24 17 9 2 22 4 10 13 18 19 20 0 23 7 21 15 11 8 3 
14))
(define (c1 p) ;move blank left
  (let(
  (x (bytes-ref p 15))
  (c (bytes-copy p)))
  (bytes-set! c 15 0)
  (bytes-set! c 16 x)
  (bytes=? p c)
  c))

-- 
You received this message because you are subscribed to the Google Groups 
"Racket Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to racket-users+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[racket-users] Re: appending files

2016-01-28 Thread Scotty C
> I think you understand perfectly.
i'm coming around

> You said the keys are 128-bit (16 byte) values.  You can store one key
> directly in a byte string of length 16.
yup

> So instead of using a vector of pointers to individual byte strings,
> you would allocate a single byte string of length 
>{buckets x chain length x 16} 
> and index it directly as if it were a 3-dimensional array [using
> offset calculations as you would in C].
i'm going to actually run my code later (probably in the morning). and then 
grab some stats from the hash. take a look at the stats. give me some thoughts 
on the above implementation after looking at the stats.

> Can you explain why the bignums are important.  For a simple task like
> filtering a file, they would seem to be both a waste of memory and a
> performance drain wrt storing the key values as byte strings.
ok, true story. about 10 years ago i'm a student taking an intro to ai class we 
have been assigned the 3x3 tile puzzle and 2 algorithms, ida* and a*. also, try 
it on a 4x4. so i whip these out and am appalled by how slow the fn things are 
but intrigued by what they can do and i'm just a puzzle kind of guy. btw, the 
4x4 wasn't solvable by my stuff at the time. so i head to the chief's office 
with my little bit of code. tell him it's way to slow. what do you think? he 
takes 5 seconds to peruse the item and says "you're probably making too much 
memory". side note, later i graded this class for that prof and the same 
project. not a single student, including the ms and phd types, did anything 
better than a vector of vectors of ints. so back to me, i'm thinking, how do i 
cram 4 bits together so that there is absolutely no wasted space. i start 
digging through the documentation and i find the bitwise stuff. i see 
arithmetic shift and having no idea whether it will work or not i type into the 
interface (arithmetic-shift 1 100). if we blow up, well we blow up big but 
we didn't. i flipped everything in my project to bignum at that point. the 
bignum stuff is massively faster than vector of vectors of ints and faster than 
just a single vector of ints. lots of bignum is easy to implement, like the 
hash. making a child state, not so much. i'm not married to bignums despite all 
this.

> I've been programming for over 30 years, professionally for 23.
i was a programmer. dot com bought us as a brick and mortar back in '99 and the 
whole shebang blew up 2 years later. idiots. anyway, been a stay at home dad 
for the most part since then.

> have an MS in Computer Science
me too. that's the other piece of the most part from up above.

> (current-memory-use)
yup, tried that a while back didn't like what i saw. check this out:

> (current-memory-use)
581753864
> (current-memory-use)
586242568
> (current-memory-use)
591181736
> (current-memory-use)
595527064

does that work for you?

-- 
You received this message because you are subscribed to the Google Groups 
"Racket Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to racket-users+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[racket-users] Re: appending files

2016-01-28 Thread Scotty C
> Way back in this thread you implied that you had extremely large FILES
> containing FIXED SIZE RECORDS, from which you needed 
> to FILTER DUPLICATE records based on the value of a FIXED SIZE KEY
> field.
this is mostly correct. the data is state and state associated data on the 
fringe. hence the name of the algorithm: fringe search. states (keys) are fixed 
size. associated data (due to the operator sequence) is variable size. i didn't 
post that here. i sent you (george) an email directly wed at 9:49 am according 
to my sent email box. getting this piece of the algorithm to go faster, less 
memory, both? awesome.

my actual test file (just checked) is 633 mb. it is data from perhaps halfway 
through a search. the fringe for a 5x5 grows by about 9x each successive 
fringe. i say about 9x because as the fringes grow, the amount of redundancy 
will increase. when i hit the limits of my hardware and patience with this 
algorithm i was at 90% redundancy but that fringe file was huge. i still hadn't 
produced an answer for the problem and decided i needed to get the code to run 
in parallel. that was about 5 years ago. last spring i started reworking my old 
stuff to work with places and ran out of enthusiasm until about 2 weeks ago.

> Doesn't work for 6x6?  Well 36 6-bit values fit neatly into 216 bits
> (27 bytes).
the guy (korf) who did the paper on the 24 puzzle has attempted the 6x6 and 
failed. notice in the 24 puzzle paper that he was unable to solve one of those 
10 sample problems. 5x5 is what i'm after.

-- 
You received this message because you are subscribed to the Google Groups 
"Racket Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to racket-users+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[racket-users] Re: appending files

2016-01-27 Thread Scotty C
On Wednesday, January 27, 2016 at 2:57:42 AM UTC-6, gneuner2 wrote:

> What is this other field on which the file is sorted?
this field is the cost in operators to arrive at the key value

> WRT a set of duplicates: are you throwing away all duplicates? Keeping
> the 1st one encountered?  Something else?
keep first instance, chuck the rest

> This structure uses a lot more space than necessary.  Where did you
> get the idea that a bignum is 10 bytes?
not sure about the 10 bytes. if i shove 5 128 bit keys into a bignum is that 
about 80 bytes plus some overhead? so 80*600 is 480 mb not including 
overhead.

> In the worst case of every key being a bignum
no, every key is contained within a bignum which can contain many many keys.

> Since you are only comparing the hash entries for equality, you could
> save a lot of space [at the expense of complexity] by defining a
> {bucket x chain_size x 16} byte array and storing the 16-byte keys
> directly.
i must be able to grow the chains. i can't make it fixed size like that.

> > have another rather large bignum in memory that i use to reduce
> >but not eliminate record duplication of about .5 gb. 
> ???

ha, ok. this is what this bignum is for. cycle elimination. a sequence of 
operators (2 bit per) when strung together is a number like 4126740 which 
represents the operator sequence (0 0 3 0 0 3 1 1 2 2 1). i change that bit in 
the bignum from 0 to 1. during data generation i look up my about to be applied 
operator sequence in the bignum. if i see a one, i skip data generation. i'm 
not really happy with the volume of memory this takes but it is an insanely 
fast lookup and keeps a ton of data off the hard drive.

> In the meantime, I would suggest you look up "merge sort" and it's
logarithmic? not happening

-- 
You received this message because you are subscribed to the Google Groups 
"Racket Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to racket-users+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[racket-users] Re: appending files

2016-01-27 Thread Scotty C
> Is it important to retain that sorting?  Or is it just informational?
it's important

> Then you're not using the hash in a conventional manner ... else the
> filter entries would be unique ... and we really have no clue what
> you're actually doing.  So any suggestions we give you are shots in
> the dark.
using it conventionally? absolutely. it is a hash with separate chaining. will 
a bit of code help?

(define (put p)
  (define kh (hashfn p))
  (vector-set! tble kh (bitwise-ior (arithmetic-shift (vector-ref tble kh) 
shftl) p)))

-- 
You received this message because you are subscribed to the Google Groups 
"Racket Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to racket-users+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[racket-users] Re: appending files

2016-01-26 Thread Scotty C
ok brandon, that's a thought. build the hash on the hard drive at the time of 
data creation. you mention collision resolution. so let me build my hash on the 
hard drive using my 6 million buckets but increase the size of each bucket from 
5 slots to 20. right? i can't exactly recreate my vector/bignum hash on the 
hard drive because i can't dynamically resize the buckets like i can the 
bignums. this gives me a 4 gb file whereas my original was 1 gb. i have enough 
space for that so that's not a problem. so as my buckets fill up they head 
towards the average of 5 data items per bucket. so on average here's what 
happens with each hd hash record. i go to my hd hash and read 3.5 (think about 
it) items and 90% of the time i don't find my data so i do a write. in my 
process i do an initial write, then a read, a write, a read, a write. compare: 
3.5 vs 2 reads; 1 vs 3 writes. the reads are more costly and if i exceed 20 
items in a bucket the hd hash breaks. what do you think? is it worth it?

-- 
You received this message because you are subscribed to the Google Groups 
"Racket Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to racket-users+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[racket-users] Re: appending files

2016-01-26 Thread Scotty C
robby findler, you the man. i like the copy-port idea. i incorporated it and it 
is nice and fast and easily fit into the existing code.

-- 
You received this message because you are subscribed to the Google Groups 
"Racket Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to racket-users+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[racket-users] Re: appending files

2016-01-26 Thread Scotty C
neil van dyke, i have used the system function before but had forgotten what it 
was called and couldn't find it as a result in the documentation. my problem 
with using the system function is that i need 2 versions of it: windoz and 
linux. the copy-port function is a write once use across multiple os solution. 
sweet.

-- 
You received this message because you are subscribed to the Google Groups 
"Racket Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to racket-users+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[racket-users] Re: appending files

2016-01-26 Thread Scotty C
gneuner2 (george), you are over thinking this thing. my test data of 1 gb is 
but a small sample file. i can't even hash that small 1 gb at the time of data 
creation. the hashed data won't fit in ram. at the time i put the redundant 
data on the hard drive, i do some constant time sorting so that the redundant 
data on the hard drive is contained in roughly 200 usefully sorted files. some 
of these files will be small and can be hashed with a single read, hash and 
write. some will be massive (data won't fit in ram) and must be split further. 
this produces another another type of single read, hash and write. these split 
files can now be fully hashed which means a second read, hash and write. 
recombining the second level files is virtually instantaneous (copy-port) 
relative to the effort spent to get to that point. all of these operations are 
constant time. it would be nice to cut into that big fat hard drive induced C 
but i can't do it with a single read and write on the larger files.

-- 
You received this message because you are subscribed to the Google Groups 
"Racket Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to racket-users+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[racket-users] Re: appending files

2016-01-26 Thread Scotty C
alright george, i'm open to new ideas. here's what i've got going. running 64 
bit linux mint OS on a 2 core laptop with 2 gb of ram. my key is 128 bits with 
~256 bits per record. so my 1 gb file contains ~63 million records and ~32 
million keys. about 8% will be dupes leaving me with ~30 million keys. i run a 
custom built hash. i use separate chaining with a vector of bignums. i am 
willing to let my chains run up to 5 keys per chain so i need a vector of 6 
million pointers. that's 48 mb for the array. another 480 mb for the bignums. 
let's round that sum to .5 gb. i have another rather large bignum in memory 
that i use to reduce but not eliminate record duplication of about .5 gb. i'm 
attempting to get this thing to run in 2 places so i need 2 hashes. add this up 
.5+.5+.5 is 1.5 gb and that gets me to about my memory limit. the generated 
keys are random but i use one of the associated fields for sorting during the 
initial write to the hard drive. what goes in each of those files is totally 
random but dupes do not run across files. also, the number of keys is >1e25.

-- 
You received this message because you are subscribed to the Google Groups 
"Racket Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to racket-users+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[racket-users] appending files

2016-01-25 Thread Scotty C
here's what i'm doing. i make a large, say 1 gb file with small records and 
there is some redundancy in the records. i will use a hash to identify 
duplicates by reading the file back in a record at a time but the file is too 
large to hash so i split it. the resultant files (10) are about 100 mb and are 
easily hashed. the resultant files need to be appended and then renamed back to 
the original. i know that i can do this in a linux terminal window with the 
following: cat mytmp*.dat >> myoriginal.dat. i'd like to accomplish this from 
within the program by shelling out. can't figure it out. other methodologies 
that are super fast will be entertained. thanks, scott

-- 
You received this message because you are subscribed to the Google Groups 
"Racket Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to racket-users+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.