Mark Freeze wrote: > You guys are way ahead of me on some of the hardware questions... However, > to try and answer some of them: > I have a script that controls the following actions: > 1. Runs a c++ program that I wrote that opens a text file (the 50 - 100 MB > file that I mentioned), reads each line sequentially and splits the data > into two output files after performing numerous tasks to the data. (e.g. > checking the validity of the zip code, making sure it matches the state, > calculating amounts due, etc... > 2. Makes the second file into a dbase file > 3. Runs another c++ program on the first file that examines each record in > the file and compares it to another database (using proprietary code > libraries supplied by our software vendor) that corrects any bad info in the > address, adds a zip+4, adds carrier route info, etc... > 4. Looks for another text file to process > 5. Appends all processed text files together > 6. Appends all dbase files into one > As I said in my previous post, each 100MB text file takes about 1 hr to > run. Most of this time is spent on step 3. > So, would clustering speed up this sometimes 3 - 4 hr process? > Thanks, > Mark.
How much file space does your whole process use (while in motion)? Sounds like it might only be a couple of hundred Megabytes. In that kind of a situation try building a RAM disk and run the whole process from that RAM disk. I've done a lot of database intensive activities this way and gotten speed increases of 10x by moving everything directly into RAM. That would mean a 3 hour job would take about 18 minutes. That is a lot better speed increase than you will get by using parallelization. Good Luck! Jon Carnes -- TriLUG mailing list : http://www.trilug.org/mailman/listinfo/trilug TriLUG Organizational FAQ : http://trilug.org/faq/ TriLUG Member Services FAQ : http://members.trilug.org/services_faq/
