Thanks for the link. My expectation is that the jobs will be CPU (and, to a lesser extent, RAM) bound, and less IO bound. It's not searching so much as doing fancy pattern matching. I'm working on algorithms to allow automatic coding of qualitative data -- essentially, a sociological use for document classification -- which will eventually reach very large numbers of documents and categories. Software running on it will be:
- Linux (probably debian) - PostgreSQL - R - Perl - Apache (to allow research staff to code texts) - MAYBE ClaraOCR, to try to turn scanned documents into text I'm going to go ahead and propose some fancy, brand-name solution (the Sun or IBM probably) with the expectation that I'll probably end up cutting it down to a cheaper solution. ---------------------------------------------------------------------- Andrew J Perrin - [EMAIL PROTECTED] - http://www.unc.edu/~aperrin Assistant Professor of Sociology, U of North Carolina, Chapel Hill 269 Hamilton Hall, CB#3210, Chapel Hill, NC 27599-3210 USA On 31 Jan 2002, Ed Hill wrote: > On Thu, 2002-01-31 at 12:24, Andrew Perrin wrote: > > Okay, this will be fun :) > > > > I'm putting together a research grant for some fairly heavy text crunching > > (categorizing thousands of documents using statistical methods). At the > > moment the grant is in the "reach for the sky" phase, meaning look for the > > best-possible technical solution. Eventually, of course, we will probably > > have to cut down. > > > > But for now, I'd like advice on hardware, potentially costing as much as > > $25,000 for this project. I'm open to clustered solutions as well as > > single-machine solutions, although I don't want to spend much time keeping > > the cluster going. Things I've thought of include: > > > > - IBM IntelliStation Z line > > - Sun Enterprise 450 or 420R > > - SGI 2200 or something like that (don't know SGI's line well) > > - Building a standard Intel-based system (dual fast processors, 4G RAM, > > fast SCSI disks, etc.) > > > > So, what would you do? > > > I assume by text processing you mean mostly integer work with some > floating point. In either case, you should be aware of the SPEC > benchmarks: > > http://www.spec.org/osg/cpu2000/results/ > > and read how the benchmark scores are calculated before browsing. > > You'll notice that, at the moment, the AMD Athlons are the best in terms > of operations (either floating point or integer) per second per dollar. > You can get dual Athlon systems for very competitive prices online. Or > pick up a recent copy of Linux Journal and you'll see multiple ads for > companies selling dual-Athlon systems that come with Linux pre-loaded > and pre-configured. For $25K you could build a small cluster. > > But getting back to the original question: how do you know whether your > application(s) will be CPU-bound? If you're doing a lot of searching, > your work is more likely to be IO-bound and in that case you're better > off getting relatively cheaper/slower CPUs and putting your grant money > into a large/fast SCSI array. > > hth, > Ed > > > -- > Edward H. Hill III, PhD > Post-Doctoral Researcher | Email: [EMAIL PROTECTED], [EMAIL PROTECTED] > Division of ESE | URL: http://www.eh3.com > Colorado School of Mines | Phone: 303-273-3483 > Golden, CO 80401 | Fax: 303-273-3311 > Key fingerprint = 5BDE 4DA1 66BE 4F7B BC17 3A0C 932B 7266 1E76 F123 >
