Adam Jenkins schrieb:
Just so you know the complexity of the situation:
we have between 30 and 40 index xml files (changes daily). We are
only given the first one, which tells us two things:
1) where 20 data files are
2) where the next 'index' file is...that contains references to the
next 20 files etc etc
Each of the 20 data files contains a reference (url) to another (big)
expanded data file with more information.
Finally, each expanded data file has a reference to either a pdf or
word document, that get's translated into XML and processed.
All up, about 1000 data files and individual word/pdf documents get
processed.
But maybe they don't all have to be processed in one pass?
You could possibly assemble a list of tasks by processing only the index
files (plus any cross-referenced data you may need), thus building a
meta index file, and then process each of the huge files in turn, having
Java code drive the process by reading the meta index files and spawning
off individual transforms, which can be pretty moderate as far as memory
consumption goes.
Michael Ludwig