Disclaimer: I'm just an Xindice user, not a developer. Perhaps it would be helpful if you looked in more detail at the reasons why Xindice is better for large numbers of relatively small documents.
Essentially, the issue is in indexing for queries. Indexes (at least when I last looked at the source code) were on collections (not on the documents), and mapped from values of elements or attributes to the subset of docs which contained those elements or attributes. If you issue a queries for certain subtrees of the documents satisfying certain conditions, first the appropriate subset is found using whatever indexes you provide, and then XPath is used to extract the subtrees from the selected documents. The index, being some kind of Btree or hash, is reasonably efficient for large collections. XPath is reasonably efficient for small documents, unless it contains expressions that cause scans of the entire document, e.g. //x (all the elements named x, at whatever position in the document). A query which retrieves an entire document by id will be quite fast, regardless of how large the collection is (probably logarithmic in collection size at worst). I don't believe there is any inherent limit in the size of a collection except that the internal compressed form of the collection and its indexes must fit in a single file (so you'll need a big disk and appropriate file system), at least in the current implementation (I think). All that being said, it is likely that 400 million non-trivial docs is larger than has been used in Xindice before. If you were to attempt to use it for this project, you would likely hit some problems and drive improvements in the software. If I were you thinking of embarking on such an experiment, I'd want to have a good picture of the pace of development of Xindice, the responsiveness of the developers to reported problems, since you're likely to hit some showstoppers, and the plan for future development (to see if its aims overlap largely with yours, or whether there is a different focus). Jeff
