My current theory is that collection faces the overhead of creating file handlers 40,000 times while index creates a handler only once to read the index results. I don't know if this is enough to produce the large difference though.
Steven On Tue, Sep 17, 2013 at 11:28 AM, Michael Carey <[email protected]> wrote: > Interesting! I haven't followed enough yet, but now you have my > interest;. :-) > do you have an explanation for why your index wins even in the case of > 100%? > (Not intuitive - maybe I am missing some details that would fill my > intuition gap.) > > > On 9/17/13 10:59 AM, Steven Jacobs wrote: > >> I ran a test on one of Preston's real-world data sets (Weather >> collection) that had around 40,000 files. I am attaching the results. There >> are three graphs. >> >> The first shows the time for returning the entire XML for all 40000 >> files. My index algorithm has huge gains over collection, no matter how >> much of the data is returned. >> >> The second shows how the two algorithms perform as the number of files >> increases. Both linearly increase, but collection has a much higher slope. >> >> The last is just a one-point comparison for returning paths that only >> exist in only 100 out of the 40000 files. Once again, index has a huge >> advantage. >> >> >> Steven >> >> >> >
