For the bulk load of one file, shouldn't it be roughly O(log(n) * log(P) * p), where n is the size of the file, P is the total number of tablets (proportional to tablet servers), and p is the number of tablets that get assigned that file?
For the BatchWriter case, there's a client-side lookup/binning that takes O(log(p)) per entry, so the latter would be O(n/p * (log(n/p) + log(p))) for each of p partitions. So, O(n*log(n)) in aggregate. Yes/no? Adam On Wed, Oct 24, 2012 at 3:57 PM, Jeff Kubina <[email protected]> wrote: > @eric, assuming the records are evenly distributed and network bandwidth > is not an issue, shouldn't that be O(n/p)+O(p) and O(n/p * log (n/p))? > > > On Wed, Oct 24, 2012 at 2:45 PM, Eric Newton <[email protected]>wrote: > >> Adding a sorted file to accumulo (bulk loading) is essentially >> constant in the normal case. It is O(n) + O(p) for the worst case >> where the index must be read, and the file assigned to every tablet >> server. In this case, the (slow) RPCs will dominate over the (fast) >> read of the index, except for very small clusters or very large >> indexes. >> >> Inserting with the BatchWriter is eventually dominated by compactions, >> which is a merge sort, or O(n log n). >> >> -Eric >> >> On Thu, Oct 18, 2012 at 11:37 AM, Jeff Kubina <[email protected]> >> wrote: >> > BatchWriter, but I would be interested in the answer assuming a >> > pre-sorted rfile. >> > >> > On Thu, Oct 18, 2012 at 11:20 AM, Josh Elser <[email protected]> >> wrote: >> >> Are you referring to "bulk inserts" as importing a pre-sorted rfile of >> >> Key/Values or usinga BatchWriter? >> >> >> >> On 10/18/12 10:49 AM, Jeff Kubina wrote: >> >>> >> >>> I am deriving the time complexities for an algorithm I implemented in >> >>> Hadoop using Accumulo and need to know the time complexity of bulk >> >>> inserting m records evenly distributed across p nodes into an empty >> >>> table with p tablet servers. Assuming B is the bandwidth of the >> >>> network, would the communication complexity be O(m/B) and the >> >>> computation complexity O(m/p * log(m/p))? If the table contained n >> >>> records would the values be O(m/B) and O(m/p * log(m/p) + n/p)? >> > >
