@eric, assuming the records are evenly distributed and network bandwidth is not an issue, shouldn't that be O(n/p)+O(p) and O(n/p * log (n/p))?
On Wed, Oct 24, 2012 at 2:45 PM, Eric Newton <[email protected]> wrote: > Adding a sorted file to accumulo (bulk loading) is essentially > constant in the normal case. It is O(n) + O(p) for the worst case > where the index must be read, and the file assigned to every tablet > server. In this case, the (slow) RPCs will dominate over the (fast) > read of the index, except for very small clusters or very large > indexes. > > Inserting with the BatchWriter is eventually dominated by compactions, > which is a merge sort, or O(n log n). > > -Eric > > On Thu, Oct 18, 2012 at 11:37 AM, Jeff Kubina <[email protected]> > wrote: > > BatchWriter, but I would be interested in the answer assuming a > > pre-sorted rfile. > > > > On Thu, Oct 18, 2012 at 11:20 AM, Josh Elser <[email protected]> > wrote: > >> Are you referring to "bulk inserts" as importing a pre-sorted rfile of > >> Key/Values or usinga BatchWriter? > >> > >> On 10/18/12 10:49 AM, Jeff Kubina wrote: > >>> > >>> I am deriving the time complexities for an algorithm I implemented in > >>> Hadoop using Accumulo and need to know the time complexity of bulk > >>> inserting m records evenly distributed across p nodes into an empty > >>> table with p tablet servers. Assuming B is the bandwidth of the > >>> network, would the communication complexity be O(m/B) and the > >>> computation complexity O(m/p * log(m/p))? If the table contained n > >>> records would the values be O(m/B) and O(m/p * log(m/p) + n/p)? >
