Re: What is the Communication and Time Complexity for Bulk Inserts?

Jeff Kubina Wed, 24 Oct 2012 12:58:30 -0700

@eric, assuming the records are evenly distributed and network bandwidth is
not an issue, shouldn't that be O(n/p)+O(p) and O(n/p * log (n/p))?


On Wed, Oct 24, 2012 at 2:45 PM, Eric Newton <[email protected]> wrote:

> Adding a sorted file to accumulo (bulk loading) is essentially
> constant in the normal case.  It is O(n) + O(p) for the worst case
> where the index must be read, and the file assigned to every tablet
> server.  In this case, the (slow) RPCs will dominate over the (fast)
> read of the index, except for very small clusters or very large
> indexes.
>
> Inserting with the BatchWriter is eventually dominated by compactions,
> which is a merge sort, or O(n log n).
>
> -Eric
>
> On Thu, Oct 18, 2012 at 11:37 AM, Jeff Kubina <[email protected]>
> wrote:
> > BatchWriter, but I would be interested in the answer assuming a
> > pre-sorted rfile.
> >
> > On Thu, Oct 18, 2012 at 11:20 AM, Josh Elser <[email protected]>
> wrote:
> >> Are you referring to "bulk inserts" as importing a pre-sorted rfile of
> >> Key/Values or usinga BatchWriter?
> >>
> >> On 10/18/12 10:49 AM, Jeff Kubina wrote:
> >>>
> >>> I am deriving the time complexities for an algorithm I implemented in
> >>> Hadoop using Accumulo and need to know the time complexity of bulk
> >>> inserting m records evenly distributed across p nodes into an empty
> >>> table with p tablet servers. Assuming B is the bandwidth of the
> >>> network, would the communication complexity be O(m/B) and the
> >>> computation complexity O(m/p * log(m/p))? If the table contained n
> >>> records would the values be O(m/B) and O(m/p * log(m/p) + n/p)?
>

Re: What is the Communication and Time Complexity for Bulk Inserts?

Reply via email to