Re: What is the fastest way to get data into Cassandra 2 from a Java application?

Sylvain Lebresne Wed, 11 Dec 2013 05:54:39 -0800

On Wed, Dec 11, 2013 at 1:52 PM, Robert Wille <rwi...@fold3.com> wrote:


> Network latency is the reason why the batched query is fastest. One trip
> to Cassandra versus 1000. If you execute the inserts in parallel, then that
> eliminates the latency issue.
>

While it is true a batch will means only one client-server round trip, I'll
note that provided you use the TokenAware load balancing policy, doing the
parallelization client will save you intra-replica round-trips, which using
a big batch won't. So that it might not be all that clear which ones is
faster. And very large batches have the disadvantage that your are more
likely to get a timeout (and if you do, you have to retry the whole batch,
even though most of it has probably be inserted correctly). Overall, the
best option probably has to do with parallelizing the inserts of reasonably
sized batches, but what are the sizes for that is likely very use case
dependent, you'll have to test.

--
Sylvain



>
> From: Sylvain Lebresne <sylv...@datastax.com>
> Reply-To: <user@cassandra.apache.org>
> Date: Wednesday, December 11, 2013 at 5:40 AM
> To: "user@cassandra.apache.org" <user@cassandra.apache.org>
> Subject: Re: What is the fastest way to get data into Cassandra 2 from a
> Java application?
>
> Then I suspect that this is artifact of your test methodology. Prepared
> statements *are* faster than non prepared ones in general. They save some
> parsing and some bytes on the wire. The savings will tend to be bigger for
> bigger queries, and it's possible that for very small queries (like the
> one you
> are testing) the performance difference is somewhat negligible, but seeing
> non
> prepared statement being significantly faster than prepared ones almost
> surely
> means you're doing wrong (of course, a bug in either the driver or C* is
> always
> possible, and always make sure to test recent versions, but I'm not aware
> of
> any such bug).
>
> Are you sure you are warming up the JVMs (client and drivers) properly for
> instance. 1000 iterations is *really small*, if you're not warming things
> up properly, you're not measuring anything relevant. Also, are you
> including
> the preparation of the query itself in the timing? Preparing a query is not
> particulary fast, but it's meant to be done just once at the begining of
> the
> application lifetime. But with only 1000 iterations, if you include the
> preparation in the timing, it's entirely possible it's eating a good chunk
> of
> the whole time.
>
> But other prepared versus non-prepared, you won't get proper performance
> unless
> you parallelize your inserts. Unlogged batches is one way to do it (it's
> really
> all Cassandra does with unlogged batch, parallelizing). But as John Sanda
> mentioned, another option is to do the parallelization client side, with
> executeAsync.
>
> --
> Sylvain
>
>
>
> On Wed, Dec 11, 2013 at 11:37 AM, David Tinker <david.tin...@gmail.com>wrote:
>
>> Yes thats what I found.
>>
>> This is faster:
>>
>> for (int i = 0; i < 1000; i++) session.execute("INSERT INTO
>> test.wibble (id, info) VALUES ('${"" + i}', '${"aa" + i}')")
>>
>> Than this:
>>
>> def ps = session.prepare("INSERT INTO test.wibble (id, info) VALUES (?,
>> ?)")
>> for (int i = 0; i < 1000; i++) session.execute(ps.bind(["" + i, "aa" +
>> i] as Object[]))
>>
>> This is the fastest option of all (hand rolled batch):
>>
>> StringBuilder b = new StringBuilder()
>> b.append("BEGIN UNLOGGED BATCH\n")
>> for (int i = 0; i < 1000; i++) {
>>     b.append("INSERT INTO ").append(ks).append(".wibble (id, info)
>> VALUES ('").append(i).append("','")
>>             .append("aa").append(i).append("')\n")
>> }
>> b.append("APPLY BATCH\n")
>> session.execute(b.toString())
>>
>>
>> On Wed, Dec 11, 2013 at 10:56 AM, Sylvain Lebresne <sylv...@datastax.com>
>> wrote:
>> >
>> >> This loop takes 2500ms or so on my test cluster:
>> >>
>> >> PreparedStatement ps = session.prepare("INSERT INTO perf_test.wibble
>> >> (id, info) VALUES (?, ?)")
>> >> for (int i = 0; i < 1000; i++) session.execute(ps.bind("" + i, "aa" +
>> i));
>> >>
>> >> The same loop with the parameters inline is about 1300ms. It gets
>> >> worse if there are many parameters.
>> >
>> >
>> > Do you mean that:
>> >   for (int i = 0; i < 1000; i++)
>> >       session.execute("INSERT INTO perf_test.wibble (id, info) VALUES
>> (" + i
>> > + ", aa" + i + ")");
>> > is twice as fast as using a prepared statement? And that the difference
>> > is even greater if you add more columns than "id" and "info"?
>> >
>> > That would certainly be unexpected, are you sure you're not
>> re-preparing the
>> > statement every time in the loop?
>> >
>> > --
>> > Sylvain
>> >
>> >> I know I can use batching to
>> >> insert all the rows at once but thats not the purpose of this test. I
>> >> also tried using session.execute(cql, params) and it is faster but
>> >> still doesn't match inline values.
>> >>
>> >> Composing CQL strings is certainly convenient and simple but is there
>> >> a much faster way?
>> >>
>> >> Thanks
>> >> David
>> >>
>> >> I have also posted this on Stackoverflow if anyone wants the points:
>> >>
>> >>
>> http://stackoverflow.com/questions/20491090/what-is-the-fastest-way-to-get-data-into-cassandra-2-from-a-java-application
>> >
>> >
>>
>>
>>
>> --
>> http://qdb.io/ Persistent Message Queues With Replay and #RabbitMQ
>> Integration
>>
>
>

Re: What is the fastest way to get data into Cassandra 2 from a Java application?

Reply via email to