Re: What is the fastest way to get data into Cassandra 2 from a Java application?

David Tinker Fri, 13 Dec 2013 03:08:09 -0800

I wrote some scripts to test this: https://github.com/davidtinker/cassandra-perf


3 node cluster, each node: Intel® Xeon® E3-1270 v3 Quadcore Haswell
32GB RAM, 1 x 2TB commit log disk, 2 x 4TB data disks (RAID0)

Using a batch of prepared statements is about 5% faster than inline parameters:

InsertBatchOfPreparedStatements: Inserted 2551704 rows in 100000
batches using 256 concurrent operations in 15.785 secs, 161653 rows/s,
6335 batches/s

InsertInlineBatch: Inserted 2551704 rows in 100000 batches using 256
concurrent operations in 16.712 secs, 152686 rows/s, 5983 batches/s

On Wed, Dec 11, 2013 at 2:40 PM, Sylvain Lebresne <sylv...@datastax.com> wrote:
> Then I suspect that this is artifact of your test methodology. Prepared
> statements *are* faster than non prepared ones in general. They save some
> parsing and some bytes on the wire. The savings will tend to be bigger for
> bigger queries, and it's possible that for very small queries (like the one
> you
> are testing) the performance difference is somewhat negligible, but seeing
> non
> prepared statement being significantly faster than prepared ones almost
> surely
> means you're doing wrong (of course, a bug in either the driver or C* is
> always
> possible, and always make sure to test recent versions, but I'm not aware of
> any such bug).
>
> Are you sure you are warming up the JVMs (client and drivers) properly for
> instance. 1000 iterations is *really small*, if you're not warming things
> up properly, you're not measuring anything relevant. Also, are you including
> the preparation of the query itself in the timing? Preparing a query is not
> particulary fast, but it's meant to be done just once at the begining of the
> application lifetime. But with only 1000 iterations, if you include the
> preparation in the timing, it's entirely possible it's eating a good chunk
> of
> the whole time.
>
> But other prepared versus non-prepared, you won't get proper performance
> unless
> you parallelize your inserts. Unlogged batches is one way to do it (it's
> really
> all Cassandra does with unlogged batch, parallelizing). But as John Sanda
> mentioned, another option is to do the parallelization client side, with
> executeAsync.
>
> --
> Sylvain
>
>
>
> On Wed, Dec 11, 2013 at 11:37 AM, David Tinker <david.tin...@gmail.com>
> wrote:
>>
>> Yes thats what I found.
>>
>> This is faster:
>>
>> for (int i = 0; i < 1000; i++) session.execute("INSERT INTO
>> test.wibble (id, info) VALUES ('${"" + i}', '${"aa" + i}')")
>>
>> Than this:
>>
>> def ps = session.prepare("INSERT INTO test.wibble (id, info) VALUES (?,
>> ?)")
>> for (int i = 0; i < 1000; i++) session.execute(ps.bind(["" + i, "aa" +
>> i] as Object[]))
>>
>> This is the fastest option of all (hand rolled batch):
>>
>> StringBuilder b = new StringBuilder()
>> b.append("BEGIN UNLOGGED BATCH\n")
>> for (int i = 0; i < 1000; i++) {
>>     b.append("INSERT INTO ").append(ks).append(".wibble (id, info)
>> VALUES ('").append(i).append("','")
>>             .append("aa").append(i).append("')\n")
>> }
>> b.append("APPLY BATCH\n")
>> session.execute(b.toString())
>>
>>
>> On Wed, Dec 11, 2013 at 10:56 AM, Sylvain Lebresne <sylv...@datastax.com>
>> wrote:
>> >
>> >> This loop takes 2500ms or so on my test cluster:
>> >>
>> >> PreparedStatement ps = session.prepare("INSERT INTO perf_test.wibble
>> >> (id, info) VALUES (?, ?)")
>> >> for (int i = 0; i < 1000; i++) session.execute(ps.bind("" + i, "aa" +
>> >> i));
>> >>
>> >> The same loop with the parameters inline is about 1300ms. It gets
>> >> worse if there are many parameters.
>> >
>> >
>> > Do you mean that:
>> >   for (int i = 0; i < 1000; i++)
>> >       session.execute("INSERT INTO perf_test.wibble (id, info) VALUES ("
>> > + i
>> > + ", aa" + i + ")");
>> > is twice as fast as using a prepared statement? And that the difference
>> > is even greater if you add more columns than "id" and "info"?
>> >
>> > That would certainly be unexpected, are you sure you're not re-preparing
>> > the
>> > statement every time in the loop?
>> >
>> > --
>> > Sylvain
>> >
>> >> I know I can use batching to
>> >> insert all the rows at once but thats not the purpose of this test. I
>> >> also tried using session.execute(cql, params) and it is faster but
>> >> still doesn't match inline values.
>> >>
>> >> Composing CQL strings is certainly convenient and simple but is there
>> >> a much faster way?
>> >>
>> >> Thanks
>> >> David
>> >>
>> >> I have also posted this on Stackoverflow if anyone wants the points:
>> >>
>> >>
>> >> http://stackoverflow.com/questions/20491090/what-is-the-fastest-way-to-get-data-into-cassandra-2-from-a-java-application
>> >
>> >
>>
>>
>>
>> --
>> http://qdb.io/ Persistent Message Queues With Replay and #RabbitMQ
>> Integration
>
>



-- 
http://qdb.io/ Persistent Message Queues With Replay and #RabbitMQ Integration

Re: What is the fastest way to get data into Cassandra 2 from a Java application?

Reply via email to