Re: Bulkoutputformat
Here you go http://thelastpickle.com/blog/2013/01/11/primary-keys-in-cql.html On Fri, Dec 13, 2013 at 7:19 AM, varun allampalli vshoori.off...@gmail.comwrote: Hi Aaron, It seems like you answered the question here. https://groups.google.com/forum/#!topic/nosql-databases/vjZA5vdycWA Can you give me the link to the blog which you mentioned http://thelastpickle.com/2013/01/11/primary-keys-in-cql/ Thanks in advance Varun On Thu, Dec 12, 2013 at 3:36 PM, varun allampalli vshoori.off...@gmail.com wrote: Thanks Aaron, I was able to generate sstables and load using sstableloader. But after loading the tables when I do a select query I get this, the table has only one record. Is there anything I am missing or any logs I can look at. Request did not complete within rpc_timeout. On Wed, Dec 11, 2013 at 7:58 PM, Aaron Morton aa...@thelastpickle.comwrote: If you don’t need to use Hadoop then try the SSTableSimpleWriter and sstableloader , this post is a little old but still relevant http://www.datastax.com/dev/blog/bulk-loading Otherwise AFAIK BulkOutputFormat is what you want from hadoop http://www.datastax.com/docs/1.1/cluster_architecture/hadoop_integration Cheers - Aaron Morton New Zealand @aaronmorton Co-Founder Principal Consultant Apache Cassandra Consulting http://www.thelastpickle.com On 12/12/2013, at 11:27 am, varun allampalli vshoori.off...@gmail.com wrote: Hi All, I want to bulk insert data into cassandra. I was wondering of using BulkOutputformat in hadoop. Is it the best way or using driver and doing batch insert is the better way. Are there any disandvantages of using bulkoutputformat. Thanks for helping Varun
Re: What is the fastest way to get data into Cassandra 2 from a Java application?
I wrote some scripts to test this: https://github.com/davidtinker/cassandra-perf 3 node cluster, each node: Intel® Xeon® E3-1270 v3 Quadcore Haswell 32GB RAM, 1 x 2TB commit log disk, 2 x 4TB data disks (RAID0) Using a batch of prepared statements is about 5% faster than inline parameters: InsertBatchOfPreparedStatements: Inserted 2551704 rows in 10 batches using 256 concurrent operations in 15.785 secs, 161653 rows/s, 6335 batches/s InsertInlineBatch: Inserted 2551704 rows in 10 batches using 256 concurrent operations in 16.712 secs, 152686 rows/s, 5983 batches/s On Wed, Dec 11, 2013 at 2:40 PM, Sylvain Lebresne sylv...@datastax.com wrote: Then I suspect that this is artifact of your test methodology. Prepared statements *are* faster than non prepared ones in general. They save some parsing and some bytes on the wire. The savings will tend to be bigger for bigger queries, and it's possible that for very small queries (like the one you are testing) the performance difference is somewhat negligible, but seeing non prepared statement being significantly faster than prepared ones almost surely means you're doing wrong (of course, a bug in either the driver or C* is always possible, and always make sure to test recent versions, but I'm not aware of any such bug). Are you sure you are warming up the JVMs (client and drivers) properly for instance. 1000 iterations is *really small*, if you're not warming things up properly, you're not measuring anything relevant. Also, are you including the preparation of the query itself in the timing? Preparing a query is not particulary fast, but it's meant to be done just once at the begining of the application lifetime. But with only 1000 iterations, if you include the preparation in the timing, it's entirely possible it's eating a good chunk of the whole time. But other prepared versus non-prepared, you won't get proper performance unless you parallelize your inserts. Unlogged batches is one way to do it (it's really all Cassandra does with unlogged batch, parallelizing). But as John Sanda mentioned, another option is to do the parallelization client side, with executeAsync. -- Sylvain On Wed, Dec 11, 2013 at 11:37 AM, David Tinker david.tin...@gmail.com wrote: Yes thats what I found. This is faster: for (int i = 0; i 1000; i++) session.execute(INSERT INTO test.wibble (id, info) VALUES ('${ + i}', '${aa + i}')) Than this: def ps = session.prepare(INSERT INTO test.wibble (id, info) VALUES (?, ?)) for (int i = 0; i 1000; i++) session.execute(ps.bind([ + i, aa + i] as Object[])) This is the fastest option of all (hand rolled batch): StringBuilder b = new StringBuilder() b.append(BEGIN UNLOGGED BATCH\n) for (int i = 0; i 1000; i++) { b.append(INSERT INTO ).append(ks).append(.wibble (id, info) VALUES (').append(i).append(',') .append(aa).append(i).append(')\n) } b.append(APPLY BATCH\n) session.execute(b.toString()) On Wed, Dec 11, 2013 at 10:56 AM, Sylvain Lebresne sylv...@datastax.com wrote: This loop takes 2500ms or so on my test cluster: PreparedStatement ps = session.prepare(INSERT INTO perf_test.wibble (id, info) VALUES (?, ?)) for (int i = 0; i 1000; i++) session.execute(ps.bind( + i, aa + i)); The same loop with the parameters inline is about 1300ms. It gets worse if there are many parameters. Do you mean that: for (int i = 0; i 1000; i++) session.execute(INSERT INTO perf_test.wibble (id, info) VALUES ( + i + , aa + i + )); is twice as fast as using a prepared statement? And that the difference is even greater if you add more columns than id and info? That would certainly be unexpected, are you sure you're not re-preparing the statement every time in the loop? -- Sylvain I know I can use batching to insert all the rows at once but thats not the purpose of this test. I also tried using session.execute(cql, params) and it is faster but still doesn't match inline values. Composing CQL strings is certainly convenient and simple but is there a much faster way? Thanks David I have also posted this on Stackoverflow if anyone wants the points: http://stackoverflow.com/questions/20491090/what-is-the-fastest-way-to-get-data-into-cassandra-2-from-a-java-application -- http://qdb.io/ Persistent Message Queues With Replay and #RabbitMQ Integration -- http://qdb.io/ Persistent Message Queues With Replay and #RabbitMQ Integration
Re: Try to configure commitlog_archiving.properties
Hello, As i told you i began to explore restore operations, see my config for archive commit logs: archive_command=/bin/bash /produits/cassandra/scripts/cassandra-archive.sh %path %name restore_command=/bin/bash /produits/cassandra/scripts/cassandra-restore.sh %from %to restore_directories=/produits/cassandra/cassandra_data/archived_commit restore_point_in_time=2013:12:11 17:00:00 My 2 scripts cassandra-archive.sh: bzip2 --best -k $1 mv $1.bz2 /produits/cassandra/cassandra_data/archived_commit/$2.bz2 cassandra-restore.sh: cp -f $1 $2 bzip2 -d $2 For an example, at 2013:12:11 17:30:00 i had truncate a table which belong to a keyspace with no replication on one node, after that i made a nodetool flush. So when i restore to 2013:12:11 17:00:00 i expect to have my table bein fill up again. The node restart with this config correctly, i see my archive commit log come back to my commitlogdirectory, seems bizarre to me that these ones finish by *.out like CommitLog-3-1386927339271.log.out and not just .log. Everything is normal ? When i query my table now, this one is still empty. Finaly my restore doesn't work and i wonder why ? Do i have to make a restore on all nodes ? my keyspace have no replication but perhaps restore need same operation on all node. I miss something, i don't know. Thanks for your help.
Re: Try to configure commitlog_archiving.properties
Bonnet Jonathan. jonathan.bonnet at externe.bnpparibas.com writes: Thanks Artur, You're right i must comment restore directory too. Now i'll try to practice around restore. Regards, Bonnet Jonathan. Hello, As i told you i began to explore restore operations, see my config for archive commit logs: archive_command=/bin/bash /produits/cassandra/scripts/cassandra-archive.sh %path %name restore_command=/bin/bash /produits/cassandra/scripts/cassandra-restore.sh %from %to restore_directories=/produits/cassandra/cassandra_data/archived_commit restore_point_in_time=2013:12:11 17:00:00 My 2 scripts cassandra-archive.sh: bzip2 --best -k $1 mv $1.bz2 /produits/cassandra/cassandra_data/archived_commit/$2.bz2 cassandra-restore.sh: cp -f $1 $2 bzip2 -d $2 For an example, at 2013:12:11 17:30:00 i had truncate a table which belong to a keyspace with no replication on one node, after that i made a nodetool flush. So when i restore to 2013:12:11 17:00:00 i expect to have my table bein fill up again. The node restart with this config correctly, i see my archive commit log come back to my commitlogdirectory, seems bizarre to me that these ones finish by *.out like CommitLog-3-1386927339271.log.out and not just .log. Everything is normal ? When i query my table now, this one is still empty. Finaly my restore doesn't work and i wonder why ? Do i have to make a restore on all nodes ? my keyspace have no replication but perhaps restore need same operation on all node. I miss something, i don't know. Thanks for your help.
Re: Try to configure commitlog_archiving.properties
Artur Kronenberg artur.kronenberg at openmarket.com writes: So, looking at the code: public void maybeRestoreArchive() { if (Strings.isNullOrEmpty(restoreDirectories)) return; for (String dir : restoreDirectories.split(,)) { File[] files = new File(dir).listFiles(); if (files == null) { throw new RuntimeException(Unable to list director + dir); } for (File fromFile : files) { File toFile = new File(DatabaseDescriptor.getCommitLogLocation(), new CommitLogDescriptor(CommitLogSegment.getNextId()).fileName()); String command = restoreCommand.replace(%from, fromFile.getPath()); command = command.replace(%to, toFile.getPath()); try { exec(command); } catch (IOException e) { throw new RuntimeException(e); } } } } I would like someone to confirm that, but it might potentially be a bug. It does the right thing for an empty restore directory. However it ignores the fact that the restore command could be empty. So for you, jonathan, I reckon you have the restore directory set? You don't need that to be set in order to archive (only if you want to restore it). So set your restore_directory property to empty and you should get rid of those errors. The directory needs to be set when you enable the restore command. On a second look, I am almost certain this is a bug, as the maybeArchive command does correctly check for the command to not be empty or null. The maybeRestore command needs to do the same thing for the restoreCommand. If someone confirms, I am happy to raise a bug. cheers, artur On 11/12/13 14:09, Bonnet Jonathan. wrote: Artur Kronenberg artur.kronenberg at openmarket.com writes: hi Bonnet, that doesn't seem to be a problem with your archiving, rather with the restoring. What is your restore command? -- artur Hello, As i told you i began to explore restore operations, see my config for archive commit logs: archive_command=/bin/bash /produits/cassandra/scripts/cassandra-archive.sh %path %name restore_command=/bin/bash /produits/cassandra/scripts/cassandra-restore.sh %from %to restore_directories=/produits/cassandra/cassandra_data/archived_commit restore_point_in_time=2013:12:11 17:00:00 My 2 scripts cassandra-archive.sh: bzip2 --best -k $1 mv $1.bz2 /produits/cassandra/cassandra_data/archived_commit/$2.bz2 cassandra-restore.sh: cp -f $1 $2 bzip2 -d $2 For an example, at 2013:12:11 17:30:00 i had truncate a table which belong to a keyspace with no replication on one node, after that i made a nodetool flush. So when i restore to 2013:12:11 17:00:00 i expect to have my table bein fill up again. The node restart with this config correctly, i see my archive commit log come back to my commitlogdirectory, seems bizarre to me that these ones finish by *.out like CommitLog-3-1386927339271.log.out and not just .log. Everything is normal ? When i query my table now, this one is still empty. Finaly my restore doesn't work and i wonder why ? Do i have to make a restore on all nodes ? my keyspace have no replication but perhaps restore need same operation on all node. I miss something, i don't know. Thanks for your help.
Re: Try to configure commitlog_archiving.properties
It's been a while since I tried that but here are some things I can think of: * the .log.out seems wrong. Unless your cassandra commitlogs don't end in .log.out. I tried this locally with your script and my commitlogs get extracted to .log files for me. * I never tried the restore procedure on a cluster with multiple nodes. I imagine if you have a quorum defined the replayed commitlog may be ignored because the commitlog write operation is older then the deletion in which case the latter will be returned (nothing in your case) On 13/12/13 13:27, Bonnet Jonathan. wrote: Hello, As i told you i began to explore restore operations, see my config for archive commit logs: archive_command=/bin/bash /produits/cassandra/scripts/cassandra-archive.sh %path %name restore_command=/bin/bash /produits/cassandra/scripts/cassandra-restore.sh %from %to restore_directories=/produits/cassandra/cassandra_data/archived_commit restore_point_in_time=2013:12:11 17:00:00 My 2 scripts cassandra-archive.sh: bzip2 --best -k $1 mv $1.bz2 /produits/cassandra/cassandra_data/archived_commit/$2.bz2 cassandra-restore.sh: cp -f $1 $2 bzip2 -d $2 For an example, at 2013:12:11 17:30:00 i had truncate a table which belong to a keyspace with no replication on one node, after that i made a nodetool flush. So when i restore to 2013:12:11 17:00:00 i expect to have my table bein fill up again. The node restart with this config correctly, i see my archive commit log come back to my commitlogdirectory, seems bizarre to me that these ones finish by *.out like CommitLog-3-1386927339271.log.out and not just .log. Everything is normal ? When i query my table now, this one is still empty. Finaly my restore doesn't work and i wonder why ? Do i have to make a restore on all nodes ? my keyspace have no replication but perhaps restore need same operation on all node. I miss something, i don't know. Thanks for your help.
Restore with archive commitlog
Hello, As i told you i began to explore restore operations, see my config for archive commit logs: archive_command=/bin/bash /produits/cassandra/scripts/cassandra-archive.sh %path %name restore_command=/bin/bash /produits/cassandra/scripts/cassandra-restore.sh %from %to restore_directories=/produits/cassandra/cassandra_data/archived_commit restore_point_in_time=2013:12:11 17:00:00 My 2 scripts cassandra-archive.sh: bzip2 --best -k $1 mv $1.bz2 /produits/cassandra/cassandra_data/archived_commit/$2.bz2 cassandra-restore.sh: cp -f $1 $2 bzip2 -d $2 For an example, at 2013:12:11 17:30:00 i had truncate a table which belong to a keyspace with no replication on one node, after that i made a nodetool flush. So when i restore to 2013:12:11 17:00:00 i expect to have my table bein fill up again. The node restart with this config correctly, i see my archive commit log come back to my commitlogdirectory, seems bizarre to me that these ones finish by *.out like CommitLog-3-1386927339271.log.out and not just .log. Everything is normal ? When i query my table now, this one is still empty. Finaly my restore doesn't work and i wonder why ? Do i have to make a restore on all nodes ? my keyspace have no replication but perhaps restore need same operation on all node. I miss something, i don't know. Thanks for your help.
Fwd: {kundera-discuss} RE: Kundera 2.9 released
fyi. -- Forwarded message -- From: Vivek Mishra vivek.mis...@impetus.co.in Date: Fri, Dec 13, 2013 at 8:54 PM Subject: {kundera-discuss} RE: Kundera 2.9 released To: kundera-disc...@googlegroups.com kundera-disc...@googlegroups.com, u...@hbase.apache.org u...@hbase.apache.org Hi All, We are happy to announce the release of Kundera 2.9 . Kundera is a JPA 2.0 compliant, object-datastore mapping library for NoSQL datastores. The idea behind Kundera is to make working with NoSQL databases drop-dead simple and fun. It currently supports Cassandra, HBase, MongoDB, Redis, OracleNoSQL, Neo4j,ElasticSearch,CouchDB and relational databases. Major Changes: == 1) Support for Secondary table 2) Support Abstract entity. Github Bug Fixes: === https://github.com/impetus-opensource/Kundera/issues/455 https://github.com/impetus-opensource/Kundera/issues/448 https://github.com/impetus-opensource/Kundera/issues/447 https://github.com/impetus-opensource/Kundera/issues/443 https://github.com/impetus-opensource/Kundera/pull/442 https://github.com/impetus-opensource/Kundera/issues/404 https://github.com/impetus-opensource/Kundera/issues/388 https://github.com/impetus-opensource/Kundera/issues/283 https://github.com/impetus-opensource/Kundera/issues/263 https://github.com/impetus-opensource/Kundera/issues/120 https://github.com/impetus-opensource/Kundera/issues/103 How to Download: To download, use or contribute to Kundera, visit: http://github.com/impetus-opensource/Kundera Latest released tag version is 2.9 Kundera maven libraries are now available at: https://oss.sonatype.org/content/repositories/releases/com/impetus Sample codes and examples for using Kundera can be found here: https://github.com/impetus-opensource/Kundera/tree/trunk/src/kundera-tests Survey/Feedback: http://www.surveymonkey.com/s/BMB9PWG Thank you all for your contributions and using Kundera! Sincerely, Kundera Team NOTE: This message may contain information that is confidential, proprietary, privileged or otherwise protected by law. The message is intended solely for the named addressee. If received in error, please destroy and notify the sender. Any use of this email is prohibited when received in error. Impetus does not represent, warrant and/or guarantee, that the integrity of this communication has been maintained nor that the communication is free of errors, virus, interception or interference. -- You received this message because you are subscribed to the Google Groups kundera-discuss group. To unsubscribe from this group and stop receiving emails from it, send an email to kundera-discuss+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/groups/opt_out.
Re: Pig 0.12.0 and Cassandra 2.0.2
I need to update those to be current with the Cassandra source download. You’re right, you would just use what’s in the examples directory now for Pig. You should be able to run the examples, but generally you need to specify the partitioner of the cluster, the host name of a node in the cluster, and the port. Those are the required things to set, either in environment variables or in hadoop properties. There are other properties that you can set such as read/write consistency level and such as well. A good place to look for a lot of those properties is in the ConfigHelper.java file under the org.apache.cassandra.hadoop package. Jeremy On 29 Nov 2013, at 21:05, Jason Lewis jle...@packetnexus.com wrote: I sent this to the Pig list, but didn't get a response... I'm trying to get Pig running with Cassandra 2.0.2. The instructions I've been using are here: https://github.com/jeromatron/pygmalion/wiki/Getting-Started The cassandra 2.0.2 src does not have a contrib directory. Am I missing something? Should I just be able to use the pig_cassandra in the examples/pig/bin directory? If so, what environment variables to I need to make sure exist? I can't seem to find solid instructions on using pig with cassandra, is there a doc somewhere that I've overlooked? jas
Re: Restore with archive commitlog
As someone told you this feature was added by Netflix to work with Priam (cassandra management tool). Priam itself uses it for several months only, so I doubt if anybody uses this feature in production. Any way, you can ping guys working on Priam. This is your best bet. https://github.com/Netflix/Priam Let us know if you can figure out how to use it. Thank you, Andrey On Fri, Dec 13, 2013 at 6:31 AM, Bonnet Jonathan. jonathan.bon...@externe.bnpparibas.com wrote: Hello, As i told you i began to explore restore operations, see my config for archive commit logs: archive_command=/bin/bash /produits/cassandra/scripts/cassandra-archive.sh %path %name restore_command=/bin/bash /produits/cassandra/scripts/cassandra-restore.sh %from %to restore_directories=/produits/cassandra/cassandra_data/archived_commit restore_point_in_time=2013:12:11 17:00:00 My 2 scripts cassandra-archive.sh: bzip2 --best -k $1 mv $1.bz2 /produits/cassandra/cassandra_data/archived_commit/$2.bz2 cassandra-restore.sh: cp -f $1 $2 bzip2 -d $2 For an example, at 2013:12:11 17:30:00 i had truncate a table which belong to a keyspace with no replication on one node, after that i made a nodetool flush. So when i restore to 2013:12:11 17:00:00 i expect to have my table bein fill up again. The node restart with this config correctly, i see my archive commit log come back to my commitlogdirectory, seems bizarre to me that these ones finish by *.out like CommitLog-3-1386927339271.log.out and not just .log. Everything is normal ? When i query my table now, this one is still empty. Finaly my restore doesn't work and i wonder why ? Do i have to make a restore on all nodes ? my keyspace have no replication but perhaps restore need same operation on all node. I miss something, i don't know. Thanks for your help.
Issues while fetching data with pycassa get for super columns
Hi Folks - I have having issue fetch data using pycassa get() function. I have copied the CF schema and my code is below. This query returns me just this Results: {u'narrativebuddieswin': ['609548930995445799_752368319', '609549303525138481_752368319', '610162034020180814_752368319', '610162805856002905_752368319', '610163571417146213_752368319', '610165900312830861_752368319']} none of the subcolumns are returned for above super column ??? Please help.. CODE: - if start: res_rows = col_fam.get(key, column_count=count, column_start=start, include_timestamp=True, include_ttl=True, ) else: res_rows = col_fam.get(key, column_count=count, include_timestamp=True, include_ttl=True,) return res_rows CF Schema: 'Twitter_Instagram': CfDef(comment='', key_validation_class='org.apache.cassandra.db.marshal.UTF8Type', min_compaction_threshold=4, key_cache_save_period_in_seconds=None, gc_grace_seconds=864000, default_validation_class='org.apache.cassandra.db.marshal.UTF8Type', max_compaction_threshold=32, read_repair_chance=0.10001, compression_options={'sstable_compression': 'org.apache.cassandra.io.compress.SnappyCompressor'}, bloom_filter_fp_chance=None, id=None, keyspace='Narrative', key_cache_size=None, replicate_on_write=True, subcomparator_type='org.apache.cassandra.db.marshal.BytesType', merge_shards_chance=None, row_cache_provider=None, row_cache_save_period_in_seconds=None, column_type='Super', memtable_throughput_in_mb=None, memtable_flush_after_mins=None, column_metadata={‘ 'expanded_url': ColumnDef(index_type=None, index_name=None, validation_class='org.apache.cassandra.db.marshal.UTF8Type', name='expanded_url', index_options=None), 'favorite': ColumnDef(index_type=None, index_name=None, validation_class='org.apache.cassandra.db.marshal.IntegerType', name='favorite', index_options=None), 'retweet': ColumnDef(index_type=None, index_name=None, validation_class='org.apache.cassandra.db.marshal.IntegerType', name='retweet', index_options=None), 'iid': ColumnDef(index_type=None, index_name=None, validation_class='org.apache.cassandra.db.marshal.UTF8Type', name='iid', index_options=None), 'screen_name': ColumnDef(index_type=None, index_name=None, validation_class='org.apache.cassandra.db.marshal.UTF8Type', name='screen_name', index_options=None), 'embedly_data': ColumnDef(index_type=None, index_name=None, validation_class='org.apache.cassandra.db.marshal.BytesType', name='embedly_data', index_options=None), 'created_date': ColumnDef(index_type=None, index_name=None, validation_class='org.apache.cassandra.db.marshal.UTF8Type', name='created_date', index_options=None), 'tid': ColumnDef(index_type=None, index_name=None, validation_class='org.apache.cassandra.db.marshal.UTF8Type', name='tid', index_options=None), 'score': ColumnDef(index_type=None, index_name=None, validation_class='org.apache.cassandra.db.marshal.IntegerType', name='score', index_options=None), 'approved': ColumnDef(index_type=None, index_name=None, validation_class='org.apache.cassandra.db.marshal.UTF8Type', name='approved', index_options=None), 'likes': ColumnDef(index_type=None, index_name=None, validation_class='org.apache.cassandra.db.marshal.IntegerType', name='likes', index_options=None)}, key_alias=None, dclocal_read_repair_chance=0.0, name='Twitter_Instagram', compaction_strategy_options={}, row_cache_keys_to_save=None, compaction_strategy='org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', memtable_operations_in_millions=None, caching='KEYS_ONLY', comparator_type='org.apache.cassandra.db.marshal.BytesType', row_cache_size=None),
Re: Bulkoutputformat
Thanks Rahul..article was insightful On Fri, Dec 13, 2013 at 12:25 AM, Rahul Menon ra...@apigee.com wrote: Here you go http://thelastpickle.com/blog/2013/01/11/primary-keys-in-cql.html On Fri, Dec 13, 2013 at 7:19 AM, varun allampalli vshoori.off...@gmail.com wrote: Hi Aaron, It seems like you answered the question here. https://groups.google.com/forum/#!topic/nosql-databases/vjZA5vdycWA Can you give me the link to the blog which you mentioned http://thelastpickle.com/2013/01/11/primary-keys-in-cql/ Thanks in advance Varun On Thu, Dec 12, 2013 at 3:36 PM, varun allampalli vshoori.off...@gmail.com wrote: Thanks Aaron, I was able to generate sstables and load using sstableloader. But after loading the tables when I do a select query I get this, the table has only one record. Is there anything I am missing or any logs I can look at. Request did not complete within rpc_timeout. On Wed, Dec 11, 2013 at 7:58 PM, Aaron Morton aa...@thelastpickle.comwrote: If you don’t need to use Hadoop then try the SSTableSimpleWriter and sstableloader , this post is a little old but still relevant http://www.datastax.com/dev/blog/bulk-loading Otherwise AFAIK BulkOutputFormat is what you want from hadoop http://www.datastax.com/docs/1.1/cluster_architecture/hadoop_integration Cheers - Aaron Morton New Zealand @aaronmorton Co-Founder Principal Consultant Apache Cassandra Consulting http://www.thelastpickle.com On 12/12/2013, at 11:27 am, varun allampalli vshoori.off...@gmail.com wrote: Hi All, I want to bulk insert data into cassandra. I was wondering of using BulkOutputformat in hadoop. Is it the best way or using driver and doing batch insert is the better way. Are there any disandvantages of using bulkoutputformat. Thanks for helping Varun
Re: One big table/cf or many small ones?
Hi Tinus, On 12 Dec 2013, at 6:59 pm, Tinus Sky tinus...@gmail.com wrote: My service does have users who can add a message to a list. The list of message is sorted by date and displayed. When a user changes the message the date is changed and the message moves to the top of the list. A possible solution is to remove row and insert it again, but i suspect this might be not the best solution. Is there a alternative solution? Just insert a new version of the message, with a version field into the table/columnfamily. When outputting query results to the user, filter/hide older versions. And bam you have a new feature in your web site called show old versions or show previous edits. My second question is regarding the number of tables/column families in a keyspace. I can create a table which contains all messages from all users. But i can also create one table for every user which has a name like: messages_[userid], where [userid] is the id of the user. Or i can shard: messages_a (contains messages from user where name starts with a), messages_b (contains messages from user where name starts with b) My users count is around 100.000. And the messages per user are approx around 20.000. What would be the choice: put everything in 1 big table or go with the many small tables option. I'm pretty sure you don't want to put 100,000 tables in a Cassandra key space. Go with one. Regards, Jacob