Re: Bulkoutputformat

2013-12-13 Thread Rahul Menon
Here you go

http://thelastpickle.com/blog/2013/01/11/primary-keys-in-cql.html


On Fri, Dec 13, 2013 at 7:19 AM, varun allampalli
vshoori.off...@gmail.comwrote:

 Hi Aaron,

 It seems like you answered the question here.

 https://groups.google.com/forum/#!topic/nosql-databases/vjZA5vdycWA

 Can you give me the link to the blog which you mentioned

 http://thelastpickle.com/2013/01/11/primary-keys-in-cql/

 Thanks in advance
 Varun


 On Thu, Dec 12, 2013 at 3:36 PM, varun allampalli 
 vshoori.off...@gmail.com wrote:

 Thanks Aaron, I was able to generate sstables and load using
 sstableloader. But after loading the tables when I do a select query I get
 this, the table has only one record. Is there anything I am missing or any
 logs I can look at.

 Request did not complete within rpc_timeout.


 On Wed, Dec 11, 2013 at 7:58 PM, Aaron Morton aa...@thelastpickle.comwrote:

 If you don’t need to use Hadoop then try the SSTableSimpleWriter and
 sstableloader , this post is a little old but still relevant
 http://www.datastax.com/dev/blog/bulk-loading

 Otherwise AFAIK BulkOutputFormat is what you want from hadoop
 http://www.datastax.com/docs/1.1/cluster_architecture/hadoop_integration

 Cheers

  -
 Aaron Morton
 New Zealand
 @aaronmorton

 Co-Founder  Principal Consultant
 Apache Cassandra Consulting
 http://www.thelastpickle.com

 On 12/12/2013, at 11:27 am, varun allampalli vshoori.off...@gmail.com
 wrote:

 Hi All,

 I want to bulk insert data into cassandra. I was wondering of using
 BulkOutputformat in hadoop. Is it the best way or using driver and doing
 batch insert is the better way.

 Are there any disandvantages of using bulkoutputformat.

 Thanks for helping

 Varun







Re: What is the fastest way to get data into Cassandra 2 from a Java application?

2013-12-13 Thread David Tinker
I wrote some scripts to test this: https://github.com/davidtinker/cassandra-perf

3 node cluster, each node: Intel® Xeon® E3-1270 v3 Quadcore Haswell
32GB RAM, 1 x 2TB commit log disk, 2 x 4TB data disks (RAID0)

Using a batch of prepared statements is about 5% faster than inline parameters:

InsertBatchOfPreparedStatements: Inserted 2551704 rows in 10
batches using 256 concurrent operations in 15.785 secs, 161653 rows/s,
6335 batches/s

InsertInlineBatch: Inserted 2551704 rows in 10 batches using 256
concurrent operations in 16.712 secs, 152686 rows/s, 5983 batches/s

On Wed, Dec 11, 2013 at 2:40 PM, Sylvain Lebresne sylv...@datastax.com wrote:
 Then I suspect that this is artifact of your test methodology. Prepared
 statements *are* faster than non prepared ones in general. They save some
 parsing and some bytes on the wire. The savings will tend to be bigger for
 bigger queries, and it's possible that for very small queries (like the one
 you
 are testing) the performance difference is somewhat negligible, but seeing
 non
 prepared statement being significantly faster than prepared ones almost
 surely
 means you're doing wrong (of course, a bug in either the driver or C* is
 always
 possible, and always make sure to test recent versions, but I'm not aware of
 any such bug).

 Are you sure you are warming up the JVMs (client and drivers) properly for
 instance. 1000 iterations is *really small*, if you're not warming things
 up properly, you're not measuring anything relevant. Also, are you including
 the preparation of the query itself in the timing? Preparing a query is not
 particulary fast, but it's meant to be done just once at the begining of the
 application lifetime. But with only 1000 iterations, if you include the
 preparation in the timing, it's entirely possible it's eating a good chunk
 of
 the whole time.

 But other prepared versus non-prepared, you won't get proper performance
 unless
 you parallelize your inserts. Unlogged batches is one way to do it (it's
 really
 all Cassandra does with unlogged batch, parallelizing). But as John Sanda
 mentioned, another option is to do the parallelization client side, with
 executeAsync.

 --
 Sylvain



 On Wed, Dec 11, 2013 at 11:37 AM, David Tinker david.tin...@gmail.com
 wrote:

 Yes thats what I found.

 This is faster:

 for (int i = 0; i  1000; i++) session.execute(INSERT INTO
 test.wibble (id, info) VALUES ('${ + i}', '${aa + i}'))

 Than this:

 def ps = session.prepare(INSERT INTO test.wibble (id, info) VALUES (?,
 ?))
 for (int i = 0; i  1000; i++) session.execute(ps.bind([ + i, aa +
 i] as Object[]))

 This is the fastest option of all (hand rolled batch):

 StringBuilder b = new StringBuilder()
 b.append(BEGIN UNLOGGED BATCH\n)
 for (int i = 0; i  1000; i++) {
 b.append(INSERT INTO ).append(ks).append(.wibble (id, info)
 VALUES (').append(i).append(',')
 .append(aa).append(i).append(')\n)
 }
 b.append(APPLY BATCH\n)
 session.execute(b.toString())


 On Wed, Dec 11, 2013 at 10:56 AM, Sylvain Lebresne sylv...@datastax.com
 wrote:
 
  This loop takes 2500ms or so on my test cluster:
 
  PreparedStatement ps = session.prepare(INSERT INTO perf_test.wibble
  (id, info) VALUES (?, ?))
  for (int i = 0; i  1000; i++) session.execute(ps.bind( + i, aa +
  i));
 
  The same loop with the parameters inline is about 1300ms. It gets
  worse if there are many parameters.
 
 
  Do you mean that:
for (int i = 0; i  1000; i++)
session.execute(INSERT INTO perf_test.wibble (id, info) VALUES (
  + i
  + , aa + i + ));
  is twice as fast as using a prepared statement? And that the difference
  is even greater if you add more columns than id and info?
 
  That would certainly be unexpected, are you sure you're not re-preparing
  the
  statement every time in the loop?
 
  --
  Sylvain
 
  I know I can use batching to
  insert all the rows at once but thats not the purpose of this test. I
  also tried using session.execute(cql, params) and it is faster but
  still doesn't match inline values.
 
  Composing CQL strings is certainly convenient and simple but is there
  a much faster way?
 
  Thanks
  David
 
  I have also posted this on Stackoverflow if anyone wants the points:
 
 
  http://stackoverflow.com/questions/20491090/what-is-the-fastest-way-to-get-data-into-cassandra-2-from-a-java-application
 
 



 --
 http://qdb.io/ Persistent Message Queues With Replay and #RabbitMQ
 Integration





-- 
http://qdb.io/ Persistent Message Queues With Replay and #RabbitMQ Integration


Re: Try to configure commitlog_archiving.properties

2013-12-13 Thread Bonnet Jonathan .
Hello,

  As i told you i began to explore restore operations, see my config for
archive commit logs:

archive_command=/bin/bash /produits/cassandra/scripts/cassandra-archive.sh
%path %name

restore_command=/bin/bash /produits/cassandra/scripts/cassandra-restore.sh
%from %to

restore_directories=/produits/cassandra/cassandra_data/archived_commit

restore_point_in_time=2013:12:11 17:00:00

My 2 scripts 

cassandra-archive.sh:

bzip2 --best -k $1
mv $1.bz2 /produits/cassandra/cassandra_data/archived_commit/$2.bz2


cassandra-restore.sh:
cp -f $1 $2
bzip2 -d $2


  
For an example, at 2013:12:11 17:30:00 i had truncate a table which belong
to a keyspace with no replication on one node, after that i made a nodetool
flush. So when i restore to 2013:12:11 17:00:00 i expect to have my table
bein fill up again.

The node restart with this config correctly, i see my archive commit log
come back to my commitlogdirectory, seems bizarre to me that these ones
finish by *.out like CommitLog-3-1386927339271.log.out and not just .log.
Everything is normal ?

When i query my table now, this one is still empty. Finaly my restore
doesn't work and i wonder why ?

Do i have to make a restore on all nodes ? my keyspace have no replication
but perhaps restore need same operation on all node.

I miss something, i don't know.

Thanks for your help.







Re: Try to configure commitlog_archiving.properties

2013-12-13 Thread Bonnet Jonathan .

Bonnet Jonathan. jonathan.bonnet at externe.bnpparibas.com writes:

 
 Thanks Artur,
 
 You're right i must comment restore directory too.
 
 Now i'll try to practice around restore.
 
 Regards,
 
 Bonnet Jonathan.
 
 

Hello,

  As i told you i began to explore restore operations, see my config for
archive commit logs:

archive_command=/bin/bash /produits/cassandra/scripts/cassandra-archive.sh
%path %name

restore_command=/bin/bash /produits/cassandra/scripts/cassandra-restore.sh
%from %to

restore_directories=/produits/cassandra/cassandra_data/archived_commit

restore_point_in_time=2013:12:11 17:00:00

My 2 scripts 

cassandra-archive.sh:

bzip2 --best -k $1
mv $1.bz2 /produits/cassandra/cassandra_data/archived_commit/$2.bz2


cassandra-restore.sh:
cp -f $1 $2
bzip2 -d $2


  
For an example, at 2013:12:11 17:30:00 i had truncate a table which belong
to a keyspace with no replication on one node, after that i made a nodetool
flush. So when i restore to 2013:12:11 17:00:00 i expect to have my table
bein fill up again.

The node restart with this config correctly, i see my archive commit log
come back to my commitlogdirectory, seems bizarre to me that these ones
finish by *.out like CommitLog-3-1386927339271.log.out and not just .log.
Everything is normal ?

When i query my table now, this one is still empty. Finaly my restore
doesn't work and i wonder why ?

Do i have to make a restore on all nodes ? my keyspace have no replication
but perhaps restore need same operation on all node.

I miss something, i don't know.

Thanks for your help.







Re: Try to configure commitlog_archiving.properties

2013-12-13 Thread Bonnet Jonathan .
Artur Kronenberg artur.kronenberg at openmarket.com writes:

 
 So, looking at the code:
 
  public void maybeRestoreArchive()
  {
  if (Strings.isNullOrEmpty(restoreDirectories))
  return;
 
  for (String dir : restoreDirectories.split(,))
  {
  File[] files = new File(dir).listFiles();
  if (files == null)
  {
  throw new RuntimeException(Unable to list director  + 
 dir);
  }
  for (File fromFile : files)
  {
  File toFile = new 
 File(DatabaseDescriptor.getCommitLogLocation(), new 
 CommitLogDescriptor(CommitLogSegment.getNextId()).fileName());
  String command = restoreCommand.replace(%from, 
 fromFile.getPath());
  command = command.replace(%to, toFile.getPath());
  try
  {
  exec(command);
  }
  catch (IOException e)
  {
  throw new RuntimeException(e);
  }
  }
  }
  }
 
 I would like someone to confirm that, but it might potentially be a bug. 
 It does the right thing for an empty restore directory. However it 
 ignores the fact that the restore command could be empty.
 So for you, jonathan, I reckon you have the restore directory set? You 
 don't need that to be set in order to archive (only if you want to 
 restore it). So set your restore_directory property to empty and you 
 should get rid of those errors. The directory needs to be set when you 
 enable the restore command.
 
 On a second look, I am almost certain this is a bug, as the maybeArchive 
 command does correctly check for the command to not be empty or null. 
 The maybeRestore command needs to do the same thing for the 
 restoreCommand. If someone confirms, I am happy to raise a bug.
 
 cheers,
 
 artur
 
 On 11/12/13 14:09, Bonnet Jonathan. wrote:
  Artur Kronenberg artur.kronenberg at openmarket.com writes:
 
 
   hi Bonnet,
 that doesn't seem to be a problem with your archiving, rather with
 the restoring. What is your restore command?
 -- artur

Hello,

  As i told you i began to explore restore operations, see my config for
archive commit logs:

archive_command=/bin/bash /produits/cassandra/scripts/cassandra-archive.sh
%path %name

restore_command=/bin/bash /produits/cassandra/scripts/cassandra-restore.sh
%from %to

restore_directories=/produits/cassandra/cassandra_data/archived_commit

restore_point_in_time=2013:12:11 17:00:00

My 2 scripts 

cassandra-archive.sh:

bzip2 --best -k $1
mv $1.bz2 /produits/cassandra/cassandra_data/archived_commit/$2.bz2


cassandra-restore.sh:
cp -f $1 $2
bzip2 -d $2


  
For an example, at 2013:12:11 17:30:00 i had truncate a table which belong
to a keyspace with no replication on one node, after that i made a nodetool
flush. So when i restore to 2013:12:11 17:00:00 i expect to have my table
bein fill up again.

The node restart with this config correctly, i see my archive commit log
come back to my commitlogdirectory, seems bizarre to me that these ones
finish by *.out like CommitLog-3-1386927339271.log.out and not just .log.
Everything is normal ?

When i query my table now, this one is still empty. Finaly my restore
doesn't work and i wonder why ?

Do i have to make a restore on all nodes ? my keyspace have no replication
but perhaps restore need same operation on all node.

I miss something, i don't know.

Thanks for your help.







Re: Try to configure commitlog_archiving.properties

2013-12-13 Thread Artur Kronenberg
It's been a while since I tried that but here are some things I can 
think of:


* the .log.out seems wrong. Unless your cassandra commitlogs don't end 
in .log.out. I tried this locally with your script and my commitlogs get 
extracted to .log files for me.
* I never tried the restore procedure on a cluster with multiple nodes. 
I imagine if you have a quorum defined the replayed commitlog may be 
ignored because the commitlog write operation is older then the deletion 
in which case the latter will be returned (nothing in your case)


On 13/12/13 13:27, Bonnet Jonathan. wrote:

Hello,

   As i told you i began to explore restore operations, see my config for
archive commit logs:

archive_command=/bin/bash /produits/cassandra/scripts/cassandra-archive.sh
%path %name

restore_command=/bin/bash /produits/cassandra/scripts/cassandra-restore.sh
%from %to

restore_directories=/produits/cassandra/cassandra_data/archived_commit

restore_point_in_time=2013:12:11 17:00:00

My 2 scripts

cassandra-archive.sh:

bzip2 --best -k $1
mv $1.bz2 /produits/cassandra/cassandra_data/archived_commit/$2.bz2


cassandra-restore.sh:
cp -f $1 $2
bzip2 -d $2


   
For an example, at 2013:12:11 17:30:00 i had truncate a table which belong

to a keyspace with no replication on one node, after that i made a nodetool
flush. So when i restore to 2013:12:11 17:00:00 i expect to have my table
bein fill up again.

The node restart with this config correctly, i see my archive commit log
come back to my commitlogdirectory, seems bizarre to me that these ones
finish by *.out like CommitLog-3-1386927339271.log.out and not just .log.
Everything is normal ?

When i query my table now, this one is still empty. Finaly my restore
doesn't work and i wonder why ?

Do i have to make a restore on all nodes ? my keyspace have no replication
but perhaps restore need same operation on all node.

I miss something, i don't know.

Thanks for your help.









Restore with archive commitlog

2013-12-13 Thread Bonnet Jonathan .
Hello,

  As i told you i began to explore restore operations, see my config for
archive commit logs:

archive_command=/bin/bash /produits/cassandra/scripts/cassandra-archive.sh
%path %name

restore_command=/bin/bash /produits/cassandra/scripts/cassandra-restore.sh
%from %to

restore_directories=/produits/cassandra/cassandra_data/archived_commit

restore_point_in_time=2013:12:11 17:00:00

My 2 scripts 

cassandra-archive.sh:

bzip2 --best -k $1
mv $1.bz2 /produits/cassandra/cassandra_data/archived_commit/$2.bz2


cassandra-restore.sh:
cp -f $1 $2
bzip2 -d $2


  
For an example, at 2013:12:11 17:30:00 i had truncate a table which belong
to a keyspace with no replication on one node, after that i made a nodetool
flush. So when i restore to 2013:12:11 17:00:00 i expect to have my table
bein fill up again.

The node restart with this config correctly, i see my archive commit log
come back to my commitlogdirectory, seems bizarre to me that these ones
finish by *.out like CommitLog-3-1386927339271.log.out and not just .log.
Everything is normal ?

When i query my table now, this one is still empty. Finaly my restore
doesn't work and i wonder why ?

Do i have to make a restore on all nodes ? my keyspace have no replication
but perhaps restore need same operation on all node.

I miss something, i don't know.

Thanks for your help.




Fwd: {kundera-discuss} RE: Kundera 2.9 released

2013-12-13 Thread Vivek Mishra
fyi.

-- Forwarded message --
From: Vivek Mishra vivek.mis...@impetus.co.in
Date: Fri, Dec 13, 2013 at 8:54 PM
Subject: {kundera-discuss} RE: Kundera 2.9 released
To: kundera-disc...@googlegroups.com kundera-disc...@googlegroups.com, 
u...@hbase.apache.org u...@hbase.apache.org


Hi All,

We are happy to announce the release of Kundera 2.9 .

Kundera is a JPA 2.0 compliant, object-datastore mapping library for NoSQL
datastores. The idea behind Kundera is to make working with NoSQL databases
drop-dead simple and fun. It currently supports Cassandra, HBase, MongoDB,
Redis, OracleNoSQL, Neo4j,ElasticSearch,CouchDB and relational databases.

Major Changes:
==
1) Support for Secondary table
2) Support Abstract entity.

Github Bug Fixes:
===
https://github.com/impetus-opensource/Kundera/issues/455
https://github.com/impetus-opensource/Kundera/issues/448
https://github.com/impetus-opensource/Kundera/issues/447
https://github.com/impetus-opensource/Kundera/issues/443
https://github.com/impetus-opensource/Kundera/pull/442
https://github.com/impetus-opensource/Kundera/issues/404
https://github.com/impetus-opensource/Kundera/issues/388
https://github.com/impetus-opensource/Kundera/issues/283
https://github.com/impetus-opensource/Kundera/issues/263
https://github.com/impetus-opensource/Kundera/issues/120
https://github.com/impetus-opensource/Kundera/issues/103

How to Download:
To download, use or contribute to Kundera, visit:
http://github.com/impetus-opensource/Kundera

Latest released tag version is 2.9 Kundera maven libraries are now
available at:
https://oss.sonatype.org/content/repositories/releases/com/impetus

Sample codes and examples for using Kundera can be found here:
https://github.com/impetus-opensource/Kundera/tree/trunk/src/kundera-tests

Survey/Feedback:
http://www.surveymonkey.com/s/BMB9PWG

Thank you all for your contributions and using Kundera!

Sincerely,
Kundera Team








NOTE: This message may contain information that is confidential,
proprietary, privileged or otherwise protected by law. The message is
intended solely for the named addressee. If received in error, please
destroy and notify the sender. Any use of this email is prohibited when
received in error. Impetus does not represent, warrant and/or guarantee,
that the integrity of this communication has been maintained nor that the
communication is free of errors, virus, interception or interference.

--
You received this message because you are subscribed to the Google Groups
kundera-discuss group.
To unsubscribe from this group and stop receiving emails from it, send an
email to kundera-discuss+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


Re: Pig 0.12.0 and Cassandra 2.0.2

2013-12-13 Thread Jeremy Hanna
I need to update those to be current with the Cassandra source download.  
You’re right, you would just use what’s in the examples directory now for Pig.  
You should be able to run the examples, but generally you need to specify the 
partitioner of the cluster, the host name of a node in the cluster, and the 
port.  Those are the required things to set, either in environment variables or 
in hadoop properties.  There are other properties that you can set such as 
read/write consistency level and such as well.  A good place to look for a lot 
of those properties is in the ConfigHelper.java file under the 
org.apache.cassandra.hadoop package.

Jeremy

On 29 Nov 2013, at 21:05, Jason Lewis jle...@packetnexus.com wrote:

 I sent this to the Pig list, but didn't get a response...
 
 I'm trying to get Pig running with Cassandra 2.0.2.  The instructions
 I've been using are here:
 
 https://github.com/jeromatron/pygmalion/wiki/Getting-Started
 
 The cassandra 2.0.2 src does not have a contrib directory.  Am I
 missing something?  Should I just be able to use the pig_cassandra
 in the examples/pig/bin directory?  If so, what environment variables
 to I need to make sure exist?
 
 I can't seem to find solid instructions on using pig with cassandra,
 is there a doc somewhere that I've overlooked?
 
 jas



Re: Restore with archive commitlog

2013-12-13 Thread Andrey Ilinykh
As someone told you this feature was added by Netflix to work with Priam
(cassandra management tool). Priam itself uses it for several months only,
so I doubt if anybody uses this feature in production. Any way, you can
ping guys working on Priam. This is your best bet.
https://github.com/Netflix/Priam

Let us know if you can figure out how to use it.

Thank you,
  Andrey


On Fri, Dec 13, 2013 at 6:31 AM, Bonnet Jonathan. 
jonathan.bon...@externe.bnpparibas.com wrote:

 Hello,

   As i told you i began to explore restore operations, see my config for
 archive commit logs:

 archive_command=/bin/bash /produits/cassandra/scripts/cassandra-archive.sh
 %path %name

 restore_command=/bin/bash /produits/cassandra/scripts/cassandra-restore.sh
 %from %to

 restore_directories=/produits/cassandra/cassandra_data/archived_commit

 restore_point_in_time=2013:12:11 17:00:00

 My 2 scripts

 cassandra-archive.sh:

 bzip2 --best -k $1
 mv $1.bz2 /produits/cassandra/cassandra_data/archived_commit/$2.bz2


 cassandra-restore.sh:
 cp -f $1 $2
 bzip2 -d $2



 For an example, at 2013:12:11 17:30:00 i had truncate a table which belong
 to a keyspace with no replication on one node, after that i made a nodetool
 flush. So when i restore to 2013:12:11 17:00:00 i expect to have my table
 bein fill up again.

 The node restart with this config correctly, i see my archive commit log
 come back to my commitlogdirectory, seems bizarre to me that these ones
 finish by *.out like CommitLog-3-1386927339271.log.out and not just .log.
 Everything is normal ?

 When i query my table now, this one is still empty. Finaly my restore
 doesn't work and i wonder why ?

 Do i have to make a restore on all nodes ? my keyspace have no replication
 but perhaps restore need same operation on all node.

 I miss something, i don't know.

 Thanks for your help.





Issues while fetching data with pycassa get for super columns

2013-12-13 Thread Kumar Ranjan
Hi Folks - I have having issue fetch data using pycassa get() function. I
have copied the CF schema and my code is below. This query returns me just
this

Results: {u'narrativebuddieswin': ['609548930995445799_752368319',
'609549303525138481_752368319', '610162034020180814_752368319',
'610162805856002905_752368319', '610163571417146213_752368319',
'610165900312830861_752368319']}

none of the subcolumns are returned for above super column ??? Please help..


CODE: -

if start:

res_rows = col_fam.get(key, column_count=count,
column_start=start, include_timestamp=True, include_ttl=True, )

else:

res_rows = col_fam.get(key, column_count=count, include_timestamp=True,
include_ttl=True,)

return res_rows




CF Schema: 

'Twitter_Instagram':

CfDef(comment='',

  key_validation_class='org.apache.cassandra.db.marshal.UTF8Type',

  min_compaction_threshold=4,

  key_cache_save_period_in_seconds=None,

  gc_grace_seconds=864000,

  default_validation_class='org.apache.cassandra.db.marshal.UTF8Type',

  max_compaction_threshold=32,

  read_repair_chance=0.10001,

  compression_options={'sstable_compression':
'org.apache.cassandra.io.compress.SnappyCompressor'},

  bloom_filter_fp_chance=None,

  id=None,

  keyspace='Narrative',

  key_cache_size=None,

  replicate_on_write=True,

  subcomparator_type='org.apache.cassandra.db.marshal.BytesType',

  merge_shards_chance=None,

  row_cache_provider=None,

  row_cache_save_period_in_seconds=None,

  column_type='Super',

  memtable_throughput_in_mb=None,

  memtable_flush_after_mins=None,


column_metadata={‘

'expanded_url': ColumnDef(index_type=None, index_name=None,
validation_class='org.apache.cassandra.db.marshal.UTF8Type',
name='expanded_url', index_options=None),

'favorite': ColumnDef(index_type=None, index_name=None,
validation_class='org.apache.cassandra.db.marshal.IntegerType',
name='favorite', index_options=None),

'retweet': ColumnDef(index_type=None, index_name=None,
validation_class='org.apache.cassandra.db.marshal.IntegerType',
name='retweet', index_options=None),

'iid': ColumnDef(index_type=None, index_name=None,
validation_class='org.apache.cassandra.db.marshal.UTF8Type', name='iid',
index_options=None),

'screen_name': ColumnDef(index_type=None, index_name=None,
validation_class='org.apache.cassandra.db.marshal.UTF8Type',
name='screen_name', index_options=None),

'embedly_data': ColumnDef(index_type=None, index_name=None,
validation_class='org.apache.cassandra.db.marshal.BytesType',
name='embedly_data', index_options=None),

'created_date': ColumnDef(index_type=None, index_name=None,
validation_class='org.apache.cassandra.db.marshal.UTF8Type',
name='created_date', index_options=None),

'tid': ColumnDef(index_type=None, index_name=None,
validation_class='org.apache.cassandra.db.marshal.UTF8Type', name='tid',
index_options=None),

'score': ColumnDef(index_type=None, index_name=None,
validation_class='org.apache.cassandra.db.marshal.IntegerType',
name='score', index_options=None),

'approved': ColumnDef(index_type=None, index_name=None,
validation_class='org.apache.cassandra.db.marshal.UTF8Type',
name='approved', index_options=None),

'likes': ColumnDef(index_type=None, index_name=None,
validation_class='org.apache.cassandra.db.marshal.IntegerType',
name='likes', index_options=None)},


key_alias=None,

dclocal_read_repair_chance=0.0,

name='Twitter_Instagram',

compaction_strategy_options={},

row_cache_keys_to_save=None,

compaction_strategy='org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy',

memtable_operations_in_millions=None,

caching='KEYS_ONLY',

comparator_type='org.apache.cassandra.db.marshal.BytesType',

row_cache_size=None),


Re: Bulkoutputformat

2013-12-13 Thread varun allampalli
Thanks Rahul..article was insightful


On Fri, Dec 13, 2013 at 12:25 AM, Rahul Menon ra...@apigee.com wrote:

 Here you go

 http://thelastpickle.com/blog/2013/01/11/primary-keys-in-cql.html


 On Fri, Dec 13, 2013 at 7:19 AM, varun allampalli 
 vshoori.off...@gmail.com wrote:

 Hi Aaron,

 It seems like you answered the question here.

 https://groups.google.com/forum/#!topic/nosql-databases/vjZA5vdycWA

 Can you give me the link to the blog which you mentioned

 http://thelastpickle.com/2013/01/11/primary-keys-in-cql/

 Thanks in advance
 Varun


 On Thu, Dec 12, 2013 at 3:36 PM, varun allampalli 
 vshoori.off...@gmail.com wrote:

 Thanks Aaron, I was able to generate sstables and load using
 sstableloader. But after loading the tables when I do a select query I get
 this, the table has only one record. Is there anything I am missing or any
 logs I can look at.

 Request did not complete within rpc_timeout.


 On Wed, Dec 11, 2013 at 7:58 PM, Aaron Morton 
 aa...@thelastpickle.comwrote:

 If you don’t need to use Hadoop then try the SSTableSimpleWriter and
 sstableloader , this post is a little old but still relevant
 http://www.datastax.com/dev/blog/bulk-loading

 Otherwise AFAIK BulkOutputFormat is what you want from hadoop
 http://www.datastax.com/docs/1.1/cluster_architecture/hadoop_integration

 Cheers

  -
 Aaron Morton
 New Zealand
 @aaronmorton

 Co-Founder  Principal Consultant
 Apache Cassandra Consulting
 http://www.thelastpickle.com

 On 12/12/2013, at 11:27 am, varun allampalli vshoori.off...@gmail.com
 wrote:

 Hi All,

 I want to bulk insert data into cassandra. I was wondering of using
 BulkOutputformat in hadoop. Is it the best way or using driver and doing
 batch insert is the better way.

 Are there any disandvantages of using bulkoutputformat.

 Thanks for helping

 Varun








Re: One big table/cf or many small ones?

2013-12-13 Thread Jacob Rhoden
Hi Tinus,

 On 12 Dec 2013, at 6:59 pm, Tinus Sky tinus...@gmail.com wrote:
 
 My service does have users who can add a message to a list. The list of 
 message is sorted by date and displayed. When a user changes the message the 
 date is changed and the message moves to the top of the list.
 
 A possible solution is to remove row and insert it again, but i suspect this 
 might be not the best solution. Is there a alternative solution?

Just insert a new version of the message, with a version field into the 
table/columnfamily. When outputting query results to the user, filter/hide 
older versions. And bam you have a new feature in your web site called show 
old versions or show previous edits.

 My second question is regarding the number of tables/column families in a 
 keyspace.
 
 I can create a table which contains all messages from all users. But i can 
 also create one table for every user which has a name like: 
 messages_[userid], where [userid] is the id of the user. Or i can shard: 
 messages_a (contains messages from user where name starts with a), messages_b 
 (contains messages from user where name starts with b)
 
 My users count is around 100.000. And the messages per user are approx around 
 20.000.
 
 What would be the choice: put everything in 1 big table or go with the many 
 small tables option.

I'm pretty sure you don't want to put 100,000 tables in a Cassandra key space. 
Go with one.

Regards,
Jacob