Cassandra Database using too much space

2014-12-14 Thread Chamila Wijayarathna
Hello all,

We are trying to develop a language corpus by using Cassandra as its
storage medium.

https://gist.github.com/cdwijayarathna/7550176443ad2229fae0 shows the types
of information we need to extract from corpus interface.
So we designed schema at
https://gist.github.com/cdwijayarathna/6491122063152669839f to use as the
database. Out target is to develop corpus with 100+ million words.

By now we have inserted about 1.5 million words and database has used about
14GB space. Is this a normal scenario or are we doing anything wrong? Is
there any issue in our data model?

Thank You!
-- 
*Chamila Dilshan Wijayarathna,*
SMIEEE, SMIESL,
Undergraduate,
Department of Computer Science and Engineering,
University of Moratuwa.


Hinted handoff not working

2014-12-14 Thread Robert Wille
I have a cluster with RF=3. If I shut down one node, add a bunch of data to the 
cluster, I don’t see a bunch of records added to system.hints. Also, du of 
/var/lib/cassandra/data/system/hints of the nodes that are up shows that hints 
aren’t being stored. When I start the down node, its data doesn’t grow until I 
run repair, which then takes a really long time because it is significantly out 
of date. Is there some magic setting I cannot find in the documentation to 
enable hinted handoff? I’m running 2.0.11. Any insights would be greatly 
appreciated. 

Thanks

Robert



Re: Hinted handoff not working

2014-12-14 Thread Rahul Neelakantan
http://www.datastax.com/documentation/cassandra/2.0/cassandra/configuration/configCassandra_yaml_r.html?scroll=reference_ds_qfg_n1r_1k__hinted_handoff_enabled

Rahul

 On Dec 14, 2014, at 9:46 AM, Robert Wille rwi...@fold3.com wrote:
 
 I have a cluster with RF=3. If I shut down one node, add a bunch of data to 
 the cluster, I don’t see a bunch of records added to system.hints. Also, du 
 of /var/lib/cassandra/data/system/hints of the nodes that are up shows that 
 hints aren’t being stored. When I start the down node, its data doesn’t grow 
 until I run repair, which then takes a really long time because it is 
 significantly out of date. Is there some magic setting I cannot find in the 
 documentation to enable hinted handoff? I’m running 2.0.11. Any insights 
 would be greatly appreciated. 
 
 Thanks
 
 Robert
 


Re: Cassandra Database using too much space

2014-12-14 Thread Ryan Svihla
Well your data model looks fine at a glance, a lot of tables, but they
appear to be mapping to logically obvious query paths. This denormalization
will make your queries fast but eat up more disk, and if disk is really a
pain point, Id suggest looking at your economics a bit, and look at your
tradeoffs.


   1. If you want less disk usage, and can afford to have longer query
   times, switch from denormalized views and use indexes instead, you'll get
   better disk space savings, at the cost of more round trips on a read (read
   index value..get partition key, do another read).
   2. If you really need queries to be as fast as possible, then you're on
   the right path, but you'll have to realize this is the cost of scale. With
   even relational databases in the past I've had to use a similar strategy to
   speed up lookups (less different query parameters in that case and more
   queries that would normally require lots of joins).

Hope this helps explain tradeoffs and costs.

On Sun, Dec 14, 2014 at 6:01 AM, Chamila Wijayarathna 
cdwijayarat...@gmail.com wrote:

 Hello all,

 We are trying to develop a language corpus by using Cassandra as its
 storage medium.

 https://gist.github.com/cdwijayarathna/7550176443ad2229fae0 shows the
 types of information we need to extract from corpus interface.
 So we designed schema at
 https://gist.github.com/cdwijayarathna/6491122063152669839f to use as the
 database. Out target is to develop corpus with 100+ million words.

 By now we have inserted about 1.5 million words and database has used
 about 14GB space. Is this a normal scenario or are we doing anything wrong?
 Is there any issue in our data model?

 Thank You!
 --
 *Chamila Dilshan Wijayarathna,*
 SMIEEE, SMIESL,
 Undergraduate,
 Department of Computer Science and Engineering,
 University of Moratuwa.



-- 

[image: datastax_logo.png] http://www.datastax.com/

Ryan Svihla

Solution Architect

[image: twitter.png] https://twitter.com/foundev [image: linkedin.png]
http://www.linkedin.com/pub/ryan-svihla/12/621/727/

DataStax is the fastest, most scalable distributed database technology,
delivering Apache Cassandra to the world’s most innovative enterprises.
Datastax is built to be agile, always-on, and predictably scalable to any
size. With more than 500 customers in 45 countries, DataStax is the
database technology and transactional backbone of choice for the worlds
most innovative companies such as Netflix, Adobe, Intuit, and eBay.


Re: Cassandra Database using too much space

2014-12-14 Thread Chamila Wijayarathna
Hi Ryan,

Thank you very much. This helps a lot.

On Sun, Dec 14, 2014 at 9:14 PM, Ryan Svihla rsvi...@datastax.com wrote:

 Well your data model looks fine at a glance, a lot of tables, but they
 appear to be mapping to logically obvious query paths. This denormalization
 will make your queries fast but eat up more disk, and if disk is really a
 pain point, Id suggest looking at your economics a bit, and look at your
 tradeoffs.


1. If you want less disk usage, and can afford to have longer query
times, switch from denormalized views and use indexes instead, you'll get
better disk space savings, at the cost of more round trips on a read (read
index value..get partition key, do another read).
2. If you really need queries to be as fast as possible, then you're
on the right path, but you'll have to realize this is the cost of scale.
With even relational databases in the past I've had to use a similar
strategy to speed up lookups (less different query parameters in that case
and more queries that would normally require lots of joins).

 Hope this helps explain tradeoffs and costs.

 On Sun, Dec 14, 2014 at 6:01 AM, Chamila Wijayarathna 
 cdwijayarat...@gmail.com wrote:

 Hello all,

 We are trying to develop a language corpus by using Cassandra as its
 storage medium.

 https://gist.github.com/cdwijayarathna/7550176443ad2229fae0 shows the
 types of information we need to extract from corpus interface.
 So we designed schema at
 https://gist.github.com/cdwijayarathna/6491122063152669839f to use as
 the database. Out target is to develop corpus with 100+ million words.

 By now we have inserted about 1.5 million words and database has used
 about 14GB space. Is this a normal scenario or are we doing anything wrong?
 Is there any issue in our data model?

 Thank You!
 --
 *Chamila Dilshan Wijayarathna,*
 SMIEEE, SMIESL,
 Undergraduate,
 Department of Computer Science and Engineering,
 University of Moratuwa.



 --

 [image: datastax_logo.png] http://www.datastax.com/

 Ryan Svihla

 Solution Architect

 [image: twitter.png] https://twitter.com/foundev [image: linkedin.png]
 http://www.linkedin.com/pub/ryan-svihla/12/621/727/

 DataStax is the fastest, most scalable distributed database technology,
 delivering Apache Cassandra to the world’s most innovative enterprises.
 Datastax is built to be agile, always-on, and predictably scalable to any
 size. With more than 500 customers in 45 countries, DataStax is the
 database technology and transactional backbone of choice for the worlds
 most innovative companies such as Netflix, Adobe, Intuit, and eBay.



-- 
*Chamila Dilshan Wijayarathna,*
SMIEEE, SMIESL,
Undergraduate,
Department of Computer Science and Engineering,
University of Moratuwa.


Re: Cassandra Database using too much space

2014-12-14 Thread Jack Krupansky
It looks like you will have quite a few “combinatoric explosions” to cope with. 
In addition to 1.5M words,  you have bigrams – combinations of two and three 
words. You need to get a handle on the cardinality of each of your tables. 
Bigrams and trigrams could give you who knows how many millions more rows than 
the 1.5M word frequency rows.

And then you have word, bigram, and trigram frequencies by year as well, 
meaning take the counts from above and multiply by the number of years in your 
corpus!

And then you have word, bigram, and triagram “usage”  - and by year as well. Is 
that every unique sentence from the corpus? Either way, this is an incredible 
combinatoric explosion.

And then there is category and position, which I didn’t look at since you 
didn’t specify what exactly they are. Once again, start with a focus on 
cardinality of the data.

In short, just as a thought experiment, say that your 1.5M words expanded into 
15M rows, divide that into 15Gbytes and that would give you 1000 bytes per row, 
which may be a bit more than desired, but not totally unreasonable. And maybe 
the explosion is more like 30 to 1, which would give like 333 bytes per row, 
which seems quite reasonable.

Also, are you doing heavy updates, for each word (and bigram and trigram) as 
each occurrence is encountered in the corpus or are you counting things in 
memory and then only writing each row once after the full corpus has been read?

Also, what is the corpus size – total word instances, both for the full corpus 
and for the subset containing your 1.5 million words?

-- Jack Krupansky

From: Chamila Wijayarathna 
Sent: Sunday, December 14, 2014 7:01 AM
To: user@cassandra.apache.org 
Subject: Cassandra Database using too much space

Hello all, 

We are trying to develop a language corpus by using Cassandra as its storage 
medium.

https://gist.github.com/cdwijayarathna/7550176443ad2229fae0 shows the types of 
information we need to extract from corpus interface. 

So we designed schema at 
https://gist.github.com/cdwijayarathna/6491122063152669839f to use as the 
database. Out target is to develop corpus with 100+ million words.

By now we have inserted about 1.5 million words and database has used about 
14GB space. Is this a normal scenario or are we doing anything wrong? Is there 
any issue in our data model?

Thank You!
-- 

Chamila Dilshan Wijayarathna,
SMIEEE, SMIESL,
Undergraduate,
Department of Computer Science and Engineering,
University of Moratuwa.


Access to locally partitioned data

2014-12-14 Thread Jason Kania
Hello,

I am wondering if there is a way to obtain results from a table where only the 
results from the local partition are returned in the query?

To give some background, my application requires millions of timers and since 
queue-like implementations are a bad fit/anti-pattern for Cassandra, I am 
moving to an in-memory system to manage these timers. However, I would like to 
partition the timers such that:

1) related DB queries using the same partitioning key are most likely handled 
locally to minimize traffic as these timers are short duration in nature
2) there is no need to manage multiple partitioning schemes for the same data 
as the cluster grows

In all other respects Cassandra is one of the best databases for my needs as I 
am using it for time series data.

Thanks,

Jason



Re: Cassandra Database using too much space

2014-12-14 Thread Chamila Wijayarathna
Hi Jack ,

Thanks for replying.

Here what I meant by 1.5M words is not 1.5 Distincts words, it is the count
of all words we added to the corpus (total word instances). Then in
word_frequency and word_ordered_frequency CFs, we have a row for each
distinct word with its frequency (two CFs have same data with different
indexing). Also we keep frequencies year wise ,category wise (newspaper,
magazine, fiction, etc.) and position where word occur in a sentence. So
the distinct word count will be probably about 0.2M. We don't keep any
details in frequency table where frequency is 0. So word 'abc' may only
have rows for year 2014 and 2010 if it only used in those years.

In bigram and trigram ables, we do not store all possible combinations of
words, we only store bigrams/trigrams that occur in resources we have
considered. In word_usage table we have a entry for each word, that means
1.5M rows with the context details where the word has been used. Same
happens in bigrams and trigrams as well.

Here we used separate column families word_usage, word_year_usage,
word_Category_usage with same details, since we have to search in 4
scenarios, using

   1. year,
   2. category,
   3. yearcategory,
   4. none

 inside WHERE clause and also order them by date. They contain same data
but different indexing. Same goes with bigram and trigram CFs.

We update frequencies while entering words to database. So for every word
instances we add, we either insert a new row or update a existing row. In
some cases where we use frequency as clustering index, since we can't
update frequency, we delete entire row and add new row with updated
frequency. [1] is the client we used for inserting data.

I am very new to Cassandra and I may have done lot of bad things in
modeling and implementing this database. Please let me know if there is
anything wrong here.

Thank You!

1.
https://github.com/DImuthuUpe/DBFeederMvn/blob/master/src/main/java/com/sinmin/corpus/cassandra/CassandraClient.java

On Mon, Dec 15, 2014 at 1:46 AM, Jack Krupansky j...@basetechnology.com
wrote:

   It looks like you will have quite a few “combinatoric explosions” to
 cope with. In addition to 1.5M words,  you have bigrams – combinations of
 two and three words. You need to get a handle on the cardinality of each of
 your tables. Bigrams and trigrams could give you who knows how many
 millions more rows than the 1.5M word frequency rows.

 And then you have word, bigram, and trigram frequencies by year as well,
 meaning take the counts from above and multiply by the number of years in
 your corpus!

 And then you have word, bigram, and triagram “usage”  - and by year as
 well. Is that every unique sentence from the corpus? Either way, this is an
 incredible combinatoric explosion.

 And then there is category and position, which I didn’t look at since you
 didn’t specify what exactly they are. Once again, start with a focus on
 cardinality of the data.

 In short, just as a thought experiment, say that your 1.5M words expanded
 into 15M rows, divide that into 15Gbytes and that would give you 1000 bytes
 per row, which may be a bit more than desired, but not totally
 unreasonable. And maybe the explosion is more like 30 to 1, which would
 give like 333 bytes per row, which seems quite reasonable.

 Also, are you doing heavy updates, for each word (and bigram and trigram)
 as each occurrence is encountered in the corpus or are you counting things
 in memory and then only writing each row once after the full corpus has
 been read?

 Also, what is the corpus size – total word instances, both for the full
 corpus and for the subset containing your 1.5 million words?

 -- Jack Krupansky

  *From:* Chamila Wijayarathna cdwijayarat...@gmail.com
 *Sent:* Sunday, December 14, 2014 7:01 AM
 *To:* user@cassandra.apache.org
 *Subject:* Cassandra Database using too much space

  Hello all,

 We are trying to develop a language corpus by using Cassandra as its
 storage medium.

 https://gist.github.com/cdwijayarathna/7550176443ad2229fae0 shows the
 types of information we need to extract from corpus interface.
 So we designed schema at
 https://gist.github.com/cdwijayarathna/6491122063152669839f to use as the
 database. Out target is to develop corpus with 100+ million words.

 By now we have inserted about 1.5 million words and database has used
 about 14GB space. Is this a normal scenario or are we doing anything wrong?
 Is there any issue in our data model?

 Thank You!
 --
 *Chamila Dilshan Wijayarathna,*
 SMIEEE, SMIESL,
 Undergraduate,
 Department of Computer Science and Engineering,
 University of Moratuwa.



-- 
*Chamila Dilshan Wijayarathna,*
SMIEEE, SMIESL,
Undergraduate,
Department of Computer Science and Engineering,
University of Moratuwa.


Re: Hinted handoff not working

2014-12-14 Thread Robert Wille
I’ve got hinted_handoff_enabled: true in cassandra.yaml. My settings are all 
default except for the DC, listen addresses and snitch. I should have mentioned 
this in my original post.

On Dec 14, 2014, at 8:02 AM, Rahul Neelakantan ra...@rahul.be wrote:

 http://www.datastax.com/documentation/cassandra/2.0/cassandra/configuration/configCassandra_yaml_r.html?scroll=reference_ds_qfg_n1r_1k__hinted_handoff_enabled
 
 Rahul
 
 On Dec 14, 2014, at 9:46 AM, Robert Wille rwi...@fold3.com wrote:
 
 I have a cluster with RF=3. If I shut down one node, add a bunch of data to 
 the cluster, I don’t see a bunch of records added to system.hints. Also, du 
 of /var/lib/cassandra/data/system/hints of the nodes that are up shows that 
 hints aren’t being stored. When I start the down node, its data doesn’t grow 
 until I run repair, which then takes a really long time because it is 
 significantly out of date. Is there some magic setting I cannot find in the 
 documentation to enable hinted handoff? I’m running 2.0.11. Any insights 
 would be greatly appreciated. 
 
 Thanks
 
 Robert
 



Re: Hinted handoff not working

2014-12-14 Thread Jens Rantil
Hi Robert ,

Maybe you need to flush your memtables to actually see the disk usage increase? 
This applies to both hosts.

Cheers,
Jens

On Sun, Dec 14, 2014 at 3:52 PM, Robert Wille rwi...@fold3.com wrote:

 I have a cluster with RF=3. If I shut down one node, add a bunch of data to 
 the cluster, I don’t see a bunch of records added to system.hints. Also, du 
 of /var/lib/cassandra/data/system/hints of the nodes that are up shows that 
 hints aren’t being stored. When I start the down node, its data doesn’t grow 
 until I run repair, which then takes a really long time because it is 
 significantly out of date. Is there some magic setting I cannot find in the 
 documentation to enable hinted handoff? I’m running 2.0.11. Any insights 
 would be greatly appreciated. 
 Thanks
 Robert

What does the -node argument mean in Cassandra stress tool?

2014-12-14 Thread 孔嘉林
Hi,
I am using Cassandra stress tool provided in the distribution 2.1.2. I
wonder what does the -node argument mean. Dose it specify the cluster
server node or stress client node?

In the document, it says
Splitting up a load over multiple cassandra-stress instances on different
nodes:  This is useful for loading into large clusters, where a single
cassandra-stress load generator node cannot saturate the cluster. In this
example, $NODES is a variable whose value is a comma delimited list of IP
addresses such as 10.0.0.1,10.0.0.2, and so on.“

#On Node1$ cassandra-stress write -node $NODES
 #On Node2$ cassandra-stress write -node $NODES

Seems that in the example the -node specifies the all the client nodes
sending stress requests. But if I run client and server on different nodes,
how do I specify the server nodes ip in the stress command line?

Thanks,
Joy