AW: Cassandra at Amazon AWS

2013-01-21 Thread Roland Gude
On a side note:
If you are going for priam AND you are using LeveledCompaction think carefully 
whether you need incremental backups. The s3 upload cost can be very high 
because Leveled Compaction tends to create a lot of files and each put request 
to s3 costs money. We had this setup in relatively small cluster of 4 nodes 
where the switch to leveledcompaction increased backup cost by 800 Euro a month.

Greetings
Roland

Von: Roland Gude [mailto:roland.g...@ez.no]
Gesendet: Freitag, 18. Januar 2013 09:23
An: user@cassandra.apache.org
Betreff: AW: Cassandra at Amazon AWS

Priam is good for backups but it is another complex (but very good) part to a 
software stack.
A simple solution is to do regular snapshots (via cron)
Compress them and put them into s3
On the s3 you can simply choose how many days the files are kept.

This can be done with a couple of lines of shellscript. And a simple crontab 
entry

Von: Marcelo Elias Del Valle [mailto:mvall...@gmail.com]
Gesendet: Freitag, 18. Januar 2013 04:53
An: user@cassandra.apache.orgmailto:user@cassandra.apache.org
Betreff: Re: Cassandra at Amazon AWS

Everyone, thanks a lot for the answer, they helped me a lot.

2013/1/17 Andrey Ilinykh ailin...@gmail.commailto:ailin...@gmail.com
I'd recommend Priam.

http://techblog.netflix.com/2012/02/announcing-priam.html

Andrey

On Thu, Jan 17, 2013 at 5:44 AM, Adam Venturella 
aventure...@gmail.commailto:aventure...@gmail.com wrote:
Jared, how do you guys handle data backups for your ephemeral based cluster?

I'm trying to move to ephemeral drives myself, and that was my last sticking 
point; asking how others in the community deal with backup in case the VM 
explodes.


On Wed, Jan 16, 2013 at 1:21 PM, Jared Biel 
jared.b...@bolderthinking.commailto:jared.b...@bolderthinking.com wrote:
We're currently using Cassandra on EC2 at very low scale (a 2 node
cluster on m1.large instances in two regions.) I don't believe that
EBS is recommended for performance reasons. Also, it's proven to be
very unreliable in the past (most of the big/notable AWS outages were
due to EBS issues.) We've moved 99% of our instances off of EBS.

As other have said, if you require more space in the future it's easy
to add more nodes to the cluster. I've found this page
(http://www.ec2instances.info/) very useful in determining the amount
of space each instance type has. Note that by default only one
ephemeral drive is attached and you must specify all ephemeral drives
that you want to use at launch time. Also, you can create a RAID 0 of
all local disks to provide maximum speed and space.


On 16 January 2013 20:42, Marcelo Elias Del Valle 
mvall...@gmail.commailto:mvall...@gmail.com wrote:
 Hello,

I am currently using hadoop + cassandra at amazon AWS. Cassandra runs on
 EC2 and my hadoop process runs at EMR. For cassandra storage, I am using
 local EC2 EBS disks.
My system is running fine for my tests, but to me it's not a good setup
 for production. I need my system to perform well for specially for writes on
 cassandra, but the amount of data could grow really big, taking several Tb
 of total storage.
 My first guess was using S3 as a storage and I saw this can be done by
 using Cloudian package, but I wouldn't like to become dependent on a
 pre-package solution and I found it's kind of expensive for more than 100Tb:
 http://www.cloudian.com/pricing.html
 I saw some discussion at internet about using EBS or ephemeral disks for
 storage at Amazon too.

 My question is: does someone on this list have the same problem as me?
 What are you using as solution to Cassandra's storage when running it at
 Amazon AWS?

 Any thoughts would be highly appreciatted.

 Best regards,
 --
 Marcelo Elias Del Valle
 http://mvalle.com - @mvallebr





--
Marcelo Elias Del Valle
http://mvalle.com - @mvallebr


AW: Cassandra at Amazon AWS

2013-01-18 Thread Roland Gude
Priam is good for backups but it is another complex (but very good) part to a 
software stack.
A simple solution is to do regular snapshots (via cron)
Compress them and put them into s3
On the s3 you can simply choose how many days the files are kept.

This can be done with a couple of lines of shellscript. And a simple crontab 
entry

Von: Marcelo Elias Del Valle [mailto:mvall...@gmail.com]
Gesendet: Freitag, 18. Januar 2013 04:53
An: user@cassandra.apache.org
Betreff: Re: Cassandra at Amazon AWS

Everyone, thanks a lot for the answer, they helped me a lot.

2013/1/17 Andrey Ilinykh ailin...@gmail.commailto:ailin...@gmail.com
I'd recommend Priam.

http://techblog.netflix.com/2012/02/announcing-priam.html

Andrey

On Thu, Jan 17, 2013 at 5:44 AM, Adam Venturella 
aventure...@gmail.commailto:aventure...@gmail.com wrote:
Jared, how do you guys handle data backups for your ephemeral based cluster?

I'm trying to move to ephemeral drives myself, and that was my last sticking 
point; asking how others in the community deal with backup in case the VM 
explodes.


On Wed, Jan 16, 2013 at 1:21 PM, Jared Biel 
jared.b...@bolderthinking.commailto:jared.b...@bolderthinking.com wrote:
We're currently using Cassandra on EC2 at very low scale (a 2 node
cluster on m1.large instances in two regions.) I don't believe that
EBS is recommended for performance reasons. Also, it's proven to be
very unreliable in the past (most of the big/notable AWS outages were
due to EBS issues.) We've moved 99% of our instances off of EBS.

As other have said, if you require more space in the future it's easy
to add more nodes to the cluster. I've found this page
(http://www.ec2instances.info/) very useful in determining the amount
of space each instance type has. Note that by default only one
ephemeral drive is attached and you must specify all ephemeral drives
that you want to use at launch time. Also, you can create a RAID 0 of
all local disks to provide maximum speed and space.


On 16 January 2013 20:42, Marcelo Elias Del Valle 
mvall...@gmail.commailto:mvall...@gmail.com wrote:
 Hello,

I am currently using hadoop + cassandra at amazon AWS. Cassandra runs on
 EC2 and my hadoop process runs at EMR. For cassandra storage, I am using
 local EC2 EBS disks.
My system is running fine for my tests, but to me it's not a good setup
 for production. I need my system to perform well for specially for writes on
 cassandra, but the amount of data could grow really big, taking several Tb
 of total storage.
 My first guess was using S3 as a storage and I saw this can be done by
 using Cloudian package, but I wouldn't like to become dependent on a
 pre-package solution and I found it's kind of expensive for more than 100Tb:
 http://www.cloudian.com/pricing.html
 I saw some discussion at internet about using EBS or ephemeral disks for
 storage at Amazon too.

 My question is: does someone on this list have the same problem as me?
 What are you using as solution to Cassandra's storage when running it at
 Amazon AWS?

 Any thoughts would be highly appreciatted.

 Best regards,
 --
 Marcelo Elias Del Valle
 http://mvalle.com - @mvallebr





--
Marcelo Elias Del Valle
http://mvalle.com - @mvallebr


AW: TTL on SecondaryIndex Columns. A bug?

2012-12-19 Thread Roland Gude
I think this might be https://issues.apache.org/jira/browse/CASSANDRA-4670
Unfortunately apart from me no one was yet able to reproduce.

Check if data is available before/after compaction
If you have leveled compaction it is hard to test because you cannot trigger 
compaction manually.

-Ursprüngliche Nachricht-
Von: Alexei Bakanov [mailto:russ...@gmail.com] 
Gesendet: Mittwoch, 19. Dezember 2012 09:35
An: user@cassandra.apache.org
Betreff: Re: TTL on SecondaryIndex Columns. A bug?

I'm running on a single node on my laptop.
It looks like the point when rows dissapear from the index depends on JVM 
memory settings. With more memory it needs more data to feed in before things 
start disappearing.
Please try to run Cassandra with -Xms1927M -Xmx1927M -Xmn400M

To be sure, try to get rows for 'indexedColumn'='1':

[default@ks123] get cf1 where 'indexedColumn'='1';

0 Row Returned.

Thanks


On 19 December 2012 05:15, aaron morton aa...@thelastpickle.com wrote:
 Thanks for the nice steps to reproduce.

 I ran this on my MBP using C* 1.1.7 and got the expected results, both 
 get's returned a row.

 Were you running against a single node or a cluster ? If a cluster did 
 you change the CL, cassandra-cli defaults to ONE.

 Cheers

 -
 Aaron Morton
 Freelance Cassandra Developer
 New Zealand

 @aaronmorton
 http://www.thelastpickle.com

 On 18/12/2012, at 9:44 PM, Alexei Bakanov russ...@gmail.com wrote:

 Hi,

 We are having an issue with TTL on Secondary index columns. We get 0 
 rows in return when running queries on indexed columns that have TTL.
 Everything works fine with small amounts of data, but when we get over 
 a ceratin threshold it looks like older rows dissapear from the index.
 In the example below we create 70 rows with 45k columns each + one 
 indexed column with just the rowkey as value, so we have one row per 
 indexed value. When the script is finished the index contains rows 
 66-69. Rows 0-65 are gone from the index.
 Using 'indexedColumn' without TTL fixes the problem.


 - SCHEMA START - create keyspace ks123  
 with placement_strategy = 'NetworkTopologyStrategy'
  and strategy_options = {datacenter1 : 1}  and durable_writes = true;

 use ks123;

 create column family cf1
  with column_type = 'Standard'
  and comparator = 'AsciiType'
  and default_validation_class = 'AsciiType'
  and key_validation_class = 'AsciiType'
  and read_repair_chance = 0.1
  and dclocal_read_repair_chance = 0.0
  and gc_grace = 864000
  and min_compaction_threshold = 4
  and max_compaction_threshold = 32
  and replicate_on_write = true
  and compaction_strategy =
 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy'
  and caching = 'KEYS_ONLY'
  and column_metadata = [
{column_name : 'indexedColumn',
validation_class : AsciiType,
index_name : 'INDEX1',
index_type : 0}]
  and compression_options = {'sstable_compression' :
 'org.apache.cassandra.io.compress.SnappyCompressor'};
 - SCHEMA FINISH -

 - POPULATE START - from pycassa.batch 
 import Mutator import pycassa

 pool = pycassa.ConnectionPool('ks123') cf = pycassa.ColumnFamily(pool, 
 'cf1')

 for rowKey in xrange(70):
b = Mutator(pool)
for datapoint in xrange(1, 45001):
b.insert(cf,str(rowKey), {str(datapoint): 'val'}, ttl=7884000);
b.insert(cf, str(rowKey), {'indexedColumn': str(rowKey)}, ttl=7887600);
print 'row %d' % rowKey
b.send()
b = Mutator(pool)

 pool.dispose()
 - POPULATE FINISH -

 - QUERY START - [default@ks123] get cf1 
 where 'indexedColumn'='65';

 0 Row Returned.
 Elapsed time: 2.38 msec(s).

 [default@ks123] get cf1 where 'indexedColumn'='66';
 ---
 RowKey: 66
 = (column=1, value=val, timestamp=1355818765548964, ttl=7884000) ...
 = (column=10087, value=val, timestamp=1355818766075538, ttl=7884000) 
 = (column=indexedColumn, value=66, timestamp=1355818768119334, 
 ttl=7887600)

 1 Row Returned.
 Elapsed time: 31 msec(s).
 - QUERY FINISH -

 This is all using Cassandra 1.1.7 with default settings.

 Best regards,

 Alexei Bakanov






AW: Replication Factor and Consistency Level Confusion

2012-12-19 Thread Roland Gude
Hi

RF 2 means that 2 nodes are responsible for any given row (no matter how many 
nodes are in the cluster)
For your cluster with three nodes let's just assume the following 
responsibilities

NodeA   B   C
Primary keys0-5 6-1011-15
Replica keys11-15   0-5 6-10

Assume node 'C' is down
Writing any key in range 0-5 with consistency TWO is possible (A and B are up)
Writing any key in range 11-15 with consistency TWO will fail (C is down and 
11-15 is its primary range)
Writing any key in range 6-10 with consistency TWO will fail (C is down and it 
is the replica for this range)

I hope this explains it.

-Ursprüngliche Nachricht-
Von: Vasileios Vlachos [mailto:vasileiosvlac...@gmail.com] 
Gesendet: Mittwoch, 19. Dezember 2012 17:07
An: user@cassandra.apache.org
Betreff: Replication Factor and Consistency Level Confusion

Hello All,

We have a 3-node cluster and we created a keyspace (say Test_1) with 
Replication Factor set to 3. I know is not great but we wanted to test 
different behaviors. So, we created a Column Family (say cf_1) and we tried 
writing something with Consistency Level ANY, ONE, TWO, THREE, QUORUM and ALL. 
We did that while all nodes were in UP state, so we had no problems at all. No 
matter what the Consistency Level was, we were able to insert a value.

Same cluster, different keyspace (say Test_2) with Replication Factor set to 2 
this time and one of the 3 nodes deliberately DOWN. Again, we created a Column 
Family (say cf_1) and we tried writing something with different Consistency 
Levels. Here is what we got:
ANY: worked (expected...)
ONE: worked (expected...)
TWO: did not work (WHT???)
THREE: did not work (expected...)
QUORUM: worked (expected...)
ALL: did not work (expected I guess...)

Now, we know that QUORUM derives from (RF/2)+1, so we were expecting that to 
work, after all only 1 node was DOWN. Why did Consistency Level TWO not work 
then???

Third test... Same cluster again, different keyspace (say Test_3) with 
Replication Factor set to 3 this time and 1 of the 3 nodes deliberately DOWN 
again. Same approach again, created different Column Family (say cf_1) and 
different Consistency Level settings resulted in the following:
ANY: worked (what???)
ONE: worked (what???)
TWO: did not work (what???)
THREE: did not work (expected...)
QUORUM: worked (what???)
ALL: worked (what???)

We thought that if the Replication Factor is greater than the number of nodes 
in the cluster, writes are blocked.

Apparently we are completely missing the a level of understanding here, so we 
would appreciate any help!

Thank you in advance!

Vasilis




AW: secondery indexes TTL - strange issues

2012-09-17 Thread Roland Gude
Issue created.

Will attach debug logs asap
CASSANDRA-4670https://issues.apache.org/jira/browse/CASSANDRA-4670

Von: aaron morton [mailto:aa...@thelastpickle.com]
Gesendet: Montag, 17. September 2012 03:46
An: user@cassandra.apache.org
Betreff: Re: secondery indexes TTL - strange issues

 Date gets inserted and accessible via index query for some time. At some point 
in time Indexes are completely empty and start filling again (while new data 
enters the system).
If you can reproduce this please create a ticket on 
https://issues.apache.org/jira/browse/CASSANDRA .

If you can include DEBUG level logs that would be helpful.

Cheers

-
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 14/09/2012, at 10:08 PM, Roland Gude 
roland.g...@ez.nomailto:roland.g...@ez.no wrote:


I am not sure it is compacting an old file: the same thing happens eeverytime I 
rebuild the index. New Files appear, get compacted and vanish.

We have set up a new smaller cluster with fresh data. Same thing happens here 
as well. Date gets inserted and accessible via index query for some time. At 
some point in time Indexes are completely empty and start filling again (while 
new data enters the system).

I am currently testing with SizeTiered on both the fresh set and the imported 
set.

For the fresh set (which is significantly smaller) first results imply that the 
issue is not happening with SizeTieredCompaction - I have not yet tested 
everything that comes into my mind and will update if something new comes up.

As for the failing query it is from the cli:
get EventsByItem where 0003--1000--=utf8('someValue');
0003--1000-- is a TUUID we use as a marker for a 
TimeSeries.
(and equivalent queries with astyanax and hector as well)

This is a cf with the issue:

create column family EventsByItem
  with column_type = 'Standard'
  and comparator = 'TimeUUIDType'
  and default_validation_class = 'BytesType'
  and key_validation_class = 'BytesType'
  and read_repair_chance = 0.5
  and dclocal_read_repair_chance = 0.0
  and gc_grace = 864000
  and min_compaction_threshold = 4
  and max_compaction_threshold = 32
  and replicate_on_write = true
  and compaction_strategy = 
'org.apache.cassandra.db.compaction.LeveledCompactionStrategy'
  and caching = 'NONE'
  and column_metadata = [
{column_name : '--1000--',
validation_class : BytesType,
index_name : 'ebi_mandatorIndex',
index_type : 0},
{column_name : '0002--1000--',
validation_class : BytesType,
index_name : 'ebi_itemidIndex',
index_type : 0},
{column_name : '0003--1000--',
validation_class : BytesType,
index_name : 'ebi_eventtypeIndex',
index_type : 0}]
  and compression_options={sstable_compression:SnappyCompressor, 
chunk_length_kb:64};

Von: aaron morton [mailto:aa...@thelastpickle.comhttp://thelastpickle.com]
Gesendet: Freitag, 14. September 2012 10:46
An: user@cassandra.apache.orgmailto:user@cassandra.apache.org
Betreff: Re: secondery indexes TTL - strange issues

INFO [CompactionExecutor:181] 2012-09-13 12:58:37,443 CompactionTask.java (line
221) Compacted to [/var/lib/cassandra/data/Eventstore/EventsByItem/Eventstore-E
ventsByItem.ebi_eventtypeIndex-he-10-Data.db,].  78,623,000 to 373,348 (~0% of o
riginal) bytes for 83 keys at 0.000280MB/s.  Time: 1,272,883ms.
There is a lot of weird things here.
It could be levelled compaction compacting an older file for the first time. 
But that would be a guess.

Rebuilding the index gives us back the data for a couple of minutes - then it 
vanishes again.
Are you able to do a test with SiezedTieredCompaction ?

Are you able to replicate the problem with a fresh testing CF and some test 
Data?

If it's only a problem with imported data can you provide a sample of the 
failing query ? Any maybe the CF definition ?

Cheers


-
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 14/09/2012, at 2:46 AM, Roland Gude 
roland.g...@ez.nomailto:roland.g...@ez.no wrote:



Hi,

we have been running a system on Cassandra 0.7 heavily relying on secondary 
indexes for columns with TTL.
This has been working like a charm, but we are trying hard to move forward with 
Cassandra and are struggling at that point:

When we put our data into a new cluster (any 1.1.x version - currently 1.1.5) , 
rebuild indexes and run our system, everything seems to work good - until in 
some point of time index queries do not return any data at all anymore (note 
that the TTL has not yet expired for several months).
Rebuilding the index gives us back the data for a couple of minutes - then it 
vanishes again.

What seems strange is that compaction apparently is very aggressive:

INFO [CompactionExecutor:181] 2012-09-13 12:58:37,443 CompactionTask.java (line
221) Compacted to [/var/lib/cassandra/data/Eventstore

secondery indexes TTL - strange issues

2012-09-13 Thread Roland Gude
Hi,

we have been running a system on Cassandra 0.7 heavily relying on secondary 
indexes for columns with TTL.
This has been working like a charm, but we are trying hard to move forward with 
Cassandra and are struggling at that point:

When we put our data into a new cluster (any 1.1.x version - currently 1.1.5) , 
rebuild indexes and run our system, everything seems to work good - until in 
some point of time index queries do not return any data at all anymore (note 
that the TTL has not yet expired for several months).
Rebuilding the index gives us back the data for a couple of minutes - then it 
vanishes again.

What seems strange is that compaction apparently is very aggressive:

INFO [CompactionExecutor:181] 2012-09-13 12:58:37,443 CompactionTask.java (line
221) Compacted to [/var/lib/cassandra/data/Eventstore/EventsByItem/Eventstore-E
ventsByItem.ebi_eventtypeIndex-he-10-Data.db,].  78,623,000 to 373,348 (~0% of o
riginal) bytes for 83 keys at 0.000280MB/s.  Time: 1,272,883ms.


Actually we have switched to LeveledCompaction. Could it be that leveled 
compaction does not play nice with indexes?




AW: How to control location of data?

2012-01-10 Thread Roland Gude
Hi,

i think everything is called a replica so if data is on 3 nodes you have 3 
replicas. There is no such thing as an original.

A partitioner decides into which partition a piece of data belongs
A replica placement strategy decides which partition goes on which node

You cannot suppress the partitioner.

You can select different placement strategies and partitioners for different 
keyspaces, thereby choosing known data to be stored on known hosts.
This is however discouraged for various reasons - i.e.  you need a lot of 
knowledge about your data to keep the cluster balanced. What is your usecase 
for this requirement? there is probably a more suitable solution.

Von: Andreas Rudolph [mailto:andreas.rudo...@spontech-spine.com]
Gesendet: Dienstag, 10. Januar 2012 09:53
An: user@cassandra.apache.org
Betreff: How to control location of data?

Hi!

We're evaluating Cassandra for our storage needs. One of the key benefits we 
see is the online replication of the data, that is an easy way to share data 
across nodes. But we have the need to precisely control on what node group 
specific parts of a key space (columns/column families) are stored on. Now 
we're having trouble understanding the documentation. Could anyone help us with 
to find some answers to our questions?

*  What does the term replica mean: If a key is stored on exactly three nodes 
in a cluster, is it correct then to say that there are three replicas of that 
key or are there just two replicas (copies) and one original?

*  What is the relation between the Cassandra concepts Partitioner and 
Replica Placement Strategy? According to documentation found on DataStax web 
site and architecture internals from the Cassandra Wiki the first storage 
location of a key (and its associated data) is determined by the Partitioner 
whereas additional storage locations are defined by Replica Placement 
Strategy. I'm wondering if I could completely redefine the way how nodes are 
selected to store a key by just implementing my own subclass of 
AbstractReplicationStrategy and configuring that subclass into the key space.

*  How can I suppress that the Partitioner is consulted at all to determine 
what node stores a key first?

*  Is a key space always distributed across the whole cluster? Is it possible 
to configure Cassandra in such a way that more or less freely chosen parts of a 
key space (columns) are stored on arbitrarily chosen nodes?

Any tips would be very appreciated :-)


AW: AW: How to control location of data?

2012-01-10 Thread Roland Gude

Each node in the cluster is assigned a token (can be done automatically - but 
usually should not)
The token of a node is the start token of the partition it is responsible for 
(and the token of the next node is the end token of the current tokens 
partition)

Assume you have the following nodes/tokens (which are usually numbers but for 
the example I will use letters)

N1/A
N2/D
N3/M
N4/X

This means that N1 is responsible (primary) for [A-D)
   N2 for [D-M)
   N3 for [M-X)
And N4 for [X-A)

If you have a replication factor of 1 data will go on the nodes like this:

B - N1
E-N2
X-N4

And so on
If you have a higher replication factor, the placement strategy decides which 
node will take replicas of which partition (becoming secondary node for that 
partition)
Simple strategy will just put the replica on the next node in the ring
So same example as above but RF of 2 and simple strategy:

B- N1 and N2
E - N2 and N3
X - N4 and N1

Other strategies can factor in things like put  data in another datacenter or 
put data in another rack or such things.

Even though the terms primary and secondary imply some means of quality or 
consistency, this is not the case. If a node is responsible for a piece of 
data, it will store it.


But placement of the replicas is usually only relevant for availability reasons 
(i.e. disaster recovery etc.)
Actual location should mean nothing to most applications as you can ask any 
node for the data you want and it will provide it to you (fetching it from the 
responsible nodes).
This should be sufficient in almost all cases.

So in the above example again, you can ask N3 what data is available and it 
will tell you: B, E and X, or you could ask it give me X and it will fetch it 
from N4 or N1 or both of them depending on consistency configuration and return 
the data to you.


So actually if you use Cassandra - for the application the actual storage 
location of the data should not matter. It will be available anywhere in the 
cluster if it is stored on any reachable node.

Von: Andreas Rudolph [mailto:andreas.rudo...@spontech-spine.com]
Gesendet: Dienstag, 10. Januar 2012 15:06
An: user@cassandra.apache.org
Betreff: Re: AW: How to control location of data?

Hi!

Thank you for your last reply. I'm still wondering if I got you right...

...
A partitioner decides into which partition a piece of data belongs
Does your statement imply that the partitioner does not take any decisions at 
all on the (physical) storage location? Or put another way: What do you mean 
with partition?

To quote http://wiki.apache.org/cassandra/ArchitectureInternals: ... 
AbstractReplicationStrategy controls what nodes get secondary, tertiary, etc. 
replicas of each key range. Primary replica is always determined by the token 
ring (...)


...
You can select different placement strategies and partitioners for different 
keyspaces, thereby choosing known data to be stored on known hosts.
This is however discouraged for various reasons - i.e.  you need a lot of 
knowledge about your data to keep the cluster balanced. What is your usecase 
for this requirement? there is probably a more suitable solution.

What we want is to partition the cluster with respect to key spaces.
That is we want to establish an association between nodes and key spaces so 
that a node of the cluster holds data from a key space if and only if that node 
is a *member* of that key space.

To our knowledge Cassandra has no built-in way to specify such a 
membership-relation. Therefore we thought of implementing our own replica 
placement strategy until we started to assume that the partitioner had to be 
replaced, too, to accomplish the task.

Do you have any ideas?



Von: Andreas Rudolph [mailto:andreas.rudo...@spontech-spine.com]
Gesendet: Dienstag, 10. Januar 2012 09:53
An: user@cassandra.apache.orgmailto:user@cassandra.apache.org
Betreff: How to control location of data?

Hi!

We're evaluating Cassandra for our storage needs. One of the key benefits we 
see is the online replication of the data, that is an easy way to share data 
across nodes. But we have the need to precisely control on what node group 
specific parts of a key space (columns/column families) are stored on. Now 
we're having trouble understanding the documentation. Could anyone help us with 
to find some answers to our questions?

*  What does the term replica mean: If a key is stored on exactly three nodes 
in a cluster, is it correct then to say that there are three replicas of that 
key or are there just two replicas (copies) and one original?

*  What is the relation between the Cassandra concepts Partitioner and 
Replica Placement Strategy? According to documentation found on DataStax web 
site and architecture internals from the Cassandra Wiki the first storage 
location of a key (and its associated data) is determined by the Partitioner 
whereas additional storage locations are defined by Replica Placement 
Strategy. I'm wondering if I could 

AW: Garbage collection freezes cassandra node

2011-12-19 Thread Roland Gude
Tuning garbage colletion is really hard. Espescially if you do not know why 
garbage collection stalls.
In general I must say I have never seen a software which shipped with such a 
good garbage collection configuration as Cassandra.

The thing that looks suspiscious is that the major collections appear regularly 
in 1 hour intervals.
The only software I know of is RMI explicit collections (which I am not certain 
Cassandra uses)

You could avoid those with
-XX:+DisableExplicitGC
In the Cassandra start script

But I assume that Cassandra itself makes use of explicit gc  as well so this 
might have some nasty side effects.
Unless someone tells you never to do that on Cassandra I would go for a try and 
see what happens. Keep flushes and deletions of sstables on your mind though, 
they are somehow tied to gc I think.

Another option (if it is RMI) is to make the RMI full gc threshold larger for 
timeouts to occur less often.
-Dsun.rmi.dgc.client.gcInterval=60
-Dsun.rmi.dgc.server.gcInterval=60

The number is ms until a gc is triggered.

Anyway.. same warning as before applies

Cheers.



Von: Rene Kochen [mailto:rene.koc...@emea.schange.com]
Gesendet: Montag, 19. Dezember 2011 16:35
An: user@cassandra.apache.org
Betreff: Garbage collection freezes cassandra node

I recently see the following garbage collect behavior in our performance tests 
(the attached chart shows the heap-size in MB):

[cid:image001.jpg@01CCBE73.8B567F80]

During the garbage collections, Cassandra freezes for about ten seconds. I 
observe the following log entries:

GC for ConcurrentMarkSweep: 11597 ms for 1 collections, 1887933144 used; max 
is 8550678528

I use Windows Server 2008 and Cassandra 0.7.10 with min and max heap size set 
to 8 GB.

What can I do to make Cassandra not freeze? Just allocate more memory?

Thanks,

Rene
inline: image001.jpg

AW: Pending ReadStage is exploding on only one node

2011-11-23 Thread Roland Gude
Are you using indexslicequeries?

I described a similar problem a couple of months ago (and mechanisms to 
reproduce the behavior) but unfortunately failed to create an issue for it 
(shame on me).
The mail thread is in the archives
http://www.mail-archive.com/user@cassandra.apache.org/msg16157.html



Von: Johann Höchtl [mailto:h.hoec...@ic-drei.de]
Gesendet: Montag, 21. November 2011 22:17
An: user@cassandra.apache.org
Betreff: Re: Pending ReadStage is exploding on only one node

Yes, it's random partioned.

Am 21.11.2011 13:47, schrieb Jahangir Mohammed:
Hmm..What's the data distribution like on cluster? R.P.?
On Mon, Nov 21, 2011 at 7:31 AM, Johann Höchtl 
h.hoec...@ic-drei.demailto:h.hoec...@ic-drei.de wrote:
I'm using hector-0.8.0-2.
No custom load balancer.
Hardware is equal on every server.

Am 21.11.2011 13:26, schrieb Jahangir Mohammed:
I am not so sure from version to version.

1. Which client are you using? Any custom load balancer?
2. Is the hardware on this node any different from other nodes?

Thanks,
Jahangir.
On Mon, Nov 21, 2011 at 5:55 AM, Johann Höchtl 
h.hoec...@ic-drei.demailto:h.hoec...@ic-drei.de wrote:
Hi all,

I'm experiencing strange behaviour of my 6-node cassandra cluster and I hope 
some one can explain, what I'm doing wrong.

The setting:
6-Cassandra Nodes 1.0.3
Random Partitioning
The ColumnFamily in question has a replication factor of 2 and stores products 
of different shops with a secondary index on shop_id.

Twice a day, I do an update of the data with the following mechanism:
Get all keys of a shop.
Read the new CSV.
Insert the rows from the csv, which keys are not present and delete the rows 
which are not longer present.
Update all prices of the products from the csv and set an update_date.

I'm measuring a high load value on a few nodes during the update process (which 
is normal), but one node keeps the high load after the process for a long time.
I checked the tpstats and found out, that on this node there are over 50k 
pending ReadStage tasks.
All the other nodes don't have that behaviour.

I already had this problem on cassandra 0.7, but after upgrading to 0.8 it 
disappeared. Now it is back.

Any suggestions?

Thanks,
Hans


--
Mit freundlichen Grüßen,

Johann Höchtl
stellv. IT-Leiter


Adresse
Grafinger Straße 6
81671 München

Kontakt
Web: www.ic3.dehttp://www.ic-drei.de/
E-Mail: h.hoec...@ic-drei.demailto:h.hoec...@ic-drei.de
Tel.: 089 638 666 89 - 0
Fax: 089 638 666 89 - 20



[cid:image001.jpg@01CCA9E7.8ED1B7B0]http://www.ic3.de/


Wichtige Hinweise
Hinweis: Diese Nachricht kann vertrauliche/rechtlich geschützte Informationen 
enthalten. Sofern Sie nicht der in dieser Nachricht genannte Adressat (oder ein 
für die Weiterleitung der Nachricht an den Adressaten Verantwortlicher) sind, 
ist es Ihnen untersagt, diese Nachricht zu kopieren oder an Dritte 
weiterzugeben. In diesem Fall löschen Sie bitte diese Nachricht und informieren 
Sie den Absender dieser Nachricht per Antwort-Nachricht. Die ungenehmigte 
Nutzung oder Verbreitung dieser Nachricht ganz oder in Teilen ist strengstens 
untersagt. Bitte beachten Sie ferner, dass E-Mails leicht manipuliert werden 
können. Daher ist der Inhalt dieser Nachricht nicht rechtlich verbindlich. Der 
Inhalt dieser Nachricht ist nur rechtsverbindlich, wenn er schriftlich 
bestätigt wird. IC3 Ltd. kann nicht für die unrichtige oder unvollständige 
Übermittlung von in dieser Nachricht enthaltenen Informationen, für 
Verzögerungen beim Erhalt dieser Nachricht oder für Schädigungen Ihrer 
EDV-Systeme durch diese Nachricht verantwortlich gemacht werden. IC3 Ltd. 
übernimmt keinerlei Gewähr dafür, dass diese Nachricht nicht verändert wurde 
und keinerlei Gewähr dafür, dass diese Nachricht nicht von Viren befallen, 
abgefangen oder in sie anderweitig eingegriffen wurde.


Important notice
Disclaimer: Privileged/Confidential Informations may be contained in this 
message. If you are not the addressee indicated in this message (or responsible 
for delivery of the message to such person), you may not copy or deliver this 
message to anyone. In such case, you should destroy this message and kindly 
notify the sender by reply email. Any unauthorized use or dissemination of this 
message in whole or in part is strictly prohibited. Please note that e-mails 
are susceptible to change. The content of this message is therefore not legally 
binding. The content of this message is only legally binding if confirmed in 
writing. IC3 Ltd. shall not be liable for the improper or incomplete 
transmission of the information contained in this communication nor for any 
delay in its receipt or damage to your system. IC3 Ltd. does not guarantee that 
the integrity of this communication has been maintained nor that this 
communication is free of viruses, interceptions or interference.




inline: image001.jpg

flushwriter all time blocked

2011-08-29 Thread Roland Gude
Hi all,

On a 0.7.8 cluster In tpstats i can see flushwriter stage having several tasks 
in state all-time-blocked (immendiatly after node restart its 8 but grows over 
time to around 300). What does it mean (or how can I find out) and what can I 
do about it?

--
YOOCHOOSE GmbH

Roland Gude
Software Engineer

Im Mediapark 8, 50670 Köln

+49 221 4544151 (Tel)
+49 221 4544159 (Fax)
+49 171 7894057 (Mobil)


Email: roland.g...@yoochoose.com
WWW: www.yoochoose.comhttp://www.yoochoose.com/

YOOCHOOSE GmbH
Geschäftsführer: Dr. Uwe Alkemper, Michael Friedmann
Handelsregister: Amtsgericht Köln HRB 65275
Ust-Ident-Nr: DE 264 773 520
Sitz der Gesellschaft: Köln



AW: flushwriter all time blocked

2011-08-29 Thread Roland Gude
Hi,

This still leaves me puzzled.

Is it a bad thing?
Why is it happening?

And what does blocked before being accepted mean? Does it mean Cassandra did 
not even try to put the task into a queue?


Thanks for enlightening me,

roland


-Ursprüngliche Nachricht-
Von: Jonathan Ellis [mailto:jbel...@gmail.com] 
Gesendet: Montag, 29. August 2011 15:10
An: user@cassandra.apache.org
Betreff: Re: flushwriter all time blocked

the javadoc for the mbeans explains:

/**
 * Get the number of tasks that had blocked before being accepted (or
 * rejected).
 */
public int getTotalBlockedTasks();

/**
 * Get the number of tasks currently blocked, waiting to be accepted by
 * the executor (because all threads are busy and the backing
queue is full).
 */
public int getCurrentlyBlockedTasks();

On Mon, Aug 29, 2011 at 3:39 AM, Roland Gude roland.g...@yoochoose.com wrote:
 Hi all,



 On a 0.7.8 cluster In tpstats i can see flushwriter stage having several
 tasks in state all-time-blocked (immendiatly after node restart its 8 but
 grows over time to around 300). What does it mean (or how can I find out)
 and what can I do about it?



 --

 YOOCHOOSE GmbH



 Roland Gude

 Software Engineer



 Im Mediapark 8, 50670 Köln



 +49 221 4544151 (Tel)

 +49 221 4544159 (Fax)

 +49 171 7894057 (Mobil)





 Email: roland.g...@yoochoose.com

 WWW: www.yoochoose.com



 YOOCHOOSE GmbH

 Geschäftsführer: Dr. Uwe Alkemper, Michael Friedmann

 Handelsregister: Amtsgericht Köln HRB 65275

 Ust-Ident-Nr: DE 264 773 520

 Sitz der Gesellschaft: Köln





-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of DataStax, the source for professional Cassandra support
http://www.datastax.com




AW: IndexSliceQuery issue - ReadStage piling up (looks like deadlock/infinite loop or similar)

2011-08-11 Thread Roland Gude
Yes, i can reproduce this behavior

If issue a query like this (on 0.7.8 with patch for CASSANDRA-2964 applied)
[default@demo]get users where birth_date = 1968 and state = 'UT';
with an index on birth_date but no index on state
I do not get results (actually I get '0 rows') even though there are rows which 
statisfy all clauses.
However, if I repeat this several times several of the nodes start piling up 
pending reads. (tpstats shows some 8 reads pending)
And even though the nodes are not able to fulfill (read) requests anymore, they 
are not marked as down by the gossiper. Overall this results in an unusable 
cluster.

If I do the same thing on a 0.7.5 cluster
Cassandra logs a nullpointerexception and the cli returns with null, but the 
cluster stays functional.


Von: aaron morton [mailto:aa...@thelastpickle.com]
Gesendet: Mittwoch, 10. August 2011 23:48
An: user@cassandra.apache.org
Betreff: Re: IndexSliceQuery issue - ReadStage piling up (looks like 
deadlock/infinite loop or similar)

Are you still having a problem ? I'm a bit confused about what you saying.

Cheers

-
Aaron Morton
Freelance Cassandra Developer
@aaronmorton
http://www.thelastpickle.com

On 10 Aug 2011, at 03:33, Roland Gude wrote:


Hi,

I experience issues when doing a indexslicequery with multiple expressions if 
one of the expressions is about a non index column

I did the equivalent of this example (but with my data) from
http://www.datastax.com/dev/blog/whats-new-cassandra-07-secondary-indexes


Secondary indexes automate this. Let's add some state data:

[default@demo] set users[bsanderson][state] = 'UT';
[default@demo] set users[prothfuss][state] = 'WI';
[default@demo] set users[htayler][state] = 'UT';

Note that even though state is not indexed yet, we can include the new state 
data in a query as long as another column in the query is indexed:

[default@demo] get users where state = 'UT';
No indexed columns present in index clause with operator EQ
[default@demo] get users where state = 'UT' and birth_date  1970;
No indexed columns present in index clause with operator EQ
[default@demo]get users where birth_date = 1968 and state = 'UT';
---
RowKey: htayler
= (column=birth_date, value=1968, timestamp=1291334765649000)
= (column=full_name, value=Howard Tayler, timestamp=129133474916)
= (column=state, value=5554, timestamp=1291334890708000)

On On 0.7.8 (with CASSANDRA-2964 applied)
This example will not return any data, but return 0 rows. I repeated the 
query multiple times with different variations for the values which should all 
have returned data, but eventually I ended up with the cluster having 8 
reads pending on some of the nodes

On 0.7.5 the query will result in a NullPointerException being thrown and 
null returned in the cli

ERROR [ReadStage:258] 2011-08-09 16:03:27,153 AbstractCassandraDaemon.java 
(line 113) Fatal exception in thread Thread[ReadStage:258,5,main]
java.lang.RuntimeException: java.lang.NullPointerException
at 
org.apache.cassandra.service.IndexScanVerbHandler.doVerb(IndexScanVerbHandler.java:51)
at 
org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:72)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:619)
Caused by: java.lang.NullPointerException
at org.apache.cassandra.db.ColumnFamily.addAll(ColumnFamily.java:131)
at 
org.apache.cassandra.db.ColumnFamilyStore.scan(ColumnFamilyStore.java:1615)
at 
org.apache.cassandra.service.IndexScanVerbHandler.doVerb(IndexScanVerbHandler.java:42)
... 4 more
ERROR [ReadStage:258] 2011-08-09 16:03:27,153 AbstractCassandraDaemon.java 
(line 113) Fatal exception in thread Thread[ReadStage:258,5,main]
java.lang.RuntimeException: java.lang.NullPointerException
at 
org.apache.cassandra.service.IndexScanVerbHandler.doVerb(IndexScanVerbHandler.java:51)
at 
org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:72)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:619)
Caused by: java.lang.NullPointerException
at org.apache.cassandra.db.ColumnFamily.addAll(ColumnFamily.java:131)
at 
org.apache.cassandra.db.ColumnFamilyStore.scan(ColumnFamilyStore.java:1615)
at 
org.apache.cassandra.service.IndexScanVerbHandler.doVerb(IndexScanVerbHandler.java:42)
... 4 more



Can anybody reproduce this?

Greetings,
Roland

--
YOOCHOOSE GmbH

Roland Gude
Software Engineer

Im Mediapark 8, 50670 Köln

+49 221 4544151 (Tel)
+49 221 4544159 (Fax)
+49 171 7894057 (Mobil)


Email: roland.g...@yoochoose.commailto:roland.g...@yoochoose.com
WWW

IndexSliceQuery issue - ReadStage piling up (looks like deadlock/infinite loop or similar)

2011-08-09 Thread Roland Gude
Hi,

I experience issues when doing a indexslicequery with multiple expressions if 
one of the expressions is about a non index column

I did the equivalent of this example (but with my data) from
http://www.datastax.com/dev/blog/whats-new-cassandra-07-secondary-indexes


Secondary indexes automate this. Let's add some state data:

[default@demo] set users[bsanderson][state] = 'UT';
[default@demo] set users[prothfuss][state] = 'WI';
[default@demo] set users[htayler][state] = 'UT';

Note that even though state is not indexed yet, we can include the new state 
data in a query as long as another column in the query is indexed:

[default@demo] get users where state = 'UT';
No indexed columns present in index clause with operator EQ
[default@demo] get users where state = 'UT' and birth_date  1970;
No indexed columns present in index clause with operator EQ
[default@demo]get users where birth_date = 1968 and state = 'UT';
---
RowKey: htayler
= (column=birth_date, value=1968, timestamp=1291334765649000)
= (column=full_name, value=Howard Tayler, timestamp=129133474916)
= (column=state, value=5554, timestamp=1291334890708000)

On On 0.7.8 (with CASSANDRA-2964 applied)
This example will not return any data, but return 0 rows. I repeated the 
query multiple times with different variations for the values which should all 
have returned data, but eventually I ended up with the cluster having 8 
reads pending on some of the nodes

On 0.7.5 the query will result in a NullPointerException being thrown and 
null returned in the cli

ERROR [ReadStage:258] 2011-08-09 16:03:27,153 AbstractCassandraDaemon.java 
(line 113) Fatal exception in thread Thread[ReadStage:258,5,main]
java.lang.RuntimeException: java.lang.NullPointerException
at 
org.apache.cassandra.service.IndexScanVerbHandler.doVerb(IndexScanVerbHandler.java:51)
at 
org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:72)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:619)
Caused by: java.lang.NullPointerException
at org.apache.cassandra.db.ColumnFamily.addAll(ColumnFamily.java:131)
at 
org.apache.cassandra.db.ColumnFamilyStore.scan(ColumnFamilyStore.java:1615)
at 
org.apache.cassandra.service.IndexScanVerbHandler.doVerb(IndexScanVerbHandler.java:42)
... 4 more
ERROR [ReadStage:258] 2011-08-09 16:03:27,153 AbstractCassandraDaemon.java 
(line 113) Fatal exception in thread Thread[ReadStage:258,5,main]
java.lang.RuntimeException: java.lang.NullPointerException
at 
org.apache.cassandra.service.IndexScanVerbHandler.doVerb(IndexScanVerbHandler.java:51)
at 
org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:72)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:619)
Caused by: java.lang.NullPointerException
at org.apache.cassandra.db.ColumnFamily.addAll(ColumnFamily.java:131)
at 
org.apache.cassandra.db.ColumnFamilyStore.scan(ColumnFamilyStore.java:1615)
at 
org.apache.cassandra.service.IndexScanVerbHandler.doVerb(IndexScanVerbHandler.java:42)
... 4 more



Can anybody reproduce this?

Greetings,
Roland

--
YOOCHOOSE GmbH

Roland Gude
Software Engineer

Im Mediapark 8, 50670 Köln

+49 221 4544151 (Tel)
+49 221 4544159 (Fax)
+49 171 7894057 (Mobil)


Email: roland.g...@yoochoose.com
WWW: www.yoochoose.comhttp://www.yoochoose.com/

YOOCHOOSE GmbH
Geschäftsführer: Dr. Uwe Alkemper, Michael Friedmann
Handelsregister: Amtsgericht Köln HRB 65275
Ust-Ident-Nr: DE 264 773 520
Sitz der Gesellschaft: Köln



AW: results of index slice query

2011-07-29 Thread Roland Gude
Hi,

I have so far not been able to reproduce this bug on any other cluster than our 
production cluster which started with the behavior only after the upgrade from 
0.7.5 to 0.7.7 I have attached logs to the issue but I have absolutely no clue 
how to move forward. Any ideas anybody?

-Ursprüngliche Nachricht-
Von: Roland Gude [mailto:roland.g...@yoochoose.com] 
Gesendet: Donnerstag, 28. Juli 2011 11:22
An: user@cassandra.apache.org
Betreff: AW: results of index slice query

Created https://issues.apache.org/jira/browse/CASSANDRA-2964

-Ursprüngliche Nachricht-
Von: Jonathan Ellis [mailto:jbel...@gmail.com] 
Gesendet: Mittwoch, 27. Juli 2011 17:35
An: user@cassandra.apache.org
Betreff: Re: results of index slice query

Sounds like a Cassandra bug to me.

On Wed, Jul 27, 2011 at 6:44 AM, Roland Gude roland.g...@yoochoose.com wrote:
 Hi,

 I was just experiencing that when i do an IndexSliceQuery with the index
 column not in the slicerange the index column will be returned anyways. Is
 this behavior intended or is it a bug (if so - is it a Cassandra bug or a
 hector bug)?

 I am using Cassandra 0.7.7 and hector 0.7-26



 Greetings,

 roland



 --

 YOOCHOOSE GmbH



 Roland Gude

 Software Engineer



 Im Mediapark 8, 50670 Köln



 +49 221 4544151 (Tel)

 +49 221 4544159 (Fax)

 +49 171 7894057 (Mobil)





 Email: roland.g...@yoochoose.com

 WWW: www.yoochoose.com



 YOOCHOOSE GmbH

 Geschäftsführer: Dr. Uwe Alkemper, Michael Friedmann

 Handelsregister: Amtsgericht Köln HRB 65275

 Ust-Ident-Nr: DE 264 773 520

 Sitz der Gesellschaft: Köln





-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of DataStax, the source for professional Cassandra support
http://www.datastax.com






AW: results of index slice query

2011-07-28 Thread Roland Gude
Created https://issues.apache.org/jira/browse/CASSANDRA-2964

-Ursprüngliche Nachricht-
Von: Jonathan Ellis [mailto:jbel...@gmail.com] 
Gesendet: Mittwoch, 27. Juli 2011 17:35
An: user@cassandra.apache.org
Betreff: Re: results of index slice query

Sounds like a Cassandra bug to me.

On Wed, Jul 27, 2011 at 6:44 AM, Roland Gude roland.g...@yoochoose.com wrote:
 Hi,

 I was just experiencing that when i do an IndexSliceQuery with the index
 column not in the slicerange the index column will be returned anyways. Is
 this behavior intended or is it a bug (if so - is it a Cassandra bug or a
 hector bug)?

 I am using Cassandra 0.7.7 and hector 0.7-26



 Greetings,

 roland



 --

 YOOCHOOSE GmbH



 Roland Gude

 Software Engineer



 Im Mediapark 8, 50670 Köln



 +49 221 4544151 (Tel)

 +49 221 4544159 (Fax)

 +49 171 7894057 (Mobil)





 Email: roland.g...@yoochoose.com

 WWW: www.yoochoose.com



 YOOCHOOSE GmbH

 Geschäftsführer: Dr. Uwe Alkemper, Michael Friedmann

 Handelsregister: Amtsgericht Köln HRB 65275

 Ust-Ident-Nr: DE 264 773 520

 Sitz der Gesellschaft: Köln





-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of DataStax, the source for professional Cassandra support
http://www.datastax.com




results of index slice query

2011-07-27 Thread Roland Gude
Hi,
I was just experiencing that when i do an IndexSliceQuery with the index column 
not in the slicerange the index column will be returned anyways. Is this 
behavior intended or is it a bug (if so - is it a Cassandra bug or a hector 
bug)?
I am using Cassandra 0.7.7 and hector 0.7-26

Greetings,
roland

--
YOOCHOOSE GmbH

Roland Gude
Software Engineer

Im Mediapark 8, 50670 Köln

+49 221 4544151 (Tel)
+49 221 4544159 (Fax)
+49 171 7894057 (Mobil)


Email: roland.g...@yoochoose.com
WWW: www.yoochoose.comhttp://www.yoochoose.com/

YOOCHOOSE GmbH
Geschäftsführer: Dr. Uwe Alkemper, Michael Friedmann
Handelsregister: Amtsgericht Köln HRB 65275
Ust-Ident-Nr: DE 264 773 520
Sitz der Gesellschaft: Köln



AW: Multi-type column values in single CF

2011-07-03 Thread Roland Gude
You could do the serialization for all your supported datatypes yourself (many 
libraries for serialization are available and a pretty thorough benchmarking 
for them can be found here: https://github.com/eishay/jvm-serializers/wiki) and 
prepend the serialized bytes with an identifier for your datatype.
This would not avoid casting though but would still be better performing then 
serializing to strings as it is done in your example.
Prepending the values with the id seems to be better to me, because you can be 
sure that a new insertion to some field overwrites the correct column even if 
it changed the type.

-Ursprüngliche Nachricht-
Von: osishkin osishkin [mailto:osish...@gmail.com] 
Gesendet: Sonntag, 3. Juli 2011 13:52
An: user@cassandra.apache.org
Betreff: Multi-type column values in single CF

Hi all,

I need to store column values that are of various data types in a
single column family, i.e I have column values that are integers,
others that are strings, and maybe more later. All column names are
strings (no comparator problem for me).
The thing is I need to store unstructured data - I do not have fixed
and known-in-advacne column names, so I can not use a fixed static map
for casting the values back to their original type on retrieval from
cassandra.

My immediate naive thought is to simply prefix every column name with
the type the value needs to be cast back to.
For example i'll do the follwing conversion to the columns of some key -
{'attr1': 'val1','attr2': 100}  ~ {'str_attr1' : 'val1', 'int_attr2' : '100'}
and only then send it to cassandra. This way I know to what should I
cast it back.

But all this casting back and forth on the client side seems to me to
be very bad for performance.
Another option is to split the columns on dedicated column families
with mathcing validation types - a column family for integer values,
one for string, one for timestamp etc.
But that does not seem very efficient either (and worse for any
rollback mechanism), since now I have to perform several get calls on
multiple CFs where once I had only one.

I thought perhaps someone has encountered a similar situation in the
past, and can offer some advice on the best course of action.

Thank you,
Osi




AW: Column value type

2011-06-22 Thread Roland Gude
There is a comparator type (fort he name) and a validation type (for the value)
If you have set the validation to be UTF8 you can only store data that is valid 
UTF8 there.
The default validation is BytesType so it should accept everything unless 
otherwise specified.

I cannot tell anything regarding pycassa client side validation though.

-Ursprüngliche Nachricht-
Von: osishkin osishkin [mailto:osish...@gmail.com] 
Gesendet: Mittwoch, 22. Juni 2011 13:14
An: user@cassandra.apache.org
Betreff: Column value type

Is there a limitation on the data type of a column value (not column
name) in cassandra?
I'm saving data using a pycassa client, for a UTF8 column family, and
I get an error when I try saving integer data values.
Only when convert the values to string can I save the data.
Looking at the pycassa code it seems to prevent me from sending non-string data.
It doesn't make sense to me, since as far as I understood things, the
type should apply only to column names (for comparison etc.).
Am I wrong?

Thank you




Re: range query vs slice range query

2011-05-25 Thread Roland Gude
I cannot Display the Book page you are referring to, but your General 
understanding is correct. A Range Refers to several rows, a slice Refers to 
several columns. A RangeSlice is a combination of Both. From all rows in a 
Range get a specific slice of columns.

Am 25.05.2011 um 10:43 schrieb david lee 
iecan...@gmail.commailto:iecan...@gmail.com:

hi guys,

i'm reading up on the book Cassandra - Definitive guide
and i don't seem to understand what it says about ranges and slices

my understanding is
a range as in a mathematical range to define a subset from an ordered set of 
elements,
in cassandra typically means a range of rows whereas
a slice means a range of columns.

a range query refers to a query to retrieve a range of rows whereas
a slice range queyr refers to a query to retrieve range of columns within a row.

i may be talking about total nonsense but i really am more confused after 
reading this portion of the book
http://books.google.com/books?id=MKGSbCbEdg0Cpg=PA134lpg=PA134dq=cassandra+%22range+query%22+%22range+slice%22source=blots=XoPB4uA60usig=uDDoQe0FRkQobHnr-vPvvQ3B8TQhl=enei=ub3cTcvGLZLevQOuxs3CDwsa=Xoi=book_resultct=resultresnum=4ved=0CCwQ6AEwAw#v=onepageq=cassandra%20%22range%20query%22%20%22range%20slice%22f=falsehttp://books.google.com/books?id=MKGSbCbEdg0Cpg=PA134lpg=PA134dq=cassandra+%22range+query%22+%22range+slice%22source=blots=XoPB4uA60usig=uDDoQe0FRkQobHnr-vPvvQ3B8TQhl=enei=ub3cTcvGLZLevQOuxs3CDwsa=Xoi=book_resultct=resultresnum=4ved=0CCwQ6AEwAw#v=onepageq=cassandra%20%22range%20query%22%20%22range%20slice%22f=false

many thanx in advance
david



Re: range query vs slice range query

2011-05-25 Thread Roland Gude
That is correct. Random partitioner orders rows according to the MD5 sum.

Am 25.05.2011 um 16:11 schrieb Robert Jackson 
robe...@promedicalinc.commailto:robe...@promedicalinc.com:

Also, it is my understanding that if you are not using 
OrderPreservingPartitioner a get_range_slices may not return what you would 
expect.

With the RandomPartitioner you can iterate over the complete list by using the 
last row key as the start for subsequent requests, but if you are using a 
single query you will be returned all the rows where the returned row key's md5 
is between the md5 of the start row key and stop row key.

Reference:
http://wiki.apache.org/cassandra/FAQ - Why aren't range slices/sequential 
scans giving me the expected results?

Robert Jackson


From: Jonathan Ellis jbel...@gmail.commailto:jbel...@gmail.com
To: user@cassandra.apache.orgmailto:user@cassandra.apache.org
Sent: Wednesday, May 25, 2011 8:54:34 AM
Subject: Re: range query vs slice range query

get_range_slices is the api to get a slice (of columns) from each of a
range (of rows)

On Wed, May 25, 2011 at 3:42 AM, david lee iecan...@gmail.com wrote:
 hi guys,
 i'm reading up on the book Cassandra - Definitive guide
 and i don't seem to understand what it says about ranges and slices
 my understanding is
 a range as in a mathematical range to define a subset from an ordered set
 of elements,
 in cassandra typically means a range of rows whereas
 a slice means a range of columns.
 a range query refers to a query to retrieve a range of rows whereas
 a slice range queyr refers to a query to retrieve range of columns within a
 row.
 i may be talking about total nonsense but i really am more confused after
 reading this portion of the book
 http://books.google.com/books?id=MKGSbCbEdg0Cpg=PA134lpg=PA134dq=cassandra+%22range+query%22+%22range+slice%22source=blots=XoPB4uA60usig=uDDoQe0FRkQobHnr-vPvvQ3B8TQhl=enei=ub3cTcvGLZLevQOuxs3CDwsa=Xoi=book_resultct=resultresnum=4ved=0CCwQ6AEwAw#v=onepageq=cassandra%20%22range%20query%22%20%22range%20slice%22f=false
 many thanx in advance
 david




--
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of DataStax, the source for professional Cassandra support
http://www.datastax.com



AW: Does anyone have Cassandra running on OpenSolaris?

2011-05-09 Thread Roland Gude

Use bash as a shell

#bash bin/cassandra -f


-Ursprüngliche Nachricht-
Von: Jeffrey Kesselman [mailto:jef...@gmail.com] 
Gesendet: Montag, 9. Mai 2011 17:12
An: user@cassandra.apache.org
Betreff: Does anyone have Cassandra running on OpenSolaris?

I get this error:

bin/cassandra: syntax error at line 29: `system_memory_in_mb=$' unexpected

Thanks

JK


-- 
It's always darkest just before you are eaten by a grue.




Re: low performance inserting

2011-05-03 Thread Roland Gude
Hi,
Not sure this is the case for your Bad Performance, but you are Meassuring Data 
creation and Insertion together. Your Data creation involves Lots of class 
casts which are probably quite Slow.
Try
Timing only the b.send Part and See how Long that Takes. 

Roland

Am 03.05.2011 um 12:30 schrieb charles THIBAULT charl.thiba...@gmail.com:

 Hello everybody, 
 
 first: sorry for my english in advance!!
 
 I'm getting started with Cassandra on a 5 nodes cluster inserting data
 with the pycassa API.
 
 I've read everywere on internet that cassandra's performance are better than 
 MySQL
 because of the writes append's only into commit logs files.
 
 When i'm trying to insert 100 000 rows with 10 columns per row with batch 
 insert, I'v this result: 27 seconds
 But with MySQL (load data infile) this take only 2 seconds (using indexes)
 
 Here my configuration
 
 cassandra version: 0.7.5
 nodes : 192.168.1.210, 192.168.1.211, 192.168.1.212, 192.168.1.213, 
 192.168.1.214
 seed: 192.168.1.210
 
 My script
 *
 #!/usr/bin/env python
 
 import pycassa
 import time
 import random
 from cassandra import ttypes
 
 pool = pycassa.connect('test', ['192.168.1.210:9160'])
 cf = pycassa.ColumnFamily(pool, 'test')
 b = cf.batch(queue_size=50, 
 write_consistency_level=ttypes.ConsistencyLevel.ANY)
 
 tps1 = time.time()
 for i in range(10):
 columns = dict()
 for j in range(10):
 columns[str(j)] = str(random.randint(0,100))
 b.insert(str(i), columns)
 b.send()
 tps2 = time.time()
 
 
 print(execution time:  + str(tps2 - tps1) +  seconds)
 *
 
 what I'm doing rong ?


AW: AW: Two versions of schema

2011-04-19 Thread Roland Gude
Yeah it happens from time to time even if everything seems to be fine that 
schema changes don't work correctly. But it's always repairable with the 
described procedure. Therefore the operator being available is a must have I 
think.

Drain is a nodetool command. The node flushes data and stops accepting new 
writes. This just speeds up bringing the node back up again in this case. 
Probably a flush is equally acceptable.

-Ursprüngliche Nachricht-
Von: mcasandra [mailto:mohitanch...@gmail.com] 
Gesendet: Montag, 18. April 2011 18:27
An: cassandra-u...@incubator.apache.org
Betreff: Re: AW: Two versions of schema

In my case all hosts were reachable and I ran nodetool ring before running
the schema update. I don't think it was because of node being down. I tihnk
for some reason it just took over 10 secs because I was reducing key_cache
from 1M to 1000. I think it might be taking long to trim the keys hence 10
sec default may not be the right way.

What is drain?

--
View this message in context: 
http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Two-versions-of-schema-tp6277365p6284276.html
Sent from the cassandra-u...@incubator.apache.org mailing list archive at 
Nabble.com.




Re: Site Not Surviving a Single Cassandra Node Crash

2011-04-10 Thread Roland Gude
Not sure about that Hector Version, but there was a Hector Bug that Hector did 
Not stop using a Dead Node As Proxy and that it did not do proper Load 
balancing in the requests. If you enable trace Logs for Hector you can See 
which nodes it uses for requests. If there is a newer 0.6 Hector you should 
give it a try.
Furthermore i Suggest Brunhild down One Node and request data with the cli. If 
that Works it is probably the Hector bug.

Am 10.04.2011 um 06:57 schrieb Patricio Echagüe 
patric...@gmail.commailto:patric...@gmail.com:

What is the consistency level you are using ?

And as Ed said, if you can provide the stacktrace that would help too.

On Sat, Apr 9, 2011 at 7:02 PM, aaron morton 
mailto:aa...@thelastpickle.comaa...@thelastpickle.commailto:aa...@thelastpickle.com
 wrote:
btw, the nodes are a tad out of balance was that deliberate ?

http://wiki.apache.org/cassandra/Operations#Token_selectionhttp://wiki.apache.org/cassandra/Operations#Token_selection
http://wiki.apache.org/cassandra/Operations#Load_balancinghttp://wiki.apache.org/cassandra/Operations#Load_balancing


Aaron

On 10 Apr 2011, at 08:44, Ed Anuff wrote:

Sounds like the problem might be on the hector side.  Lots of hector
users on this list, but usually not a bad idea to ask on
mailto:hector-us...@googlegroups.comhector-us...@googlegroups.commailto:hector-us...@googlegroups.com
 (cc'd).

The jetty servers stopping responding is a bit vague, somewhere in
your logs is an error message that should shed some light on where
things are going awry.  If you can find the exception that's being
thrown in hector and post that, it'd make it much easier to help you
out.

Ed

On Sat, Apr 9, 2011 at 12:11 PM, Vram Kouramajian
mailto:vram.kouramaj...@gmail.comvram.kouramaj...@gmail.commailto:vram.kouramaj...@gmail.com
 wrote:
The hector clients are used as part of our jetty servers. And, the
jetty servers stop responding when one of the Cassandra nodes go down.

Vram

On Sat, Apr 9, 2011 at 11:54 AM, Joe Stump 
mailto:j...@joestump.netj...@joestump.netmailto:j...@joestump.net wrote:
Did the Cassandra cluster go down or did you start getting failures from the 
client when it routed queries to the downed node? The key in the client is to 
keep working around the ring if the initial node is down.

--Joe

On Apr 9, 2011, at 12:52 PM, Vram Kouramajian wrote:

We have a 5 Cassandra nodes with the following configuration:

Casandra Version: 0.6.11
Number of Nodes: 5
Replication Factor: 3
Client: Hector 0.6.0-14
Write Consistency Level: Quorum
Read Consistency Level: Quorum
Ring Topology:
  OwnsRange  Ring

132756707369141912386052673276321963528
192.168.89.153Up 4.15 GB   33.87%
20237398133070283622632741498697119875 |--|
192.168.89.155Up 5.17 GB   18.29%
51358066040236348437506517944084891398 |   ^
192.168.89.154Up 7.41 GB   33.97%
109158969152851862753910401160326064203v   |
192.168.89.152Up 5.07 GB   6.34%
119944993359936402983569623214763193674|   ^
192.168.89.151Up 4.22 GB   7.53%
132756707369141912386052673276321963528|--|

We believe that our setup should survive the crash of one of the
Cassandra nodes. But, we had few crashes and the system stopped
functioning until we brought back the Cassandra nodes.

Any clues?

Vram







Re: Atomicity Strategies

2011-04-10 Thread Roland Gude

A Strategy that should Cover at least some use Cases is roughly like this:

Given cf A and B should Be in Sync
In write 'a' to cf A Add another Column 'Synchronisation_token' and Write a 
tuuid 'T' (or a timestamp or some Otter Value that Allows (Time based) 
ordering) As its value.
On the related write to cfB Write the Token As well.
When Reading check Client Side if tokens Match and reread Data with Lower Token 
until it does.


Roland


Am 10.04.2011 um 03:53 scaaron morton 
aa...@thelastpickle.commailto:aa...@thelastpickle.com:

My understanding of what they did with locking (based on the examples) was to 
achieve a level of transaction isolation 
http://en.wikipedia.org/wiki/Isolation_(database_systems) 
http://en.wikipedia.org/wiki/Isolation_(database_systems)

I think the issue here is more about atomicity 
http://wiki.apache.org/cassandra/FAQ#batch_mutate_atomic 
http://wiki.apache.org/cassandra/FAQ#batch_mutate_atomic

http://wiki.apache.org/cassandra/FAQ#batch_mutate_atomicWe cannot guarantee 
that all or none of the mutations in your batch are completed. There is some 
work in this area though https://issues.apache.org/jira/browse/CASSANDRA-1684 
https://issues.apache.org/jira/browse/CASSANDRA-1684

https://issues.apache.org/jira/browse/CASSANDRA-1684AFAIK the best approach 
now is to work at Quourm, and write your code to handle missing relations. Also 
cassandra does do a lot of work upfront before the write starts to ensure it 
will succeed, failures during a write will probably be due to a SW/HW failure 
or overload on a node that gossip has not picked up.

Retrying is the recommended approach when a request fails.

Hope that helps.
Aaron

On 9 Apr 2011, at 15:58, Dan Washusen wrote:

Here's a good writeup on how http://www.fightmymonster.com/ 
fightmymonster.comhttp://fightmymonster.com does it...

http://ria101.wordpress.com/category/nosql-databases/locking/http://ria101.wordpress.com/category/nosql-databases/locking/

--
Dan Washusen
Make big files fly
visit http://digitalpigeon.com/ digitalpigeon.comhttp://digitalpigeon.com

On Saturday, 9 April 2011 at 11:53 AM, Alex Araujo wrote:

On 4/8/11 5:46 PM, Drew Kutcharian wrote:
I'm interested in this too, but I don't think this can be done with Cassandra 
alone. Cassandra doesn't support transactions. I think hector can retry 
operations, but I'm not sure about the atomicity of the whole thing.



On Apr 8, 2011, at 1:26 PM, Alex Araujo wrote:

Hi, I was wondering if there are any patterns/best practices for creating 
atomic units of work when dealing with several column families and their 
inverted indices.

For example, if I have Users and Groups column families and did something like:

Users.insert( user_id, columns )
UserGroupTimeline.insert( group_id, { timeuuid() : user_id } )
UserGroupStatus.insert( group_id + : + user_id, { Active : True } )
UserEvents.insert( timeuuid(), { user_id : user_id, group_id : group_id, 
event_type : join } )

Would I want the client to retry all subsequent operations that failed against 
other nodes after n succeeded, maintain an undo queue of operations to run, 
batch the mutations and choose a strong consistency level, some combination of 
these/others, etc?

Thanks,
Alex
Thanks Drew. I'm familiar with lack of transactions and have read about
people usiing ZK (possibly Cages as well?) to accomplish this, but since
it seems that inverted indices are common place I'm interested in how
anyone is mitigating lack of atomicity to any extent without the use of
such tools. It appears that Hector and Pelops have retrying built in to
their APIs and I'm fairly confident that proper use of those
capabilities may help. Just trying to cover all bases. Hopefully
someone can share their approaches and/or experiences. Cheers, Alex.




Re: Secondary Index keeping track of column names

2011-04-07 Thread Roland Gude
You could simulate it thoug. Just Add some Meta Column with a boolean Value 
indicating if the referred Column is in the Row or Not. Then Add an Index in 
that Meta Column and query for it. 
I. E.  Row a: (c=1234),(has_c=Yes)
Quert : List cf where has_c=Yes 

Am 06.04.2011 um 18:52 schrieb Jonathan Ellis jbel...@gmail.com:

 No, 0.7 indexes handle equality queries; you're basically asking for a
 IS NOT NULL query.
 
 On Wed, Apr 6, 2011 at 11:23 AM, Jeremiah Jordan
 jeremiah.jor...@morningstar.com wrote:
In 0.7.X is there a way to have an automatic secondary index
 which keeps track of what keys contain a certain column?  Right now we
 are keeping track of this manually, so we can quickly get all of the
 rows which contain a given column, it would be nice if it was automatic.
 
 -Jeremiah
 
 
 Jeremiah Jordan
 Application Developer
 Morningstar, Inc.
 
 Morningstar. Illuminating investing worldwide.
 
 +1 312 696-6128 voice
 jeremiah.jor...@morningstar.com
 
 www.morningstar.com
 
 This e-mail contains privileged and confidential information and is
 intended only for the use of the person(s) named above. Any
 dissemination, distribution, or duplication of this communication
 without prior written consent from Morningstar is strictly prohibited.
 If you have received this message in error, please contact the sender
 immediately and delete the materials from any computer.
 
 
 
 
 -- 
 Jonathan Ellis
 Project Chair, Apache Cassandra
 co-founder of DataStax, the source for professional Cassandra support
 http://www.datastax.com
 


AW: Strange nodetool repair behaviour

2011-04-04 Thread Roland Gude
I am experiencing the same behavior but had it on previous versions of 0.7 as 
well.

 
-Ursprüngliche Nachricht-
Von: Jonas Borgström [mailto:jonas.borgst...@trioptima.com] 
Gesendet: Montag, 4. April 2011 12:26
An: user@cassandra.apache.org
Betreff: Strange nodetool repair behaviour

Hi,

I have a 6 node 0.7.4 cluster with replication_factor=3 where nodetool
repair keyspace behaves really strange.

The keyspace contains three column families and about 60GB data in total
(i.e 30GB on each node).

Even though no data has been added or deleted since the last repair, a
repair takes hours and the repairing node seems to receive 100+GB worth
of sstable data from its neighbourhood nodes, i.e several times the
actual data size.

The log says things like:

Performing streaming repair of 27 ranges

And a bunch of:

Compacted to filename 22,208,983,964 to 4,816,514,033 (~21% of original)

In the end the repair finishes without any error after a few hours but
even then the active sstables seems to contain lots of redundant data
since the disk usage can be sliced in half by triggering a major compaction.

All this leads me to believe that something stops the AES from correctly
figuring out what data is already on the repairing node and what needs
to be streamed from the neighbours.

The only thing I can think of right now is that one of the column
families contains a lot of large rows that are larger than
memtable_throughput and that's perhaps what's confusing the merkle tree.

Anyway, is this a known problem of perhaps expected behaviour?
Otherwise I'll try to create a more reproducible test case.

Regards,
Jonas




AW: too many open files - maybe a fd leak in indexslicequeries

2011-04-02 Thread Roland Gude

Hi,

The open file limit is 1024
Sstable count is somewhere around 20 or so thread count is in the same order of 
magnitude I guess
But lsof shows that deleted sstables still have open file handles. This seems 
to be the issue as this number keeps growing.
Any ideas?

Roland.

-Ursprüngliche Nachricht-
Von: Jonathan Ellis [mailto:jbel...@gmail.com] 
Gesendet: Freitag, 1. April 2011 06:07
An: user@cassandra.apache.org
Cc: Roland Gude; Juergen Link; Johannes Hoerle
Betreff: Re: too many open files - maybe a fd leak in indexslicequeries

Index queries (ColumnFamilyStore.scan) don't do any low-level i/o
themselves, they go through CFS.getColumnFamily, which is what normal
row fetches also go through.  So if there is a leak there it's
unlikely to be specific to indexes.

What is your open-file limit (remember that sockets count towards
this), thread count, sstable count?

On Thu, Mar 31, 2011 at 4:15 PM, Roland Gude roland.g...@yoochoose.com wrote:
 I experience something that looks exactly like
 https://issues.apache.org/jira/browse/CASSANDRA-1178

 On cassandra 0.7.3 when using index slice queries (lots of them)

 Crashing multiple nodes and rendering the cluster useless. But I have no
 clue where to look if index queries still leak fd



 Does anybody know about it?

 Where could I look?



 Greetings,

 roland



 --

 YOOCHOOSE GmbH



 Roland Gude

 Software Engineer



 Im Mediapark 8, 50670 Köln



 +49 221 4544151 (Tel)

 +49 221 4544159 (Fax)

 +49 171 7894057 (Mobil)





 Email: roland.g...@yoochoose.com

 WWW: www.yoochoose.com



 YOOCHOOSE GmbH

 Geschäftsführer: Dr. Uwe Alkemper, Michael Friedmann

 Handelsregister: Amtsgericht Köln HRB 65275

 Ust-Ident-Nr: DE 264 773 520

 Sitz der Gesellschaft: Köln





-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of DataStax, the source for professional Cassandra support
http://www.datastax.com




too many open files - maybe a fd leak in indexslicequeries

2011-03-31 Thread Roland Gude
I experience something that looks exactly like 
https://issues.apache.org/jira/browse/CASSANDRA-1178
On cassandra 0.7.3 when using index slice queries (lots of them)
Crashing multiple nodes and rendering the cluster useless. But I have no clue 
where to look if index queries still leak fd

Does anybody know about it?
Where could I look?

Greetings,
roland

--
YOOCHOOSE GmbH

Roland Gude
Software Engineer

Im Mediapark 8, 50670 Köln

+49 221 4544151 (Tel)
+49 221 4544159 (Fax)
+49 171 7894057 (Mobil)


Email: roland.g...@yoochoose.com
WWW: www.yoochoose.comhttp://www.yoochoose.com/

YOOCHOOSE GmbH
Geschäftsführer: Dr. Uwe Alkemper, Michael Friedmann
Handelsregister: Amtsgericht Köln HRB 65275
Ust-Ident-Nr: DE 264 773 520
Sitz der Gesellschaft: Köln



AW: problems while TimeUUIDType-index-querying with two expressions

2011-03-15 Thread Roland Gude
Actually its not the column values that should be UUIDs in our case, but the 
column keys. The CF uses TimeUUID ordering and the values are just some 
ByteArrays. Even with changing the code to use UUIDSerializer instead of 
serializing the UUIDs manually the issue still exists.

As far as I can see, there is nothing wrong with the IndexExpression.
using two Index expressions with key=TimedUUID and Value=anything does not work
using one index expression (any one of the other two) alone does work fine.

I refactored Johannes code into a junit testcase. It  needs the cluster 
configured as described in Johannes mail.
There are three cases. Two with one of the indexExpressions and one with both 
index expression. The one with Both IndexExpression will never finish and youz 
will see the exception in the Cassandra logs.

Bye,
roland

Von: aaron morton [mailto:aa...@thelastpickle.com]
Gesendet: Dienstag, 15. März 2011 07:54
An: user@cassandra.apache.org
Cc: Juergen Link; Roland Gude; her...@datastax.com
Betreff: Re: problems while TimeUUIDType-index-querying with two expressions

Perfectly reasonable, created 
https://issues.apache.org/jira/browse/CASSANDRA-2328

Aaron
On 15 Mar 2011, at 16:52, Jonathan Ellis wrote:


Sounds like we should send an InvalidRequestException then.

On Mon, Mar 14, 2011 at 8:06 PM, aaron morton 
aa...@thelastpickle.commailto:aa...@thelastpickle.com wrote:

It's failing to when comparing two TimeUUID values because on of them is not
properly formatted. In this case it's comparing a stored value with the
value passed in the get_indexed_slice() query expression.
I'm going to assume it's the value passed for the expression.
When you create the IndexedSlicesQuery this is incorrect
IndexedSlicesQueryString, byte[], byte[] indexQuery = HFactory
.createIndexedSlicesQuery(keyspace,
stringSerializer, bytesSerializer, bytesSerializer);
Use a UUIDSerializer for the last param and then pass the UUID you want to
build the expressing. Rather than the string/byte thing you are passing
Hope that helps.
Aaron
On 15 Mar 2011, at 04:17, Johannes Hoerle wrote:

Hi all,

in order to improve our queries, we started to use IndexedSliceQueries from
the hector project (https://github.com/zznate/hector-examples). I followed
the instructions for creating IndexedSlicesQuery with
GetIndexedSlices.java.
I created the corresponding CF with in a keyspace called Keyspace1 (
create keyspace  Keyspace1;) with:
create column family Indexed1 with column_type='Standard' and
comparator='UTF8Type' and keys_cached=20 and read_repair_chance=1.0 and
rows_cached=2 and column_metadata=[{column_name: birthdate,
validation_class: LongType, index_name: dateIndex, index_type:
KEYS},{column_name: birthmonth, validation_class: LongType, index_name:
monthIndex, index_type: KEYS}];
and the example GetIndexedSlices.java worked fine.

Output of CF Indexed1:
---
[default@Keyspace1] list Indexed1;
Using default limit of 100
---
RowKey: fake_key_12
= (column=birthdate, value=1974, timestamp=1300110485826059)
= (column=birthmonth, value=0, timestamp=1300110485826060)
= (column=fake_column_0, value=66616b655f76616c75655f305f3132,
timestamp=1300110485826056)
= (column=fake_column_1, value=66616b655f76616c75655f315f3132,
timestamp=1300110485826057)
= (column=fake_column_2, value=66616b655f76616c75655f325f3132,
timestamp=1300110485826058)
---
RowKey: fake_key_8
= (column=birthdate, value=1974, timestamp=1300110485826039)
= (column=birthmonth, value=8, timestamp=1300110485826040)
= (column=fake_column_0, value=66616b655f76616c75655f305f38,
timestamp=1300110485826036)
= (column=fake_column_1, value=66616b655f76616c75655f315f38,
timestamp=1300110485826037)
= (column=fake_column_2, value=66616b655f76616c75655f325f38,
timestamp=1300110485826038)
---



Now to the problem:
As we have another column format in our cluster (using TimeUUIDType as
comparator in CF definition) I adapted the application to our schema on a
cassandra-0.7.3 cluster.
We use a manually defined UUID for a mandator id index
(--1000--) and another one for a userid index
(0001--1000--). It can be created with:
create column family ByUser with column_type='Standard' and
comparator='TimeUUIDType' and keys_cached=20 and read_repair_chance=1.0
and rows_cached=2 and column_metadata=[{column_name:
--1000--, validation_class: BytesType,
index_name: mandatorIndex, index_type: KEYS}, {column_name:
0001--1000--, validation_class: BytesType,
index_name: useridIndex, index_type: KEYS}];


which looks in the cluster using cassandra-cli like this:

[default@Keyspace1] describe keyspace;
Keyspace: Keyspace1:
  Replication Strategy: org.apache.cassandra.locator.SimpleStrategy
Replication Factor: 1
  Column Families:
ColumnFamily: ByUser
  Columns sorted

Re: cant seem to figure out secondary index definition

2011-02-21 Thread Roland Gude
Yes,
It has such a Clause. I am very certain that this is Not my Code because the 
very Same program Works against a Cluster of the Index is created with the cli 
and it does not, when the Index is configured with cassandra.yaml

My assumption is, that the Index Kreation with configured file is flawed (it 
dös Not Seem to use the Same Code As configuration Parameters are named 
differently) i suspekt it to create the Index for the wrong Column. 

Greetings,
Roland.


Am 17.r.2011 um 21:46 schrieb Nate McCall n...@datastax.com:

 How are you constructing the IndexSlicesQuery? Does it have an equals
 clause with that UUID as the column name?
 
 On Thu, Feb 17, 2011 at 11:32 AM, Roland Gude roland.g...@yoochoose.com 
 wrote:
 Hi again,
 
 
 
 i am still having trouble with this.
 
 If I define the index using cli with these commands:
 
 create column family A with column_type='Standard' and
 comparator='TimeUUIDType' and keys_cached=20 and read_repair_chance=1.0
 and rows_cached=0.0 and column_metadata=[{column_name:
 --1000--, validation_class: UTF8Type,
 index_name: MyIndex, index_type: KEYS}];
 
 create column family B with column_type='Standard' and
 comparator='TimeUUIDType' and keys_cached=20 and read_repair_chance=1.0
 and rows_cached=0.0 and column_metadata=[{column_name:
 --1000--, validation_class: UTF8Type,
 index_name: MyIndex, index_type: KEYS}];
 
 
 
 I can do IndexedSliceQueries as expected
 
 
 
 In my unit tests where I use an embedded Cassandra instance configured via
 yaml like this:
 
   - column_metadata: [{name: --1000--,
 validator_class: UTF8Type, index_name: MyIndex, index_type: KEYS}]
 
 compare_with: TimeUUIDType
 
 gc_grace_seconds: 864000
 
 keys_cached: 0.0
 
 max_compaction_threshold: 32
 
 min_compaction_threshold: 4
 
 name: A
 
 read_repair_chance: 1.0
 
 rows_cached: 0.0
 
   - column_metadata: [{name: --1000--,
 validator_class: UTF8Type, index_name: MyIndex, index_type: KEYS}]
 
 compare_with: TimeUUIDType
 
 gc_grace_seconds: 864000
 
 keys_cached: 0.0
 
 max_compaction_threshold: 32
 
 min_compaction_threshold: 4
 
 name: B
 
 read_repair_chance: 1.0
 
 rows_cached: 0.0
 
 
 
 I get these Exceptions:
 
 
 
 18:23:55.973 [CassandraDataFetcher-queries] ERROR
 c.y.s.c.i.event.CassandraDataFetcher - Query
 me.prettyprint.cassandra.model.IndexedSlicesQuery@1bbd3e2 failed, stop
 query.
 
 me.prettyprint.hector.api.exceptions.HInvalidRequestException:
 InvalidRequestException(why:No indexed columns present in index clause with
 operator EQ)
 
   at
 me.prettyprint.cassandra.service.ExceptionsTranslatorImpl.translate(ExceptionsTranslatorImpl.java:42)
 ~[hector-core-0.7.0-26.jar:na]
 
   at
 me.prettyprint.cassandra.service.KeyspaceServiceImpl$12.execute(KeyspaceServiceImpl.java:513)
 ~[hector-core-0.7.0-26.jar:na]
 
   at
 me.prettyprint.cassandra.service.KeyspaceServiceImpl$12.execute(KeyspaceServiceImpl.java:495)
 ~[hector-core-0.7.0-26.jar:na]
 
   at
 me.prettyprint.cassandra.service.Operation.executeAndSetResult(Operation.java:101)
 ~[hector-core-0.7.0-26.jar:na]
 
   at
 me.prettyprint.cassandra.connection.HConnectionManager.operateWithFailover(HConnectionManager.java:161)
 ~[hector-core-0.7.0-26.jar:na]
 
   at
 me.prettyprint.cassandra.service.KeyspaceServiceImpl.operateWithFailover(KeyspaceServiceImpl.java:129)
 ~[hector-core-0.7.0-26.jar:na]
 
   at
 me.prettyprint.cassandra.service.KeyspaceServiceImpl.getIndexedSlices(KeyspaceServiceImpl.java:517)
 ~[hector-core-0.7.0-26.jar:na]
 
   at
 me.prettyprint.cassandra.model.IndexedSlicesQuery$1.doInKeyspace(IndexedSlicesQuery.java:140)
 ~[hector-core-0.7.0-26.jar:na]
 
   at
 me.prettyprint.cassandra.model.IndexedSlicesQuery$1.doInKeyspace(IndexedSlicesQuery.java:131)
 ~[hector-core-0.7.0-26.jar:na]
 
   at
 me.prettyprint.cassandra.model.KeyspaceOperationCallback.doInKeyspaceAndMeasure(KeyspaceOperationCallback.java:20)
 ~[hector-core-0.7.0-26.jar:na]
 
   at
 me.prettyprint.cassandra.model.ExecutingKeyspace.doExecute(ExecutingKeyspace.java:85)
 ~[hector-core-0.7.0-26.jar:na]
 
   at
 me.prettyprint.cassandra.model.IndexedSlicesQuery.execute(IndexedSlicesQuery.java:130)
 ~[hector-core-0.7.0-26.jar:na]
 
   at
 com.yoochoose.services.cassandra.internal.event.CassandraDataFetcher$1.onMessage(CassandraDataFetcher.java:60)
 [classes/:na]
 
   at
 com.yoochoose.services.cassandra.internal.event.CassandraDataFetcher$1.onMessage(CassandraDataFetcher.java:47)
 [classes/:na]
 
   at
 org.jetlang.channels.ChannelSubscription$1.run(ChannelSubscription.java:31)
 [jetlang-0.2.1.jar:na]
 
   at
 org.jetlang.core.BatchExecutorImpl.execute(BatchExecutorImpl.java:11)
 [jetlang-0.2.1.jar:na]
 
   at
 org.jetlang.core.RunnableExecutorImpl.run(RunnableExecutorImpl.java:34)
 [jetlang-0.2.1

AW: rename index

2011-02-17 Thread Roland Gude
Thanks,
Up to now i could not see any problems with the index names
For now I will not touch it. If I encounter something I’ll let you know

Von: Aaron Morton [mailto:aa...@thelastpickle.com]
Gesendet: Mittwoch, 16. Februar 2011 21:00
An: user@cassandra.apache.org
Betreff: Re: rename index

There is no rename, but update column family though the cli or api with just 
the renamed index should work.

The code says it will remove old and add new indexes based on their name.

I'm not sure if the name is used for anything other than identifying the index 
inside the CF. Are the duplicate names causing a problem?

Aaron

On 17/02/2011, at 6:15 AM, Roland Gude 
roland.g...@yoochoose.commailto:roland.g...@yoochoose.com wrote:
Hi,
unfortiunately i made a copy paste error and created two indexes called 
“myindex” on different columnfamilies.
What can I do to fix this?

Below the output from describe keyspace

ColumnFamily: A
  Columns sorted by: org.apache.cassandra.db.marshal.TimeUUIDType
  Row cache size / save period: 0.0/0
  Key cache size / save period: 20.0/14400
 Memtable thresholds: 1.1203125/239/60
  GC grace seconds: 864000
  Compaction min/max thresholds: 4/32
  Read repair chance: 1.0
  Column Metadata:
Column Name: --1000--
  Validation Class: org.apache.cassandra.db.marshal.UTF8Type
  Index Name: MyIndex
  Index Type: KEYS
ColumnFamily: B
  Columns sorted by: org.apache.cassandra.db.marshal.TimeUUIDType
  Row cache size / save period: 0.0/0
  Key cache size / save period: 20.0/14400
  Memtable thresholds: 1.1203125/239/60
  GC grace seconds: 864000
  Compaction min/max thresholds: 4/32
  Read repair chance: 1.0
  Column Metadata:
Column Name: --1000--
  Validation Class: org.apache.cassandra.db.marshal.UTF8Type
  Index Name: MyIndex
  Index Type: KEYS

--
YOOCHOOSE GmbH

Roland Gude
Software Engineer

Im Mediapark 8, 50670 Köln

+49 221 4544151 (Tel)
+49 221 4544159 (Fax)
+49 171 7894057 (Mobil)


Email: roland.g...@yoochoose.commailto:roland.g...@yoochoose.com
WWW: www.yoochoose.comhttp://www.yoochoose.com/

YOOCHOOSE GmbH
Geschäftsführer: Dr. Uwe Alkemper, Michael Friedmann
Handelsregister: Amtsgericht Köln HRB 65275
Ust-Ident-Nr: DE 264 773 520
Sitz der Gesellschaft: Köln



AW: cant seem to figure out secondary index definition

2011-02-17 Thread Roland Gude
.execute(KeyspaceServiceImpl.java:501)
 ~[hector-core-0.7.0-26.jar:na]
  ... 19 common frames omitted



With the very same code and data.
I assume that the column name I give in Cassandra.yaml is somehow not 
inmterpreted as a TimedUUID or something.

Any help would be greatly appreciated

Greetings,
roland


Von: Michal Augustýn [mailto:augustyn.mic...@gmail.com]
Gesendet: Dienstag, 15. Februar 2011 16:22
An: user@cassandra.apache.org
Betreff: Re: cant seem to figure out secondary index definition

Ah, ok. I checked that in source and the problem is that you wrote 
validation_class but you should validator_class.

Augi
2011/2/15 Roland Gude 
roland.g...@yoochoose.commailto:roland.g...@yoochoose.com
Yeah i know about that, but the definition i have is for a cluster that is 
started/stopped from a unit test with hector embeddedServerHelper, which takes 
definitions from the yaml.
So i'd still like to define the index in the yaml file (it should very well be 
possible I guess)


Von: Michal Augustýn 
[mailto:augustyn.mic...@gmail.commailto:augustyn.mic...@gmail.com]
Gesendet: Dienstag, 15. Februar 2011 15:53
An: user@cassandra.apache.orgmailto:user@cassandra.apache.org
Betreff: Re: cant seem to figure out secondary index definition

Hi,

if you download Cassandra and look into conf/cassandra.yaml then you can see 
this:

this keyspace definition is for demonstration purposes only. Cassandra will 
not load these definitions during startup. See 
http://wiki.apache.org/cassandra/FAQ#no_keyspaces for an explanation.

So you should make all schema-related operation via Thrift/AVRO API, or you can 
use Cassandra CLI.

Augi

2011/2/15 Roland Gude 
roland.g...@yoochoose.commailto:roland.g...@yoochoose.com
Hi,

i am a little puzzled on creation of secondary indexes and the docs in that 
area are still very sparse.
What I am trying to do is - in a columnfamily with TimeUUID comparator, I want 
the special timeuuid --1000-- to be indexed. The 
value being some UTF8 string on which I want to perform equality checks.

What do I need to put in my cassandra.yaml file?
Something like this?

  - column_metadata: [{name: --1000--, 
validation_class: UTF8Type, index_name: MyIndex, index_type: KEYS}]

This gives me that error:

15:05:12.492 [pool-1-thread-1] ERROR o.a.c.config.DatabaseDescriptor - Fatal 
error: null; Can't construct a java object for 
tag:yaml.orghttp://yaml.org,2002:org.apache.cassandra.config.Config; 
exception=Cannot create property=keyspaces for 
JavaBean=org.apache.cassandra.config.Config@7eb6e2; Cannot create 
property=column_families for 
JavaBean=org.apache.cassandra.config.RawKeyspace@987a33; Cannot create 
property=column_metadata for 
JavaBean=org.apache.cassandra.config.RawColumnFamily@716cb7; Cannot create 
property=validation_class for 
JavaBean=org.apache.cassandra.config.RawColumnDefinition@e29820; Unable to find 
property 'validation_class' on class: 
org.apache.cassandra.config.RawColumnDefinition
Bad configuration; unable to start server


I am furthermor uncertain if the column name will be correctly used if given 
like this. Should I put the byte representation of the uuid there?

Greetings,
roland
--
YOOCHOOSE GmbH

Roland Gude
Software Engineer

Im Mediapark 8, 50670 Köln

+49 221 4544151 (Tel)
+49 221 4544159 (Fax)
+49 171 7894057 (Mobil)


Email: roland.g...@yoochoose.commailto:roland.g...@yoochoose.com
WWW: www.yoochoose.comhttp://www.yoochoose.com/

YOOCHOOSE GmbH
Geschäftsführer: Dr. Uwe Alkemper, Michael Friedmann
Handelsregister: Amtsgericht Köln HRB 65275
Ust-Ident-Nr: DE 264 773 520
Sitz der Gesellschaft: Köln





rename index

2011-02-16 Thread Roland Gude
Hi,
unfortiunately i made a copy paste error and created two indexes called 
myindex on different columnfamilies.
What can I do to fix this?

Below the output from describe keyspace

ColumnFamily: A
  Columns sorted by: org.apache.cassandra.db.marshal.TimeUUIDType
  Row cache size / save period: 0.0/0
  Key cache size / save period: 20.0/14400
 Memtable thresholds: 1.1203125/239/60
  GC grace seconds: 864000
  Compaction min/max thresholds: 4/32
  Read repair chance: 1.0
  Column Metadata:
Column Name: --1000--
  Validation Class: org.apache.cassandra.db.marshal.UTF8Type
  Index Name: MyIndex
  Index Type: KEYS
ColumnFamily: B
  Columns sorted by: org.apache.cassandra.db.marshal.TimeUUIDType
  Row cache size / save period: 0.0/0
  Key cache size / save period: 20.0/14400
  Memtable thresholds: 1.1203125/239/60
  GC grace seconds: 864000
  Compaction min/max thresholds: 4/32
  Read repair chance: 1.0
  Column Metadata:
Column Name: --1000--
  Validation Class: org.apache.cassandra.db.marshal.UTF8Type
  Index Name: MyIndex
  Index Type: KEYS

--
YOOCHOOSE GmbH

Roland Gude
Software Engineer

Im Mediapark 8, 50670 Köln

+49 221 4544151 (Tel)
+49 221 4544159 (Fax)
+49 171 7894057 (Mobil)


Email: roland.g...@yoochoose.com
WWW: www.yoochoose.comhttp://www.yoochoose.com/

YOOCHOOSE GmbH
Geschäftsführer: Dr. Uwe Alkemper, Michael Friedmann
Handelsregister: Amtsgericht Köln HRB 65275
Ust-Ident-Nr: DE 264 773 520
Sitz der Gesellschaft: Köln



cant seem to figure out secondary index definition

2011-02-15 Thread Roland Gude
Hi,

i am a little puzzled on creation of secondary indexes and the docs in that 
area are still very sparse.
What I am trying to do is - in a columnfamily with TimeUUID comparator, I want 
the special timeuuid --1000-- to be indexed. The 
value being some UTF8 string on which I want to perform equality checks.

What do I need to put in my cassandra.yaml file?
Something like this?

  - column_metadata: [{name: --1000--, 
validation_class: UTF8Type, index_name: MyIndex, index_type: KEYS}]

This gives me that error:

15:05:12.492 [pool-1-thread-1] ERROR o.a.c.config.DatabaseDescriptor - Fatal 
error: null; Can't construct a java object for 
tag:yaml.org,2002:org.apache.cassandra.config.Config; exception=Cannot create 
property=keyspaces for JavaBean=org.apache.cassandra.config.Config@7eb6e2; 
Cannot create property=column_families for 
JavaBean=org.apache.cassandra.config.RawKeyspace@987a33; Cannot create 
property=column_metadata for 
JavaBean=org.apache.cassandra.config.RawColumnFamily@716cb7; Cannot create 
property=validation_class for 
JavaBean=org.apache.cassandra.config.RawColumnDefinition@e29820; Unable to find 
property 'validation_class' on class: 
org.apache.cassandra.config.RawColumnDefinition
Bad configuration; unable to start server


I am furthermor uncertain if the column name will be correctly used if given 
like this. Should I put the byte representation of the uuid there?

Greetings,
roland
--
YOOCHOOSE GmbH

Roland Gude
Software Engineer

Im Mediapark 8, 50670 Köln

+49 221 4544151 (Tel)
+49 221 4544159 (Fax)
+49 171 7894057 (Mobil)


Email: roland.g...@yoochoose.com
WWW: www.yoochoose.comhttp://www.yoochoose.com/

YOOCHOOSE GmbH
Geschäftsführer: Dr. Uwe Alkemper, Michael Friedmann
Handelsregister: Amtsgericht Köln HRB 65275
Ust-Ident-Nr: DE 264 773 520
Sitz der Gesellschaft: Köln



AW: cant seem to figure out secondary index definition

2011-02-15 Thread Roland Gude
Yeah i know about that, but the definition i have is for a cluster that is 
started/stopped from a unit test with hector embeddedServerHelper, which takes 
definitions from the yaml.
So i'd still like to define the index in the yaml file (it should very well be 
possible I guess)


Von: Michal Augustýn [mailto:augustyn.mic...@gmail.com]
Gesendet: Dienstag, 15. Februar 2011 15:53
An: user@cassandra.apache.org
Betreff: Re: cant seem to figure out secondary index definition

Hi,

if you download Cassandra and look into conf/cassandra.yaml then you can see 
this:

this keyspace definition is for demonstration purposes only. Cassandra will 
not load these definitions during startup. See 
http://wiki.apache.org/cassandra/FAQ#no_keyspaces for an explanation.

So you should make all schema-related operation via Thrift/AVRO API, or you can 
use Cassandra CLI.

Augi

2011/2/15 Roland Gude 
roland.g...@yoochoose.commailto:roland.g...@yoochoose.com
Hi,

i am a little puzzled on creation of secondary indexes and the docs in that 
area are still very sparse.
What I am trying to do is - in a columnfamily with TimeUUID comparator, I want 
the special timeuuid --1000-- to be indexed. The 
value being some UTF8 string on which I want to perform equality checks.

What do I need to put in my cassandra.yaml file?
Something like this?

  - column_metadata: [{name: --1000--, 
validation_class: UTF8Type, index_name: MyIndex, index_type: KEYS}]

This gives me that error:

15:05:12.492 [pool-1-thread-1] ERROR o.a.c.config.DatabaseDescriptor - Fatal 
error: null; Can't construct a java object for 
tag:yaml.orghttp://yaml.org,2002:org.apache.cassandra.config.Config; 
exception=Cannot create property=keyspaces for 
JavaBean=org.apache.cassandra.config.Config@7eb6e2; Cannot create 
property=column_families for 
JavaBean=org.apache.cassandra.config.RawKeyspace@987a33; Cannot create 
property=column_metadata for 
JavaBean=org.apache.cassandra.config.RawColumnFamily@716cb7; Cannot create 
property=validation_class for 
JavaBean=org.apache.cassandra.config.RawColumnDefinition@e29820; Unable to find 
property 'validation_class' on class: 
org.apache.cassandra.config.RawColumnDefinition
Bad configuration; unable to start server


I am furthermor uncertain if the column name will be correctly used if given 
like this. Should I put the byte representation of the uuid there?

Greetings,
roland
--
YOOCHOOSE GmbH

Roland Gude
Software Engineer

Im Mediapark 8, 50670 Köln

+49 221 4544151 (Tel)
+49 221 4544159 (Fax)
+49 171 7894057 (Mobil)


Email: roland.g...@yoochoose.commailto:roland.g...@yoochoose.com
WWW: www.yoochoose.comhttp://www.yoochoose.com/

YOOCHOOSE GmbH
Geschäftsführer: Dr. Uwe Alkemper, Michael Friedmann
Handelsregister: Amtsgericht Köln HRB 65275
Ust-Ident-Nr: DE 264 773 520
Sitz der Gesellschaft: Köln




AW: cant seem to figure out secondary index definition

2011-02-15 Thread Roland Gude
Thanks, it works.

roland

Von: Michal Augustýn [mailto:augustyn.mic...@gmail.com]
Gesendet: Dienstag, 15. Februar 2011 16:22
An: user@cassandra.apache.org
Betreff: Re: cant seem to figure out secondary index definition

Ah, ok. I checked that in source and the problem is that you wrote 
validation_class but you should validator_class.

Augi
2011/2/15 Roland Gude 
roland.g...@yoochoose.commailto:roland.g...@yoochoose.com
Yeah i know about that, but the definition i have is for a cluster that is 
started/stopped from a unit test with hector embeddedServerHelper, which takes 
definitions from the yaml.
So i'd still like to define the index in the yaml file (it should very well be 
possible I guess)


Von: Michal Augustýn 
[mailto:augustyn.mic...@gmail.commailto:augustyn.mic...@gmail.com]
Gesendet: Dienstag, 15. Februar 2011 15:53
An: user@cassandra.apache.orgmailto:user@cassandra.apache.org
Betreff: Re: cant seem to figure out secondary index definition

Hi,

if you download Cassandra and look into conf/cassandra.yaml then you can see 
this:

this keyspace definition is for demonstration purposes only. Cassandra will 
not load these definitions during startup. See 
http://wiki.apache.org/cassandra/FAQ#no_keyspaces for an explanation.

So you should make all schema-related operation via Thrift/AVRO API, or you can 
use Cassandra CLI.

Augi

2011/2/15 Roland Gude 
roland.g...@yoochoose.commailto:roland.g...@yoochoose.com
Hi,

i am a little puzzled on creation of secondary indexes and the docs in that 
area are still very sparse.
What I am trying to do is - in a columnfamily with TimeUUID comparator, I want 
the special timeuuid --1000-- to be indexed. The 
value being some UTF8 string on which I want to perform equality checks.

What do I need to put in my cassandra.yaml file?
Something like this?

  - column_metadata: [{name: --1000--, 
validation_class: UTF8Type, index_name: MyIndex, index_type: KEYS}]

This gives me that error:

15:05:12.492 [pool-1-thread-1] ERROR o.a.c.config.DatabaseDescriptor - Fatal 
error: null; Can't construct a java object for 
tag:yaml.orghttp://yaml.org,2002:org.apache.cassandra.config.Config; 
exception=Cannot create property=keyspaces for 
JavaBean=org.apache.cassandra.config.Config@7eb6e2; Cannot create 
property=column_families for 
JavaBean=org.apache.cassandra.config.RawKeyspace@987a33; Cannot create 
property=column_metadata for 
JavaBean=org.apache.cassandra.config.RawColumnFamily@716cb7; Cannot create 
property=validation_class for 
JavaBean=org.apache.cassandra.config.RawColumnDefinition@e29820; Unable to find 
property 'validation_class' on class: 
org.apache.cassandra.config.RawColumnDefinition
Bad configuration; unable to start server


I am furthermor uncertain if the column name will be correctly used if given 
like this. Should I put the byte representation of the uuid there?

Greetings,
roland
--
YOOCHOOSE GmbH

Roland Gude
Software Engineer

Im Mediapark 8, 50670 Köln

+49 221 4544151 (Tel)
+49 221 4544159 (Fax)
+49 171 7894057 (Mobil)


Email: roland.g...@yoochoose.commailto:roland.g...@yoochoose.com
WWW: www.yoochoose.comhttp://www.yoochoose.com/

YOOCHOOSE GmbH
Geschäftsführer: Dr. Uwe Alkemper, Michael Friedmann
Handelsregister: Amtsgericht Köln HRB 65275
Ust-Ident-Nr: DE 264 773 520
Sitz der Gesellschaft: Köln





AW: Data ends up in wrong Columnfamily

2011-02-11 Thread Roland Gude
Hi,

machine A has absolutely no knowledge about the anything about the other 
application. Not even the columnfamily name.
I was digging into this further:

Since the data I find in the wrong space has a timestamp in its row key It was 
quite easy to find out that the data was relatively old. Unfortunately from a 
time where I do not have batch mutation logs from the server side.
I think this might be related to the “deleted columns reappear” thread, as I 
saw the following happen:


· I truncated the columnfamily that contained wrong data using the 
Cassandra-cli.

· I regenerated the correct data for that columnfamily

· I ran repair on a node in the cluster

· - The data reappeared

I tried this multiple times. And even tried to truncate the columnfamily using 
clustertool on the slight  chance that it does something different than the cli 
when truncating. But up to the moment I have not been successful in removing 
the data from the cluster.
Another strange thing about the issue is, that repair seems to blow up the data 
indefinetly.
The columnfamily that contains wrong data contains around 200Kb of correct data 
before I repair. The complete cluster contains around 6Gb of data ( 3 nodes 3Gb 
each replication factor 2). After repair on one node, that node contains about 
14GB of data. If I trigger a repair now on the second node, It gets to around 
24Gb of data before it falls to OOM.
Getting to 24Gb of data seems to be impossible to me from the amount of data I 
have written to the cluster. I can only imagine that it is data that was once 
deleted but keeps reappering and while doing so, it reappears in the wrong 
place.
Note that the columnfamily that contains the wrong data did not even exist when 
the data was first written (It was created with the cli only a couple of days 
ago, while the oldest row I could find that was not supposed to exist was from 
January 7th)

We did fail to run repair regulary on that cluster in the meantime.

If I find a BatchMutation log that indicates an incorrect write received by the 
server, I will post it.

Greetings,

roland
Von: Aaron Morton [mailto:aa...@thelastpickle.com]
Gesendet: Donnerstag, 10. Februar 2011 21:37
An: user@cassandra.apache.org
Betreff: Re: Data ends up in wrong Columnfamily

Not heard of that before, chances are it's a problem in your code. Does machine 
A even know the other CF name? Can you log the batch mutations you are sending? 
When it appears in the other CF is the data complete?

There is also a Hector list, perhaps they can help.

Aaron

On 10/02/2011, at 11:58 PM, Roland Gude 
roland.g...@yoochoose.commailto:roland.g...@yoochoose.com wrote:
Hi,

i am experiencing a strange issue. I have two applications writing to Cassandra 
(in different Column families in the same keyspace). The applications reside on 
different machines and know nothing about the existence of each other.
The both produce data and write it in Cassandra with batch mutations using 
hector.
So far so good, but it regularly happens, that data from one application ends 
up in columnfamilies reserved for the other application as well as the intended 
columnfamily.

Machine A writes to column family CF_A
Machine B writes to column families CF_B to CF_N

Regularly data that was written (According to my application logs) from Machine 
A to CF_A ends up in CF_A and in one of the other columnfamilies.

Any ideas why this could be happening?

I am using Cassandra 0.7.0 and hector 0.7.0-23

Greetings,
Roland

--
YOOCHOOSE GmbH

Roland Gude
Software Engineer

Im Mediapark 8, 50670 Köln

+49 221 4544151 (Tel)
+49 221 4544159 (Fax)
+49 171 7894057 (Mobil)


Email: roland.g...@yoochoose.commailto:roland.g...@yoochoose.com
WWW: www.yoochoose.comhttp://www.yoochoose.com/

YOOCHOOSE GmbH
Geschäftsführer: Dr. Uwe Alkemper, Michael Friedmann
Handelsregister: Amtsgericht Köln HRB 65275
Ust-Ident-Nr: DE 264 773 520
Sitz der Gesellschaft: Köln



AW: Why is it when I removed a row the RowKey is still there?

2011-02-11 Thread Roland Gude
It has something to do with the way data is deleted in Cassandra. You are not 
doing anything wrong.
See here http://wiki.apache.org/cassandra/FAQ#range_ghosts
Or here: http://wiki.apache.org/cassandra/DistributedDeletes
For some more detail


-Ursprüngliche Nachricht-
Von: Joshua Partogi [mailto:joshua.j...@gmail.com] 
Gesendet: Freitag, 11. Februar 2011 11:34
An: user@cassandra.apache.org
Betreff: Why is it when I removed a row the RowKey is still there?

Hi,

I am very puzzled with this. So I removed a row from the client, but
when I query the data from CLI, the rowkey is still there:
RowKey: 3
---
RowKey: 2
= (column=6e616d65, value=42696c6c, timestamp=1297338131027004)
---
RowKey: 1
= (column=6e616d65, value=4a6f65, timestamp=1297420269035522)


Did I do something wrong? What do I need to do in order to completely
remove the entire row with its key.

Thank you for the assistance.

Kind regards,
Joshua

-- 
http://twitter.com/jpartogi




AW: Data ends up in wrong Columnfamily

2011-02-11 Thread Roland Gude
Yes this could very well be the issue.
As I see its already fixed for 0.7.1. Hopefully it will pass a vote soon.

Thanks,

Roland

-Ursprüngliche Nachricht-
Von: sc...@scode.org [mailto:sc...@scode.org] Im Auftrag von Peter Schuller
Gesendet: Freitag, 11. Februar 2011 09:11
An: user@cassandra.apache.org
Betreff: Re: Data ends up in wrong Columnfamily

 So far so good, but it regularly happens, that data from one application
 ends up in columnfamilies reserved for the other application as well as the
 intended columnfamily.

Maybe https://issues.apache.org/jira/browse/CASSANDRA-1992

-- 
/ Peter Schuller



AW: cassandra solaris x64 support

2011-02-11 Thread Roland Gude
This is a problem with the start scripts, not with Cassandra itself (or any of 
its configuration)
The shell you are using cannot start the cassandra shell script.

Try 
#bash bin/cassandra -f

As far as I know, it should work fine. Actually it should work with sh as 
well...


-Ursprüngliche Nachricht-
Von: Xiaobo Gu [mailto:guxiaobo1...@gmail.com] 
Gesendet: Freitag, 11. Februar 2011 16:12
An: user@cassandra.apache.org
Betreff: Re: cassandra solaris x64 support

On Fri, Feb 11, 2011 at 10:51 PM, Jonathan Ellis jbel...@gmail.com wrote:
 The vast majority run on Linux, but there are a few people running
 Cassandra on Solaris, FreeBSD, and Windows.
But I failed to start the one node test cluster,
# sh bin/cassandra -f
bin/cassandra: syntax error at line 22: `MAX_HEAP_SIZE=$' unexpected

My environemnt is as follwoing:
# more /etc/release
   Solaris 10 10/09 s10x_u8wos_08a X86
   Copyright 2009 Sun Microsystems, Inc.  All Rights Reserved.
Use is subject to license terms.
   Assembled 16 September 2009

# java -fullversion
java full version 1.6.0_23-b05
# java -version
java version 1.6.0_23
Java(TM) SE Runtime Environment (build 1.6.0_23-b05)
Java HotSpot(TM) Client VM (build 19.0-b09, mixed mode, sharing)

I changed initial_token:0


 On Fri, Feb 11, 2011 at 4:40 AM, Xiaobo Gu guxiaobo1...@gmail.com wrote:
 Hi,
 Because I can't access the archives of the mailing list, so my
 apologies if someone have asked this before.

 Does any have successfully run Cassandra on Solaris 10 X64 clusters?

 Regards,

 Xiaobo Gu




 --
 Jonathan Ellis
 Project Chair, Apache Cassandra
 co-founder of DataStax, the source for professional Cassandra support
 http://www.datastax.com





Data ends up in wrong Columnfamily

2011-02-10 Thread Roland Gude
Hi,

i am experiencing a strange issue. I have two applications writing to Cassandra 
(in different Column families in the same keyspace). The applications reside on 
different machines and know nothing about the existence of each other.
The both produce data and write it in Cassandra with batch mutations using 
hector.
So far so good, but it regularly happens, that data from one application ends 
up in columnfamilies reserved for the other application as well as the intended 
columnfamily.

Machine A writes to column family CF_A
Machine B writes to column families CF_B to CF_N

Regularly data that was written (According to my application logs) from Machine 
A to CF_A ends up in CF_A and in one of the other columnfamilies.

Any ideas why this could be happening?

I am using Cassandra 0.7.0 and hector 0.7.0-23

Greetings,
Roland

--
YOOCHOOSE GmbH

Roland Gude
Software Engineer

Im Mediapark 8, 50670 Köln

+49 221 4544151 (Tel)
+49 221 4544159 (Fax)
+49 171 7894057 (Mobil)


Email: roland.g...@yoochoose.com
WWW: www.yoochoose.comhttp://www.yoochoose.com/

YOOCHOOSE GmbH
Geschäftsführer: Dr. Uwe Alkemper, Michael Friedmann
Handelsregister: Amtsgericht Köln HRB 65275
Ust-Ident-Nr: DE 264 773 520
Sitz der Gesellschaft: Köln



strange issue with timeUUID columns

2010-12-22 Thread Roland Gude
Hi,

I am experiencing a strange issue when using TimeUUID as columnkeys.
I am storing a number of events with timeUUId as key in a row. Later I try to 
query for a slice of that row with a given lower bound timeUUID and 
upperBoundTimeUUID (constructed as described in the wiki)
If I inserted the events in ascending order everything goes well
If for some reason I insert the events in random order (which may very well 
happen in a concurrent scenario) and I later query for the data (even with much 
more tolerant bounds) I get no data back.
Furthermore if I wait for some time (about 15 minutes seem to be sufficient) I 
can query the data again.
The Cassandra I use is a single node 0.7.0-rc2
I am querying with hector.

Has anyone else experienced such issues?
Can someone think of an explanation for this?

Kind regards,
roland

--
YOOCHOOSE GmbH

Roland Gude
Software Engineer

Im Mediapark 8, 50670 Köln

+49 221 4544151 (Tel)
+49 221 4544159 (Fax)
+49 171 7894057 (Mobil)


Email: roland.g...@yoochoose.com
WWW: www.yoochoose.comhttp://www.yoochoose.com/

YOOCHOOSE GmbH
Geschäftsführer: Dr. Uwe Alkemper, Michael Friedmann
Handelsregister: Amtsgericht Köln HRB 65275
Ust-Ident-Nr: DE 264 773 520
Sitz der Gesellschaft: Köln