Re: [RELEASE] Apache Cassandra 1.0.5 released

2011-12-01 Thread Evgeniy Ryabitskiy
+1
After upgrade to 1.0.5 also have Timeout exception on Secondary Index
search (get_indexed_slices API) .


Re: 1.0.3 CLI oddities

2011-11-28 Thread Evgeniy Ryabitskiy
Hi,

Just now migrated to 1.0.3 and got same error.

I did folowing;
1) Create CF with compression
2) Update cf metadata on new created CF.

update failed with same Exception about Caused by: java.util.concurrent.
ExecutionException: java.io.IOException:
org.apache.cassandra.config.ConfigurationException: Invalid negative or
null chunk_length_kb


After removing compression settings update succeed.


Evgeny.


Re: Setting java heap size for Cassandra process

2011-09-21 Thread Evgeniy Ryabitskiy
Looks like I have same problem as here:
https://issues.apache.org/jira/browse/CASSANDRA-2868

But, it's been fixed in 0.8.5 and I'm using 0.8.5 ...


Evgeny.


Setting java heap size for Cassandra process

2011-09-20 Thread Evgeniy Ryabitskiy
Hi,
I am running Cassandra over Linux VMs, each VM is: 2GB RAM, 4 core CPU.
Using RPM distribution. I have set -Xmx to 512M in cassandra-env.sh

After day of running I see that Cassandra process is utilizing over 80% of
memory that is 3 times more then 512M.
In result after 2 days of running, Cassandra process is killed by OS
(without OutOfMemoryException).

Here is *ps aux* output. Could you help to understand this behavior?


USER   PID %CPU %MEMVSZ RSS   TTY   STATSTART   TIME
COMMAND
103  1067  2.5  * 80.76417932**1693508*?
SLl  01:06  24:30 * /usr/java/jre1.6.0_26/bin/java* -ea
-javaagent:/usr/share/cassandra//lib/jamm-0.2.2.jar -XX:+UseThreadPriorities
-XX:ThreadPriorityPolicy=42 *-Xms512M -Xmx512M
-Xmn128M*-XX:+HeapDumpOnOutOfMemoryError -Xss128k -XX:+UseParNewGC
-XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:SurvivorRatio=8
-XX:MaxTenuringThreshold=1 -XX:CMSInitiatingOccupancyFraction=75
-XX:+UseCMSInitiatingOccupancyOnly -Djava.net.preferIPv4Stack=true
-Dcom.sun.management.jmxremote.port=7199
-Dcom.sun.management.jmxremote.ssl=false
-Dcom.sun.management.jmxremote.authenticate=false
-Dlog4j.configuration=log4j-server.properties
-Dlog4j.defaultInitOverride=true
-Dcassandra-pidfile=/var/run/cassandra/cassandra.pid -cp
/etc/cassandra/conf:/usr/share/cassandra/lib/antlr-3.2.jar:/usr/share/cassandra/lib/apache-cassandra-0.8.5.jar:/usr/share/cassandra/lib/apache-cassandra-thrift-0.8.5.jar:/usr/share/cassandra/lib/avro-1.4.0-fixes.jar:/usr/share/cassandra/lib/avro-1.4.0-sources-fixes.jar:/usr/share/cassandra/lib/commons-cli-1.1.jar:/usr/share/cassandra/lib/commons-codec-1.2.jar:/usr/share/cassandra/lib/commons-collections-3.2.1.jar:/usr/share/cassandra/lib/commons-lang-2.4.jar:/usr/share/cassandra/lib/concurrentlinkedhashmap-lru-1.1.jar:/usr/share/cassandra/lib/guava-r08.jar:/usr/share/cassandra/lib/high-scale-lib-1.1.2.jar:/usr/share/cassandra/lib/jackson-core-asl-1.4.0.jar:/usr/share/cassandra/lib/jackson-mapper-asl-1.4.0.jar:/usr/share/cassandra/lib/jamm-0.2.2.jar:/usr/share/cassandra/lib/jline-0.9.94.jar:/usr/share/cassandra/lib/jna.jar:/usr/share/cassandra/lib/json-simple-1.1.jar:/usr/share/cassandra/lib/libthrift-0.6.jar:/usr/share/cassandra/lib/log4j-1.2.16.jar:/usr/share/cassandra/lib/servlet-api-2.5-20081211.jar:/usr/share/cassandra/lib/slf4j-api-1.6.1.jar:/usr/share/cassandra/lib/slf4j-log4j12-1.6.1.jar:/usr/share/cassandra/lib/snakeyaml-1.6.jar
org.apache.cassandra.thrift.CassandraDaemon


Evgeny.


Re: Setting java heap size for Cassandra process

2011-09-20 Thread Evgeniy Ryabitskiy
Thanks for reply. Now it looks much more clear.

Top shows this:
PID USER  PR  NI  VIRT  *RES*  SHR
67 cassandr  18 0  6267m *1.6g *  805m S  0.3 79.0  24:35.80
java

It's Ok with huge VIRT memory since it's 64 bit architecture.
But RES is keep growing.

And I still have questions:

1) If I don't have swap on node, then every mmap will add usage to my real
RES memory that is limited with 2GB?
2) Is there any way to limit mmap's RES memory usage?
3) Any common solution of problem when memory is out because of mmaps and
Cassandra process killed by OS?

I guess switching disk_access_mode to standard will solve this problem, but
it decrease performance.

Evgeny.


Re: Index search in provided list of rows (list of rowKeys).

2011-09-14 Thread Evgeniy Ryabitskiy
Why it's radically?

It will be same get_indexes_slices search but in specified set of rows. So
mostly it will be one more Search Expression over rowIDs not only column
values. Usually the more restrictions you could specify in search query, the
faster search it can be (not slower at least).

About moving to another engine:

Sphinx has it's advantages (quite fast) and disadvantages (painful
integration, lot's of limitations). Currently my company using it on
production, so moving to another search engine is a big step and it will be
considered.


What I want to discuss is common task of searching in Cassandra. Maybe I
missing some already well known solution for it (silver bullet)?
I see only 2 solutions:

1) Using external search engine that will index all storage fields

advantage:
 support full text search
some engines have nice search features like sorting by relevance

disadvantage:
for range scans it stores column values, it mean that huge part of cassandra
data will be also stored at Search Engine metadata
usually engines have set of limitations

2) Use Cassandra embedded Indexing search
advantage:
doesn't need to index all columns that are used for filtering.
Filtering performed at storage, close to data.

disadvantage:
not full text search support
require to create and maintain secondary indexes.

Both solutions are exclusive, you could choose only one and there is no way
to use combination of this 2 solutions (except intersection at client side
which is not a solution).

So API that was discussed would open some possibility to use that
combination.
For me it looks like third solution. Could it really change the way we are
searching in Cassandra?


Evgeny.


Index search in provided list of rows (list of rowKeys).

2011-09-12 Thread Evgeniy Ryabitskiy
Hi,

We have an issue to search over Cassandra and we are using Sphinx for
indexing.
Because of Sphinx architecture we can't use range queries over all fields
that we need to.
So we have to run Sphinx Query first to get List of rowKeys and perform
additional range filtering over column values.

First simple solution is to do it on Client side. That will increase network
traffic and memory usage on client.

Now I'm wondering if it possible to perform such filtering on Cassandra
side.
I wish to use some IndexExpression for range filtering in list of records
(list of rowKeys returned from external Indexing Search Engine).

Looking at get_indexed_slices I found out that in IndexClause is no
possibility set List of rowKeys (like for multiget_slice), only start_key.

So 2 questions:

1) Am I missing something and my idea is possible via some another API?
2) If not possible, can I add JIRA for this feature?

Evgeny.


Re: Index search in provided list of rows (list of rowKeys).

2011-09-12 Thread Evgeniy Ryabitskiy
Something like this.

Actually I think it's better to extend get_indexed_slice() API instead of
creating new one thrift method.
I wish to have something like this:

//here we run query to external search engine
Listbyte[] keys = performSphinxQuery(someFullTextSearchQuery);
IndexClause indexClause = new IndexClause();

//required API to set list of keys
indexClause.setKeys(keys);
indexClause.setExpressions(someFilteringExpressions);
List finalResult = get_indexed_slices(colParent, indexClause, colPredicate,
cLevel);



I can't solve my issue with single get_indexed_slice().
Here is issue in more details:
1) have ~ 6 millions records, in feature could be much more
2) have   10k different properties (stored as column values in Cassandra),
in feature could be much more
3) properties are text descriptions , int/float values, string values
4) need to implement search over all properties. For text descriptions: full
text search. for int/float properties: range search.
5) Search query could use any combination of property descriptions. Like
full text search description and some range expression for int/float field.
6) have external search engine (Sphinx) that indexed all string and text
properties
7) still need to perform range search for int, float fields.

So now I split my query expressions in 2 groups:
1) expressions that can be handled by search engine
2) others (additional filters)

For example I run first query to Sphinx and got list of rowKeys, with length
of 100k.  (mark as RESULT1)
Now I need to filter it by second group of expressions. For example I have
simple expression: age  25.
So imagine I would run get_indexed_slice() with this query and could
possibly get half of my records in result. (mark as RESULT2)
Then I would need to get intersection between RESULT1 and RESULT2 on client
side, which could take a lot of time and memory.
That is why I can't use single get_indexed_slice here.

For me is better to iterate RESULT1 (with 100k records) at client side to
filter by age and got 10-50k record as final result. Disadvantage here is
that I have to fetch all 100k records.

Evgeny.


UnavailableException while storing with EACH_QUORUM and RF=3

2011-09-05 Thread Evgeniy Ryabitskiy
Hi,

I'am trying to store record with EACH_QUORUM consistency and RF=3. While
same thing with RF=2 is working.
Could some one tell me why EACH_QUORUM is working with RF=2 but not with RF
=3

I have 7 nodes cluster. All nodes are UP.

Here is simple CLI script:

create keyspace kspace3
with placement_strategy =
'org.apache.cassandra.locator.NetworkTopologyStrategy'
and strategy_options = [{datacenter1:3}];

use kspace3;
create column family User with comparator = UTF8Type;

consistencylevel as LOCAL_QUORUM;
set User[1]['name'] = 'Smith';
consistencylevel as EACH_QUORUM;
set User[2]['name'] = 'Smith';
list User;


In result is only one record, EACH_QUORUM is failed:

[default@unknown] create keyspace kspace3
...with placement_strategy =
'org.apache.cassandra.locator.NetworkTopologyStrategy'
...and strategy_options = [{datacenter1:3}];
dd350870-d7ce-11e0--5025568f27ff
Waiting for schema agreement...
... schemas agree across the cluster
[default@unknown]
[default@unknown] use kspace3;
Authenticated to keyspace: kspace3
[default@kspace3] create column family User with comparator = UTF8Type;
dd45f860-d7ce-11e0--5025568f27ff
Waiting for schema agreement...
... schemas agree across the cluster
[default@kspace3]
[default@kspace3] consistencylevel as LOCAL_QUORUM;
Consistency level is set to 'LOCAL_QUORUM'.
[default@kspace3] set User[1]['name'] = 'Smith';
Value inserted.
[default@kspace3] consistencylevel as EACH_QUORUM;
Consistency level is set to 'EACH_QUORUM'.
[default@kspace3] set User[2]['name'] = 'Smith';
null
[default@kspace3] list User;
Using default limit of 100
---
RowKey: 01
= (column=name, value=536d697468, timestamp=1315234443834000)

1 Row Returned.


While same thing with RF=2 is working:

[default@kspace3] create keyspace kspace2
...with placement_strategy =
'org.apache.cassandra.locator.NetworkTopologyStrategy'
...and strategy_options = [{datacenter1:2}];
Keyspace already exists.
[default@kspace3]
[default@kspace3] use kspace2;
Authenticated to keyspace: kspace2
[default@kspace2] create column family User with comparator = UTF8Type;
User already exists in keyspace kspace2
[default@kspace2]
[default@kspace2] set User[1]['name'] = 'Smith';
Value inserted.
[default@kspace2] consistencylevel as EACH_QUORUM;
Consistency level is set to 'EACH_QUORUM'.
[default@kspace2] set User[2]['name'] = 'Smith';
Value inserted.
[default@kspace2] list User;
Using default limit of 100
---
RowKey: 01
= (column=name, value=536d697468, timestamp=1315234997189000)
---
RowKey: 02
= (column=name, value=536d697468, timestamp=1315234997198000)

2 Rows Returned.


Re: UnavailableException while storing with EACH_QUORUM and RF=3

2011-09-05 Thread Evgeniy Ryabitskiy
One more thing, Cassandra version is 0.8.4.
And if I try same thing from Pelops(thrift), I get UnavailableException.


Re: UnavailableException while storing with EACH_QUORUM and RF=3

2011-09-05 Thread Evgeniy Ryabitskiy
great thanks!

Evgeny.


Re: Trying to understand QUORUM and Strategies

2011-09-02 Thread Evgeniy Ryabitskiy
So.
You have created keyspace with SimpleStrategy.
If you want to use *LOCAL_QUORUM, *you should create keyspace (or change
existing) with NetworkTopologyStrategy.

I have provided CLI examples on how to do it. If you are creating keyspace
from Hector, you have to do same via Java API.

Evgeny.


Re: Trying to understand QUORUM and Strategies

2011-08-31 Thread Evgeniy Ryabitskiy
Hi
Actually you can use LOCAL_QUORUM and EACH_QUORUM policy everywhere on
DEV/QA/Prod.
Even it would be better for integration tests to use same Consistency level
as on production.

For production with multiple DC you usually need to chouse between 2 common
solutions: Geographical Distribution or Disaster Recovery.
See: http://www.datastax.com/docs/0.8/operations/datacenter

 LOCAL_QUORUM and EACH_QUORUM for DEV/QA/Prod by examples:

create keyspace KeyspaceDEV
with placement_strategy =
'org.apache.cassandra.locator.NetworkTopologyStrategy'
and strategy_options=[{*datacenter1*:1}];

create keyspace KeyspaceQA
with placement_strategy =
'org.apache.cassandra.locator.NetworkTopologyStrategy'
and strategy_options=[{*datacenter1*:2}];

create keyspace KeyspaceProd
with placement_strategy =
'org.apache.cassandra.locator.NetworkTopologyStrategy'
and strategy_options=[{*datacenter1*:3, datacenter2:3}];


Be careful(!!!), usually default name of DC in new cluster is *datacenter1*.
But cassandra-cli use default name *DC1*. (some small mismatch/bug maybe).

Evgeny.


NPT while get_range_slices in 0.8.1

2011-08-26 Thread Evgeniy Ryabitskiy
Hi,

we have 4 node Cassandra (version 0.8.1) cluster. 2 CF inside. While first
CF is working properly (read/store), get_range_slices query on second CF
return NPE error.
Any idea why it happen? Maybe some known bug and fixed in 0.8.3 ?



ERROR [pool-2-thread-51] 2011-08-25 15:02:04,360 Cassandra.java (line 3210)
Internal error processing get_range_slices
java.lang.NullPointerException
at org.apache.cassandra.db.ColumnFamily.diff(ColumnFamily.java:298)
at org.apache.cassandra.db.ColumnFamily.diff(ColumnFamily.java:406)
at
org.apache.cassandra.service.RowRepairResolver.maybeScheduleRepairs(RowRepairResolver.java:103)
at
org.apache.cassandra.service.RangeSliceResponseResolver$2.getReduced(RangeSliceResponseResolver.java:120)
at
org.apache.cassandra.service.RangeSliceResponseResolver$2.getReduced(RangeSliceResponseResolver.java:85)
at
org.apache.cassandra.utils.ReducingIterator.computeNext(ReducingIterator.java:74)
at
com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:140)
at
com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:135)
at
org.apache.cassandra.service.StorageProxy.getRangeSlice(StorageProxy.java:715)
at
org.apache.cassandra.thrift.CassandraServer.get_range_slices(CassandraServer.java:617)
at
org.apache.cassandra.thrift.Cassandra$Processor$get_range_slices.process(Cassandra.java:3202)
at
org.apache.cassandra.thrift.Cassandra$Processor.process(Cassandra.java:2889)
at
org.apache.cassandra.thrift.CustomTThreadPoolServer$WorkerProcess.run(CustomTThreadPoolServer.java:187)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown
Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown
Source)
at java.lang.Thread.run(Unknown Source)


Re: Is Cassandra suitable for this use case?

2011-08-25 Thread Evgeniy Ryabitskiy
Hi,

If you want to store files with partition/replication, you could use
Distributed File System(DFS).
Like http://hadoop.apache.org/hdfs/
or any other:
http://en.wikipedia.org/wiki/Distributed_file_system

Still you could use Cassandra to store any metadata and filepath in DFS.

So: Cassandra + HDFS would be my solution.

Evgeny.


Re: 5 node cluster - Recommended seed configuration.

2011-08-09 Thread Evgeniy Ryabitskiy
 Rule of thumb, you should identify two servers in the cluster to be your
 seed nodes.




Is this rule same for N node cluster? Any common practice/formula for seed
number?
Going to use about 10 nodes and extend it in a future.



-- 
Evgeniy Ryabitskiy


Re: 5 node cluster - Recommended seed configuration.

2011-08-09 Thread Evgeniy Ryabitskiy
Thanks a lot!

Maybe this should be placed at Cassandra FAQ.
http://wiki.apache.org/cassandra/FAQ#seed has less information.


-- 
Evgeniy Ryabitskiy